Unable to parse pdf by Jpedal - java

I'm facing a problem while parsing a PDF with Jpedal.
While reading the wordlist from the Jpedal, I get garbled characters in the wordslist. This also happens when using OCR, and when I copy the text from PDF and paste in Word or a simple text editor. What I understand is this PDF was generated by Quartz PDF context on MAC OS X 10.6.4, which is used to compress the file size, but iseasily viewable on PDF viewers. I searched for any Java API supporting for decoding this kind of PDF but was unsuccessful. I'm looking for any application or Java API which I can use to decode it; must be usable on a Linux machine.

Hye everybody
I'm posting a possible solution for problem. Here is link describing how quartz parse the pdf and of course which need to be implemented in code cause till now I didn't found any readymade API for it and I believe that stackoverflow is all about taking initiative and do and answer the questions which not been done or asked before.
regards
Rituraj

Related

How to extract text from pdf and doc file without downloading

I have searched a lot before asking that question. I have a program(java) which crawls some wep pages and trying to find some .doc and .pdf files and it can download them but only one .pdf or .doc can cover up to 3-4mb which is not good because there are millions of files.. so I decied to extract their text without downloading the whole file. Basically, I need to see pdf or doc file online and download their text only but I could not figure out how to do that. If necessary I can provide my code.
Edit:This question can be closed now since I got the idea and (no)solution.
Thanks for help.
And What's up with those downgrades on question ?
That is not possible. You can only start extracting the document once you download the bytes.
(unless you also have control over the server, you could do the extraction server-side and provide a txt download link)
Reading a file from a website on the Internet without downloading it is impossible.
If you have control of the server you could write a web service that can parse the files on demand and extract the parts you are interested in, which would then be sent to the client.
If not, and if you're up for a more challenging problem, you could write an HTTP client that starts downloading the file and parses it on the fly, downloading only as much as you need to extract the part(s) you need. This might or might not be feasible (or worthwhile) depending on where in the files the "interesting" bits were located. If they're close to the beginning in most cases then you might be able to reduce the download size significantly.
A detailed explanation of how to accomplish this is probably beyond the guidelines for StackOverflow answer length.

Reading a PDF Document in Android

this is my first question or rather Questions. I am in my last semester this college and Reading PDF is one of the components that im developing for my thesis.
I have been reading questions about reading a pdf document but there are no solid answer. I want to know what are the ways to read a PDF document? what i have read that there are API's available that can read PDF Document like PDFBox, muPDF, and iText. I have not seen any other API's but this is what i have read on other posts.
The problem here is first PDFBox i read that PDFBox can not be use because of AWT Dependencies and android is has no AWT and Swing related classes. PDFBox is out of the question. muPDF i have not read anything about muPDF, it was recommended to me but i want to know if it is usable to read PDF Document. iText this is the most common API that i encounter in PDF and android related questions. the problem here is its License?(Correct me if im wrong) I have not tried any of this 3 yet because i want know if there are another solutions beside this 3.
Other than APIs, i think PDF Reader Applications can be used too if im not mistaken? if it can be used then HOW?. I'm not looking for a Code but a explanation of how you did it and how you implemented it in your application.
i have thought another way but i do not know if this is possible. how about convert PDF Document into a .txt or .doc file? inside the android. it would be like when i load a PDF document inside the android a Code will convert that PDF Document into a .txt / .doc file and the application will search and extract text from the .txt / .doc file rather then the PDF Document.
if you are asking WHY do i need this kind of component, because i'm working on a application that would SEARCH and EXTRACT text from a PDF Document using Android.
This is my questions:
What are the ways to read a PDF Document in Android?
What are your experiences in using this kind of method?
How did you do it using this kind of method(just a flow/explanation would do)?
If the method has a License what would be the problem in the future?
PS: Correct me if i'm wrong.
Thank you.
This is a very iText specific answer so it does not answer all your questions, but it may still be of help to you.
What are the ways to read a PDF Document in Android?
I use the iText java bindings (Keep reading to find out about LGPL licensing)
What are your experiences in using this kind of method?
Great! It covers all things PDF related. (Older versions may be a little different)
How did you do it using this kind of method(just a flow/explanation would do)?
I assume this question is related to the "other way" that you thought of so it is not relevant to the iText PDF library?
If the method has a License what would be the problem in the future?
I still encourage you to use the updated version of iText, however if you use iText version 2.1.7 or older it falls under the old LGPL license, and has far more free reign and is more suitable for commercial or private/closed use with no real problems. From what I can tell all the functionality you are after is available in version 2.1.7 version.
The AGPL license for current version of iText is pretty decent, from what I understand you do need to publish your program under a similar license and make the code freely available to others (it would pay to check the details though), if sharing code is not a problem then the latest version of iText is worth looking into.
References:
LGPL License: http://www.gnu.org/licenses/lgpl.html
iText AGPL License: http://itextpdf.com/terms-of-use/agpl.php
I was also working on an Android Application, where i need to open PDF files in my android application. First i thought about doing this by using foreign API's. I tried to used mupdf, but it was not a good experience.
Then our team leader suuggested me to first install a pdf viewer application in your device and then use code to open pdf through the installed pdf viewer application.
You can easily download pdf viewer application from gogle play, then you can use the following code to open the pdf in your application,
File file = new File("/sdcard/MyPDFfile.pdf");
Uri path = Uri.fromFile(file);
Intent intent = new Intent(Intent.ACTION_VIEW);
intent.setDataAndType(path, "application/pdf");
intent.setFlags(Intent.FLAG_ACTIVITY_CLEAR_TOP);
startActivity(intent);
This code will automatically look for the installed pdf application in your device and will pass the intent to that application to open the pdf file.

PDF Open Office or MS Word

I am new to java, I have to read a PDF, Open Office or MS Word file and make changes in the file and render as PDF document on my web page. Please someone tell me which of these file's API or SDK is easy to use and also tell me best SDK for this. So I can read, Update and render easily. file also contains Table but there is no image.
We use Apache POI to read Microsoft Office files. There are many libraries for PDF in Java. iText is something I have used. Once you pick the tools, do a selective search on Stack Overflow. There are plenty of discussions around these tools.
Depending on the types of updates you are doing, modifying PDF is going to be a problem - it's not intended for editing. You might have to find some way of converting the PDF to something first, then edit. Depending on the types of changes you want to make and the documents you are working from even editing DOC and Writer files is going to be tricky. They are all different formats.
As Jayan mentioned, iText and POI may help you a little. OpenOffice Writer documents can be edited by unzipping then modifying the XML or using the UNO API. Word documents can be editied by using MS Office automation (bad idea), converting to OpenOffice first then editing, or if DOCX, unzipping and processing the XML.
Good luck.

how to convert a HTML web page into a PDF file using Java

i've been searching on the internet on how to convert a HTML page into a PDF file using Java. i found a lot of pointers, and in short, they don't work or are too difficult to implement. i also downloaded a commercial product, pdf4ml; the API is something i'd be happy to work with, except that when i crawled a simple page on wikipedia, i get a out of memory error (setting Xmx to 1024 M). in some approaches, they suggest converting HTML -> XHTML -> FO -> PDF. however, i am getting a lot of exceptions for the XHTML-to-FO XLS file; and reading the documentations, it's not something that i have enough time to understand right now.
here are my questions/concerns.
1. is there another cohesive API out there that will easily convert HTML to PDF (commercial or not)?
2. is there a way i can simply capture a HTML page and store it as a single file. this approach would be similar to using internet explorer's way of saving a web page as a web archive (single file, MHT format)?
any help is appreciated. (btw, i know this question has been asked repeatedly, but in addition to the original spirit of the question, i'm opened to other ways). thanks.
Try wkhtmltopdf, which is using WebKit. Another option (I'm using that currently) is using OpenOffice (remote controlled via macros).
you may use iText open source Java lib for that, and read this
or use YaHPConverter open source Java lib.
or do this whith help of icepdf popular open source lib
or use pd4ml, but it not free, only trial.
or use this, and this is man for it.
My 2 cents using opensource tools:
You can use either Capture screenshots with Selenium or WebDriver to save html page's screenshot in an image file from your Java code. And once you have image file you can convert it to pdf again from your Java code.
EDIT:
It seems you can do all that in 1 step using itext Html to Pdf
I am not sure but you could Try
1) cobra html rendering engine http://lobobrowser.org/cobra.jsp
2) htmleditorkit -- part of jdk
3) JWebPane
Use the rendering kit to parse and render html. The rendered out put is a swing component. Swing component can be used by itext to generate pdf file out put
You can try out Pdfcrowd. It is an easy to use commercial online API with many options and with support for Java.
It can create PDF either from web pages or raw HTML code.

Extract text from PDF (google app engine)

Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?
I've read about PDFJet, but it can't read PDF, can it?
Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.
iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.
PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)
I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.
More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html
I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.
Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.
You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.
For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page.
https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit
I know there is http://pdfbox.apache.org/index.html
Apache PDFBox is an open source Java
PDF library for working with PDF
documents. This project allows
creation of new PDF documents,
manipulation of existing documents and
the ability to extract content from
documents.
but I've never tested it.
Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.

Categories