How to extract text from pdf and doc file without downloading - java

I have searched a lot before asking that question. I have a program(java) which crawls some wep pages and trying to find some .doc and .pdf files and it can download them but only one .pdf or .doc can cover up to 3-4mb which is not good because there are millions of files.. so I decied to extract their text without downloading the whole file. Basically, I need to see pdf or doc file online and download their text only but I could not figure out how to do that. If necessary I can provide my code.
Edit:This question can be closed now since I got the idea and (no)solution.
Thanks for help.
And What's up with those downgrades on question ?

That is not possible. You can only start extracting the document once you download the bytes.
(unless you also have control over the server, you could do the extraction server-side and provide a txt download link)

Reading a file from a website on the Internet without downloading it is impossible.
If you have control of the server you could write a web service that can parse the files on demand and extract the parts you are interested in, which would then be sent to the client.
If not, and if you're up for a more challenging problem, you could write an HTTP client that starts downloading the file and parses it on the fly, downloading only as much as you need to extract the part(s) you need. This might or might not be feasible (or worthwhile) depending on where in the files the "interesting" bits were located. If they're close to the beginning in most cases then you might be able to reduce the download size significantly.
A detailed explanation of how to accomplish this is probably beyond the guidelines for StackOverflow answer length.

Related

PDF Open Office or MS Word

I am new to java, I have to read a PDF, Open Office or MS Word file and make changes in the file and render as PDF document on my web page. Please someone tell me which of these file's API or SDK is easy to use and also tell me best SDK for this. So I can read, Update and render easily. file also contains Table but there is no image.
We use Apache POI to read Microsoft Office files. There are many libraries for PDF in Java. iText is something I have used. Once you pick the tools, do a selective search on Stack Overflow. There are plenty of discussions around these tools.
Depending on the types of updates you are doing, modifying PDF is going to be a problem - it's not intended for editing. You might have to find some way of converting the PDF to something first, then edit. Depending on the types of changes you want to make and the documents you are working from even editing DOC and Writer files is going to be tricky. They are all different formats.
As Jayan mentioned, iText and POI may help you a little. OpenOffice Writer documents can be edited by unzipping then modifying the XML or using the UNO API. Word documents can be editied by using MS Office automation (bad idea), converting to OpenOffice first then editing, or if DOCX, unzipping and processing the XML.
Good luck.

Verifying integrity of documents

What are the steps to verify integrity of these documents ? doc,docx,docm,odt,rtf,pdf,odf,odp,xls,xlsx,xlsm,ppt,pptm
Or at least of some of them. Usually when uploaded to a content repository.
I guess that inputStream is always 99,99% read properly from MultiPart http request otherwise exception would be thrown and action taken. But user can upload already corrupted file - do I use third party libraries for checking that? I didn't see anything like that in odftoolkit, itextpdf, pdfbox, apache poi or tika
There are many kinds of "corrupt".
Some corruptions should be easy to detect. For instance a truncated ODF file will most likely fail when you attempt to open it because the ZIP reader can't read it.
Others will be literally impossible to detect. For instance a one character corruption in an RTF file will be undetectable, and so (I think) will most RTF file truncations.
I'd be surprised if you found a single (free) tool to do this job for all of those file types, even to the extent that it is technically possible. The current generation of open source libraries for reading / writing document formats tend to focus on one family of formats only. If you are serious about this, you probably need to use a commercial library.
For all of the above listed file formats there are 3rd-party libraries which can open etc. - I don't know of a "verification only" but I think being able to open them without exceptions etc. is at least a basic check that the file is within the specified format... One such (commercial) library is Aspose - not affiliated, just a happy customer...
You can do checksums/hashes (that is, a secure hash) of the file before uploading, then upload the checksum separately. If the subsequently downloaded file has the same checksum, it has not been changed (to a certain high probability, depending on the checksum/hash used) from the original.
Go to check LibreOffice project (that already handles these archives), it has parts written in Java, and for sure you could find and use their mecanisms to check for corrupted files.
I think you can get the code from here:
http://www.libreoffice.org/get-involved/developers/

Upload a file from Flex/Flash interface to a Java server using a webservice

I would like to transfer a file from the Flex front end to a back end Java web service, how can I achieve this ?
Will byte array be a good option for the transfer ?
It would be appreciated if you can give a hint as to how to achieve the solution or point me in the right direction.
Note: the file is a small .jpg file, and I am new to Java
Have a look on http://www.adobe.com/devnet/flex/articles/file_upload.html
You can use Flex "FileReference class" to Upload a file on Server
Flex Working with file upload and download
and commonly on server there should be a servlet to accept multipart request
using
Apache Commons FileUpload
this is useful example of servlet
Servlet File Upload Example
Hopes that helps
I have used a byte array to transfer files when I know they will be small. It can be a lot simpler to post them when dealing with https/cert issues, etc. that FileReference does not work well with. A FileReference upload is your other option (typical solution). Either way you'll use FileReference to select the file and then either use .upload to upload it or .load to load bytes in. Then you'll use .data to get the byte array. If your Jpg is coming from a snapshot taken from a flex component in memory, you'll need to work with a special Jpeg image encoder. I can tell you how to do that if that is what you are doing. Really beyond the scope of your original question, though.

How to read content of scanned pdf file in java / jsp or in javascript

How can i read content of scanned pdf file in java/jsp or in javascript, can you tell how to achieve this with developing code?
advance thanks for reply
You can convert the scanned PDF to a image using GhostScript and then feed it to an OCR engine, such as Tesseract. Take a look at VietOCR for an example implementation.
What you are trying to do (I think) is use OCR to extract text from a image PDF produced by a scanner. Java is probably the best for doing this. There are a number of options for doing this, depending on whether you are prepared to pay for software to do this. Google for Java (or Javascript), PDF and OCR.
IMO, this task is not something that should be done in a JSP. JSPs are best for rendering results ... not for generating them in the first place.
Actually, I am working on the same project at the moment, I am doing this in the following steps and the result works well.
User upload a scanned pdf to PDFUploader servlet, returns a server side file name to front end, which indicates upload is successful.
Front end uses this file name and default page 0 to ask PDFReader servlet to retrieve the first page of pdf file and display is at the front end, you can convert this pdf to a image for use an iframe to have the embedded pdf reader.
Front end uses this file name and default page 0 to ask OCRServlet to perform OCR. I am using WeOCR and tesseract as my OCR engine in an Apache http server. I have modified some parts of the submit.cgi in WeOCR server since I know what types of the format that the WeOCR server will receive. I still have some problems while I convert the scanned pdf to an image (I am using pdfbox )
Google for anything OCR related,
best bet will be to use existing libraries like http://asprise.com/product/ocr/index.php?lang=java

Unable to parse pdf by Jpedal

I'm facing a problem while parsing a PDF with Jpedal.
While reading the wordlist from the Jpedal, I get garbled characters in the wordslist. This also happens when using OCR, and when I copy the text from PDF and paste in Word or a simple text editor. What I understand is this PDF was generated by Quartz PDF context on MAC OS X 10.6.4, which is used to compress the file size, but iseasily viewable on PDF viewers. I searched for any Java API supporting for decoding this kind of PDF but was unsuccessful. I'm looking for any application or Java API which I can use to decode it; must be usable on a Linux machine.
Hye everybody
I'm posting a possible solution for problem. Here is link describing how quartz parse the pdf and of course which need to be implemented in code cause till now I didn't found any readymade API for it and I believe that stackoverflow is all about taking initiative and do and answer the questions which not been done or asked before.
regards
Rituraj

Categories