How to Sanitize PDF with a opensource Java tool ( like PDFBox)? - java

I am trying to enhance security of a file upload segment in Spring based web application.
It uses a antivirus to do virus screening before upload, However it is additionally required to sanitize the files or restrict the files with active contents(javascript, autoaction) from being uploaded .
Allowed file formats are XLS, DOCX, PDF along with few image formats.

For anyone who might need it later. This seems like a good starting point.
(Even though archived) DocBleach project has the implementation to detect malicious content and to sanitize the file. it also supports other office formats. It is build on top of pdfbox.
https://github.com/docbleach/DocBleach

Related

Converting docx/ODT to image using Java

I am working on a Spring REST service based web application (UI is based on HTML5, backbone.js). The actual requirement is, an uploaded document (could be any document like excel, word, ppt, pdf etc) requires an preview option using which an user can view the document in the browser (user may or may not have office installed).
My idea is to convert the documents into images and display them to the user. On searching, i found multiple ways to convert a PDF to image but not much ODT to image (Note: I am looking for an open source). JODConverter, docx4j can be used to convert the documents to pdf. Then I can convert these PDFs to images. But is this the right way. Is there any other efficient way to achieve the same. Please suggest and point me to the right direction.
Thanks in advance.
Gopi
Yes, you won't do any better than .docx to .pdf to image. You really need a stable workflow, and this is as good as you'll find for this purpose, unless you're running on a Microsoft server and you have access to the official Microsoft Office stuff.
For previews, docx4j or similar will do just fine. Not everything converts perfectly, but it should be fine for a preview.

PDF Open Office or MS Word

I am new to java, I have to read a PDF, Open Office or MS Word file and make changes in the file and render as PDF document on my web page. Please someone tell me which of these file's API or SDK is easy to use and also tell me best SDK for this. So I can read, Update and render easily. file also contains Table but there is no image.
We use Apache POI to read Microsoft Office files. There are many libraries for PDF in Java. iText is something I have used. Once you pick the tools, do a selective search on Stack Overflow. There are plenty of discussions around these tools.
Depending on the types of updates you are doing, modifying PDF is going to be a problem - it's not intended for editing. You might have to find some way of converting the PDF to something first, then edit. Depending on the types of changes you want to make and the documents you are working from even editing DOC and Writer files is going to be tricky. They are all different formats.
As Jayan mentioned, iText and POI may help you a little. OpenOffice Writer documents can be edited by unzipping then modifying the XML or using the UNO API. Word documents can be editied by using MS Office automation (bad idea), converting to OpenOffice first then editing, or if DOCX, unzipping and processing the XML.
Good luck.

Verifying integrity of documents

What are the steps to verify integrity of these documents ? doc,docx,docm,odt,rtf,pdf,odf,odp,xls,xlsx,xlsm,ppt,pptm
Or at least of some of them. Usually when uploaded to a content repository.
I guess that inputStream is always 99,99% read properly from MultiPart http request otherwise exception would be thrown and action taken. But user can upload already corrupted file - do I use third party libraries for checking that? I didn't see anything like that in odftoolkit, itextpdf, pdfbox, apache poi or tika
There are many kinds of "corrupt".
Some corruptions should be easy to detect. For instance a truncated ODF file will most likely fail when you attempt to open it because the ZIP reader can't read it.
Others will be literally impossible to detect. For instance a one character corruption in an RTF file will be undetectable, and so (I think) will most RTF file truncations.
I'd be surprised if you found a single (free) tool to do this job for all of those file types, even to the extent that it is technically possible. The current generation of open source libraries for reading / writing document formats tend to focus on one family of formats only. If you are serious about this, you probably need to use a commercial library.
For all of the above listed file formats there are 3rd-party libraries which can open etc. - I don't know of a "verification only" but I think being able to open them without exceptions etc. is at least a basic check that the file is within the specified format... One such (commercial) library is Aspose - not affiliated, just a happy customer...
You can do checksums/hashes (that is, a secure hash) of the file before uploading, then upload the checksum separately. If the subsequently downloaded file has the same checksum, it has not been changed (to a certain high probability, depending on the checksum/hash used) from the original.
Go to check LibreOffice project (that already handles these archives), it has parts written in Java, and for sure you could find and use their mecanisms to check for corrupted files.
I think you can get the code from here:
http://www.libreoffice.org/get-involved/developers/

How to read content of scanned pdf file in java / jsp or in javascript

How can i read content of scanned pdf file in java/jsp or in javascript, can you tell how to achieve this with developing code?
advance thanks for reply
You can convert the scanned PDF to a image using GhostScript and then feed it to an OCR engine, such as Tesseract. Take a look at VietOCR for an example implementation.
What you are trying to do (I think) is use OCR to extract text from a image PDF produced by a scanner. Java is probably the best for doing this. There are a number of options for doing this, depending on whether you are prepared to pay for software to do this. Google for Java (or Javascript), PDF and OCR.
IMO, this task is not something that should be done in a JSP. JSPs are best for rendering results ... not for generating them in the first place.
Actually, I am working on the same project at the moment, I am doing this in the following steps and the result works well.
User upload a scanned pdf to PDFUploader servlet, returns a server side file name to front end, which indicates upload is successful.
Front end uses this file name and default page 0 to ask PDFReader servlet to retrieve the first page of pdf file and display is at the front end, you can convert this pdf to a image for use an iframe to have the embedded pdf reader.
Front end uses this file name and default page 0 to ask OCRServlet to perform OCR. I am using WeOCR and tesseract as my OCR engine in an Apache http server. I have modified some parts of the submit.cgi in WeOCR server since I know what types of the format that the WeOCR server will receive. I still have some problems while I convert the scanned pdf to an image (I am using pdfbox )
Google for anything OCR related,
best bet will be to use existing libraries like http://asprise.com/product/ocr/index.php?lang=java

Extract text from PDF (google app engine)

Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?
I've read about PDFJet, but it can't read PDF, can it?
Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.
iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.
PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)
I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.
More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html
I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.
Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.
You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.
For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page.
https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit
I know there is http://pdfbox.apache.org/index.html
Apache PDFBox is an open source Java
PDF library for working with PDF
documents. This project allows
creation of new PDF documents,
manipulation of existing documents and
the ability to extract content from
documents.
but I've never tested it.
Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.

Categories