Verifying integrity of documents - java

What are the steps to verify integrity of these documents ? doc,docx,docm,odt,rtf,pdf,odf,odp,xls,xlsx,xlsm,ppt,pptm
Or at least of some of them. Usually when uploaded to a content repository.
I guess that inputStream is always 99,99% read properly from MultiPart http request otherwise exception would be thrown and action taken. But user can upload already corrupted file - do I use third party libraries for checking that? I didn't see anything like that in odftoolkit, itextpdf, pdfbox, apache poi or tika

There are many kinds of "corrupt".
Some corruptions should be easy to detect. For instance a truncated ODF file will most likely fail when you attempt to open it because the ZIP reader can't read it.
Others will be literally impossible to detect. For instance a one character corruption in an RTF file will be undetectable, and so (I think) will most RTF file truncations.
I'd be surprised if you found a single (free) tool to do this job for all of those file types, even to the extent that it is technically possible. The current generation of open source libraries for reading / writing document formats tend to focus on one family of formats only. If you are serious about this, you probably need to use a commercial library.

For all of the above listed file formats there are 3rd-party libraries which can open etc. - I don't know of a "verification only" but I think being able to open them without exceptions etc. is at least a basic check that the file is within the specified format... One such (commercial) library is Aspose - not affiliated, just a happy customer...

You can do checksums/hashes (that is, a secure hash) of the file before uploading, then upload the checksum separately. If the subsequently downloaded file has the same checksum, it has not been changed (to a certain high probability, depending on the checksum/hash used) from the original.

Go to check LibreOffice project (that already handles these archives), it has parts written in Java, and for sure you could find and use their mecanisms to check for corrupted files.
I think you can get the code from here:
http://www.libreoffice.org/get-involved/developers/

Related

How to decompress an pkAES-256 Deflate encrypted zip files?

I need to unzip zip files with Java which are compressed and password secured with following information:
Method: pkAES-256 Deflate
Chraracteristics: 0xD StrongCrypto : Encrypt StrongCrypto
I tried to use zip4j but it always gives me this stacktrace:
net.lingala.zip4j.exception.ZipException: java.io.IOException: java.util.zip.DataFormatException: invalid code lengths set
at net.lingala.zip4j.tasks.AsyncZipTask.performTaskWithErrorHandling(AsyncZipTask.java:51)
at net.lingala.zip4j.tasks.AsyncZipTask.execute(AsyncZipTask.java:38)
at net.lingala.zip4j.ZipFile.extractFile(ZipFile.java:494)
at net.lingala.zip4j.ZipFile.extractFile(ZipFile.java:460)
at Main.main(Main.java:29)
Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid code lengths set
at net.lingala.zip4j.io.inputstream.InflaterInputStream.read(InflaterInputStream.java:55)
at net.lingala.zip4j.io.inputstream.ZipInputStream.read(ZipInputStream.java:141)
at net.lingala.zip4j.io.inputstream.ZipInputStream.read(ZipInputStream.java:121)
at net.lingala.zip4j.tasks.AbstractExtractFileTask.unzipFile(AbstractExtractFileTask.java:82)
at net.lingala.zip4j.tasks.AbstractExtractFileTask.extractFile(AbstractExtractFileTask.java:64)
at net.lingala.zip4j.tasks.ExtractFileTask.executeTask(ExtractFileTask.java:39)
at net.lingala.zip4j.tasks.ExtractFileTask.executeTask(ExtractFileTask.java:21)
at net.lingala.zip4j.tasks.AsyncZipTask.performTaskWithErrorHandling(AsyncZipTask.java:44)
... 4 more
Caused by: java.util.zip.DataFormatException: invalid code lengths set
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at net.lingala.zip4j.io.inputstream.InflaterInputStream.read(InflaterInputStream.java:45)
... 11 more
Does anybody knows how to deal with such an encryption? I can only open these files with 7zip - but I need to do that with Java.
Thank you for your help.
The ZIP file format, at least, the one that is universally understood and supported by tons of libraries, only supports one kind of encryption; it is called 'ZipCrypto', it is of dubious quality (it's not completely broken, but it's rather easy to end up in a scenario where someone who shouldn't be able to read that zip file will figure it out. It is for example quite easy to try tons of passwords, so if the password is a simple dictionary word, it's mostly useless). This is the crypto you get when you run zip -c on the command line for just about every distribution of the 'zip' executable.
WinZip added, all on its own, an extension to the ZIP format called StrongCrypto which is AES-256 based. It sounds like you have that.
zip is more or less public domain (it's tricky; PKWare as a company more or less owns various parts of it, but nevertheless, e.g. the /bin/unzip command in your linux distro is fully open source, legally the fate of zip is somewhat tricky to explain)... so when winzip, on its own, just adds features to the zip concept, that was quite idiotic: Neither the open source community at large, nor PKWare, would agree to this random flyby upgrade, so for a long while, these 'WinZip based strongcrypto zip files that end in .zip' just weren't zip files, and if that's confusing, the blame falls entirely on WinZip, Inc.'s shoulders. What you have just isnt a zip file, even if it looks like one.
However, since then, at least WinZip and PKWare now reached an agreement and they can decrypt each other's stronger crypto offerings. However, the open source community has mostly washed its hands and doesn't consider these strongcrypto options as 'zip files'. That explains why the library you have cannot decrypt this file, and probably never will.
Thus, because of this mess entirely due to PKWare and WinZip's shenanigans: if you want to encrypt a zip file, I STRONGLY suggest you don't use zip's built in stuff (neither ZipCrypto which is bad, nor StrongCrypto which is badly supported), but to just zip as normal with no encryption, and then encrypt the resulting file (and then don't name that file foo.zip, as it is no longer a zip file. foo.zip.enc would be a better name).
If you're stuck on this, and there is no possibility to change the format of the file being sent, you need 7zip. 7zip is open source and can probably decrypt this file, whereas most open source 'zip' libraries can't. A big problem is that there is no all-java 7zip impl that I'm aware of. There is the 7zip-binding project, which just farms out the work to a C library, which means you need a so-called 'native' file with your java project (a DLL on windows, a .SO file on linux, and a .JNILIB file on mac), and you need one such file for every architecture/OS combo you want to support. Kinda painful, it ruins the 'write once run anywhere' promise of java, but it's what you'd have to do. The site looks like it's old enough to order beers, but as far as I know it is being maintained, so there's that. But, seriously, don't use zip's built in encryption stuff, it sucks. Try to avoid it.
NB: The reason 7zip can do it is difference of opinion: the open source communities supporting plain zip endeavour to keep it simple to ensure as many platforms can do it, which is probably why there are various all-java zip impls around. 7zip tries to go for awesome support, at the cost of making it a lot harder to port 7zip around, which is probably why there isn't an all-java 7zip impl, only a binding. As a consequence, 7zip is willing to try to figure out how to decrypt this winzip stuff, plain zip isn't.

PDF Compression - HTML to PDF (wkhtmltopdf)

Background
I'm working on a Scala/Java project where we convert individual HTML files to PDF files. We then merge the individual files into one larger complete PDF file.
For the converting we are using sPDF which is built on top of wkhtmltopdf. For the merging we use PDFMergerUtility.
The reasons for making individual files is a bit complicated - but it should be noted we can't make one big PDF off the bat and have to make the individual files first.
The issue
Initially we had no problems with this approach - however as the system has grown - so have the final PDF files. We went from files that were 2MB-3MB to files that are 20MB. I would like to know if there is any obvious compression methods or techniques we could use?
There is a lot of shared content across the individual files but since we're just merging them as isolated/independent files (as in, none of the content that is the same across the individual files is being reused to save space) it doesn't make a difference in bringing down the file size.
If I manually ZIP the final PDF file it greatly reduces the file size -as obviously there is a lot of repeated content.
So one option might just be to zip the PDF after I've finished the merging, but I would prefer to compress it during the merger process or conversion process.
Any ideas?
You could try Sejda to merge, it's Java, open source and based on a fork of PDFBox. It can generate PDF files using object streams (PDFBox currently doesn't support that) and, in case it doesn't reduce the size that much, you can try to pipe its 'compress' task which goes through the document removing unused resources and compressing images.
It's battle tested as engine behind PDFsam so, if you want to give it a quick test and see what's the outcome, just download PDFsam, use the merge module with your files (and compression flag on) and the result is what Sejda will generate.

How to extract text from pdf and doc file without downloading

I have searched a lot before asking that question. I have a program(java) which crawls some wep pages and trying to find some .doc and .pdf files and it can download them but only one .pdf or .doc can cover up to 3-4mb which is not good because there are millions of files.. so I decied to extract their text without downloading the whole file. Basically, I need to see pdf or doc file online and download their text only but I could not figure out how to do that. If necessary I can provide my code.
Edit:This question can be closed now since I got the idea and (no)solution.
Thanks for help.
And What's up with those downgrades on question ?
That is not possible. You can only start extracting the document once you download the bytes.
(unless you also have control over the server, you could do the extraction server-side and provide a txt download link)
Reading a file from a website on the Internet without downloading it is impossible.
If you have control of the server you could write a web service that can parse the files on demand and extract the parts you are interested in, which would then be sent to the client.
If not, and if you're up for a more challenging problem, you could write an HTTP client that starts downloading the file and parses it on the fly, downloading only as much as you need to extract the part(s) you need. This might or might not be feasible (or worthwhile) depending on where in the files the "interesting" bits were located. If they're close to the beginning in most cases then you might be able to reduce the download size significantly.
A detailed explanation of how to accomplish this is probably beyond the guidelines for StackOverflow answer length.

Validation of files based on their file extensions

I get files from queues in Java. They may be of following formats.
docx
pdf
doc
xls
xlsx
txt
rtf
After reading their extensions, I want to validate whether they are actually files of these types.
For example, I got a file and checked that it has extension .xls. Afterwards, I want to check whether it is actually an .xls file or someone uploaded file of some other format after changing its extension.
EDIT: I'd like to check the file's MIME type by actually checking its content, not its extension. How it can be done?
I don't think this is a problem you should be solving. Any solution to this problem would be brittle and based upon your current understand of what constitutes a valid file of a particular type.
For example, take a XLS file. Do you know for sure what Excel accepts when opening such a file? Can you be sure you'll keep abreast of any changes in future releases that might support a different encoding style?
Ask yourself - what's the worse that could happen if the user uploads a file of the wrong type? Perhaps you'll pass the file to the application that handles that file extension and you'll get an error? Not a problem, just pass that to the user!
Without using external libraries:
You can get the file mimetype using MimetypesFileTypeMap:
File f = new File(...);
System.out.println(new MimetypesFileTypeMap().getContentType(f));
You can get a similar result with:
URLConnection.guessContentTypeFromName
Both these solutions, according to the documentation, look only at the extension.
A better option: URLConnection.guessContentTypeFromStream
File f= new File(...);
System.out.println(URLConnection.guessContentTypeFromStream(new FileInputStream(f)));
This try to guess from the first bytes of the file - be warned this is only a guess - I found it works in most cases, but fails to detect some obvious types.
I recommend a combination of both.

creating own file extension [duplicate]

This question already has answers here:
How to create my own file extension like .odt or .doc? [closed]
(3 answers)
Closed 8 years ago.
I'm on my way in developing a desktop application using netbeans(Java Dextop Application) and I need to implement my own file format which is specific to that application only. I'm quite uncertain as to how should I go about first.What code should I use so that my java application read that file and open it in a way as I want it to be.
If it's character data, use Reader/Writer. If it's binary data, use InputStream/OutputStream. That's it. They are available in several flavors, like BufferdReader which eases reading a text file line by line and so on.
They're part of the Java IO API. Start learning it here: Java IO tutorial.
By the way, Java at its own really doesn't care about the file extension or format. It's the code logic which you need to write to handle each character or byte of the file according to some file format specification (which you in turn have to writeup first if you'd like to invent one yourself).
I am not sure this directly addresses your question, but since you mentioned a custom file format, it is worth noting that applications launched using Java Web Start can declare a file association. If the user double clicks one of those file types, the file name will be passed to the main(String[]) of the app.
This ability is used in the File Service demo. of the JNLP API - available at my site.
As to the exact format of the file & the best ways to load and save it, there are a large number of possibilities that can be narrowed down with more details of the information it contains.
Choosing a new/existing file extension does not affect your application (or in any case anyone's). It is upto the programmer what files he wants his app to read.
For example, you may consider you can't read a pdf or doc directly as a text file....but that is not because they are written/ stored differently, but because they have headers or characters which your app does not understand. So we might use a plugin or extension which understands those added headers ( or rather the grammar of the pdf /doc file) removes them & lets our app know what text (or anything else) it contains.
So if you wish to incorporate your own extension, & specifically want no other application to be able to read it, just write the text in a way that only your program is able to understand. Though writing a file in binary pretty much ensures that your file is not read directly just by user opening a file, but it is however still possible to read from it, if it is merely collection of raw characters.
If you ask code for hiding a data, I'd say there are plenty of algorithms you might use, which usually get tagged as encryptions cause you are basically trying to lock/hide your stuff. So if you do not really care for the big hulla-bulla, simply trying to keep a file from being directly read & successful attempts to read the file does not cause any harm to your application, write it in binary.

Categories