Validation of files based on their file extensions - java

I get files from queues in Java. They may be of following formats.
docx
pdf
doc
xls
xlsx
txt
rtf
After reading their extensions, I want to validate whether they are actually files of these types.
For example, I got a file and checked that it has extension .xls. Afterwards, I want to check whether it is actually an .xls file or someone uploaded file of some other format after changing its extension.
EDIT: I'd like to check the file's MIME type by actually checking its content, not its extension. How it can be done?

I don't think this is a problem you should be solving. Any solution to this problem would be brittle and based upon your current understand of what constitutes a valid file of a particular type.
For example, take a XLS file. Do you know for sure what Excel accepts when opening such a file? Can you be sure you'll keep abreast of any changes in future releases that might support a different encoding style?
Ask yourself - what's the worse that could happen if the user uploads a file of the wrong type? Perhaps you'll pass the file to the application that handles that file extension and you'll get an error? Not a problem, just pass that to the user!

Without using external libraries:
You can get the file mimetype using MimetypesFileTypeMap:
File f = new File(...);
System.out.println(new MimetypesFileTypeMap().getContentType(f));
You can get a similar result with:
URLConnection.guessContentTypeFromName
Both these solutions, according to the documentation, look only at the extension.
A better option: URLConnection.guessContentTypeFromStream
File f= new File(...);
System.out.println(URLConnection.guessContentTypeFromStream(new FileInputStream(f)));
This try to guess from the first bytes of the file - be warned this is only a guess - I found it works in most cases, but fails to detect some obvious types.
I recommend a combination of both.

Related

How to read in a file extension and the perform a certain action, Java

so I have made a simple text editor built in Java; however, I need to know how to be able to recognize a certain file extension and then perform a certain action. To be frank, I have two files containing java and python keywords, the user should be able to save the file as .java or .py or open a .java or .py file and the keywords should be a different color from the rest of the text. I am confused as to how to read these extensions in.
Seems that the simple case is something like filename.endsWith(".java"); to check if the filename has a .java extension. There may be other ways to discover a file type by mimetype, but I am afraid I do not remember that at the moment.
Edit: Maybe you should be careful to convert the string to lowercase before, to make it possible to recognize file extension even if it has a different case, no matter if your system has a case sensitive filesystem or not
Edit2: Actually I found that you can use java7 features to detect file type. If you have a java.nio.file.Path object, you can do:
String type = Files.probeContentType(filepath);
The type returned is the mimetype, for example text/x-java for java source code. Not sure how good it is, and not sure how it recognizes things, but I have tested that it works on example java and python source files.

Java: what file extension should I use when writing objects with ObjectOutputStream

The API shows .tmp, my text book uses .dat and I've seen .ser
Does it matter?
I'm writing an arraylist of objects
Extensions doesn't matter. You could also use your name nicolas as extension too. Extensions are for OS to associate files to particular program.
You can use any custom extensions. Extensions are meant to identify default program for the file. For example, abc.txt file should be opened with text programs. You can change extension of your video file to .txt and your computer will try to open it with text processing program. Hence, you can provide any extension unless you want to open file by particular program.
Many people use .ser. You could use .bin, anything you like really, except the ones that indicate text: .txt, .doc, etc. It's binary, not text.

How to know file type without extension

While trying to come-up with a servlet based application to read files and manipulate them (image type conversion) here is a question that came up to me:
Is it possible to inspect a file content and know the filetype?
Is there a standard that specifies that each file MUST provide some type of marker in their content so that the application will not have to rely on the file extension constraints?
Consider an application scenario:
I am creating an application that will be able to convert different file formats to a set of output formats. Say user uploads an PDF, my application can suggest that the possible conversion formats are microsoft word or TIFF or JPEG etc.
As my application will gradually support different file formats (over a period of time), I want my application to inspect the input file instead of having the user to specify the format. And suggest to user the possible formats of output.
I understand this is an open ended, broad question. Please let me know if it needs to be modified.
Thanks,
Ayusman
Yeap you can figure out the type without an extension using the magic number.
Also, the way the file command figures it out, is actually through a 3 step check:
Check for filesystem properties to identifie empty files, folders, etc...
The said magic number
In text files, check for language in it
Here's a library that'll help you with Magic Numbers: jmimemagic

Verifying integrity of documents

What are the steps to verify integrity of these documents ? doc,docx,docm,odt,rtf,pdf,odf,odp,xls,xlsx,xlsm,ppt,pptm
Or at least of some of them. Usually when uploaded to a content repository.
I guess that inputStream is always 99,99% read properly from MultiPart http request otherwise exception would be thrown and action taken. But user can upload already corrupted file - do I use third party libraries for checking that? I didn't see anything like that in odftoolkit, itextpdf, pdfbox, apache poi or tika
There are many kinds of "corrupt".
Some corruptions should be easy to detect. For instance a truncated ODF file will most likely fail when you attempt to open it because the ZIP reader can't read it.
Others will be literally impossible to detect. For instance a one character corruption in an RTF file will be undetectable, and so (I think) will most RTF file truncations.
I'd be surprised if you found a single (free) tool to do this job for all of those file types, even to the extent that it is technically possible. The current generation of open source libraries for reading / writing document formats tend to focus on one family of formats only. If you are serious about this, you probably need to use a commercial library.
For all of the above listed file formats there are 3rd-party libraries which can open etc. - I don't know of a "verification only" but I think being able to open them without exceptions etc. is at least a basic check that the file is within the specified format... One such (commercial) library is Aspose - not affiliated, just a happy customer...
You can do checksums/hashes (that is, a secure hash) of the file before uploading, then upload the checksum separately. If the subsequently downloaded file has the same checksum, it has not been changed (to a certain high probability, depending on the checksum/hash used) from the original.
Go to check LibreOffice project (that already handles these archives), it has parts written in Java, and for sure you could find and use their mecanisms to check for corrupted files.
I think you can get the code from here:
http://www.libreoffice.org/get-involved/developers/

Biff exception in Java

When I tried to read an Excel file in Java it throws "biff exception".
What does this mean? I tried to Google it but wasn't able to find a proper explanation.
jxl.read.biff.BiffException: Unable to recognize OLE stream
at jxl.read.biff.CompoundFile.<init>(CompoundFile.java:116)
at jxl.read.biff.File.<init>(File.java:127)
at jxl.Workbook.getWorkbook(Workbook.java:221)
at jxl.Workbook.getWorkbook(Workbook.java:198)
at Com.Parsing.ExcelFile.excel(Extract.java:20)
at Com.Parsing.Extract.main(Extract.java:55)
I also faced similar problem and was able to fix it.
I was using a .xlsx file and when I changed it to .xls file, it worked just fine. Seems JXL doesn't support .xlsx format.
Please correct me if somebody knows that it supports.
The javadoc for BiffException.
Exception thrown when reading a biff file.
This exception has a number of messages that should provide some information about the cause:
excelFileNotFound
excelFileTooBig
expectedGlobals
passwordProtected
streamNotFound
unrecognizedBiffVersion
unrecognizedOLEFile
Edit:
unrecognizedOLEFile seems to mean that something is embedded in the file that cannot be read.
An Excel workbook with several sheets (from BIFF5 on) is stored using the compound document file format (also known as “OLE2 storage file format” or “Microsoft Office compatible storage file format”). It contains several streams for different types of data.
A complete documentation of the format of compound document files can be found at
http://sc.openoffice.org/compdocfileformat.pdf
I think the exception mean that your parsing library can not recognise it(For example:biff5 format can not be parsed in POI and Jexcelapi).
You can check your file version be open it in Office and click 'SAVE AS',the format list in the filedialog is it's current file version.

Categories