detecting binary files and character encodings in zipfiles

detecting binary files and character encodings in zipfiles - java

When reading zipfiles (using Java ZipInputStream or any other library) from an unknown source is there any way of detecting which entries are "character data" (and if so the encoding) or "binary data". And, if binary, any way of determining any more information (MIME types, etc.)
EDIT does the ByteOrderMark (BOM) occur in zipentries and if so do we have to make special operations for it.

It basically boils down to heuristics for determining the contents of files. For instance, for text files (ASCII) it should be possible to make a fairly good guess by checking the range of byte values used in the file -- although this will never be completely fool-proof.
You should try to limit the classes of file types you want to identify, e.g. is it enough to discern between "text data" and "binary data" ? If so you should be able to get a fairly high success rate for detection.
For UNIX systems, there is always the file command which tries to identify file types based on (mostly) content.

Maybe implement a Java component that is capable of applying the rules defined in /usr/share/file/magic. I would love to have something like that. (You would basically have to be able to look at the first x couple of bytes.)

Related

how to compare a value's encoding of string type with a specific encoding in java?

I'm told to write a code that get a string text and check if its encoding is equal the specific encoding that we want or not. I've searched a lot but I didn't seem to find anything. I found a method (getEncoding()) but it just works with files and that is not what I want. and also I'm told that i should use java library not methods of mozilla or apache.
I really appreciate any help. thanks in advance.

What you are thinking of is "Internationalization". There are libraries for this like, Loc4j, but you can also get this using java.util.Locale in Java. However in general text is just text. It is a token with a certain value. No localization information is stored in the character. This is why a file normally provides the encoding in the header. A console or terminal can also provide localization using certain commands/functions.
Unless you know the source encoding and the token used you will have a limited ability to guess what encoding is used in the other end. If you still would want to do this you will need to go into deeper areas such as decryption where this kind of stuff usually is done using statistic analysis. This in turn requires databases on the usage of different tokens and depending on the quality of the text, databases and algorithms a specific amount of text is required. Special stuff, like writing Swedish with eg. US encoding (like using a for å and ä or o for ö) will require more advanced analysis.
EDIT
Since I got a comment that encoding and internationalization is different entities I will add some comments. It is possible to work with different encodings working plainly with English (like some English special characters). It is also possible to work with encodings using for example Charset. However for many applications using different encodings it may still be efficient to use Locale, since this library can do a lot of operations on text with different encodings.

Thanks for ur answers and contribution but these two link did the trick. I had already seen these two pages but it didn't seem to work for me cause I was thinking about get the encoding directly and then compare it with the specific one.
This is one of them
This is another one.

Apache FileUtils when comparing two identical pdfs keeps returning false

I am using FileUtils to compare two identical pdfs. This is the code:
boolean comparison = FileUtils.contentEquals(pdfFile1, pdfFile2);
Despite the fact that both pdf files are identical, I keep getting false. I also noticed that when I execute:
byte[] byteArray = FileUtils.readFileToByteArray(pdfFile1);
byte[] byteArrayTwo = FileUtils.readFileToByteArray(pdfFile2);
System.out.println(byteArray);
System.out.println(byteArrayTwo);
I get the following bytecode for the two pdf files:
[B#3a56f631
[B#233d28e3
So even though both pdf files are absolutely identical visually, their byte-code is different and hence failing the boolean test. Is there any way to test whether the identical pdf files are identical?

Unfortunately for PDF there is a big difference between having "identical files" and having files that are "visually identical". So the first question is what you are looking for.
One very simple example, information in a PDF file can be compressed or not, and can be compressed with different compression filters. Taking a file where some of the content is not compressed, and compressing that content with a ZIP compression filter for example, would give you two files that are very different on a byte level, yet very much the same visually.
So you can do a number of different things to compare PDF files:
1) If you want to check whether you have "the same file", read them in and calculate some sort of checksum as answered before by Peter Petrov.
2) If you want to know whether or know files are visually identical, the most common method is some kind of rendering. Render all pages to images and compare the images. In practice this is not as simple as it sounds and there are both simple (for example callas pdfToolbox) and complex (for example Global Vision DigitalPage) applications that implement some kind of "sameness" algorithm (caution, I'm related to both of those vendors).
So define very well what exactly you need first, then choose carefully which approach would work best.

Yes, generate md5 sum from both files.
See if these sums are identical.
If they are, then your files are identical
too with a certainty which is practically 100%.
If the sums are not identical, then
your files are different for sure.
To generate the md5 sums, on Linux there's an md5sum
command, for Windows there's a small tool called fciv.
http://www.microsoft.com/en-us/download/details.aspx?id=11533

Just to note, the two identifiers you wrote
[B#3a56f631
[B#233d28e3
are different because they belong to two different objects. These are object identifiers, not bytecode. Two objects can be logically equal even if they are not exactly the same objects (e.g. they have different objectIDs).
Otherwise, calculating an MD5 checksum as peter.petrov wrote is a good idea.

Batch, Parse, and Convert Meta-Data from .1sc Files

TLDR: Questions are after the break.
I am looking to convert and store information from a large (3TB) set of *.1sc images (Bio-Rad, Quantity One). In addition to having the actual image, the file contains a good deal of information regarding where/how the image was taken (meta-data). All of this seams to be held in the Intel Hex format (or at least they all open with "Stable File Version 2.0 Intel Format" in hex).
The ImageJ plugin Bioformats can handle the image, and includes functionality in MetadataTools. To capture just the batch images, I had great success using the batchTiffconvert plugin. The meta-data that seems to be available in ImageJ is incomplete, for this format, but I'm not certain on how to use the MetadataTools (any good guide references would be appreciated, currently going over the API).
My real problem isn't actually parsing the hex to find what I'm looking for. Where I'm failing is actually converting the hex into something meaning full. Example:
I can parse the hex for scan_area, but I haven't been able to convert 00 10 00 16 00 EC B5 86 00 into something meaningful.
Approaching this from the same direction as a similar DM3 question, I was able to make an XML file, but even if I wrote out the whole XML file, much of the meta-data wasn't included (it had things like the date-stamp, which are good). I think this is because of the information passed to GelReader.Java from BioRadReader.Java. In particular this section:
if (getMetadataOptions().getMetadataLevel() != MetadataLevel.MINIMUM) {
String units = firstIFD.getIFDStringValue(MD_FILE_UNITS);
String lab = firstIFD.getIFDStringValue(MD_LAB_NAME);
addGlobalMeta("Scale factor", scale);
addGlobalMeta("Lab name", lab);
addGlobalMeta("Sample info", info);
addGlobalMeta("Date prepared", prepDate);
addGlobalMeta("Time prepared", prepTime);
addGlobalMeta("File units", units);
addGlobalMeta("Data format",
fmt == SQUARE_ROOT ? "square root" : "linear");
}
Because the MetadataLevel set in all the Bio-Rad scripts is MetadataLevel.MINIMUM. I tried adding the additional metadata I wanted here, but again it wasn't able to be convert/decoded usefully.
Is it possible to retrieve more of the metadata using this system? If so, am I working in the right section of code? The source for bio-formats is quite large, and I won't even pretend to have a good grasp on it (though I'm trying). Am I just running into a proprietary format problem? Can anyone tell me how to convert the hex values or point more to a resource that explains it?

First of all: note that neither of the sources you linked above actually correspond to the .1sc file format reader of Bio-Formats. You want the BioRadGelReader.
The Bio-Formats library parses three types of metadata. From the About Bio-Formats page:
There are three types of metadata in Bio-Formats, which we call core metadata, original metadata, and OME metadata.
Core metadata only includes things necessary to understand the basic structure of the pixels: image resolution; number of focal planes, time points, channels, and other dimensional axes; byte order; dimension order; color arrangement (RGB, indexed color or separate channels); and thumbnail resolution.
Original metadata is information specific to a particular file format. These fields are key/value pairs in the original format, with no guarantee of cross-format naming consistency or compatibility. Nomenclature often differs between formats, as each vendor is free to use their own terminology.
OME metadata is information from #1 and #2 converted by Bio-Formats into the OME data model. Performing this conversion is the primary purpose of Bio-Formats. Bio-Formats uses its ability to convert proprietary metadata into OME-XML as part of its integration with the OME and OMERO servers—essentially, they are able to populate their databases in a structured way because Bio-Formats sorts the metadata into the proper places. This conversion is nowhere near complete or bug free, but we are constantly working to improve it. We would greatly appreciate any and all input from users concerning missing or improperly converted metadata fields.
The Bio-Formats command line tools are capable of dumping all original metadata key/value pairs for a given dataset, as well as the converted OME-XML.
In your case, if what you want is quantity over quality, you probably want to record all the original metadata somehow. The showinf command line tool does that automatically (you actually have to pass the -nometa flag to suppress it).
If you look over the complete list of original metadata key/value pairs and the information you seek is still not there, then we'd have to go to the next level and improve the BioRadGelReader to parse more metadata.
Unfortunately, inspecting the source code, it looks like essentially nothing is parsed into the original metadata table for that file format. It was likely reverse engineered, since the Bio-Rad Gel format page says that we do not have a specification document for it.
So what that means is that the Bio-Formats developers are as clueless about the file structure as you are, and would do the same thing you are doing: stare at a hex editor and try to figure things out. Some tricks include:
Look up metadata values using the official Bio-Rad software, then search for those values in various encodings using your hex editor.
Edit one metadata value (if possible) using the official Bio-Rad software—or by doing multiple acquisitions as similarly as possible except for one variable—then diff the output files to see what effect changing that value had.
Check whether the first few hundred bytes matches a known pattern for container formats such as Microsoft OLE-based data, TIFF-based data, or HDF-based data, since many formats reuse these general container structures.
You could also email Bio-Rad to ask whether they are willing to send a spec, and if so, use it to improve the file format reader, and/or forward it on to the Bio-Formats developers.

Writing a file in binary or bytecode

I am storing large amounts of information inside of text files that are written via java. I have two questions relating to this:
Is there any efficiency boost to writing in binary or bytecode over Strings?
What would I use to write the data type into a file.
I already have a setup based around Strings, but I want to compare and at least know how to write the file in bytecode or binary.
When I read in the file, it will be translated into Strings again, but according to my reasearch if I write the file straight into bytecode it removes the added process on both ends of translating between Strings and code both for writing the file and for reading it.
cHao has a good point about just using Strings anyway, but I am still interested in the how if how to write varied data types in the file.
In other words, can I still use the FileReader and BufferedReader to read and translate back to Strings, or is there another thing to use. Also using a BinaryWriter, is it still just the FileWriter class that I use???

If you want to write it in "binary", and you want to save space, why not just zip it using the jdk? Meets all your requirements.

File upload-download in its actual format

I've to make a code to upload/download a file on remote machine. But when i upload the file new line is not saved as well as it automatically inserts some binary characters. Also I'm not able to save the file in its actual format, I've to save it as "filename.ser". I'm using serialization-deserialization concept of java.
Thanks in advance.

How exactly are you transmitting the files? If you're using implementations of InputStream and OutputStream, they work on a byte-by-byte level so you should end up with a binary-equal output.
If you're using implementations of Reader and Writer, they convert the bytes to characters according to some character mapping, and then perform the reverse process when saving. Depending on the platform encodings of the various machines (and possibly other effects if you're not specifying the charset explicitly), you could well end up with differences in the binary file.
The fact that you mention newlines makes me think that you're using Readers to send strings (and possibly that you're stitching the strings back together yourself by manually adding newlines). If you want the files to be binary equal, then send them as a stream of bytes and store that stream verbatim. If you want them to be equal as strings in a given character set, then use Readers and Writers but specify the character set explicitly. If you want them to be transmitted as strings in the platform default set (not very useful), then accept that they're not going to be binary equal as files.
(Also, your question really doesn't provide much information to solve it. To me, it basically reads "I wrote some code to do X, and it doesn't work. Where did I go wrong?" You seem to assume that your code is correct by not listing it, but at the same time recognise that it's not...)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.