I think there's a way to do this but I'm not sure how? Basically, I was writing a compression program that resulted in a crc error when I tried to unzip the compressed data. Normally this means that the decompressor actually recognized my data as being in the right format and decompressed it, but when it compared the result to the expected length as indicated by the CRC, they weren't the same.
However, for comparison reasons, I actually do want to see the output to see if it's just a concatenation issue (which should be relatively obvious if the decompressed output isn't gibberish but just in the wrong order).
You said "unzip", but the question says "gzip". Which is it? Those are two different programs that operate on two different formats. I will assume gzip. Also the length is not "indicated by the CRC". The gzip trailer contains a CRC and an uncompressed length (modulo 232), which are two different things.
The gzip command will decompress all valid deflate data and write it out before checking the crc. So if, for example, I take a .gz file and corrupt just the crc (or length) at the end, and do:
gzip -dc < corrupt.gz > result
then result will be the entire, correct uncompressed data stream. There is no need to modify and recompile gzip, nor to write your own ungzipper. gzip will complain about the crc, but all of the data will be written nevertheless.
As far as I'm aware, the CRC check is part of the GZIP wrapper, not part of the actual compressed data in DEFLATE format.
So you should be able to take literally just the bytes that are the compressed data stream, ignoring the GZIP header and CRC at the end, and pass it through an Inflater.
In other words, you need to take just the bytes corresponding to those referred to as "compressed blocks" in the GZIP File format specification and try to decompress using a Java Inflater object. A little bit of work but possibly less than re-compiling the GZIP code as Greg suggests (though his option would also work in principle).
Related
I was presented with the situation where a file with a proprietary format was compressed to a .gz, then subsequently renamed it back to its original extension and then compressed again. I would like to capture such scenario and wonder whether there is a way to detect when a file has been compressed twice.
I am reading the .gz files as follows:
GZIPInputStream gzip = new GZIPInputStream(Files.newInputStream(inFile));
BufferedReader breader = new BufferedReader(new InputStreamReader(gzip));
You can check for a valid gzip header within the file. A gzip file should contain a defined header starting with a 2-byte number with values 0x1f and 0x8b (see spec ). You can check these bytes to see if they match the header values:
InputStream is = new FileInputStream(new File(filePath));
byte[] b = new byte[2];
int n = is.read(b);
if ( n != 2 ){
//not a gzip file
}
if ( (b[0] == (byte) 0x1f) && (b[1] == (byte)0x8b)){
//2-byte gzip header
}
These two bytes alone have an ~1/65k chance of randomly occurring, but depending upon the data you expect to receive can be enough to base your decision. To be more confident of the call you can read further into the header to be sure it follows valid spec values (see link above - eg third byte is typically but not always an 8 for DEFLATE compression, and so on...)
A brute force way would be: uncompress the file; and if that works; try to uncompress it again. If that works again, you know that it was compressed (at least twice). But worst case, it could still be compressed.
And actually; I dont other ways to figure that.
You see, in the end, compression is about changing the bytes of your file. SO, even when the second compression doesn't do much to the content of the file; it still changes some bytes. So, just from looking at those bytes, you wont see what is going on.
I'm working through the problems in Programming Pearls, 2nd edition, Column 1. One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file. Since Java is the language I'm the most familiar with, I've decided to use it even though the author seems to have had C and C++ in mind.
Since I'm pretending memory is limited for the purpose of the problem I'm working on, I'd like to make sure the process of reading the file has no buffering at all.
I thought InputStreamReader would be a good solution, until I read this in the Java documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
Ideally, only the bytes that are necessary would be read from the stream -- in other words, I don't want any buffering.
One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file.
This implies that you need to read the file as bytes (not characters).
Assuming that you do have a genuine requirement to read from a file without buffering, then you should use the FileInputStream class. It does no buffering. It reads (or attempts to read) precisely the number of bytes that you asked for.
If you then need to convert those bytes to characters, you could do this by applying the appropriate String constructor to a byte or byte[]. Note that for multibyte character encodings such as UTF-8, you would need to read sufficient bytes to complete each character. Doing that without the possibility of read-ahead is a bit tricky ... and entails "knowledge* of the character encoding you are reading.
(You could avoid that knowledge by using a CharsetDecoder directly. But then you'd need to use the decode method that operates on Buffer objects, and that is a bit complicated too.)
For what it is worth, Java makes a clear distinction between stream-of-byte and stream-of-character I/O. The former is supported by InputStream and OutputStream, and the latter by Reader and Write. The InputStreamReader class is a Reader, that adapts an InputStream. You should not be considering using it for an application that wants to read stuff byte-wise.
In Java, how would I convert a byte array (TCP packet payload from a pcap file) into some kind of HTTP object that I can use to get HTTP headers and content body?
One of the stupid lovely things about Java is a total lack of unsigned types. So, a good place to start would be taking your byte array and converting it into a short array to make sure that you don't have any rollover problems. (16 bits versus 8 bits per number).
From there, you could use a BufferedOutputStream to write your data to a file and parse it with one of the Java built-in XML readers, such as JaxB or DOM. BufferedOutputStream writes hex directly to a file, and can take an input of an int, byte, or short array. After you write it out, using the OutputStream it should be very simple to parse the HTML out of it.
If you need any help with any of these individual steps, I'd be happy to help.
EDIT: as maerics has pointed out, perhaps I didn't grasp what you were asking. Regardless, writing your byte array with a BufferedOutputStream is the way to go in my opinion, and I could still help you build a parser if you want.
JNetPcap can do exactly this.
Here are examples for
Opening a pcap file
Parsing http (in the example, we extract an image)
Drawback: parsing http in this library is depracated*, but that doesn't mean it doesn't work
*I can't post anymore links without more reputation. Sorry. You can Google for "jnetpcap http deprecated"
Edit: I really just need to know when a Deflater derived class decides to write header and footer data, and how to exploit those facts. I would really like to do the following:
Prime the dictionary for a Deflater derived class with some bytes (I think I got this).
Send some data to be compressed to the Deflater derived class (I think I got this).
Output all of that compressed data (WITH NO HEADER OR FOOTER DATA) to wherever I want (Not sure how to do this, it would also be okay to have both header/footer, or just one, just as long as it was consistent).
Reuse object by starting again at 1.
Original Q: I am using the Java DeflaterOutputStream to compress some data. I am also modifying this compressed data by modifying the headers and the footers. I would like to give some input to DeflaterOutputStream, and have it only output the compressed data part, not the header or footer of the gzip format. How might I do this?
So far, I have been trying to do something like this:
internalWriter.write(storage, 0, amountRead);
internalWriter.finish();
internalWriter.getDef().reset();
internalWriter here is an extension of DeflaterOutputStream. This outputs the compressed data with header and footer. However, on subsequent calls with the same object, it outputs compressed data and footer. I basically want only the compressed data, or perhaps the same thing to happen each time. Any ideas? A quick explanation of how compression streams use close,flush,finish, might help me out too, with a focus on when the header and footer are created and outputted.
And every time I use DeflaterOutputStream, I want it to output everything right away. That is why I did the finish right after the right as seen above...
You can see good examples in Java Almanac
Compressing a Byte Array
Decompressing a Byte Array
--- EDIT ---
Let me try to help a little more. The book Java I/O by Elliote Rusty Harold is perhaps the best reference I have found. You can get it from OReilly Books. I will provide you with some quotes and examples from the book.
Deflating Data
The Deflater class contains methods to compress blocks of data. You
can choose the compression format, the level of compression, and the
compression strategy. Deflating data with the Deflater class requires
nine steps:
Construct a Deflater object.
Choose the strategy (optional).
Set the compression level (optional).
Preset the dictionary (optional).
Set the input.
Deflate the data repeatedly until needsInput( ) returns true.
If more input is available, go back to step 5 to provide additional input data. Otherwise, go to step 8.
Finish the data.
If there are more streams to be deflated, reset the deflater.
More often than not, you don’t use this class directly. Instead, you
use a Deflater object indirectly through one of the compressing stream
classes like DeflaterInputStream or DeflaterOutputStream. These
classes provide more convenient programmer interfaces for
stream-oriented compression than the raw Deflater methods.
Inflating Data
Construct an Inflater object.
Set the input with the compressed data to be inflated.
Call needsDictionary( ) to determine if a preset dictionary is required.
If needsDictionary( ) returns true, call getAdler( ) to get the Adler-32 checksum of the dictionary. Then invoke setDictionary( ) to
set the dictionary data.
Inflate the data repeatedly until inflate( ) returns 0.
If needsInput( ) returns true, go back to step 2 to provide additional input data.
The finished( ) method returns true.
Now, the book dedicates a whole chapter to compressing and decompressing data, and I do not think it possible to explain it all here. You'll have to do you part of task and if needed, come back with a narrower question.
See the deflater (sic) documentation. If nowrap is true, then there is no header or trailer generated -- just raw compressed data in the deflate format.
It sounds like you want to have two streams, your destination stream and then your compressor stream that decorates the destination stream. Then you'll write your uncompressed data to the base stream and the compressed data to the decorator stream. Make sure that you flush before switching. Reading will be a similar procedure, but you'll need to know where the compressed data begins and ends in your stream.
Suppose the destination stream is a file, something like the following pseudo code...
FileOutputStream dest = new FileOutputStream(foo);
DeflaterOutputStream decorator = new DeflaterOutputStream(dest);
byte[] header = getHeader();
byte[] body = getBody();
byte[] footer = getFooter();
dest.write(header);
dest.flush();
decorator.write(body);
decorator.flush();
dest.write(footer);
I wonder if DeflaterOutputStream is really what you want though. Isn't that part of a zip file? If you're doing something custom, it seems like you'd just want to gzip it.
I read from ORACLE of the following bit:
Can I execute methods on compressed versions of my objects, for example isempty(zip(serial(x)))?
This is not really viable for arbitrary objects because of the encoding of objects. For a particular object (such as String) you can compare the resulting bit streams. The encoding is stable, in that every time the same object is encoded it is encoded to the same set of bits.
So I got this idea, say if I have a char array of 4M something long, is it possible for me to compress it to several hundreds of bytes using GZIPOutputStream, and then map the whole file into memory, and do random search on it by comparing bits? Say if I am looking for a char sequence of "abcd", could I somehow get the bit sequence of compressed version of "abcd", and then just search the file for it? Thanks.
You cannot use GZIP or similar to do this as the encoding of each byte change as the stream is processed. i.e. the only way to determine what a byte means is to read all the bytes previous.
If you want to access the data randomly, you can break the String into smaller sections. That way you only need to decompress a relative short section of data.