Wrap deflated data in gzip format - java

I think I'm missing something very simple. I have a byte array holding deflated data written into it using a Deflater:
deflate(outData, 0, BLOCK_SIZE, SYNC_FLUSH)
The reason I didn't just use GZIPOutputStream was because there were 4 threads (variable) that each were given a block of data and each thread compressed it's own block before storing that compressed data into a global byte array. If I used GZIPOutputStream it messes up the format because each little block has a header and trailer and is it's own gzip data (I only want to compress it).
So in the end, I've got this byteArray, outData, that's holding all of my compressed data but I'm not really sure how to wrap it. GZIPOutputStream writes from an buffer with uncompressed data, but this array is all set. It's already compressed and I'm just hitting a wall trying to figure out how to get it into a form.
EDIT: Ok, bad wording on my part. I'm writing it to output, not a file, so that it could be redirected if needed. A really simple example is that
cat file.txt | java Jzip | gzip -d | cmp file.txt
should return 0. The problem right now is if I write this byte array as is to output, it's just "raw" compressed data. I think gzip needs all this extra information.
If there's an alternative method, that would be fine to. The whole reason it's like this is because I needed to use multiple threads. Otherwise I would just call GZIPOutputStream.
DOUBLE EDIT: Since the comments provide a lot of good insight, another method is that I just have a bunch of uncompressed blocks of data that were originally one long stream. If gzip can read concatenated streams, if I took those blocks (and kept them in order) and gave each one to a thread that calls GZIPOutputStream on its own block, then took the results and concatenated them. In essence, each block now has header, the compressed info, and trailer. Would gzip recognize that if I concatenated them?
Example:
cat file.txt
Hello world! How are you? I'm ready to set fire to this assignment.
java Testcase < file.txt > file.txt.gz
So I accept it from input. Inside the program, the stream is split up into
"Hello world!" "How are you?" "I'm ready to set fire to this assignment" (they're not strings, it's just an array of bytes! this is just illustration)
So I've got these three blocks of bytes, all uncompressed. I give each of these blocks to a thread, which uses
public static class DGZIPOutputStream extends GZIPOutputStream
{
public DGZIPOutputStream(OutputStream out, boolean flush) throws IOException
{
super(out, flush);
}
public void setDictionary(byte[] b)
{
def.setDictionary(b);
}
public void updateCRC(byte[] input)
{
crc.update(input);
}
}
As you can see, the only thing here is that I've set the flush to SYNC_FLUSH so I can get the alignment right and have the ability to set the dictionary. If each thread were to use DGZIPOutputStream (which I've tested and it works for one long continuous input), and I concatenated those three blocks (now compressed each with a header and trailer), would gzip -d file.txt.gz work?
If that's too weird, ignore the dictionary completely. It doesn't really matter. I just added it in while I was at it.

If you set nowrap true when using the Deflater (sic) constructor, then the result is raw deflate. Otherwise it's zlib, and you would have to strip the zlib header and trailer. For the rest of the answer, I am assuming nowrap is true.
To wrap a complete, terminated deflate stream to be a gzip stream, you need to prepend ten bytes:
"\x1f\x8b\x08\0\0\0\0\0\0\xff"
(sorry -- C format, you'll need to convert to Java octal). You need to also append the four byte CRC in little endian order, followed by the four-byte total uncompressed length modulo 2^32, also in little endian order. Given what is available in the standard Java API, you'll need to compute the CRC in serial. It can't be done in parallel. zlib does have a function to combine separate CRCs that are computed in parallel, but that is not exposed in Java.
Note that I said a complete, terminated deflate stream. It takes some care to make one of those with parallel deflate tasks. You would need to make n-1 unterminated deflate streams and one final terminated deflate stream and concatenate those. The last one is made normally. The other n-1 need to be terminated using sync flush in order to end each on a byte boundary and to not mark it as the end of the stream. To do that, you use deflate with the flush parameter SYNC_FLUSH. Don't use finish() on those.
For better compression, you can use setDictionary on each chunk with the last 32K of the previous chunk.

If you are looking to write the outdata in a file, you may write as:
GZIPOutputStream outStream= new GZIPOutputStream(new FileOutputStream("fileName"));
outStream.write(outData, 0, outData.length);
outStream.close();
Or simply use java.io.FileOutputStream to write:
FileOutputStream outStream= new FileOutputStream("fileName");
outStream.write(outData, 0, outData.length);
outStream.close();

You just want to write a byte array - as is - to a file?
You can use Apache Commons:
FileOutputStream fos = new FileOutputStream("yourFilename");
fos.write(outData);
fos.close():
Or plain old Java:
BufferedOutputStream bs = null;
try {
FileOutputStream fs = new FileOutputStream(new File("yourFilename"));
bs = new BufferedOutputStream(fs);
bs.write(outData);
bs.close();
} catch (Exception e) {
//please handle this
}
if (bs != null) try {
bs.close();
} catch (Exception e) {
//please handle this
}

Related

Not a gzip format for a obvious gzip text in Java

I have been trying to implement decompressing text compressed in GZIP format
Below we have method I implemented
private byte[] decompress(String compressed) throws Exception {
ByteArrayOutputStream out = new ByteArrayOutputStream();
ByteArrayInputStream in = new
ByteArrayInputStream(compressed.getBytes(StandardCharsets.UTF_8));
GZIPInputStream ungzip = new GZIPInputStream(in);
byte[] buffer = new byte[256];
int n;
while ((n = ungzip.read(buffer)) >= 0) {
out.write(buffer, 0, n);
}
return out.toByteArray();
}
And now I am testing the solution for following compressed text:
H4sIAAAAAAAACjM0MjYxBQAcOvXLBQAAAA==
And there is Not a gzip format exception.
I tried different ways but there still is this error. Maybe anyone has idea what am I doing wrong?
That's not gzip formatted. In general, compressed cannot be a string (because compressed data is bytes, and a string isn't bytes. Some languages / tutorials / 1980s thinking conflate the 2, but it's the 2020s. We don't do that anymore. There are more characters than what's used in english).
It looks like perhaps the following has occurred:
Someone has some data.
They gzipped it.
They then turned the gzipped stream (which are bytes) into characters using Base64 encoding.
They sent it to you.
You now want to get back to the data.
Given that 2 transformations occurred (first, gzip it, then, base64 it), you need to also do 2 transformations, in reverse. You need to:
Take the input string, and de-base64 it, giving you bytes.
You then need to take these bytes and decompress them.
and now you have the original data back.
Thus:
byte[] gzipped = java.util.Base64.getDecoder().decode(compressed);
var in = new GZIPInputStream(new ByteArrayInputStream(gzipped));
return in.readAllBytes();
Note:
Pushing the data from input to outputstream like this is a waste of resources and a bunch of finicky code. There is no need to write this; just call readAllBytes.
If the incoming Base64 is large, there are ways to do this in a streaming fashion. This would require that this method takes in a Reader (instead of a String which cannot be streamed), and would return an InputStream instead of a byte[]. Of course if the input is not particularly large, there is no need. The above approach is somewhat wasteful - both the base64-ed data, and the un-base64ed data, and the decompressed data is all in memory at the same time and you can't avoid this nor can the garbage collector collect any of this stuff in between (because the caller continues to ref that base64-ed string most likely).
In other words, if the compressed ratio is, say, 50%, and the total uncompressed data is 100MB in size, this method takes MORE than:
100MB (uncompressed ) + 50MB (compressed) + 50*4/3 = 67MB (compressed but base64ed) = ~ 217MB of memory.
You know better than we do how much heap your VM is running on, and how large the input data is likely to ever get.
NB: Base64 transfer is extremely inefficient, taking 4 bytes of base64 content for every 3 bytes of input data, and if the data transfer is in UTF-16, it's 8 bytes per 3, even. Ouch. Given that the content was GZipped, this feels a bit daft: First we painstakingly reduce the size of this thing, and then we casually inflate it by 33% for probably no good reason. You may want to check the 'pipe' that leads you to this, possibly you can just... eliminate the base64 aspect of this.
For example, if you have a wire protocol and someone decided that JSON was a good idea, then.. simply.. don't. JSON is not a good idea if you have the need to transfer a bunch of raw data. Use protobuf, or send a combination of JSON and blobs, etc.

how to write long (4byte) value to binary file in android

I'm writing a binary file header from java, and I had been using fixed values for the file size in the header. That was easy:
OutputStream os = new FileOutputStream(filename);
os.write(0x36);//LSB
os.write(0x10);
os.write(0x0E);
os.write(0x00);//MSB
But now I want to be more dynamic and write whatever size buffer I have to a file. So I might get the size of my array as say 4054; I want to take that and either break it apart and do four os.writes, or maybe there's a way to write it all at once.
OutputStream seems to only take one byte at a time, but I'd like to still use it as all the rest of my header code is already using it.
Use a ByteBuffer, so you can control whether it writes LSB or MSB first.
ByteBuffer buf = ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN);
buf.putLong(value);
os.write(buf.array());

Is it possible to check whether a file (.gz) has been compressed more than once?

I was presented with the situation where a file with a proprietary format was compressed to a .gz, then subsequently renamed it back to its original extension and then compressed again. I would like to capture such scenario and wonder whether there is a way to detect when a file has been compressed twice.
I am reading the .gz files as follows:
GZIPInputStream gzip = new GZIPInputStream(Files.newInputStream(inFile));
BufferedReader breader = new BufferedReader(new InputStreamReader(gzip));
You can check for a valid gzip header within the file. A gzip file should contain a defined header starting with a 2-byte number with values 0x1f and 0x8b (see spec ). You can check these bytes to see if they match the header values:
InputStream is = new FileInputStream(new File(filePath));
byte[] b = new byte[2];
int n = is.read(b);
if ( n != 2 ){
//not a gzip file
}
if ( (b[0] == (byte) 0x1f) && (b[1] == (byte)0x8b)){
//2-byte gzip header
}
These two bytes alone have an ~1/65k chance of randomly occurring, but depending upon the data you expect to receive can be enough to base your decision. To be more confident of the call you can read further into the header to be sure it follows valid spec values (see link above - eg third byte is typically but not always an 8 for DEFLATE compression, and so on...)
A brute force way would be: uncompress the file; and if that works; try to uncompress it again. If that works again, you know that it was compressed (at least twice). But worst case, it could still be compressed.
And actually; I dont other ways to figure that.
You see, in the end, compression is about changing the bytes of your file. SO, even when the second compression doesn't do much to the content of the file; it still changes some bytes. So, just from looking at those bytes, you wont see what is going on.

Data loss when writing bytes to a file

I'm working on a string compressor for a school assignment,
There's one bug that I can't seem to work out. The compressed data is being written a file using a FileWriter, represented by a byte array. The compression algorithm returns an input stream so the data flows as such:
piped input stream
-> input stream reader
-> data stored in char buffer
-> data written to file with file writer.
Now, the bug is, that with some very specific strings, the second to last byte in the byte array is written wrong. and it's always the same bit values "11111100".
Every time it's this bit values and always the second to last byte.
Here are some samples from the code:
InputStream compress(InputStream){
//...
//...
PipedInputStream pin = new PipedInputStream();
PipedOutputStream pout = new PipedOutputStream(pin);
ObjectOutputStream oos = new ObjectOutputStream(pout);
oos.writeObject(someobject);
oos.flush();
DataOutputStream dos = new DataOutputStream(pout);
dos.writeFloat(//);
dos.writeShort(//);
dos.write(SomeBytes); // ---Here
dos.flush();
dos.close();
return pin;
}
void write(char[] cbuf, int off, int len){
//....
//....
InputStreamReader s = new InputStreamReader(
c.compress(new ByteArrayInputStream(str.getBytes())));
s.read(charbuffer);
out.write(charbuffer);
}
A string which triggers it is "hello and good evenin" for example.
I have tried to iterate over the byte array and write them one by one, it didn't help.
It's also worth noting that when I tried to write to a file using the output stream in the algorithm itself it worked fine. This design was not my choice btw.
So I'm not really sure what i'm doing wrong here.
Considering that you're saying:
Now, the bug is, that with some very specific strings, the second to
last byte in the byte array is written wrong. and it's always the same
bit values "11111100".
You are taking a
binary stream (the compressed data)
-> reading it as chars
-> then writing it as chars.
And your are converting bytes to chars without clearly defining the encoding.
I'd say that the problem is that your InputStreamReader is translating some byte sequences in a way that you're not expecting.
Remember that in encodings like utf-8 two or three bytes may become one single char.
It can't be coincidence that the very byte pattern you pointed out (11111100) Is one of the utf-8 escape codes (1111110x). Check this wikipedia table at and you'll see that uft-8 is destructive since if a byte starts with: 1111110x the next must start with 10xxxxxx.
Meaning that if using utf-8 to convert
bytes1[] -> chars[] -> bytes2[]
in some cases bytes2 will be different from bytes1.
I recommend changing your code to remove those readers. Or specify ASCII encoding to see if that prevent the translations.
I solved this by encoding and decoding the bytes with Base64.

How to use Java DeflaterOutputStream

Edit: I really just need to know when a Deflater derived class decides to write header and footer data, and how to exploit those facts. I would really like to do the following:
Prime the dictionary for a Deflater derived class with some bytes (I think I got this).
Send some data to be compressed to the Deflater derived class (I think I got this).
Output all of that compressed data (WITH NO HEADER OR FOOTER DATA) to wherever I want (Not sure how to do this, it would also be okay to have both header/footer, or just one, just as long as it was consistent).
Reuse object by starting again at 1.
Original Q: I am using the Java DeflaterOutputStream to compress some data. I am also modifying this compressed data by modifying the headers and the footers. I would like to give some input to DeflaterOutputStream, and have it only output the compressed data part, not the header or footer of the gzip format. How might I do this?
So far, I have been trying to do something like this:
internalWriter.write(storage, 0, amountRead);
internalWriter.finish();
internalWriter.getDef().reset();
internalWriter here is an extension of DeflaterOutputStream. This outputs the compressed data with header and footer. However, on subsequent calls with the same object, it outputs compressed data and footer. I basically want only the compressed data, or perhaps the same thing to happen each time. Any ideas? A quick explanation of how compression streams use close,flush,finish, might help me out too, with a focus on when the header and footer are created and outputted.
And every time I use DeflaterOutputStream, I want it to output everything right away. That is why I did the finish right after the right as seen above...
You can see good examples in Java Almanac
Compressing a Byte Array
Decompressing a Byte Array
--- EDIT ---
Let me try to help a little more. The book Java I/O by Elliote Rusty Harold is perhaps the best reference I have found. You can get it from OReilly Books. I will provide you with some quotes and examples from the book.
Deflating Data
The Deflater class contains methods to compress blocks of data. You
can choose the compression format, the level of compression, and the
compression strategy. Deflating data with the Deflater class requires
nine steps:
Construct a Deflater object.
Choose the strategy (optional).
Set the compression level (optional).
Preset the dictionary (optional).
Set the input.
Deflate the data repeatedly until needsInput( ) returns true.
If more input is available, go back to step 5 to provide additional input data. Otherwise, go to step 8.
Finish the data.
If there are more streams to be deflated, reset the deflater.
More often than not, you don’t use this class directly. Instead, you
use a Deflater object indirectly through one of the compressing stream
classes like DeflaterInputStream or DeflaterOutputStream. These
classes provide more convenient programmer interfaces for
stream-oriented compression than the raw Deflater methods.
Inflating Data
Construct an Inflater object.
Set the input with the compressed data to be inflated.
Call needsDictionary( ) to determine if a preset dictionary is required.
If needsDictionary( ) returns true, call getAdler( ) to get the Adler-32 checksum of the dictionary. Then invoke setDictionary( ) to
set the dictionary data.
Inflate the data repeatedly until inflate( ) returns 0.
If needsInput( ) returns true, go back to step 2 to provide additional input data.
The finished( ) method returns true.
Now, the book dedicates a whole chapter to compressing and decompressing data, and I do not think it possible to explain it all here. You'll have to do you part of task and if needed, come back with a narrower question.
See the deflater (sic) documentation. If nowrap is true, then there is no header or trailer generated -- just raw compressed data in the deflate format.
It sounds like you want to have two streams, your destination stream and then your compressor stream that decorates the destination stream. Then you'll write your uncompressed data to the base stream and the compressed data to the decorator stream. Make sure that you flush before switching. Reading will be a similar procedure, but you'll need to know where the compressed data begins and ends in your stream.
Suppose the destination stream is a file, something like the following pseudo code...
FileOutputStream dest = new FileOutputStream(foo);
DeflaterOutputStream decorator = new DeflaterOutputStream(dest);
byte[] header = getHeader();
byte[] body = getBody();
byte[] footer = getFooter();
dest.write(header);
dest.flush();
decorator.write(body);
decorator.flush();
dest.write(footer);
I wonder if DeflaterOutputStream is really what you want though. Isn't that part of a zip file? If you're doing something custom, it seems like you'd just want to gzip it.

Categories