Edit: I really just need to know when a Deflater derived class decides to write header and footer data, and how to exploit those facts. I would really like to do the following:
Prime the dictionary for a Deflater derived class with some bytes (I think I got this).
Send some data to be compressed to the Deflater derived class (I think I got this).
Output all of that compressed data (WITH NO HEADER OR FOOTER DATA) to wherever I want (Not sure how to do this, it would also be okay to have both header/footer, or just one, just as long as it was consistent).
Reuse object by starting again at 1.
Original Q: I am using the Java DeflaterOutputStream to compress some data. I am also modifying this compressed data by modifying the headers and the footers. I would like to give some input to DeflaterOutputStream, and have it only output the compressed data part, not the header or footer of the gzip format. How might I do this?
So far, I have been trying to do something like this:
internalWriter.write(storage, 0, amountRead);
internalWriter.finish();
internalWriter.getDef().reset();
internalWriter here is an extension of DeflaterOutputStream. This outputs the compressed data with header and footer. However, on subsequent calls with the same object, it outputs compressed data and footer. I basically want only the compressed data, or perhaps the same thing to happen each time. Any ideas? A quick explanation of how compression streams use close,flush,finish, might help me out too, with a focus on when the header and footer are created and outputted.
And every time I use DeflaterOutputStream, I want it to output everything right away. That is why I did the finish right after the right as seen above...
You can see good examples in Java Almanac
Compressing a Byte Array
Decompressing a Byte Array
--- EDIT ---
Let me try to help a little more. The book Java I/O by Elliote Rusty Harold is perhaps the best reference I have found. You can get it from OReilly Books. I will provide you with some quotes and examples from the book.
Deflating Data
The Deflater class contains methods to compress blocks of data. You
can choose the compression format, the level of compression, and the
compression strategy. Deflating data with the Deflater class requires
nine steps:
Construct a Deflater object.
Choose the strategy (optional).
Set the compression level (optional).
Preset the dictionary (optional).
Set the input.
Deflate the data repeatedly until needsInput( ) returns true.
If more input is available, go back to step 5 to provide additional input data. Otherwise, go to step 8.
Finish the data.
If there are more streams to be deflated, reset the deflater.
More often than not, you don’t use this class directly. Instead, you
use a Deflater object indirectly through one of the compressing stream
classes like DeflaterInputStream or DeflaterOutputStream. These
classes provide more convenient programmer interfaces for
stream-oriented compression than the raw Deflater methods.
Inflating Data
Construct an Inflater object.
Set the input with the compressed data to be inflated.
Call needsDictionary( ) to determine if a preset dictionary is required.
If needsDictionary( ) returns true, call getAdler( ) to get the Adler-32 checksum of the dictionary. Then invoke setDictionary( ) to
set the dictionary data.
Inflate the data repeatedly until inflate( ) returns 0.
If needsInput( ) returns true, go back to step 2 to provide additional input data.
The finished( ) method returns true.
Now, the book dedicates a whole chapter to compressing and decompressing data, and I do not think it possible to explain it all here. You'll have to do you part of task and if needed, come back with a narrower question.
See the deflater (sic) documentation. If nowrap is true, then there is no header or trailer generated -- just raw compressed data in the deflate format.
It sounds like you want to have two streams, your destination stream and then your compressor stream that decorates the destination stream. Then you'll write your uncompressed data to the base stream and the compressed data to the decorator stream. Make sure that you flush before switching. Reading will be a similar procedure, but you'll need to know where the compressed data begins and ends in your stream.
Suppose the destination stream is a file, something like the following pseudo code...
FileOutputStream dest = new FileOutputStream(foo);
DeflaterOutputStream decorator = new DeflaterOutputStream(dest);
byte[] header = getHeader();
byte[] body = getBody();
byte[] footer = getFooter();
dest.write(header);
dest.flush();
decorator.write(body);
decorator.flush();
dest.write(footer);
I wonder if DeflaterOutputStream is really what you want though. Isn't that part of a zip file? If you're doing something custom, it seems like you'd just want to gzip it.
Related
How to implement seek() function in BufferSink (or BufferedSource) in OKHttp?
We all know that in Java, the RandomAccessFile class has a method seek(long), which enable us to start reading/writing a file from a specific position, and the bytes before the postion will be discarded. Is there any similar methods in OKHttp?
I have noticed that there is a method in BufferedSink:
write(byteString: ByteString, offset: Int, byteCount: Int)
But unfortunately the parameter "offset" aceepts only type int, not type long, which has some limit when transmitting large files.
The API you're looking for is BufferedSource.skip().
In Okio 3.0 (coming soon) we’re adding a new Cursor class that'll make skip() faster if the underlying source is a File.
https://github.com/square/okio/issues/889
I am using Okio.buffer to read an image file from the assets folder like this:
BufferedSource img = Okio.buffer(Okio.source(getAssets().open("image.jpg")));
byte[] image = img.readByteArray();
Question may be quite vague, let me expound it here.
I'm developing an application in which I'll be reading data from a file. I've a FileReader class which opens the file in following fashion
currentFileStream = new FileInputStream(currentFile);
fileChannel = currentFileStream.getChannel();
data is read as following
bytesRead = fileChannel.read(buffer); // Data is buffered using a ByteBuffer
I'm processing the data in any one of the 2 forms, one is binary and other is character.
If its processed as character I do an additional step of decoding this ByteBuffer into CharBuffer
CoderResult result = decoder.decode(byteBuffer, charBuffer, false);
Now my problem is I need to read by repositioning the file from some offset during recovery mode in case of some failure or crash in application.
For this, I maintain a byteOffset which keeps track of no of bytes processed during binary mode and I persist this variable.
If something happens I reposition the file like this
fileChannel.position(byteOffset);
which is pretty straightforward.
But if processing mode is character, I maintain recordOffset which keeps track of character position/offset in the file. During recovery I make calls to read() internally till I get some character offset which is persisted recordOffset+1.
Is there anyway to get corresponding bytes which were needed to decode characters? For instance I want something similar like this if recordOffset is 400 and its corresponding byteOffset is 410 or 480 something( considering different charsets). So that while repositioning I can do this
fileChannel.position(recordOffset); //recordOffset equivalent value in number of bytes
instead of making repeated calls internally in my application.
Other approach I could think for this was using an InputStreamReader's skip method.
If there are any better approach for this or if possible to get byte - character mapping, please let me know.
I think there's a way to do this but I'm not sure how? Basically, I was writing a compression program that resulted in a crc error when I tried to unzip the compressed data. Normally this means that the decompressor actually recognized my data as being in the right format and decompressed it, but when it compared the result to the expected length as indicated by the CRC, they weren't the same.
However, for comparison reasons, I actually do want to see the output to see if it's just a concatenation issue (which should be relatively obvious if the decompressed output isn't gibberish but just in the wrong order).
You said "unzip", but the question says "gzip". Which is it? Those are two different programs that operate on two different formats. I will assume gzip. Also the length is not "indicated by the CRC". The gzip trailer contains a CRC and an uncompressed length (modulo 232), which are two different things.
The gzip command will decompress all valid deflate data and write it out before checking the crc. So if, for example, I take a .gz file and corrupt just the crc (or length) at the end, and do:
gzip -dc < corrupt.gz > result
then result will be the entire, correct uncompressed data stream. There is no need to modify and recompile gzip, nor to write your own ungzipper. gzip will complain about the crc, but all of the data will be written nevertheless.
As far as I'm aware, the CRC check is part of the GZIP wrapper, not part of the actual compressed data in DEFLATE format.
So you should be able to take literally just the bytes that are the compressed data stream, ignoring the GZIP header and CRC at the end, and pass it through an Inflater.
In other words, you need to take just the bytes corresponding to those referred to as "compressed blocks" in the GZIP File format specification and try to decompress using a Java Inflater object. A little bit of work but possibly less than re-compiling the GZIP code as Greg suggests (though his option would also work in principle).
I think I'm missing something very simple. I have a byte array holding deflated data written into it using a Deflater:
deflate(outData, 0, BLOCK_SIZE, SYNC_FLUSH)
The reason I didn't just use GZIPOutputStream was because there were 4 threads (variable) that each were given a block of data and each thread compressed it's own block before storing that compressed data into a global byte array. If I used GZIPOutputStream it messes up the format because each little block has a header and trailer and is it's own gzip data (I only want to compress it).
So in the end, I've got this byteArray, outData, that's holding all of my compressed data but I'm not really sure how to wrap it. GZIPOutputStream writes from an buffer with uncompressed data, but this array is all set. It's already compressed and I'm just hitting a wall trying to figure out how to get it into a form.
EDIT: Ok, bad wording on my part. I'm writing it to output, not a file, so that it could be redirected if needed. A really simple example is that
cat file.txt | java Jzip | gzip -d | cmp file.txt
should return 0. The problem right now is if I write this byte array as is to output, it's just "raw" compressed data. I think gzip needs all this extra information.
If there's an alternative method, that would be fine to. The whole reason it's like this is because I needed to use multiple threads. Otherwise I would just call GZIPOutputStream.
DOUBLE EDIT: Since the comments provide a lot of good insight, another method is that I just have a bunch of uncompressed blocks of data that were originally one long stream. If gzip can read concatenated streams, if I took those blocks (and kept them in order) and gave each one to a thread that calls GZIPOutputStream on its own block, then took the results and concatenated them. In essence, each block now has header, the compressed info, and trailer. Would gzip recognize that if I concatenated them?
Example:
cat file.txt
Hello world! How are you? I'm ready to set fire to this assignment.
java Testcase < file.txt > file.txt.gz
So I accept it from input. Inside the program, the stream is split up into
"Hello world!" "How are you?" "I'm ready to set fire to this assignment" (they're not strings, it's just an array of bytes! this is just illustration)
So I've got these three blocks of bytes, all uncompressed. I give each of these blocks to a thread, which uses
public static class DGZIPOutputStream extends GZIPOutputStream
{
public DGZIPOutputStream(OutputStream out, boolean flush) throws IOException
{
super(out, flush);
}
public void setDictionary(byte[] b)
{
def.setDictionary(b);
}
public void updateCRC(byte[] input)
{
crc.update(input);
}
}
As you can see, the only thing here is that I've set the flush to SYNC_FLUSH so I can get the alignment right and have the ability to set the dictionary. If each thread were to use DGZIPOutputStream (which I've tested and it works for one long continuous input), and I concatenated those three blocks (now compressed each with a header and trailer), would gzip -d file.txt.gz work?
If that's too weird, ignore the dictionary completely. It doesn't really matter. I just added it in while I was at it.
If you set nowrap true when using the Deflater (sic) constructor, then the result is raw deflate. Otherwise it's zlib, and you would have to strip the zlib header and trailer. For the rest of the answer, I am assuming nowrap is true.
To wrap a complete, terminated deflate stream to be a gzip stream, you need to prepend ten bytes:
"\x1f\x8b\x08\0\0\0\0\0\0\xff"
(sorry -- C format, you'll need to convert to Java octal). You need to also append the four byte CRC in little endian order, followed by the four-byte total uncompressed length modulo 2^32, also in little endian order. Given what is available in the standard Java API, you'll need to compute the CRC in serial. It can't be done in parallel. zlib does have a function to combine separate CRCs that are computed in parallel, but that is not exposed in Java.
Note that I said a complete, terminated deflate stream. It takes some care to make one of those with parallel deflate tasks. You would need to make n-1 unterminated deflate streams and one final terminated deflate stream and concatenate those. The last one is made normally. The other n-1 need to be terminated using sync flush in order to end each on a byte boundary and to not mark it as the end of the stream. To do that, you use deflate with the flush parameter SYNC_FLUSH. Don't use finish() on those.
For better compression, you can use setDictionary on each chunk with the last 32K of the previous chunk.
If you are looking to write the outdata in a file, you may write as:
GZIPOutputStream outStream= new GZIPOutputStream(new FileOutputStream("fileName"));
outStream.write(outData, 0, outData.length);
outStream.close();
Or simply use java.io.FileOutputStream to write:
FileOutputStream outStream= new FileOutputStream("fileName");
outStream.write(outData, 0, outData.length);
outStream.close();
You just want to write a byte array - as is - to a file?
You can use Apache Commons:
FileOutputStream fos = new FileOutputStream("yourFilename");
fos.write(outData);
fos.close():
Or plain old Java:
BufferedOutputStream bs = null;
try {
FileOutputStream fs = new FileOutputStream(new File("yourFilename"));
bs = new BufferedOutputStream(fs);
bs.write(outData);
bs.close();
} catch (Exception e) {
//please handle this
}
if (bs != null) try {
bs.close();
} catch (Exception e) {
//please handle this
}
I have a bunch of different objects(and objec types) that i want to write to a binary file. First of all i need the file to be structured like this:
`Object type1
obj1, obj2 ...
Object type2
obj1, obj2...
....
Being a binary file this doesn't help a user read it, but i want to have a structure so i can search, delete or add an object by it's type, not parsing the entire file. And this is something i don't know how to do. Is this even posible?
You will have to maintain a header at the beginning of the file (or somewhere else) to mark the position and length of each of your objects.
The kind and layout of the header depend a lot on how you plan to read and write into the file. For example if you plan to retrieve the objects by name, you could have in your file something like this
object1 500 1050
object2 1550 800
object3 2350 2000
<some padding to cover 500 bytes>
<the 1050 bytes of object1><the 800 bytes of object2><the 2000 bytes of object3>
And know that object1 starts at the offset 400 in the file, and has a length of 1050 bytes.
Since it seems that you have different types of objects that you want to store, you will probably need to add some additional data to your header.
Take care of the following:
Each time you add, delete or modifiy a file, you will have to update in the header the offset for all files that follow (for example if I remove object2, then the offset for object3 is now 1550).
If you store the header in the same file as the data, then you must take the size of the header into account when computing offsets (this will make things much harder, I suggest you keep the header and binary data separated.
You will have to read and parse the header each time you want to access an object. Consider using a standardized format for your header to avoid problems (YML or XML).
I'm not aware of any library that will help you implement such a feature but I'm pretty sure there are some. Maybe someone will be able to suggest one.
--
Another solution would be to use something like a ZipFile (which is natively supported by Java) and write each of your objects as a differenz ZipEntry. This way you won't have to manage object separation yourself, and will only need to worry about knowing the exact ZipEntry you want.