How to improve GZIP performance - java

Currently I do have the problem that this piece of code will be called >500k of times. The size of the compressed byte[] is less than 1KB. Every time the method is called all of the streams has to been created. So I am looking for a way to improve this code.
private byte[] unzip(byte[] data) throws IOException, DataFormatException {
byte[] unzipData = new byte[4096];
try (ByteArrayInputStream in = new ByteArrayInputStream(data);
GZIPInputStream gzipIn = new GZIPInputStream(in);
ByteArrayOutputStream out = new ByteArrayOutputStream()) {
int read = 0;
while( (read = gzipIn.read(unzipData)) != -1) {
out.write(unzipData, 0, read);
}
return out.toByteArray();
}
}
I already tried to replace ByteArrayOutputStream with a ByteBuffer, but at the time of creation I don't know how many bytes I need to allocate.
Also, I tried to use Inflater but I stumbled across the problem descriped here.
Any other idea what I could do to improve the perfomance of this code.
UPDATE#1
Maybe this lib helps someone.
Also there is an open JDK-Bug.

Profile your application, to be sure that you're really spending optimizable time in this function. It doesn't matter how many times you call this function; if it doesn't account for a significant fraction of overall program execution time, then optimization is wasted.
Pre-size the ByteArrayOutputStream. The default buffer size is 32 bytes, and resizes require copying all existing bytes. If you know that your decoded arrays will be around 1k, use new ByteArrayOutputStream(2048).
Rather than reading a byte at a time, read a block at a time, using a pre-allocated byte[]. Beware that you must use the return value from read as an input to write. Better, use something like Jakarta Commons IOUtils.copy() to avoid mistakes.

I'm not sure if it applies in your case, but I've found incredible speed difference when comparing using the default buffer size of GZIPInputStream vs increasing to 65536.
example: using a 500M input file ->
new GZIPInputStream(new FileInputStream(path.toFile())) // takes 4 mins to process
vs
new GZIPInputStream(new FileInputStream(path.toFile()), 65536) // takes 10s
J
More details can be found here http://java-performance.info/java-io-bufferedinputstream-and-java-util-zip-gzipinputstream/
Both BufferedInputStream and GZIPInputStream have internal buffers.
Default size for the former one is 8192 bytes and for the latter one
is 512 bytes. Generally it worth increasing any of these sizes to at
least 65536.

You can use the Inflater class method reset() to reuse the Inflater object without having to recreate it each time. You will have a little bit of added programming to do in order to decode the gzip header and perform the integrity check with the gzip trailer. You would then use Inflater with the nowrap option to decompress the raw deflated data after then gzip header and before the trailer.

Related

Not a gzip format for a obvious gzip text in Java

I have been trying to implement decompressing text compressed in GZIP format
Below we have method I implemented
private byte[] decompress(String compressed) throws Exception {
ByteArrayOutputStream out = new ByteArrayOutputStream();
ByteArrayInputStream in = new
ByteArrayInputStream(compressed.getBytes(StandardCharsets.UTF_8));
GZIPInputStream ungzip = new GZIPInputStream(in);
byte[] buffer = new byte[256];
int n;
while ((n = ungzip.read(buffer)) >= 0) {
out.write(buffer, 0, n);
}
return out.toByteArray();
}
And now I am testing the solution for following compressed text:
H4sIAAAAAAAACjM0MjYxBQAcOvXLBQAAAA==
And there is Not a gzip format exception.
I tried different ways but there still is this error. Maybe anyone has idea what am I doing wrong?
That's not gzip formatted. In general, compressed cannot be a string (because compressed data is bytes, and a string isn't bytes. Some languages / tutorials / 1980s thinking conflate the 2, but it's the 2020s. We don't do that anymore. There are more characters than what's used in english).
It looks like perhaps the following has occurred:
Someone has some data.
They gzipped it.
They then turned the gzipped stream (which are bytes) into characters using Base64 encoding.
They sent it to you.
You now want to get back to the data.
Given that 2 transformations occurred (first, gzip it, then, base64 it), you need to also do 2 transformations, in reverse. You need to:
Take the input string, and de-base64 it, giving you bytes.
You then need to take these bytes and decompress them.
and now you have the original data back.
Thus:
byte[] gzipped = java.util.Base64.getDecoder().decode(compressed);
var in = new GZIPInputStream(new ByteArrayInputStream(gzipped));
return in.readAllBytes();
Note:
Pushing the data from input to outputstream like this is a waste of resources and a bunch of finicky code. There is no need to write this; just call readAllBytes.
If the incoming Base64 is large, there are ways to do this in a streaming fashion. This would require that this method takes in a Reader (instead of a String which cannot be streamed), and would return an InputStream instead of a byte[]. Of course if the input is not particularly large, there is no need. The above approach is somewhat wasteful - both the base64-ed data, and the un-base64ed data, and the decompressed data is all in memory at the same time and you can't avoid this nor can the garbage collector collect any of this stuff in between (because the caller continues to ref that base64-ed string most likely).
In other words, if the compressed ratio is, say, 50%, and the total uncompressed data is 100MB in size, this method takes MORE than:
100MB (uncompressed ) + 50MB (compressed) + 50*4/3 = 67MB (compressed but base64ed) = ~ 217MB of memory.
You know better than we do how much heap your VM is running on, and how large the input data is likely to ever get.
NB: Base64 transfer is extremely inefficient, taking 4 bytes of base64 content for every 3 bytes of input data, and if the data transfer is in UTF-16, it's 8 bytes per 3, even. Ouch. Given that the content was GZipped, this feels a bit daft: First we painstakingly reduce the size of this thing, and then we casually inflate it by 33% for probably no good reason. You may want to check the 'pipe' that leads you to this, possibly you can just... eliminate the base64 aspect of this.
For example, if you have a wire protocol and someone decided that JSON was a good idea, then.. simply.. don't. JSON is not a good idea if you have the need to transfer a bunch of raw data. Use protobuf, or send a combination of JSON and blobs, etc.

Reading large file in bytes by chunks with dynamic buffer size

I'm trying to read a large file by chunks and save them in an ArrayList of bytes.
My code, in short, looks like this:
public ArrayList<byte[]> packets = new ArrayList<>();
FileInputStream fis = new FileInputStream("random_text.txt");
byte[] buffer = new byte[512];
while (fis.read(buffer) > 0){
packets.add(buffer);
}
fis.close();
But it has a behavior that I don't know how to solve, for example: If a file has only the words "hello world", this chunk does not necessarily need to be 512 bytes long. In fact, I want each chunk to be a maximum of 512 bytes not that they all necessarily have that size.
First of all, what you are doing is probably a bad idea. Storing a file's contents in memory like this is liable to be a waste of heap space ... and can lead to OutOfMemoryError exceptions and / or a requirement for an excessively large heap if you process large (enough) input files.
The second problem is that your code is wrong. You are repeatedly reading the data into the same byte array. Each time you do, it overwrites what was there before. So you will end up will a list containing lots of reference to a single byte array ... containing just the last chunk of data that you read.
To solve the problem that you asked about1, you will need to copy the chunk that you read to a new (smaller) byte array.
Something like this:
public ArrayList<byte[]> packets = new ArrayList<>();
try (FileInputStream fis = new FileInputStream("random_text.txt")) {
byte[] buffer = new byte[512];
int len;
while ((len = fis.read(buffer)) > 0) {
packets.add(Arrays.copyOf(buffer, len));
}
}
Note that this also deals with the second problem I mentioned. And fixes a potential resource leak by using try with resource syntax to manage the closure of the input stream.
A final issue: If this is really a text file that you are reading, you probably should be using a Reader to read it, and char[] or String to hold it.
But even if you do that there are some awkward edge cases if your text contains Unicode codepoints that are not in code plane 0. For example, emojis. The edge cases will involve code points that are represented as a surrogate pair AND the pair being split on a chunk boundary. Reading and storing the text as lines would avoid that.
1 - The issue here is not the "wasted" space. Unless you are reading and caching a large number of small file, any space wastage due to "short" chunks will be unimportant. The important issue is knowing which bytes in each byte[] are actually valid data.

Reading chunks / streaming through a file with the FileChannel

I am implementing some kind of file viewer/file explorer as a Web-Application. Therefore I need to read files from the hard disk of the system. Of course I have to deal with small files and large files and I want the fastest and most performant way of doing this.
Now I have the following code and want to ask the "big guys" who have a lot of knowledge about efficiently reading (large) files if I am doing it the right way:
RandomAccessFile fis = new RandomAccessFile(filename, "r");
FileChannel fileChannel = fis.getChannel();
// Don't load the whole file into the memory, therefore read 4096 bytes from position on
MappedByteBuffer mappedByteBuffer = fileChannel.map(MapMode.READ_ONLY, position, 4096);
byte[] buf = new byte[4096];
StringBuilder sb = new StringBuilder();
while (mappedByteBuffer.hasRemaining()) {
// Math.min(..) to avoid BufferUnderflowException
mappedByteBuffer.get(buf, 0, Math.min(4096, map1.remaining()));
sb.append(new String(buf));
}
LOGGER.debug(sb.toString()); // Debug purposes
I hope you can help me and give me some advices.
When you are going to view arbitrary files, including potentially large files, I’d assume that there’s possibility that these files are not actually text files or that you may encounter files which have different encodings.
So when you are going to view such files as text on a best-effort basis, you should think about which encoding you want to use and make sure that failures do not harm your operation. The constructor you use with new String(buf) does replace invalid characters, but it is redundant to construct a new String instance and append it to a StringBuilder afterwards.
Generally, you shouldn’t go so many detours. Since Java 7, you don’t need a RandomAccessFile (or FileInputStream) to get a FileChannel. A straight-forward solution would look like
// Instead of StandardCharsets.ISO_8859_1 you could also use Charset.defaultCharset()
CharsetDecoder decoder = StandardCharsets.ISO_8859_1.newDecoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.replaceWith(".");
try(FileChannel fileChannel=FileChannel.open(Paths.get(filename),StandardOpenOption.READ)) {
//Don't load the whole file into the memory, therefore read 4096 bytes from position on
ByteBuffer mappedByteBuffer = fileChannel.map(MapMode.READ_ONLY, position, 4096);
CharBuffer cb = decoder.decode(mappedByteBuffer);
LOGGER.debug(cb.toString()); // Debug purposes
}
You can operate with the resulting CharBuffer directly or invoke toString() on it to get a String instance (but of course, avoid doing it multiple times). The CharsetDecoder also allows to re-use a CharBuffer, however, that may not have such a big impact on the performance. What you should definitely avoid, is to concatenate all these chunks to a big string.

Practical usage of ByteArrayInputStream/ByteArrayOutputStream

What are some practical areas where ByteArrayInputStream and/or ByteArrayOutputStream are used? Examples are also welcome.
If one searches for examples, one finds usually something like:
byte[] buf = { 16, 47, 12 };
ByteArrayInputStream byt = new ByteArrayInputStream(buf);
It does not help where or why should one use it. I know that they are used when working with images, ZIP files, or writing to ServletOutputStream.
ByteArrayInputStream: every time you need an InputStream (typically because an API takes that as argument), and you have all the data in memory already, as a byte array (or anything that can be converted to a byte array).
ByteArrayOutputStream: every time you need an OutputStream (typically because an API writes its output to an OutputStream) and you want to store the output in memory, and not in a file or on the network.

What's the fastest way to write a very small string to a file in Java?

My code needs to take an integer value between 0 and 255 and write it to a file as a string. It needs to be fast as it may be called repeatedly very quickly, so any optimisation will become noticeable when under heavy load. There are other questions on here dealing with efficient ways to write large amounts of data to file, but how about small amounts of data?
Here's my current approach:
public static void writeInt(final String filename, final int value)
{
try
{
// Convert the int to a string representation in a byte array
final String string = Integer.toString(value);
final byte[] bytes = new byte[string.length()];
for (int i = 0; i < string.length(); i++)
{
bytes[i] = (byte)string.charAt(i);
}
// Now write the byte array to file
final FileOutputStream fileOutputStream = new FileOutputStream(filename);
fileOutputStream.write(bytes, 0, bytes.length);
fileOutputStream.close();
}
catch (IOException exception)
{
// Error handling here
}
}
I don't think a BufferedOutputStream will help here: the overhead of building the flushing the buffer is probably counter-productive for a 3-character write, isn't it? Are there any other improvements I can make?
I think this is about as efficient as you can get given the requirements of the 0-255 range requirement. Using a buffered writer will be less efficient since it would create some temporary structures that you don't need to create with so few bytes being written.
static byte[][] cache = new byte[256][];
public static void writeInt(final String filename, final int value)
{
// time will be spent on integer to string conversion, so cache that
byte[] bytesToWrite = cache[value];
if (bytesToWrite == null) {
bytesToWrite = cache[value] = String.valueOf(value).getBytes();
}
FileOutputStream fileOutputStream = null;
try {
// Now write the byte array to file
fileOutputStream = new FileOutputStream(filename);
fileOutputStream.write(bytesToWrite);
fileOutputStream.close();
} catch (IOException exception) {
// Error handling here
} finally {
if (fileOutputStream != null) {
fileOutputStream.close()
}
}
}
You cannot make it faster IMO. BufferedOutputStream would be of no help here if not otherwise. If we look at the src we'll see that FileOutputStream.write(byte b[], int off, int len) sends byte array directly to native method while BufferedOutputStream.write(byte b[], int off, int len) is synchronized and copies array to its buffer first, and on close it will flush the bytes from buffer to the actual stream.
Besides the slowest part in this case is opening / closing the file.
I think, the bottleneck here is IO and these two improvements could help:
to think of a granularity of updates. I.e. if you need no more than 20 updates per second out of your app, then you could optimize your app to write no more than 1 update per 1/20 second. And this could be very benefitial depending on the environment.
Java NIO has proved to be much faster for large sizes, so it also makes sense to experiment with small sizes, e.g. to write to a Channel instead of InputStream.
Sorry for coming so late to the party :)
I think trying to optimise the code is probably not the right approach. If you're writing the same tiny file repeatedly, and you have to write it each time rather than buffering it in your application, then by far the biggest consideration will be filesystem and the storage hardware.
The point is that if you're actually hitting the hardware every time then you will upset it severely. If your system is caching the writes, though, then you might be able to have it not hit the hardware very often at all: the data will have been overwritten before it gets there, and only the new data will be written.
But this depends on two things. For one, what does your filesystem do when it gets a new write before it's written the old one? Some filesystems might still end up writing extra entries in a journal, or even writing the old file in one place and then the new file in a different physical location. That would be a killer.
For another, what does your hardware do when asked to overwrite something? If it's a conventional hard drive, it will probably just overwrite the old data. If it's flash memory (as it might well be if this is Android), the wear levelling will kick in, and it'll keep writing to different bits of the drive.
You really need to do whatever you can, in terms of disk caching and filesystem, to ensure that if you send 1000 updates before anything pushes the cache to disk, only the last update gets written.
Since this is Android, you're probably looking at ext2/3/4. Look carefully at the journalling options, and investigate what the effect of the delayed allocation in ext4 would be. Perhaps the best option will be to go with ext4, but turn the journalling off.
A quick google search brought up a benchmark of different write/read operations with files of various sizes:
http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/
The author comes to the conclusion that WinFileIO.WriteBlocks performs fastest for writing data to a file, altough I/O operations rely heavily on multiple factors, such as operating system file caching, file indexing, disk fragmentation, filesystem caching etc.

Categories