Java Deflater for large set of random strings - java

I am using the Deflater class to try to compress a large set of random strings. My compression and decompression methods look like this:
public static String compressAndEncodeBase64(String text) {
try {
ByteArrayOutputStream os = new ByteArrayOutputStream();
try (DeflaterOutputStream dos = new DeflaterOutputStream(os)) {
dos.write(text.getBytes());
}
byte[] bytes = os.toByteArray();
return new String(Base64.getEncoder().encode(bytes));
} catch (Exception e){
log.info("Caught exception when trying to compress {}: ", text, e);
}
return null;
}
public static String decompressB64(String compressedAndEncodedText) {
try {
byte[] decodedText = Base64.getDecoder().decode(compressedAndEncodedText);
ByteArrayOutputStream os = new ByteArrayOutputStream();
try (OutputStream ios = new InflaterOutputStream(os)) {
ios.write(decodedText);
}
byte[] decompressedBArray = os.toByteArray();
return new String(decompressedBArray, StandardCharsets.UTF_8);
} catch (Exception e){
log.error("Caught following exception when trying to decode and decompress text {}: ", compressedAndEncodedText, e);
throw new BadRequestException(Constants.ErrorMessages.COMPRESSED_GROUPS_HEADER_ERROR);
}
}
However, when I test this on a large set of random strings, my "compressed" string is larger than the original string. Even for a relatively small random string, the compressed data is longer. For example, this unit test fails:
#Test
public void testCompressDecompressRandomString(){
String orig = RandomStringUtils.random(71, true, true);
String compressedString = compressAndEncodeBase64(orig.toString());
Assertions.assertTrue((orig.toString().length() - compressedString.length()) > 0, "The decompressed string has length " + orig.toString().length() + ", while compressed string has length " + compressedString.length());
}
Anyone can explain what's going on and a possible alternative?
Note: I tried using the deflater without the base64 encoding:
public static String compress(String data) {
Deflater new_deflater = new Deflater();
new_deflater.setInput(data.getBytes(StandardCharsets.UTF_8));
new_deflater.finish();
byte compressed_string[] = new byte[1024];
int compressed_size = new_deflater.deflate(compressed_string);
byte[] returnValues = new byte[compressed_size];
System.arraycopy(compressed_string, 0, returnValues, 0, compressed_size);
log.info("The Original String: " + data + "\n Size: " + data.length());
log.info("The Compressed String Output: " + new String(compressed_string) + "\n Size: " + compressed_size);
return new String(returnValues, StandardCharsets.UTF_8);
}
My test still fails however.

First off, you aren't going to get much or any compression on short strings. Compressors need more data to both collect statistics on the data and to have previous data in which to look for repeated strings.
Second, if you're testing with random data, you are further crippling the compressor, since now there are no repeated strings. For your test case with random alphanumeric strings, the only compression you can get is to take advantage of the fact that there are only 62 possible values for each byte. That can be compressed by a factor of log(62)/log(256) = 0.744. Even then, you need to have enough input to cancel the overhead of the code description. Your test case of 71 characters will always be compressed to 73 bytes by deflate, which is essentially just copying the data with a small overhead. There isn't enough input to justify the code description to take advantage of the limited character set. If I have 1,000,000 random characters from that set of 62, then deflate can compress that to about 752,000 bytes.
Third, you are then expanding the resulting compressed data by a factor of 1.333 by encoding it using Base64. So if I take that compression by a factor of 0.752 and then expand it by 1.333, I get an overall expansion of 1.002! You won't get anywhere that way on random characters from a set of 62, no matter how long the input is.
Given all that, you need to do your testing on real-world inputs. I suspect that your application does not have randomly-generated data. Don't attempt compression on short strings. Combine your strings into much longer input, so that the compressor has something to work with. If you must encode with Base64, then you must. But expect that there may be expansion instead of compression. You could include in your output format an option for chunks to be compressed or not compressed, indicated by a leading byte. Then when compressing, if it doesn't compress, send it without compression instead. You can also try a more efficient encoding, e.g. Base85, or whatever number of characters you can transmit transparently.

Related

Base64 Encoded to Decoded File Conversion Problem

I am processing very large files (> 2Gig). Each input file is Base64 encoded, andI am outputting to new files after decoding. Depending on the buffer size (LARGE_BUF) and for a given input file, my input to output conversion either works fine, is missing one or more bytes, or throws an exception at the outputStream.write line (IllegalArgumentException: Last unit does not have enough bits). Here is the code snippet (could not cut and paste so my not be perfect):
.
.
final int LARGE_BUF = 1024;
byte[] inBuf = new byte[LARGE_BUF];
try(InputStream inputStream = new FileInputStream(inFile); OutputStream outStream new new FileOutputStream(outFile)) {
for(int len; (len = inputStream.read(inBuf)) > 0); ) {
String out = new String(inBuf, 0, len);
outStream.write(Base64.getMimeDecoder().decode(out.getBytes()));
}
}
For instance, for my sample input file, if LARGE_BUF is 1024, output file is 4 bytes too small, if 2*1024, I get the exception mentioned above, if 7*1024, it works correctly. Grateful for any ideas. Thank you.
First, you are converting bytes into a String, then immediately back into bytes. So, remove the use of String entirely.
Second, base64 encoding turns each sequence of three bytes into four bytes, so when decoding, you need four bytes to properly decode three bytes of original data. It is not safe to create a new decoder for each arbitrarily read sequence of bytes, which may or may not have a length which is an exact multiple of four.
Finally, Base64.Decoder has a wrap(InputStream) method which makes this considerably easier:
try (InputStream inputStream = Base64.getDecoder().wrap(
new BufferedInputStream(
Files.newInputStream(Paths.get(inFile))))) {
Files.copy(inputStream, Paths.get(outFile));
}

computing checksum for an input stream

I need to compute checksum for an inputstream(or a file) to check if the file contents are changed. I have this below code that generates a different value for each execution though I'm using the same stream. Can someone help me to do this right?
public class CreateChecksum {
public static void main(String args[]) {
String test = "Hello world";
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
}
public static String checkSum(InputStream fis){
String checksum = null;
try {
MessageDigest md = MessageDigest.getInstance("MD5");
//Using MessageDigest update() method to provide input
byte[] buffer = new byte[8192];
int numOfBytesRead;
while( (numOfBytesRead = fis.read(buffer)) > 0){
md.update(buffer, 0, numOfBytesRead);
}
byte[] hash = md.digest();
checksum = new BigInteger(1, hash).toString(16); //don't use this, truncates leading zero
} catch (Exception ex) {
}
return checksum;
}
}
You're using the same stream object for both calls - after you've called checkSum once, the stream will not have any more data to read, so the second call will be creating a hash of an empty stream. The simplest approach would be to create a new stream each time:
String test = "Hello world";
byte[] bytes = test.getBytes(StandardCharsets.UTF_8);
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
Note that your exception handling in checkSum really needs fixing, along with your hex conversion...
Check out the code in org/apache/commons/codec/digest/DigestUtils.html
Changes on a file are relatively easy to monitor, File.lastModified() changes each time a file is changed (and closed). There is even a build-in API to get notified of selected changes to the file system: http://docs.oracle.com/javase/tutorial/essential/io/notification.html
The hashCode of an InputStream is not suitable to detect changes (there is no definition how an InputStream should calculate its hashCode - quite likely its using Object.hashCode, meaning the hashCode doesn't depend on anything but object identity).
Building an MD5 like you try works, but requires reading the entire file every time. Quite a performance killer if the file is large and/or watching for multiple files.
You are confusing two related, but different responsibilities.
First you have a Stream which provides stuff to be read. Then you have a checksum on that stream; however, your implementation is a static method call, effectively divorcing it from a class, meaning that nobody has the responsibility for maintaining the checksum.
Try reworking your solution like so
public ChecksumInputStream implements InputStream {
private InputStream in;
public ChecksumInputStream(InputStream source) {
this.in = source;
}
public int read() {
int value = in.read();
updateChecksum(value);
return value;
}
// and repeat for all the other read methods.
}
Note that now you only do one read, with the checksum calculator decorating the original input stream.
The issue is after you first read the inputstream. The pos has reach the end. The quick way to resolve your issue is
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));

How do you write a custom deflation dictionary and implement it

How do you write a custom deflation dictionary and implement it
{"playerX":"64","playerY":"224","playerTotalHealth":"100","playerCurrentHealth":"100","playerTotalMana":"50","playerCurrentMana":"50","playerExp":"0","playerExpTNL":"20","playerLevel":"1","points":"0","strength":"1","dexterity":"1","constitution":"1","intelligence":"1","wisdom":"1","items":["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24"],"currentMapX":"0","currentMapY":"0","playerBody":"1","playerHair":"6","playerClothes":"7"}
This is the String I am trying to compress.
Things that will never change are the names of each variable, so I wanna add that to a dictionary (this is a json object)
There are a lot of things I could put into a dictionary, such as
"playerX":
"playerY":
I am trying to compress this to as small as I can get it.
I just dont know how to implement it into a dictionary. I know that I have to use a byte[], but how do I separate words in a byte[]?
Currently the code I provided below compresses it to a length of 253 from 494. I wanna try to get it as small as I can. Since it is a small String I would rather have more compression than speed.
you dont have to solve it for me, but maybe provide hints and sources etc on what I can do to make this string mega small
public static void main(String[] args)
{
deflater("String");
}
public static String deflater(String str)
{
System.out.println("Original: " + str + ":End");
System.out.println("Length: " + str.length());
byte[] input = str.getBytes();
Deflater d = new Deflater();
d.setInput(input);
d.setLevel(1);
d.finish();
ByteArrayOutputStream dbos = new ByteArrayOutputStream(input.length);
byte[] buffer = new byte[1024];
while(d.finished() == false)
{
int bytesCompressed = d.deflate(buffer);
System.out.println("Total Bytes: " + bytesCompressed);
dbos.write(buffer, 0, bytesCompressed);
}
try
{
dbos.close();
}
catch(IOException e1)
{
e1.printStackTrace();
System.exit(0);
}
//Dictionary implementation required!
byte[] compressedArray = dbos.toByteArray();
String compStr = new String(compressedArray);
System.out.println("Compressed: " + compStr + ":End");
System.out.println("Length: " + compStr.length());
return null;
}
The dictionary is simply your common strings concatenated to make a sequence of bytes less than or equal to 32K in length. You do not need to separate words. There is no structure to the dictionary. It is simply used as a source of data to match the current string to. You should put the more common strings at the end of the dictionary, since it takes fewer bits to encode shorter distances back.

Compressing a string and storing it on a database as string for later decompression

I have a huge string that I need to cache somewhere and since I cannot write to file my only option is to store this on the data base as text, more specifically, in the clob I have I'm storing a JSON file where I'm placing the compressed string under a certain key of that JSON object.
I'm compressing the strings but somewhere across the string manipulation something happens that doesn't allow me to decompress the data, so I'm wondering if I should encode the data to base 64 but that will lose compression.
What could I do to ensure I can store the compressed string in the database so I can later fetch it?
I cannot change the database, so I'm stuck with that CLOB field
These are my compression functions:
public static String compress(String text) {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
OutputStream out = new DeflaterOutputStream(baos);
out.write(text.getBytes("UTF-8"));
out.close();
} catch (IOException e) {
//ooops
}
return baos.toString();
}
public static String decompress(String bytes) {
InputStream in = new InflaterInputStream(new ByteArrayInputStream(bytes.getBytes()));
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
byte[] buffer = new byte[8192];
int len;
while ((len = in.read(buffer)) > 0)
baos.write(buffer, 0, len);
return new String(baos.toByteArray(), "UTF-8");
} catch (IOException e) {
//ooops
}
}
As you found out, you can't store binary data in a CLOB without some corruption, so encoding to text will be required.
Base 64 will, on average add 33% to the size of your binary data. So you will lose some compression, but if your compression ratio is greater than 25% (this is often easy with particular types of text strings), then compression followed by base 64 encoding may provide you with a net storage gain. Lots of CPU use though.....
You can't convert arbitrary binary data to a String without breaking it. As you've already stated, if you want to store the data in a clob, you need to base64 encode the data (or use some other valid binary to text encoding).
Have you thought of other solutions, such as using memcached or other caching system? Or do you really want to mess around with compression?

Out of memory when encoding file to base64

Using Base64 from Apache commons
public byte[] encode(File file) throws FileNotFoundException, IOException {
byte[] encoded;
try (FileInputStream fin = new FileInputStream(file)) {
byte fileContent[] = new byte[(int) file.length()];
fin.read(fileContent);
encoded = Base64.encodeBase64(fileContent);
}
return encoded;
}
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
at org.apache.commons.codec.binary.BaseNCodec.encode(BaseNCodec.java:342)
at org.apache.commons.codec.binary.Base64.encodeBase64(Base64.java:657)
at org.apache.commons.codec.binary.Base64.encodeBase64(Base64.java:622)
at org.apache.commons.codec.binary.Base64.encodeBase64(Base64.java:604)
I'm making small app for mobile device.
You cannot just load the whole file into memory, like here:
byte fileContent[] = new byte[(int) file.length()];
fin.read(fileContent);
Instead load the file chunk by chunk and encode it in parts. Base64 is a simple encoding, it is enough to load 3 bytes and encode them at a time (this will produce 4 bytes after encoding). For performance reasons consider loading multiples of 3 bytes, e.g. 3000 bytes - should be just fine. Also consider buffering input file.
An example:
byte fileContent[] = new byte[3000];
try (FileInputStream fin = new FileInputStream(file)) {
while(fin.read(fileContent) >= 0) {
Base64.encodeBase64(fileContent);
}
}
Note that you cannot simply append results of Base64.encodeBase64() to encoded bbyte array. Actually, it is not loading the file but encoding it to Base64 causing the out-of-memory problem. This is understandable because Base64 version is bigger (and you already have a file occupying a lot of memory).
Consider changing your method to:
public void encode(File file, OutputStream base64OutputStream)
and sending Base64-encoded data directly to the base64OutputStream rather than returning it.
UPDATE: Thanks to #StephenC I developed much easier version:
public void encode(File file, OutputStream base64OutputStream) {
InputStream is = new FileInputStream(file);
OutputStream out = new Base64OutputStream(base64OutputStream)
IOUtils.copy(is, out);
is.close();
out.close();
}
It uses Base64OutputStream that translates input to Base64 on-the-fly and IOUtils class from Apache Commons IO.
Note: you must close the FileInputStream and Base64OutputStream explicitly to print = if required but buffering is handled by IOUtils.copy().
Either the file is too big, or your heap is too small, or you've got a memory leak.
If this only happens with really big files, put something into your code to check the file size and reject files that are unreasonably big.
If this happens with small files, increase your heap size by using the -Xmx command line option when you launch the JVM. (If this is in a web container or some other framework, check the documentation on how to do it.)
If the file recurs, especially with small files, the chances are that you've got a memory leak.
The other point that should be made is that your current approach entails holding two complete copies of the file in memory. You should be able to reduce the memory usage, though you'll typically need a stream-based Base64 encoder to do this. (It depends on which flavor of the base64 encoding you are using ...)
This page describes a stream-based Base64 encoder / decoder library, and includes lnks to some alternatives.
Well, do not do it for the whole file at once.
Base64 works on 3 bytes at a time, so you can read your file in batches of "multiple of 3" bytes, encode them and repeat until you finish the file:
// the base64 encoding - acceptable estimation of encoded size
StringBuilder sb = new StringBuilder(file.length() / 3 * 4);
FileInputStream fin = null;
try {
fin = new FileInputStream("some.file");
// Max size of buffer
int bSize = 3 * 512;
// Buffer
byte[] buf = new byte[bSize];
// Actual size of buffer
int len = 0;
while((len = fin.read(buf)) != -1) {
byte[] encoded = Base64.encodeBase64(buf);
// Although you might want to write the encoded bytes to another
// stream, otherwise you'll run into the same problem again.
sb.append(new String(buf, 0, len));
}
} catch(IOException e) {
if(null != fin) {
fin.close();
}
}
String base64EncodedFile = sb.toString();
You are not reading the whole file, just the first few kb. The read method returns how many bytes were actually read. You should call read in a loop until it returns -1 to be sure that you have read everything.
The file is too big for both it and its base64 encoding to fit in memory. Either
process the file in smaller pieces or
increase the memory available to the JVM with the -Xmx switch, e.g.
java -Xmx1024M YourProgram
This is best code to upload image of more size
bitmap=Bitmap.createScaledBitmap(bitmap, 100, 100, true);
ByteArrayOutputStream stream = new ByteArrayOutputStream();
bitmap.compress(Bitmap.CompressFormat.PNG, 100, stream); //compress to which format you want.
byte [] byte_arr = stream.toByteArray();
String image_str = Base64.encodeBytes(byte_arr);
Well, looks like your file is too large to keep the multiple copies necessary for an in-memory Base64 encoding in the available heap memory at the same time. Given that this is for a mobile device, it's probably not possible to increase the heap, so you have two options:
make the file smaller (much smaller)
Do it in a stram-based way so that you're reading from an InputStream one small part of the file at a time, encode it and write it to an OutputStream, without ever keeping the enitre file in memory.
In Manifest in applcation tag write following
android:largeHeap="true"
It worked for me
Java 8 added Base64 methods, so Apache Commons is no longer needed to encode large files.
public static void encodeFileToBase64(String inputFile, String outputFile) {
try (OutputStream out = Base64.getEncoder().wrap(new FileOutputStream(outputFile))) {
Files.copy(Paths.get(inputFile), out);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}

Categories