How can I easily compress and decompress Strings to/from byte arrays? - java

I have some strings that are roughly 10K characters each. There is plenty of repetition in them. They are serialized JSON objects. I'd like to easily compress them into a byte array, and uncompress them from a byte array.
How can I most easily do this? I'm looking for methods so I can do the following:
String original = "....long string here with 10K characters...";
byte[] compressed = StringCompressor.compress(original);
String decompressed = StringCompressor.decompress(compressed);
assert(original.equals(decompressed);

You can try
enum StringCompressor {
;
public static byte[] compress(String text) {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
OutputStream out = new DeflaterOutputStream(baos);
out.write(text.getBytes("UTF-8"));
out.close();
} catch (IOException e) {
throw new AssertionError(e);
}
return baos.toByteArray();
}
public static String decompress(byte[] bytes) {
InputStream in = new InflaterInputStream(new ByteArrayInputStream(bytes));
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
byte[] buffer = new byte[8192];
int len;
while((len = in.read(buffer))>0)
baos.write(buffer, 0, len);
return new String(baos.toByteArray(), "UTF-8");
} catch (IOException e) {
throw new AssertionError(e);
}
}
}

Peter Lawrey's answer can be improved a bit using this less complex code for the decompress function
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
OutputStream out = new InflaterOutputStream(baos);
out.write(bytes);
out.close();
return new String(baos.toByteArray(), "UTF-8");
} catch (IOException e) {
throw new AssertionError(e);
}

I made a library to solve the problem of compressing generic Strings (expecially short ones).
It tries to compress the String using various algorithms (plain utf-8, 5bit encoding for latin letters, huffman encoding, gzip for long Strings) and chooses the one with the shortest result (in the worst case, it will choose the utf-8 encoding, so that you never risk to lose space).
I hope it may be useful, here's the link
https://github.com/lithedream/lithestring
EDIT: I realized that your Strings are always "long", my library defaults on gzip for those sizes, I fear I cannot do better for you.

Related

How can I create a truststore from a base64 encoded String?

I have a String that is encoded in base64, I need to take this string, decode it and create a truststore file, but when I do that, the final file is not valid. Here is my code:
public static void buildFile() {
String exampleofencoded = "asdfasdfasdfadfa";
File file = new File("folder/file.jks");
try (FileOutputStream fos = new FileOutputStream(file);
BufferedOutputStream bos = new BufferedOutputStream(fos);
DataOutputStream dos = new DataOutputStream(bos))
{
Base64.Decoder decoder = Base64.getDecoder();
String decodedString =new String(decoder.decode(exampleofencoded).getBytes());
dos.writeBytes(decodedString);
}
catch (IOException e) {
System.out.println("Error creating file");
}
catch(NullPointerException e) {
System.out.println(e.getMessage();
}
}
The problem is two-fold.
You're converting a byte[] array to String, which is a lossy operation for actual binary data for most character sets (except maybe iso-8859-1).
You're using DataOutputStream, which is not a generic output stream, but intended for a specific serialization format of primitive types. And specifically its writeBytes method comes with an important caveat ("Each character in the string is written out, in sequence, by discarding its high eight bits."), which is one more reason why only using iso-8859-1 will likely work.
Instead, write the byte array directly to the file
public static void buildFile() {
String exampleofencoded = "asdfasdfasdfadfa";
File file = new File("folder/file.jks");
try (FileOutputStream fos = Files.newOutputStream(file.toPath()) {
Base64.Decoder decoder = Base64.getDecoder();
byte[] decodedbytes = decoder.decode(exampleofencoded);
fos.write(decodedbytes);
} catch (IOException e) {
System.out.println("Error creating file");
}
}
As an aside, you shouldn't catch NullPointerException in your code, it is almost always a problem that can be prevented by careful programming and/or validation of inputs. I would usually also advise against catch the IOException here and only printing it. It is probably better to propagate that exception as well, and let the caller handle it.

Java emulation of MySQL COMRESS DECOMPRESS functions

I'd like to insert some byte data into a mysql VARBINARY column. The data is large so I want to store it in a compressed way.
I'm using Percona 5.6 Mysql. I would like to emulate the COMPRESS function of mysql in Java and then insert the result in the database.
I would like to use MySql DECOMPRESS function in order to access this data.
Is there a way to do it?
I have tried using standard java.zip package. but it doesn't work.
Edit. Phrased differently, what is the Java equivalent of PHP's gzcompress (ZLIB)?
The result of COMPRESS is a four-byte little endian length of the uncompressed data, followed by a zlib stream containing the compressed data.
You can use the Deflater class in Java to compress to a zlib stream. Precede the result of that with the four-byte length.
Solution: Implemented MYSQL Compress and DECOMPRESS
//Compress byte stream using ZLib compression
public static byte[] compressZLib(byte[] data) {
Deflater deflater = new Deflater();
deflater.setInput(data);
deflater.finish();
ByteArrayOutputStream outputStream = new ByteArrayOutputStream(data.length);
byte[] buffer = new byte[1024];
while (!deflater.finished()) {
int count = deflater.deflate(buffer); // returns the generated code... index
outputStream.write(buffer, 0, count);
}
try {
outputStream.close();
} catch (IOException e) {
}
return outputStream.toByteArray();
}
//MYSQL COMPRESS.
public static byte[] compressMySQL(byte[] data) {
byte[] lengthBeforeCompression = ByteBuffer.allocate(Integer.SIZE / Byte.SIZE).order(ByteOrder.LITTLE_ENDIAN).putInt(data.length).array();
ByteArrayOutputStream resultStream = new ByteArrayOutputStream();
try {
resultStream.write( lengthBeforeCompression );
resultStream.write( compressZLib(data));
resultStream.close();
} catch (IOException e) {
}
return resultStream.toByteArray( );
}
//Decompress using ZLib
public static byte[] decompressZLib(byte[] data) {
Inflater inflater = new Inflater();
inflater.setInput(data);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream(data.length);
byte[] buffer = new byte[1024];
try {
while (!inflater.finished()) {
int count = inflater.inflate(buffer);
outputStream.write(buffer, 0, count);
}
outputStream.close();
}catch (IOException ioe) {
} catch (DataFormatException e) {
}
return outputStream.toByteArray();
}
//MYSQL DECOMPRESS
public static byte[] decompressSQLCompression(byte[] input) {
//ignore first four bytes which denote length of uncompressed data. use rest of the array for decompression
byte[] data= Arrays.copyOfRange(input,4,input.length);
return decompressZLib(data);
}

What's the difference between getBytes and serialize with String?

Just as the title says, I can't differ getBytes[] from serialization mechanism with String. Below is a test between getBytes[] and serialization mechanism:
public void testUTF() {
byte[] data = SerializeUtil.serUTFString(str);
System.out.println(data.length);
System.out.println(str.getBytes().length);
}
Here is SerializeUtil:
public static byte[] serUTFString(String data) {
byte[] result = null;
ObjectOutputStream oos = null;
ByteArrayOutputStream byteArray = new ByteArrayOutputStream();
try {
oos = new ObjectOutputStream(byteArray);
try {
oos.writeUTF(data);
oos.flush();
result = byteArray.toByteArray();
} finally {
oos.close();
}
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
When I set str to Redis, both can work correctly, but getBytes[] seems more efficient. Since they all return a byte array from String, whats's the difference, is serialization necessary?
String.getBytes() returns a byte array repersenting the string characters in the default encoding. ObjectOutputStream.writeUTF writes the string length then bytes in modified UTF-8 format, see java.io.DataOutput API.

GZIP decompress string and byte conversion

I have a problem in code:
private static String compress(String str)
{
String str1 = null;
ByteArrayOutputStream bos = null;
try
{
bos = new ByteArrayOutputStream();
BufferedOutputStream dest = null;
byte b[] = str.getBytes();
GZIPOutputStream gz = new GZIPOutputStream(bos,b.length);
gz.write(b,0,b.length);
bos.close();
gz.close();
}
catch(Exception e) {
System.out.println(e);
e.printStackTrace();
}
byte b1[] = bos.toByteArray();
return new String(b1);
}
private static String deCompress(String str)
{
String s1 = null;
try
{
byte b[] = str.getBytes();
InputStream bais = new ByteArrayInputStream(b);
GZIPInputStream gs = new GZIPInputStream(bais);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int numBytesRead = 0;
byte [] tempBytes = new byte[6000];
try
{
while ((numBytesRead = gs.read(tempBytes, 0, tempBytes.length)) != -1)
{
baos.write(tempBytes, 0, numBytesRead);
}
s1 = new String(baos.toByteArray());
s1= baos.toString();
}
catch(ZipException e)
{
e.printStackTrace();
}
}
catch(Exception e) {
e.printStackTrace();
}
return s1;
}
public String test() throws Exception
{
String str = "teststring";
String cmpr = compress(str);
String dcmpr = deCompress(cmpr);
}
This code throw java.io.IOException: unknown format (magic number ef1f)
GZIPInputStream gs = new GZIPInputStream(bais);
It turns out that when converting byte new String (b1) and the byte b [] = str.getBytes () bytes are "spoiled." At the output of the line we have already more bytes. If you avoid the conversion to a string and work on the line with bytes - everything works. Sorry for my English.
public String unZip(String zipped) throws DataFormatException, IOException {
byte[] bytes = zipped.getBytes("WINDOWS-1251");
Inflater decompressed = new Inflater();
decompressed.setInput(bytes);
byte[] result = new byte[100];
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
while (decompressed.inflate(result) != 0)
buffer.write(result);
decompressed.end();
return new String(buffer.toByteArray(), charset);
}
I'm use this function to decompress server responce. Thanks for help.
You have two problems:
You're using the default character encoding to convert the original string into bytes. That will vary by platform. It's better to specify an encoding - UTF-8 is usually a good idea.
You're trying to represent the opaque binary data of the result of the compression as a string by just calling the String(byte[]) constructor. That constructor is only meant for data which is encoded text... which this isn't. You should use base64 for this. There's a public domain base64 library which makes this easy. (Alternatively, don't convert the compressed data to text at all - just return a byte array.)
Fundamentally, you need to understand how different text and binary data are - when you want to convert between the two, you should do so carefully. If you want to represent "non text" binary data (i.e. bytes which aren't the direct result of encoding text) in a string you should use something like base64 or hex. When you want to encode a string as binary data (e.g. to write some text to disk) you should carefully consider which encoding to use. If another program is going to read your data, you need to work out what encoding it expects - if you have full control over it yourself, I'd usually go for UTF-8.
Additionally, the exception handling in your code is poor:
You should almost never catch Exception; catch more specific exceptions
You shouldn't just catch an exception and continue as if it had never happened. If you can't really handle the exception and still complete your method successfully, you should let the exception bubble up the stack (or possibly catch it and wrap it in a more appropriate exception type for your abstraction)
When you GZIP compress data, you always get binary data. This data cannot be converted into string as it is no valid character data (in any encoding).
So your compress method should return a byte array and your decompress method should take a byte array as its parameter.
Futhermore, I recommend you use an explicit encoding when you convert the string into a byte array before compression and when you turn the decompressed data into a string again.
When you GZIP compress data, you always get binary data. This data
cannot be converted into string as it is no valid character data (in
any encoding).
Codo is right, thanks a lot for enlightening me. I was trying to decompress a string (converted from the binary data). What I amended was using InflaterInputStream directly on the input stream returned by my http connection. (My app was retrieving a large JSON of strings)

why initialize this byte array to 1024

I'm relatively new to Java and I'm attempting to write a simple android app. I have a large text file with about 3500 lines in the assets folder of my applications and I need to read it into a string. I found a good example about how to do this but I have a question about why the byte array is initialized to 1024. Wouldn't I want to initialize it to the length of my text file? Also, wouldn't I want to use char, not byte? Here is the code:
private void populateArray(){
AssetManager assetManager = getAssets();
InputStream inputStream = null;
try {
inputStream = assetManager.open("3500LineTextFile.txt");
} catch (IOException e) {
Log.e("IOException populateArray", e.getMessage());
}
String s = readTextFile(inputStream);
// Add more code here to populate array from string
}
private String readTextFile(InputStream inputStream) {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
inputStream.length
byte buf[] = new byte[1024];
int len;
try {
while ((len = inputStream.read(buf)) != -1) {
outputStream.write(buf, 0, len);
}
outputStream.close();
inputStream.close();
} catch (IOException e) {
Log.e("IOException readTextFile", e.getMessage());
}
return outputStream.toString();
}
EDIT: Based on your suggestions, I tried this approach. Is it any better? Thanks.
private void populateArray(){
AssetManager assetManager = getAssets();
InputStream inputStream = null;
Reader iStreamReader = null;
try {
inputStream = assetManager.open("List.txt");
iStreamReader = new InputStreamReader(inputStream, "UTF-8");
} catch (IOException e) {
Log.e("IOException populateArray", e.getMessage());
}
String String = readTextFile(iStreamReader);
// more code here
}
private String readTextFile(InputStreamReader inputStreamReader) {
StringBuilder sb = new StringBuilder();
char buf[] = new char[2048];
int read;
try {
do {
read = inputStreamReader.read(buf, 0, buf.length);
if (read>0) {
sb.append(buf, 0, read);
}
} while (read>=0);
} catch (IOException e) {
Log.e("IOException readTextFile", e.getMessage());
}
return sb.toString();
}
This example is not good at all. It's full of bad practices (hiding exceptions, not closing streams in finally blocks, not specify an explicit encoding, etc.). It uses a 1024 bytes long buffer because it doesn't have any way of knowing the length of the input stream.
Read the Java IO tutorial to learn how to read text from a file.
You are reading the file into a buffer of 1024 Bytes.
Then those 1024 bytes are written to outputStream.
This process repeats until the whole file is read into the outputStream.
As JB Nizet mentioned the example is full of bad practices.
Wouldn't I want to initialize it to the length of my text file? Also, wouldn't I want to use char, not byte?
Yes, and yes ... and as other answers have said, you've picked an example with a number of errors in it.
However, there is a theoretical problem doing both; i.e. setting the buffer length to the file length and using a character buffer rather than a byte buffer. The problem is that the file size is measured in bytes, but the size of the buffer needs to be measured in characters. This is normally fine, but it is theoretically possible that you will need more characters than the file size in bytes; e.g. if the input file used a 6 bit character set and packed 4 characters into 3 bytes.
To read from a file I usaully use a Scanner and a StringBuilder.
Scanner scan = new Scanner(new BufferedInputStream(new FileInputStream(filename)), "UTF-8");
StringBuilder sb = new StringBuilder();
while (scan.hasNextLine()) {
sb.append(scan.nextLine());
sb.append("\n");
}
scan.close
return sb.toString();
Try to throw your exceptions instead of swallowing them. The caller must know there was a problem reading your file.
Edit: Also note that using a BufferedInputStream is important. Otherwise it will try to read bytes by bytes which can be slow.

Categories