LZ4 is not fast compared to deflater compressing string

LZ4 is not fast compared to deflater compressing string - java

I tried to use LZ4 compression to compress a string object.But the result are not in favour to LZ4
Here is the program i tried
public class CompressionDemo {
public static byte[] compressGZIP(String data) throws IOException {
long start = System.nanoTime ();
ByteArrayOutputStream bos = new ByteArrayOutputStream(data.length());
GZIPOutputStream gzip = new GZIPOutputStream(bos);
gzip.write(data.getBytes());
gzip.close();
byte[] compressed = bos.toByteArray();
bos.close();
System.out.println(System.nanoTime()-start);
return compressed;
}
public static byte[] compressLZ4(String data) throws IOException {
long start = System.nanoTime ();
LZ4Factory factory = LZ4Factory.fastestJavaInstance();
LZ4Compressor compressor = factory.highCompressor();
byte[] result = compressor.compress(data.getBytes());
System.out.println(System.nanoTime()-start);
return result;
}
public static byte[] compressDeflater(String stringToCompress) {
long start = System.nanoTime ();
byte[] returnValues = null;
try {
Deflater deflater = new Deflater(Deflater.BEST_COMPRESSION);
deflater.setInput(stringToCompress.getBytes("UTF-8"));
deflater.finish();
byte[] bytesCompressed = new byte[Short.MAX_VALUE];
int numberOfBytesAfterCompression = deflater.deflate(bytesCompressed);
returnValues = new byte[numberOfBytesAfterCompression];
System.arraycopy(bytesCompressed, 0, returnValues, 0, numberOfBytesAfterCompression);
} catch (Exception uee) {
uee.printStackTrace();
}
System.out.println(System.nanoTime()-start);
return returnValues;
}
public static void main(String[] args) throws IOException, DataFormatException {
System.out
.println("..it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required."
.getBytes().length);
byte[] arr = compressLZ4("..it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.");
System.out.println(arr.length);
}
}
I have collected statics as above.But LZ4 is not that fast as stated
Please let me where am i doing wrong.

Your results are meaningless because the size before compression is too small. You are trying to measure the compression of a few thousand bytes at speeds over 100MB/s. The measurement is lost in the time taken by the JVM to warm up. Try again with an input file of several MBs. You should get numbers in line with my LZ4 implementation here: https://github.com/flanglet/kanzi/wiki/Compression-examples.

Related

how to write very long string to a gzip file in java

i have a very long string, and want to wirt to a gzip file
i try use GZIPOutputStream to write a gzip file
but where has exception when i use string.getBytes()
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
there is my code, what should i do that can write file successfully?
public static void way1() throws IOException {
String filePath = "foo";
String content = "very large string";
try (OutputStream os = Files.newOutputStream(Paths.get(filePath));
GZIPOutputStream gos = new GZIPOutputStream(os)) {
gos.write(content.getBytes(StandardCharsets.UTF_8));
}
}
public static void way2() throws IOException {
String filePath = "foo";
String content = "very large string";
try (OutputStream os = Files.newOutputStream(Paths.get(filePath));
GZIPOutputStream gos = new GZIPOutputStream(os);
WritableByteChannel fc = Channels.newChannel(gos)) {
fc.write(ByteBuffer.wrap(content.getBytes(StandardCharsets.UTF_8)));
}
}

If you have ResultSet then try something like:
public static void string2Zipfile(ResultSet rs, int columnIndex, Path outputFile) throws SQLException, IOException {
try (InputStream os = rs.getBinaryStream(columnIndex)) {
try (GZIPOutputStream gos = new GZIPOutputStream(Files.newOutputStream(outputFile))) {
os.transferTo(gos);
}
}
}

It seems that when you convert String to byte[] (using content.getBytes(StandardCharsets.UTF_8)) it just needs a lot of memory for the byte[]. Instead of the conversion of the full String to byte[] at once create a ByteBuffer from it using the selcted encoding, and then write this ByteBuffer to the GZIPOutputStream, this way you will lower the needed size of memory at least by half. To create the ByteBuffer you can use:
Charset charset = StandardCharsets.UTF_8;
String content = "very large string";
ByteBuffer byteBuffer = charset.encode(content );
API of ByteBuffer: https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html
And this might be usefull: How to put the content of a ByteBuffer into an OutputStream?
Alternativelly you can also increase the amount of memory for the java heap:
Increase heap size in Java
All together would be very similar to your way2, smthg like this (I didn't test it)
public static void way2() throws IOException {
String filePath = "foo";
String content = "very large string";
try (OutputStream os = Files.newOutputStream(Paths.get(filePath));
GZIPOutputStream gos = new GZIPOutputStream(os);
WritableByteChannel fc = Channels.newChannel(gos)) {
Charset charset = StandardCharsets.UTF_8;
ByteBuffer byteBuffer = charset.encode(content );
fc.write(byteBuffer );
}
}

i use #Chaosfire suggest, edit code like this, it's write file successfully
public static void way1(List<String> originContent) throws IOException {
String filePath = "foo";
try (OutputStream os = Files.newOutputStream(Paths.get(filePath));
GZIPOutputStream gos = new GZIPOutputStream(os)) {
Lists.partition(originContent, 1000000).stream().map(part -> String.join("\r\n", part)).forEach(str -> {
try {
gos.write(str.getBytes(StandardCharsets.UTF_8));
} catch (IOException e) {
throw new RuntimeException(e);
}
});
}
}

IllegalArgumentException using Java8 Base64 decoder

I wanted to use Base64.java to encode and decode files. Encode.wrap(InputStream) and decode.wrap(InputStream) worked but runned slowly. So I used following code.
public static void decodeFile(String inputFileName,
String outputFileName)
throws FileNotFoundException, IOException {
Base64.Decoder decoder = Base64.getDecoder();
InputStream in = new FileInputStream(inputFileName);
OutputStream out = new FileOutputStream(outputFileName);
byte[] inBuff = new byte[BUFF_SIZE]; //final int BUFF_SIZE = 1024;
byte[] outBuff = null;
while (in.read(inBuff) > 0) {
outBuff = decoder.decode(inBuff);
out.write(outBuff);
}
out.flush();
out.close();
in.close();
}
However, it always throws
Exception in thread "AWT-EventQueue-0" java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit
at java.util.Base64$Decoder.decode0(Base64.java:704)
at java.util.Base64$Decoder.decode(Base64.java:526)
at Base64Coder.JavaBase64FileCoder.decodeFile(JavaBase64FileCoder.java:69)
...
After I changed final int BUFF_SIZE = 1024; into final int BUFF_SIZE = 3*1024;, the code worked. Since "BUFF_SIZE" is also used to encode file, I believe there were something wrong with the file encoded (1024 % 3 = 1, which means paddings are added in the middle of the file).
Also, as #Jon Skeet and #Tagir Valeev mentioned, I should not ignore the return value from InputStream.read(). So, I modified the code as below.
(However, I have to mention that the code does run much faster than using wrap(). I noticed the speed difference because I had coded and intensively used Base64.encodeFile()/decodeFile() long before jdk8 was released. Now, my buffed jdk8 code runs as fast as my original code. So, I do not know what is going on with wrap()... )
public static void decodeFile(String inputFileName,
String outputFileName)
throws FileNotFoundException, IOException
{
Base64.Decoder decoder = Base64.getDecoder();
InputStream in = new FileInputStream(inputFileName);
OutputStream out = new FileOutputStream(outputFileName);
byte[] inBuff = new byte[BUFF_SIZE];
byte[] outBuff = null;
int bytesRead = 0;
while (true)
{
bytesRead = in.read(inBuff);
if (bytesRead == BUFF_SIZE)
{
outBuff = decoder.decode(inBuff);
}
else if (bytesRead > 0)
{
byte[] tempBuff = new byte[bytesRead];
System.arraycopy(inBuff, 0, tempBuff, 0, bytesRead);
outBuff = decoder.decode(tempBuff);
}
else
{
out.flush();
out.close();
in.close();
return;
}
out.write(outBuff);
}
}
Special thanks to #Jon Skeet and #Tagir Valeev.

I strongly suspect that the problem is that you're ignoring the return value from InputStream.read, other than to check for the end of the stream. So this:
while (in.read(inBuff) > 0) {
// This always decodes the *complete* buffer
outBuff = decoder.decode(inBuff);
out.write(outBuff);
}
should be
int bytesRead;
while ((bytesRead = in.read(inBuff)) > 0) {
outBuff = decoder.decode(inBuff, 0, bytesRead);
out.write(outBuff);
}
I wouldn't expect this to be any faster than using wrap though.

Try to use decode.wrap(new BufferedInputStream(new FileInputStream(inputFileName))). With buffering it should be at least as fast as your manually crafted version.
As for why your code doesn't work: that's because the last chunk is likely to be shorter than 1024 bytes, but you try to decode the whole byte[] array. See the #JonSkeet answer for details.

Well, I changed
"final int BUFF_SIZE = 1024;"
into
"final int BUFF_SIZE = 1024 * 3;"
It worked!
So, I guess probabaly there is something wrong with padding... I mean, when encoding the file, (since 1024 % 3 = 1) there must be paddings. And those might raise problems when decoding...

You should records the number of bytes you have read, beside this,
You should be sure that your buffer size is divisible for 3, cause in Base64, every 3 bytes have four output(64 is 2^6, and 3*8 equals 4*6), by doing this, you can avoid padding problems.( In this way your output will not have the wrong ending of "=")

How to define a method for reading all InputStreams including ZipInputStream?

I asked this once before and my post was deleted for not providing the code that uses the helper class. This time I have created a full test suite which shows the exact problem.
I am of the opinion that Java's ZipInputStream breaks the Liskov Substitution Principle (LSP) with regards to the InputStream abstract class. For ZipInputStream to be a subtype of InputStream, then objects of type InputStream in a program may be replaced with objects of type ZipInputStream without altering any of the desirable properties of that program (correctness, task performed, etc.).
The way in which LSP is violated here is for the read methods.
InputStream.read(byte[], int, int) states that it returns:
the total number of bytes read into the buffer, or -1 if there is no more data because the end of the stream has been reached.
The problem with ZipInputStream is that it has modified the meaning of a -1 return value. It states:
the actual number of bytes read, or -1 if the end of the entry is reached
(there is actually a hint to a similar problem with the available method in the Android documentation http://developer.android.com/reference/java/util/zip/ZipInputStream.html)
Now for the code that demonstrates the problem. (This is a cut down version of what I was actually trying to do so please excuse any poor style, multithreading problems, or the fact that the stream is advanced etc.).
Class that accepts any InputStream to generate a SHA1 of the stream:
public class StreamChecker {
private byte[] lastHash = null;
public boolean isDifferent(final InputStream inputStream) throws IOException {
final byte[] hash = generateHash(inputStream);
final byte[] temp = lastHash;
lastHash = hash;
return !Arrays.equals(temp, hash);
}
private byte[] generateHash(final InputStream inputStream) throws IOException {
return DigestUtils.sha1(inputStream);
}
}
Unit tests:
public class StreamCheckerTest {
#Test
public void testByteArrayInputStreamIsSame() throws IOException {
final StreamChecker checker = new StreamChecker();
final byte[] bytes = "abcdef".getBytes();
try (final ByteArrayInputStream stream = new ByteArrayInputStream(bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
try (final ByteArrayInputStream stream = new ByteArrayInputStream(bytes)) {
Assert.assertFalse(checker.isDifferent(stream));
}
// Passes
}
#Test
public void testByteArrayInputStreamWithDifferentDataIsDifferent() throws IOException {
final StreamChecker checker = new StreamChecker();
byte[] bytes = "abcdef".getBytes();
try (final ByteArrayInputStream stream = new ByteArrayInputStream(bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
bytes = "123456".getBytes();
try (final ByteArrayInputStream stream = new ByteArrayInputStream(bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
// Passes
}
#Test
public void testZipInputStreamIsSame() throws IOException {
final StreamChecker checker = new StreamChecker();
final byte[] bytes = "abcdef".getBytes();
try (final ZipInputStream stream = createZipStream("test", bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
try (final ZipInputStream stream = createZipStream("test", bytes)) {
Assert.assertFalse(checker.isDifferent(stream));
}
// Passes
}
#Test
public void testZipInputStreamWithDifferentEntryDataIsDifferent() throws IOException {
final StreamChecker checker = new StreamChecker();
byte[] bytes = "abcdef".getBytes();
try (final ZipInputStream stream = createZipStream("test", bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
bytes = "123456".getBytes();
try (final ZipInputStream stream = createZipStream("test", bytes)) {
// Fails here
Assert.assertTrue(checker.isDifferent(stream));
}
}
private ZipInputStream createZipStream(final String entryName,
final byte[] bytes) throws IOException {
try (final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
final ZipOutputStream stream = new ZipOutputStream(outputStream)) {
stream.putNextEntry(new ZipEntry(entryName));
stream.write(bytes);
return new ZipInputStream(new ByteArrayInputStream(
outputStream.toByteArray()));
}
}
}
So back to the problem... LSP is violated since you can read to the end of the stream for an InputStream but not for a ZipInputStream and of course this will break the correctness property of any method that tries to use it in such a way.
Is there any way that this can be achieved or is ZipInputStream fundamentally flawed?

I see no LSP violation. The documentation for ZipInputStream.read(byte[], int, int) says 'Reads from the current ZIP entry into an array of bytes'.
At any one time, the ZipInputStream is really the input stream of the entry, not the whole ZIP file. And it's hard to see what else ZipInputStream.read() could possibly do at end of entry other than return -1.
this will break the correctness property of any method that tries to use it in such a way
Hard to see how the method would ever know.

What is the best way to calculate a checksum for a Java class?

I have an application where I am generating a "target file" based on a Java "source" class. I want to regenerate the target when the source changes. I have decided the best way to do this would be to get a byte[] of the class contents and calculate a checksum on the byte[].
I am looking for the best way to get the byte[] for a class. This byte[] would be equivalent to the contents of the compiled .class file. Using ObjectOutputStream does not work. The code below generates a byte[] that is much smaller than the byte contents of the class file.
// Incorrect function to calculate the byte[] contents of a Java class
public static final byte[] getClassContents(Class<?> myClass) throws IOException {
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
try( ObjectOutputStream stream = new ObjectOutputStream(buffer) ) {
stream.writeObject(myClass);
}
// This byte array is much smaller than the contents of the *.class file!!!
byte[] contents = buffer.toByteArray();
return contents;
}
Is there a way to get the byte[] with the identical contents of the *.class file? Calculating the checksum is the easy part, the hard part is obtaining the byte[] contents used to calculate an MD5 or CRC32 checksum.

THis is the solution that I ended up using. I don't know if it's the most efficient implementation, but the following code uses the class loader to get the location of the *.class file and reads its contents. For simplicity, I skipped buffering of the read.
// Function to obtain the byte[] contents of a Java class
public static final byte[] getClassContents(Class<?> myClass) throws IOException {
String path = myClass.getName().replace('.', '/');
String fileName = new StringBuffer(path).append(".class").toString();
URL url = myClass.getClassLoader().getResource(fileName);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
try (InputStream stream = url.openConnection().getInputStream()) {
int datum = stream.read();
while( datum != -1) {
buffer.write(datum);
datum = stream.read();
}
}
return buffer.toByteArray();
}

I don't get what you means, but i think you are looking for this, MD5.
To check MD5 of a file, you can use this code
public String getMd5(File file)
{
DigestInputStream stream = null;
try
{
stream = new DigestInputStream(new FileInputStream(file), MessageDigest.getInstance("MD5"));
byte[] buffer = new byte[65536];
read = stream.read(buffer);
while (read >= 1) {
read = stream.read(buffer);
}
}
catch (Exception ignored)
{
int read;
return null;
}
return String.format("%1$032x", new Object[] { new BigInteger(1, stream.getMessageDigest().digest()) });
}
Then, you can store the md5 of a file in any way for exmaple XML. An exmaple of MD5 is 49e6d7e2967d1a471341335c49f46c6c so once the file name and size change, md5 will change. You can store md5 of each file in XML format and next time your run a code to check md5 and compare the md5 of each file in the xml file.

If you really want the contents of the .class file, you should read the contents of .class file, not the byte[] representation that is in memory. So something like
import java.io.*;
public class ReadSelf {
public static void main(String args[]) throws Exception {
Class classInstance = ReadSelf.class;
byte[] bytes = readClass(classInstance);
}
public static byte[] readClass(Class classInstance) throws Exception {
String name = classInstance.getName();
name = name.replaceAll("[.]", "/") + ".class";
System.out.println("Reading this: " + name);
File file = new File(name);
System.out.println("exists: " + file.exists());
return read(file);
}
public static byte[] read(File file) throws Exception {
byte[] data = new byte[(int)file.length()]; // can only read a file of size INT_MAX
DataInputStream inputStream =
new DataInputStream(
new BufferedInputStream(
new FileInputStream(file)));
int total = 0;
int nRead = 0;
try {
while((nRead = inputStream.read(data)) != -1) {
total += nRead;
}
}
finally {
inputStream.close();
}
System.out.println("Read " + total
+ " characters, which should match file length of "
+ file.length() + " characters");
return data;
}
}

Decoding characters in Java: why is it faster with a reader than using buffers?

I am trying several ways to decode the bytes of a file into characters.
Using java.io.Reader and Channels.newReader(...)
public static void decodeWithReader() throws Exception {
FileInputStream fis = new FileInputStream(FILE);
FileChannel channel = fis.getChannel();
CharsetDecoder decoder = Charset.defaultCharset().newDecoder();
Reader reader = Channels.newReader(channel, decoder, -1);
final char[] buffer = new char[4096];
for(;;) {
if(-1 == reader.read(buffer)) {
break;
}
}
fis.close();
}
Using buffers and a decoder manually:
public static void readWithBuffers() throws Exception {
FileInputStream fis = new FileInputStream(FILE);
FileChannel channel = fis.getChannel();
CharsetDecoder decoder = Charset.defaultCharset().newDecoder();
final long fileLength = channel.size();
long position = 0;
final int bufferSize = 1024 * 1024; // 1MB
CharBuffer cbuf = CharBuffer.allocate(4096);
while(position < fileLength) {
MappedByteBuffer bbuf = channel.map(MapMode.READ_ONLY, position, Math.min(bufferSize, fileLength - position));
for(;;) {
CoderResult res = decoder.decode(bbuf, cbuf, false);
if(CoderResult.OVERFLOW == res) {
cbuf.clear();
} else if (CoderResult.UNDERFLOW == res) {
break;
}
}
position += bbuf.position();
}
fis.close();
}
For a 200MB text file, the first approach consistently takes 300ms to complete. The second approach consistently takes 700ms. Do you have any idea why the reader approach is so much faster?
Can it run even faster with another implementation?
The benchmark is performed on Windows 7, and JDK7_07.

For comparison can you try.
public static void readWithBuffersISO_8859_1() throws Exception {
FileInputStream fis = new FileInputStream(FILE);
FileChannel channel = fis.getChannel();
MappedByteBuffer bbuf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
while(bbuf.remaining()>0) {
char ch = (char)(bbuf.get() & 0xFF);
}
fis.close();
}
This assumes an ISO-8859-1. If you want maximum speed, treating the text like a binary format can help if its an option.
As #EJP points out, you are changing a number of things as once and you need to start with the simplest comparable example and see how much difference each element adds.

Here is a third implementation that does not use mapped buffers. In the same conditions than before, it runs consistently in 220ms. The default charset on my machine being "windows-1252", if I force the simpler "ISO-8859-1" charset the decoding is even faster (about 150ms).
It looks like the usage of native features like mapped buffers actually hurts performance (for this very use case). Also interesting, if I allocate direct buffers instead of heap buffers (look at the commented lines) then the performance is reduced (a run then takes around 400ms).
So far the answer seems to be: to decode characters as fast as possible in Java (provided you can't enforce the usage of one charset), use a decoder manually, write the decode loop with heap buffers, do not use mapped buffers or even native ones. I have to admit that I don't really know why it is so.
public static void readWithBuffers() throws Exception {
FileInputStream fis = new FileInputStream(FILE);
FileChannel channel = fis.getChannel();
CharsetDecoder decoder = Charset.defaultCharset().newDecoder();
// CharsetDecoder decoder = Charset.forName("ISO-8859-1").newDecoder();
ByteBuffer bbuf = ByteBuffer.allocate(4096);
// ByteBuffer bbuf = ByteBuffer.allocateDirect(4096);
CharBuffer cbuf = CharBuffer.allocate(4096);
// CharBuffer cbuf = ByteBuffer.allocateDirect(2 * 4096).asCharBuffer();
for(;;) {
if(-1 == channel.read(bbuf)) {
decoder.decode(bbuf, cbuf, true);
decoder.flush(cbuf);
break;
}
bbuf.flip();
CoderResult res = decoder.decode(bbuf, cbuf, false);
if(CoderResult.OVERFLOW == res) {
cbuf.clear();
} else if (CoderResult.UNDERFLOW == res) {
bbuf.compact();
}
}
fis.close();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

LZ4 is not fast compared to deflater compressing string - java

Related

how to write very long string to a gzip file in java

IllegalArgumentException using Java8 Base64 decoder

How to define a method for reading all InputStreams including ZipInputStream?

What is the best way to calculate a checksum for a Java class?

Decoding characters in Java: why is it faster with a reader than using buffers?

Categories

Resources