how to write very long string to a gzip file in java - java

i have a very long string, and want to wirt to a gzip file
i try use GZIPOutputStream to write a gzip file
but where has exception when i use string.getBytes()
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
there is my code, what should i do that can write file successfully?
public static void way1() throws IOException {
String filePath = "foo";
String content = "very large string";
try (OutputStream os = Files.newOutputStream(Paths.get(filePath));
GZIPOutputStream gos = new GZIPOutputStream(os)) {
gos.write(content.getBytes(StandardCharsets.UTF_8));
}
}
public static void way2() throws IOException {
String filePath = "foo";
String content = "very large string";
try (OutputStream os = Files.newOutputStream(Paths.get(filePath));
GZIPOutputStream gos = new GZIPOutputStream(os);
WritableByteChannel fc = Channels.newChannel(gos)) {
fc.write(ByteBuffer.wrap(content.getBytes(StandardCharsets.UTF_8)));
}
}

If you have ResultSet then try something like:
public static void string2Zipfile(ResultSet rs, int columnIndex, Path outputFile) throws SQLException, IOException {
try (InputStream os = rs.getBinaryStream(columnIndex)) {
try (GZIPOutputStream gos = new GZIPOutputStream(Files.newOutputStream(outputFile))) {
os.transferTo(gos);
}
}
}

It seems that when you convert String to byte[] (using content.getBytes(StandardCharsets.UTF_8)) it just needs a lot of memory for the byte[]. Instead of the conversion of the full String to byte[] at once create a ByteBuffer from it using the selcted encoding, and then write this ByteBuffer to the GZIPOutputStream, this way you will lower the needed size of memory at least by half. To create the ByteBuffer you can use:
Charset charset = StandardCharsets.UTF_8;
String content = "very large string";
ByteBuffer byteBuffer = charset.encode(content );
API of ByteBuffer: https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html
And this might be usefull: How to put the content of a ByteBuffer into an OutputStream?
Alternativelly you can also increase the amount of memory for the java heap:
Increase heap size in Java
All together would be very similar to your way2, smthg like this (I didn't test it)
public static void way2() throws IOException {
String filePath = "foo";
String content = "very large string";
try (OutputStream os = Files.newOutputStream(Paths.get(filePath));
GZIPOutputStream gos = new GZIPOutputStream(os);
WritableByteChannel fc = Channels.newChannel(gos)) {
Charset charset = StandardCharsets.UTF_8;
ByteBuffer byteBuffer = charset.encode(content );
fc.write(byteBuffer );
}
}

i use #Chaosfire suggest, edit code like this, it's write file successfully
public static void way1(List<String> originContent) throws IOException {
String filePath = "foo";
try (OutputStream os = Files.newOutputStream(Paths.get(filePath));
GZIPOutputStream gos = new GZIPOutputStream(os)) {
Lists.partition(originContent, 1000000).stream().map(part -> String.join("\r\n", part)).forEach(str -> {
try {
gos.write(str.getBytes(StandardCharsets.UTF_8));
} catch (IOException e) {
throw new RuntimeException(e);
}
});
}
}

Related

How to convert String variable back in byte[] in JAVA [duplicate]

This question already has answers here:
How to convert Java String into byte[]?
(9 answers)
Closed 4 years ago.
I have the following code to zip and unzip the String:
public static void main(String[] args) {
// TODO code application logic here
String Source = "hello world";
byte[] a = ZIP(Source);
System.out.format("answer:");
System.out.format(a.toString());
System.out.format("\n");
byte[] Source2 = a.toString().getBytes();
System.out.println("\nsource 2:" + Source2.toString() + "\n");
String b = unZIP(Source2);
System.out.println("\nunzip answer:");
System.out.format(b);
System.out.format("\n");
}
public static byte[] ZIP(String source) {
ByteArrayOutputStream bos= new ByteArrayOutputStream(source.length()* 4);
try {
GZIPOutputStream outZip= new GZIPOutputStream(bos);
outZip.write(source.getBytes());
outZip.flush();
outZip.close();
} catch (Exception Ex) {
}
return bos.toByteArray();
}
public static String unZIP(byte[] Source) {
ByteArrayInputStream bins= new ByteArrayInputStream(Source);
byte[] buf= new byte[2048];
StringBuffer rString= new StringBuffer("");
int len;
try {
GZIPInputStream zipit= new GZIPInputStream(bins);
while ((len = zipit.read(buf)) > 0) {
rString.append(new String(buf).substring(0, len));
}
return rString.toString();
} catch (Exception Ex) {
return "";
}
}
When "Hello World" have been zipped, it's will become [B#7bdecdec in byte[] and convert into String and display on the screen. However, if I'm trying to convert the string back into byte[] with the following code:
byte[] Source2 = a.toString().getBytes();
the value of variable a will become to [B#60a1807c instead of [B#7bdecdec . Does anyone know how can I convert the String (a value of byte but been convert into String) back in byte[] in JAVA?
Why doing byte[] Source2 = a.toString().getBytes(); ?
It seems like a double conversion; you convert a byte[] to string the to byte[].
The real conversion of a byte[] to string is new String(byte[]) hoping that you're in the same charset.
Source2 should be an exact copy of a hence you should just do byte[] Source2 = a;
Your unzip is wrong because you are converting back a string which might be in some other encoding (let's say UTF-8):
public static String unZIP(byte[] source) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream(source.length*2);
try (ByteArrayInputStream in = new ByteArrayInputStream(source);
GZIPInputStream zis = new GZIPInputStream(in)) {
byte[] buffer = new buffer[4096];
for (int n = 0; (n = zis.read(buffer) != 0; ) {
bos.write(buffer, 0, n);
}
}
return new String(bos.toByteArray(), StandardCharsets.UTF_8);
}
This one, not tested, will:
Store byte from the gzip stream into a ByteArrayOutputStream
Close the gzip/ByteArrayInputStream using try with resources
Convert the whole into a String using UTF-8 (you should always use encoding and unless rare case, UTF-8 is the way to go).
You must not use StringBuffer for two reasons:
The most important one: this will not behave well with multi bytes string such as UTF-8 or UTF-16.
And second, StringBuffer is synchronized: you should use StringBuilder whenever possible and whenever it should be used (eg: not here!). StringBuffer should be reserved for case where your share the StringBuffer with several threads, otherwise it is useless.
With those change, you will also need to change the ZIP as per David Conrad comment and because the unZIP use UTF-8:
public static byte[] ZIP(String source) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream(source.length()* 4);
try (GZIPOutputStream zip = new GZIPOutputStream(bos)) {
zip.write(source.getBytes(StandardCharsets.UTF_8));
}
return bos.toByteArray();
}
As for the main, printing a byte[] will result in the default toString.

How to write out percentage of file copying using Binary Stream?

I want to show the percentage while copying file by using binary stream but I don't know the way, that How to do it?
Below is my code.
public static void binaryStream() throws IOException {
try {
FileInputStream inputStream = new FileInputStream(new File("Untitled.png"));
FileOutputStream outputStream = new FileOutputStream(new File("Untitled-copied.png"));
int data;
while ((data = inputStream.read()) >= 0) {
outputStream.write(data);
}
outputStream.write(data);
inputStream.close();
outputStream.close();
} catch (FileNotFoundException e) {
System.out.println("Error");
} catch (IOException e) {
System.out.println("Error");
}
}
Example of how to do it like other people mentioned in the comments.
import java.io.*;
public class BinaryStream {
public static void binaryStream(String file1, String file2) throws Exception
{
File sourceFile = new File(file1);
try(
FileInputStream inputStream = new FileInputStream(sourceFile);
FileOutputStream outputStream = new FileOutputStream(new File(file2))
) {
long lenOfFile = sourceFile.length();
long currentBytesWritten = 0;
int data;
while ((data = inputStream.read()) != -1) {
outputStream.write(data);
currentBytesWritten += 1;
System.out.printf("%2.2f%%%n",
100*((double)currentBytesWritten)/((double)lenOfFile));
}
}
}
public static void main(String args[]) throws Exception {
binaryStream("Untitled.png", "Untitled-copied.png");
}
}
Note that I've made some changes:
Removed the extra outputStream.write() call you had that was writing extra content incorrectly
Using try-with-resources idiom to close the streams you open even on exceptions
Throw the exceptions instead of catching, as you shouldn't catch them if you can't handle them
Compare to -1, as that is the documented value for end of file (end of stream)
Output is like this on my computer:
0,06%
// removed data
99,89%
99,94%
100,00%
Note also that this code will print something after each byte written, so it is highly inefficient. You might want to do that less often. On that note, you're reading and writing one byte at a time, which is also very inefficient - you might want to use read(byte[]) instead, reading in chunks. Example of that, using 256 byte array:
import java.io.*;
public class BinaryStream {
public static void binaryStream(String file1, String file2) throws Exception {
File sourceFile = new File(file1);
try(
FileInputStream inputStream = new FileInputStream(sourceFile);
FileOutputStream outputStream = new FileOutputStream(new File(file2))
) {
long lenOfFile = sourceFile.length();
long bytesWritten = 0;
int amountOfBytesRead;
byte[] bytes = new byte[256];
while ((amountOfBytesRead = inputStream.read(bytes)) != -1) {
outputStream.write(bytes, 0, amountOfBytesRead);
bytesWritten += amountOfBytesRead;
System.out.printf("%2.2f%%%n",
100*((double)bytesWritten)/((double)lenOfFile));
}
}
}
public static void main(String args[]) throws Exception {
binaryStream("Untitled.png", "Untitled-copied.png");
}
}
Output on my computer:
14,69%
29,37%
44,06%
58,75%
73,44%
88,12%
100,00%
Note that in the first example, return value of .read() is actually the byte that was read, whereas in the second example, return value of .read() is the amount of bytes read and the actual bytes go into the byte array.

LZ4 is not fast compared to deflater compressing string

I tried to use LZ4 compression to compress a string object.But the result are not in favour to LZ4
Here is the program i tried
public class CompressionDemo {
public static byte[] compressGZIP(String data) throws IOException {
long start = System.nanoTime ();
ByteArrayOutputStream bos = new ByteArrayOutputStream(data.length());
GZIPOutputStream gzip = new GZIPOutputStream(bos);
gzip.write(data.getBytes());
gzip.close();
byte[] compressed = bos.toByteArray();
bos.close();
System.out.println(System.nanoTime()-start);
return compressed;
}
public static byte[] compressLZ4(String data) throws IOException {
long start = System.nanoTime ();
LZ4Factory factory = LZ4Factory.fastestJavaInstance();
LZ4Compressor compressor = factory.highCompressor();
byte[] result = compressor.compress(data.getBytes());
System.out.println(System.nanoTime()-start);
return result;
}
public static byte[] compressDeflater(String stringToCompress) {
long start = System.nanoTime ();
byte[] returnValues = null;
try {
Deflater deflater = new Deflater(Deflater.BEST_COMPRESSION);
deflater.setInput(stringToCompress.getBytes("UTF-8"));
deflater.finish();
byte[] bytesCompressed = new byte[Short.MAX_VALUE];
int numberOfBytesAfterCompression = deflater.deflate(bytesCompressed);
returnValues = new byte[numberOfBytesAfterCompression];
System.arraycopy(bytesCompressed, 0, returnValues, 0, numberOfBytesAfterCompression);
} catch (Exception uee) {
uee.printStackTrace();
}
System.out.println(System.nanoTime()-start);
return returnValues;
}
public static void main(String[] args) throws IOException, DataFormatException {
System.out
.println("..it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required."
.getBytes().length);
byte[] arr = compressLZ4("..it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.");
System.out.println(arr.length);
}
}
I have collected statics as above.But LZ4 is not that fast as stated
Please let me where am i doing wrong.
Your results are meaningless because the size before compression is too small. You are trying to measure the compression of a few thousand bytes at speeds over 100MB/s. The measurement is lost in the time taken by the JVM to warm up. Try again with an input file of several MBs. You should get numbers in line with my LZ4 implementation here: https://github.com/flanglet/kanzi/wiki/Compression-examples.

How to define a method for reading all InputStreams including ZipInputStream?

I asked this once before and my post was deleted for not providing the code that uses the helper class. This time I have created a full test suite which shows the exact problem.
I am of the opinion that Java's ZipInputStream breaks the Liskov Substitution Principle (LSP) with regards to the InputStream abstract class. For ZipInputStream to be a subtype of InputStream, then objects of type InputStream in a program may be replaced with objects of type ZipInputStream without altering any of the desirable properties of that program (correctness, task performed, etc.).
The way in which LSP is violated here is for the read methods.
InputStream.read(byte[], int, int) states that it returns:
the total number of bytes read into the buffer, or -1 if there is no more data because the end of the stream has been reached.
The problem with ZipInputStream is that it has modified the meaning of a -1 return value. It states:
the actual number of bytes read, or -1 if the end of the entry is reached
(there is actually a hint to a similar problem with the available method in the Android documentation http://developer.android.com/reference/java/util/zip/ZipInputStream.html)
Now for the code that demonstrates the problem. (This is a cut down version of what I was actually trying to do so please excuse any poor style, multithreading problems, or the fact that the stream is advanced etc.).
Class that accepts any InputStream to generate a SHA1 of the stream:
public class StreamChecker {
private byte[] lastHash = null;
public boolean isDifferent(final InputStream inputStream) throws IOException {
final byte[] hash = generateHash(inputStream);
final byte[] temp = lastHash;
lastHash = hash;
return !Arrays.equals(temp, hash);
}
private byte[] generateHash(final InputStream inputStream) throws IOException {
return DigestUtils.sha1(inputStream);
}
}
Unit tests:
public class StreamCheckerTest {
#Test
public void testByteArrayInputStreamIsSame() throws IOException {
final StreamChecker checker = new StreamChecker();
final byte[] bytes = "abcdef".getBytes();
try (final ByteArrayInputStream stream = new ByteArrayInputStream(bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
try (final ByteArrayInputStream stream = new ByteArrayInputStream(bytes)) {
Assert.assertFalse(checker.isDifferent(stream));
}
// Passes
}
#Test
public void testByteArrayInputStreamWithDifferentDataIsDifferent() throws IOException {
final StreamChecker checker = new StreamChecker();
byte[] bytes = "abcdef".getBytes();
try (final ByteArrayInputStream stream = new ByteArrayInputStream(bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
bytes = "123456".getBytes();
try (final ByteArrayInputStream stream = new ByteArrayInputStream(bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
// Passes
}
#Test
public void testZipInputStreamIsSame() throws IOException {
final StreamChecker checker = new StreamChecker();
final byte[] bytes = "abcdef".getBytes();
try (final ZipInputStream stream = createZipStream("test", bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
try (final ZipInputStream stream = createZipStream("test", bytes)) {
Assert.assertFalse(checker.isDifferent(stream));
}
// Passes
}
#Test
public void testZipInputStreamWithDifferentEntryDataIsDifferent() throws IOException {
final StreamChecker checker = new StreamChecker();
byte[] bytes = "abcdef".getBytes();
try (final ZipInputStream stream = createZipStream("test", bytes)) {
Assert.assertTrue(checker.isDifferent(stream));
}
bytes = "123456".getBytes();
try (final ZipInputStream stream = createZipStream("test", bytes)) {
// Fails here
Assert.assertTrue(checker.isDifferent(stream));
}
}
private ZipInputStream createZipStream(final String entryName,
final byte[] bytes) throws IOException {
try (final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
final ZipOutputStream stream = new ZipOutputStream(outputStream)) {
stream.putNextEntry(new ZipEntry(entryName));
stream.write(bytes);
return new ZipInputStream(new ByteArrayInputStream(
outputStream.toByteArray()));
}
}
}
So back to the problem... LSP is violated since you can read to the end of the stream for an InputStream but not for a ZipInputStream and of course this will break the correctness property of any method that tries to use it in such a way.
Is there any way that this can be achieved or is ZipInputStream fundamentally flawed?
I see no LSP violation. The documentation for ZipInputStream.read(byte[], int, int) says 'Reads from the current ZIP entry into an array of bytes'.
At any one time, the ZipInputStream is really the input stream of the entry, not the whole ZIP file. And it's hard to see what else ZipInputStream.read() could possibly do at end of entry other than return -1.
this will break the correctness property of any method that tries to use it in such a way
Hard to see how the method would ever know.

What is the best way to calculate a checksum for a Java class?

I have an application where I am generating a "target file" based on a Java "source" class. I want to regenerate the target when the source changes. I have decided the best way to do this would be to get a byte[] of the class contents and calculate a checksum on the byte[].
I am looking for the best way to get the byte[] for a class. This byte[] would be equivalent to the contents of the compiled .class file. Using ObjectOutputStream does not work. The code below generates a byte[] that is much smaller than the byte contents of the class file.
// Incorrect function to calculate the byte[] contents of a Java class
public static final byte[] getClassContents(Class<?> myClass) throws IOException {
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
try( ObjectOutputStream stream = new ObjectOutputStream(buffer) ) {
stream.writeObject(myClass);
}
// This byte array is much smaller than the contents of the *.class file!!!
byte[] contents = buffer.toByteArray();
return contents;
}
Is there a way to get the byte[] with the identical contents of the *.class file? Calculating the checksum is the easy part, the hard part is obtaining the byte[] contents used to calculate an MD5 or CRC32 checksum.
THis is the solution that I ended up using. I don't know if it's the most efficient implementation, but the following code uses the class loader to get the location of the *.class file and reads its contents. For simplicity, I skipped buffering of the read.
// Function to obtain the byte[] contents of a Java class
public static final byte[] getClassContents(Class<?> myClass) throws IOException {
String path = myClass.getName().replace('.', '/');
String fileName = new StringBuffer(path).append(".class").toString();
URL url = myClass.getClassLoader().getResource(fileName);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
try (InputStream stream = url.openConnection().getInputStream()) {
int datum = stream.read();
while( datum != -1) {
buffer.write(datum);
datum = stream.read();
}
}
return buffer.toByteArray();
}
I don't get what you means, but i think you are looking for this, MD5.
To check MD5 of a file, you can use this code
public String getMd5(File file)
{
DigestInputStream stream = null;
try
{
stream = new DigestInputStream(new FileInputStream(file), MessageDigest.getInstance("MD5"));
byte[] buffer = new byte[65536];
read = stream.read(buffer);
while (read >= 1) {
read = stream.read(buffer);
}
}
catch (Exception ignored)
{
int read;
return null;
}
return String.format("%1$032x", new Object[] { new BigInteger(1, stream.getMessageDigest().digest()) });
}
Then, you can store the md5 of a file in any way for exmaple XML. An exmaple of MD5 is 49e6d7e2967d1a471341335c49f46c6c so once the file name and size change, md5 will change. You can store md5 of each file in XML format and next time your run a code to check md5 and compare the md5 of each file in the xml file.
If you really want the contents of the .class file, you should read the contents of .class file, not the byte[] representation that is in memory. So something like
import java.io.*;
public class ReadSelf {
public static void main(String args[]) throws Exception {
Class classInstance = ReadSelf.class;
byte[] bytes = readClass(classInstance);
}
public static byte[] readClass(Class classInstance) throws Exception {
String name = classInstance.getName();
name = name.replaceAll("[.]", "/") + ".class";
System.out.println("Reading this: " + name);
File file = new File(name);
System.out.println("exists: " + file.exists());
return read(file);
}
public static byte[] read(File file) throws Exception {
byte[] data = new byte[(int)file.length()]; // can only read a file of size INT_MAX
DataInputStream inputStream =
new DataInputStream(
new BufferedInputStream(
new FileInputStream(file)));
int total = 0;
int nRead = 0;
try {
while((nRead = inputStream.read(data)) != -1) {
total += nRead;
}
}
finally {
inputStream.close();
}
System.out.println("Read " + total
+ " characters, which should match file length of "
+ file.length() + " characters");
return data;
}
}

Categories