How do I decompress large files using Zstd-jni and Byte Buffers - java

I am trying to decompress a lot of 40 MB+ files as I download them in parallel using ByteBuffers and Channels. I am getting better throughput by using Channels than I do by using Streams and we need this to be a very high throughput system as we need to process 40 TB of files every day and this part of the process is currently the bottleneck. The files are compressed with zstd-jni. Zstd-jni has api's for decompressing byte buffers but I get an error when I use them. How do I decompress a byte buffer at a time using zstd-jni?
I found these examples in their tests, but unless I am missing something the examples using ByteBuffers seem to assume the entire input file fits in one ByteBuffer:
https://github.com/luben/zstd-jni/blob/master/src/test/scala/Zstd.scala
Below is my code for compressing and decompressing files. The compression code works great, but the decompression code then fails with an error of -70.
public static long compressFile(String inFile, String outFolder, ByteBuffer inBuffer, ByteBuffer compressedBuffer, int compressionLevel) throws IOException {
File file = new File(inFile);
File outFile = new File(outFolder, file.getName() + ".zs");
long numBytes = 0l;
try (RandomAccessFile inRaFile = new RandomAccessFile(file, "r");
RandomAccessFile outRaFile = new RandomAccessFile(outFile, "rw");
FileChannel inChannel = inRaFile.getChannel();
FileChannel outChannel = outRaFile.getChannel()) {
inBuffer.clear();
while(inChannel.read(inBuffer) > 0) {
inBuffer.flip();
compressedBuffer.clear();
long compressedSize = Zstd.compressDirectByteBuffer(compressedBuffer, 0, compressedBuffer.capacity(), inBuffer, 0, inBuffer.limit(), compressionLevel);
numBytes+=compressedSize;
compressedBuffer.position((int)compressedSize);
compressedBuffer.flip();
outChannel.write(compressedBuffer);
inBuffer.clear();
}
}
return numBytes;
}
public static long decompressFile(String originalFilePath, String inFolder, ByteBuffer inBuffer, ByteBuffer decompressedBuffer) throws IOException {
File outFile = new File(originalFilePath);
File inFile = new File(inFolder, outFile.getName() + ".zs");
outFile = new File(inFolder, outFile.getName());
long numBytes = 0l;
try (RandomAccessFile inRaFile = new RandomAccessFile(inFile, "r");
RandomAccessFile outRaFile = new RandomAccessFile(outFile, "rw");
FileChannel inChannel = inRaFile.getChannel();
FileChannel outChannel = outRaFile.getChannel()) {
inBuffer.clear();
while(inChannel.read(inBuffer) > 0) {
inBuffer.flip();
decompressedBuffer.clear();
long compressedSize = Zstd.decompressDirectByteBuffer(decompressedBuffer, 0, decompressedBuffer.capacity(), inBuffer, 0, inBuffer.limit());
System.out.println(Zstd.isError(compressedSize) + " " + compressedSize);
numBytes+=compressedSize;
decompressedBuffer.position((int)compressedSize);
decompressedBuffer.flip();
outChannel.write(decompressedBuffer);
inBuffer.clear();
}
}
return numBytes;
}

Yes, the static methods you use in your example assume the whole compressed file fits in one ByteBuffer. As far as I understand your requirements, you need streaming decompression using ByteBuffers. ZstdDirectBufferDecompressingStream already provides this:
https://static.javadoc.io/com.github.luben/zstd-jni/1.3.7-1/com/github/luben/zstd/ZstdDirectBufferDecompressingStream.html
and here is an example how to use it (from the tests):
https://github.com/luben/zstd-jni/blob/master/src/test/scala/Zstd.scala#L261-L302
but you have also to subclass it and override the "refill" method.
EDIT: here is a new test I just added that has exactly the same structure as your question - moving data beteen channels:
https://github.com/luben/zstd-jni/blob/master/src/test/scala/Zstd.scala#L540-L586

Related

How can I read a specific number of bytes from a FileInputStream object using buffers

I have a series of objects stored within a file concatenated as below:
sizeOfFile1 || file1 || sizeOfFile2 || file2 ...
The size of the files are serialized long objects and the files are just the raw bytes of the files.
I am trying to extract the files from the input file. Below is my code:
FileInputStream fileInputStream = new FileInputStream("C:\Test.tst");
ObjectInputStream objectInputStream = new ObjectInputStream(fileInputStream);
while (fileInputStream.available() > 0)
{
long size = (long) objectInputStream.readObject();
FileOutputStream fileOutputStream = new FileOutputStream("C:\" + size + ".tst");
BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
int chunkSize = 256;
final byte[] temp = new byte[chunkSize];
int finalChunkSize = (int) (size % chunkSize);
final byte[] finalTemp = new byte[finalChunkSize];
while(fileInputStream.available() > 0 && size > 0)
{
if (fileInputStream.available() > finalChunkSize)
{
int i = fileInputStream.read(temp);
secBufferedOutputStream.write(temp, 0, i);
size = size - i;
}
else
{
int i = fileInputStream.read(finalTemp);
secBufferedOutputStream.write(finalTemp, 0, i);
size = 0;
}
}
bufferedOutputStream.close();
}
fileOutputStream.close();
My code fails after it reads the first sizeOfFile; it just reads the rest of the input file into one file when there are multiple files stored.
Can anyone see the issue here?
Regards.
Wrap it in a DataInputStream and use readFully(byte[]).
But I question the design. Serialization and random access do not mix. It sounds like you should be using a database.
NB you are misusing available(). See the method's Javadoc page. It is never correct to use it as a count of the total number of bytes in the stream. There are few if any correct uses of available(), and this isn't one of them.
you could try NIO instead...
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, SIZE);
This reads only SIZE bytes from the file.
B
This is using DataInput to read longs. In this particular case I am not using readFully() as a segment might be too long to keep it in memory:
DataInputStream in = new DataInputStream(FileInputStream());
byte[] buf = new byte[64*1024];
while(true) {
OutputStream out = ...;
long size;
try { size = in.readLong(); } catch (EOFException e) { break; }
while(size > 0) {
int len = (size > buf.length)?buf.length:size;
len = in.read(buf, 0, len);
out.write(buf, 0, len);
size-=len;
}
out.close();
}
Save yourself a lot of trouble by doing one of these things:
Switch to using Avro, trust me you would be crazy not to. It's easy to learn, and will accomodate schema changes. Using ObjectXXXStream is one of the worst ideas ever, as soon as you change your schema your old files are garbage.
or use Thrift
or use Hibernate (but this is probably not a great option, hibernate takes a lot of time to learn, and takes a lot of configuration)
If you really refuse to switch to avro, I recommend reading up on apache's IOUtils class. It has a method to copy from one input stream to another, saving you a lot of headaches. Unfortunately what you want to do is a little more complicated, you want the size prefixing each file. You might be able to use a combination of SequenceInputStream objects to do that.
There is also GzipOutputStream and ZipOutputStream, but I think those require some other jars added to your classpath too.
I'm not going to write an example because I honestly think you should just learn avro or thrift and use that.

Memory problems loading a file, plus converting into hex

I'm trying to make a file hexadecimal converter (input file -> output hex string of the file)
The code I came up with is
static String open2(String path) throws FileNotFoundException, IOException,OutOfMemoryError {
System.out.println("BEGIN LOADING FILE");
StringBuilder sb = new StringBuilder();
//sb.ensureCapacity(2147483648);
int size = 262144;
FileInputStream f = new FileInputStream(path);
FileChannel ch = f.getChannel( );
byte[] barray = new byte[size];
ByteBuffer bb = ByteBuffer.wrap( barray );
while (ch.read(bb) != -1)
{
//System.out.println(sb.capacity());
sb.append(bytesToHex(barray));
bb.clear();
}
System.out.println("FILE LOADED; BRING IT BACK");
return sb.toString();
}
I am sure that "path" is a valid filename.
The problem is with big files (>=
500mb), the compiler outputs a OutOfMemoryError: Java Heap Space on the StringBuilder.append.
To create this code I followed some tips from http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly but I got a doubt when I tried to force a space allocation for the StringBuilder sb: "2147483648 is too big for an int".
If I want to use this code even with very big files (let's say up to 2gb if I really have to stop somewhere) what's the better way to output a hexadecimal string conversion of the file in terms of speed?
I'm now working on copying the converted string into a file. Anyway I'm having problems of "writing the empty buffer on the file" after the eof of the original one.
static String open3(String path) throws FileNotFoundException, IOException {
System.out.println("BEGIN LOADING FILE (Hope this is the last change)");
FileWriter fos = new FileWriter("HEXTMP");
int size = 262144;
FileInputStream f = new FileInputStream(path);
FileChannel ch = f.getChannel( );
byte[] barray = new byte[size];
ByteBuffer bb = ByteBuffer.wrap( barray );
while (ch.read(bb) != -1)
{
fos.write(bytesToHex(barray));
bb.clear();
}
System.out.println("FILE LOADED; BRING IT BACK");
return "HEXTMP";
}
obviously the file HEXTMP created has a size multiple of 256k, but if the file is 257k it will be a 512 file with LOT of "000000" at the end.
I know I just have to create a last byte array with cut length.
(I used a file writer because i wanted to write the string of hex; otherwise it would have just copied the file as-is)
Why are you loading complete file?
You can load few bytes in buffer from input file, process bytes in buffer, then write processed bytes buffer to output file. Continue this till all bytes from input file are not processed.
FileInputStream fis = new FileInputStream("in file");
FileOutputStream fos = new FileOutputStream("out");
byte buffer [] = new byte[8192];
while(true){
int count = fis.read(buffer);
if(count == -1)
break;
byte[] processed = processBytesToConvert(buffer, count);
fos.write(processed);
}
fis.close();
fos.close();
So just read few bytes in buffer, convert it to hex string, get bytes from converted hex string, then write back these bytes to file, and continue for next few input bytes.
The problem here is that you try to read the whole file and store it in memory.
You should use stream, read some lines of your input file, convert them and write them in the output file. That way your program can scale, whatever the size of the input file is.
The key would be to read file in chunks instead of reading all of it in one go. Depending on its use you could vary size of the chunk. For example, if you are trying to make a hex viewer / editor determine how much content is being shown in the viewport and read only as much of data from file. Or if you are simply converting and dumping hex to another file use any chunk size that is small enough to fit in memory but big enough for performance. This should be tunable over some runs. Perhaps use filesystem NIO in Java 7 so that you can do all three tasks - reading, processing and writing - concurrently. The link included in question gives good primer on reading files.

Reading from set position in binary file (java)

I am making a small program in java, and i want it to read from a set position in a binary file. Like substring only on file streams. Any good way to do this?
byte[] buffer = new byte[1024];
FileInputStream in = new FileInputStream("test.bin");
while (bytesRead != -1) {
int bytesRead = inn.read(buffer, 0 , buffer.length);
}
in.close();
One way to do that is to use a java.io.RandomAccessFile and it's java.nio.FileChannel to read and/or write data from/to that file, for example
File file; // initialize somewhere
ByteBuffer buffer; // initialize somewhere
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel fc = raf.getChannel();
fc.position(pos); // position to the byte you want to start reading
fc.read(buffer); // read data into buffer
byte[] data = buffer.array();
Use seek to move the stream to the desired start location.
http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html#skip(long)
I would use RandomAcessFile for the above.
If you are loading a large amount of data I would use memory mapping as this will appear to be much faster (and sometimes it is) BTW You can use FileInputStream for memory mapping as well.
FileChannel in = new FileInputStream("test.bin").getChannel();
MappedByteBuffer mbb = in.map(FileChannel.MapMode, 0, (int) in.size());
// access mbb anywhere
long l = mbb.getLong(40000000); // long at byte 40,000,000
//
in.close();

Best way to write String to file using java nio

I need to write(append) huge string to flat file using java nio. The encoding is ISO-8859-1.
Currently we are writing as shown below. Is there any better way to do the same ?
public void writeToFile(Long limit) throws IOException{
String fileName = "/xyz/test.txt";
File file = new File(fileName);
FileOutputStream fileOutputStream = new FileOutputStream(file, true);
FileChannel fileChannel = fileOutputStream.getChannel();
ByteBuffer byteBuffer = null;
String messageToWrite = null;
for(int i=1; i<limit; i++){
//messageToWrite = get String Data From database
byteBuffer = ByteBuffer.wrap(messageToWrite.getBytes(Charset.forName("ISO-8859-1")));
fileChannel.write(byteBuffer);
}
fileChannel.close();
}
EDIT: Tried both options. Following are the results.
#Test
public void testWritingStringToFile() {
DiagnosticLogControlManagerImpl diagnosticLogControlManagerImpl = new DiagnosticLogControlManagerImpl();
try {
File file = diagnosticLogControlManagerImpl.createFile();
long startTime = System.currentTimeMillis();
writeToFileNIOWay(file);
//writeToFileIOWay(file);
long endTime = System.currentTimeMillis();
System.out.println("Total Time is " + (endTime - startTime));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/**
*
* #param limit
* Long
* #throws IOException
* IOException
*/
public void writeToFileNIOWay(File file) throws IOException {
FileOutputStream fileOutputStream = new FileOutputStream(file, true);
FileChannel fileChannel = fileOutputStream.getChannel();
ByteBuffer byteBuffer = null;
String messageToWrite = null;
for (int i = 1; i < 1000000; i++) {
messageToWrite = "This is a test üüüüüüööööö";
byteBuffer = ByteBuffer.wrap(messageToWrite.getBytes(Charset
.forName("ISO-8859-1")));
fileChannel.write(byteBuffer);
}
}
/**
*
* #param limit
* Long
* #throws IOException
* IOException
*/
public void writeToFileIOWay(File file) throws IOException {
FileOutputStream fileOutputStream = new FileOutputStream(file, true);
BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(
fileOutputStream, 128 * 100);
String messageToWrite = null;
for (int i = 1; i < 1000000; i++) {
messageToWrite = "This is a test üüüüüüööööö";
bufferedOutputStream.write(messageToWrite.getBytes(Charset
.forName("ISO-8859-1")));
}
bufferedOutputStream.flush();
fileOutputStream.close();
}
private File createFile() throws IOException {
File file = new File(FILE_PATH + "test_sixth_one.txt");
file.createNewFile();
return file;
}
Using ByteBuffer and Channel: took 4402 ms
Using buffered Writer : Took 563 ms
UPDATED:
Since Java11 there is a specific method to write strings using java.nio.file.Files:
Files.writeString(Paths.get(file.toURI()), "My string to save");
We can also customize the writing with:
Files.writeString(Paths.get(file.toURI()),
"My string to save",
StandardCharsets.UTF_8,
StandardOpenOption.CREATE,
StandardOpenOption.TRUNCATE_EXISTING);
ORIGINAL ANSWER:
There is a one-line solution, using Java nio:
java.nio.file.Files.write(Paths.get(file.toURI()),
"My string to save".getBytes(StandardCharsets.UTF_8),
StandardOpenOption.CREATE,
StandardOpenOption.TRUNCATE_EXISTING);
I have not benchmarked this solution with the others, but using the built-in implementation for open-write-close file should be fast and the code is quite small.
I don't think you will be able to get a strict answer without benchmarking your software. NIO may speed up the application significantly under the right conditions, but it may also make things slower.
Here are some points:
Do you really need strings? If you store and receive bytes from you database you can avoid string allocation and encoding costs all together.
Do you really need rewind and flip? Seems like you are creating a new buffer for every string and just writing it to the channel. (If you go the NIO way, benchmark strategies that reuse the buffers instead of wrapping / discarding, I think they will do better).
Keep in mind that wrap and allocateDirect may produce quite different buffers. Benchmark both to grasp the trade-offs. With direct allocation, be sure to reuse the same buffer in order to achieve the best performance.
And the most important thing is: Be sure to compare NIO with BufferedOutputStream and/or BufferedWritter approaches (use a intermediate byte[] or char[] buffer with a reasonable size as well). I've seen many, many, many people discovering that NIO is no silver bullet.
If you fancy some bleeding edge... Back to IO Trails for some NIO2 :D.
And here is a interesting benchmark about file copying using different strategies. I know it is a different problem, but I think most of the facts and author conclusions also apply to your problem.
Cheers,
UPDATE 1:
Since #EJP tiped me that direct buffers wouldn't be efficient for this problem, I benchmark it myself and ended up with a nice NIO solution using nemory-mapped files. In my Macbook running OS X Lion this beats BufferedOutputStream by a solid margin. but keep in mind that this might be OS / Hardware / VM specific:
public void writeToFileNIOWay2(File file) throws IOException {
final int numberOfIterations = 1000000;
final String messageToWrite = "This is a test üüüüüüööööö";
final byte[] messageBytes = messageToWrite.
getBytes(Charset.forName("ISO-8859-1"));
final long appendSize = numberOfIterations * messageBytes.length;
final RandomAccessFile raf = new RandomAccessFile(file, "rw");
raf.seek(raf.length());
final FileChannel fc = raf.getChannel();
final MappedByteBuffer mbf = fc.map(FileChannel.MapMode.READ_WRITE, fc.
position(), appendSize);
fc.close();
for (int i = 1; i < numberOfIterations; i++) {
mbf.put(messageBytes);
}
}
I admit that I cheated a little by calculating the total size to append (around 26 MB) beforehand. This may not be possible for several real world scenarios. Still, you can always use a "big enough appending size for the operations and later truncate the file.
UPDATE 2 (2019):
To anyone looking for a modern (as in, Java 11+) solution to the problem, I would follow #DodgyCodeException's advice and use java.nio.file.Files.writeString:
String fileName = "/xyz/test.txt";
String messageToWrite = "My long string";
Files.writeString(Paths.get(fileName), messageToWrite, StandardCharsets.ISO_8859_1);
A BufferedWriter around a FileWriter will almost certainly be faster than any NIO scheme you can come up with. Your code certainly isn't optimal, with a new ByteBuffer per write, and then doing pointless operations on it when it is about to go out of scope, but in any case your question is founded on a misconception. NIO doesn't 'offload the memory footprint to the OS' at all, unless you're using FileChannel.transferTo/From(), which you can't in this instance.
NB don't use a PrintWriter as suggested in comments, as this swallows exceptions. PW is really only for consoles and log files where you don't care.
Here is a short and easy way. It creates a file and writes the data relative to your code project:
private void writeToFile(String filename, String data) {
Path p = Paths.get(".", filename);
try (OutputStream os = new BufferedOutputStream(
Files.newOutputStream(p, StandardOpenOption.CREATE, StandardOpenOption.APPEND))) {
os.write(data.getBytes(), 0, data.length());
} catch (IOException e) {
e.printStackTrace();
}
}
This works for me:
//Creating newBufferedWritter for writing to file
BufferedWritter napiš = Files.newBufferedWriter(Paths.get(filePath));
napiš.write(what);
//Don't forget for this (flush all what you write to String write):
napiš.flush();

how can i read from a binary file?

I want to read a binary file that its size is 5.5 megabyte(a mp3 file). I tried it with fileinputstream but it took many attempts. If possible, I want to read file with a minimal waste of time.
You should try to use a BufferedInputStream around your FileInputStream. It will improve the performance significantly.
new BufferedInputStream(fileInputStream, 8192 /* default buffer size */);
Furthermore, I'd recommend to use the read-method that takes a byte array and fills it instead of the plain read.
There are useful utilities in FileUtils for reading a file at once. This is simpler and efficient for modest files up to 100 MB.
byte[] bytes = FileUtils.readFileToByteArray(file); // handles IOException/close() etc.
Try this:
public static void main(String[] args) throws IOException
{
InputStream i = new FileInputStream("a.mp3");
byte[] contents = new byte[i.available()];
i.read(contents);
i.close();
}
A more reliable version based on helpful comment from #Paul Cager & Liv related to available's and read's unreliability.
public static void main(String[] args) throws IOException
{
File f = new File("c:\\msdia80.dll");
InputStream i = new FileInputStream(f);
byte[] contents = new byte[(int) f.length()];
int read;
int pos = 0;
while ((read = i.read(contents, pos, contents.length - pos)) >= 1)
{
pos += read;
}
i.close();
}

Categories