Reading from set position in binary file (java) - java

I am making a small program in java, and i want it to read from a set position in a binary file. Like substring only on file streams. Any good way to do this?
byte[] buffer = new byte[1024];
FileInputStream in = new FileInputStream("test.bin");
while (bytesRead != -1) {
int bytesRead = inn.read(buffer, 0 , buffer.length);
}
in.close();

One way to do that is to use a java.io.RandomAccessFile and it's java.nio.FileChannel to read and/or write data from/to that file, for example
File file; // initialize somewhere
ByteBuffer buffer; // initialize somewhere
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel fc = raf.getChannel();
fc.position(pos); // position to the byte you want to start reading
fc.read(buffer); // read data into buffer
byte[] data = buffer.array();

Use seek to move the stream to the desired start location.
http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html#skip(long)

I would use RandomAcessFile for the above.
If you are loading a large amount of data I would use memory mapping as this will appear to be much faster (and sometimes it is) BTW You can use FileInputStream for memory mapping as well.
FileChannel in = new FileInputStream("test.bin").getChannel();
MappedByteBuffer mbb = in.map(FileChannel.MapMode, 0, (int) in.size());
// access mbb anywhere
long l = mbb.getLong(40000000); // long at byte 40,000,000
//
in.close();

Related

How do I decompress large files using Zstd-jni and Byte Buffers

I am trying to decompress a lot of 40 MB+ files as I download them in parallel using ByteBuffers and Channels. I am getting better throughput by using Channels than I do by using Streams and we need this to be a very high throughput system as we need to process 40 TB of files every day and this part of the process is currently the bottleneck. The files are compressed with zstd-jni. Zstd-jni has api's for decompressing byte buffers but I get an error when I use them. How do I decompress a byte buffer at a time using zstd-jni?
I found these examples in their tests, but unless I am missing something the examples using ByteBuffers seem to assume the entire input file fits in one ByteBuffer:
https://github.com/luben/zstd-jni/blob/master/src/test/scala/Zstd.scala
Below is my code for compressing and decompressing files. The compression code works great, but the decompression code then fails with an error of -70.
public static long compressFile(String inFile, String outFolder, ByteBuffer inBuffer, ByteBuffer compressedBuffer, int compressionLevel) throws IOException {
File file = new File(inFile);
File outFile = new File(outFolder, file.getName() + ".zs");
long numBytes = 0l;
try (RandomAccessFile inRaFile = new RandomAccessFile(file, "r");
RandomAccessFile outRaFile = new RandomAccessFile(outFile, "rw");
FileChannel inChannel = inRaFile.getChannel();
FileChannel outChannel = outRaFile.getChannel()) {
inBuffer.clear();
while(inChannel.read(inBuffer) > 0) {
inBuffer.flip();
compressedBuffer.clear();
long compressedSize = Zstd.compressDirectByteBuffer(compressedBuffer, 0, compressedBuffer.capacity(), inBuffer, 0, inBuffer.limit(), compressionLevel);
numBytes+=compressedSize;
compressedBuffer.position((int)compressedSize);
compressedBuffer.flip();
outChannel.write(compressedBuffer);
inBuffer.clear();
}
}
return numBytes;
}
public static long decompressFile(String originalFilePath, String inFolder, ByteBuffer inBuffer, ByteBuffer decompressedBuffer) throws IOException {
File outFile = new File(originalFilePath);
File inFile = new File(inFolder, outFile.getName() + ".zs");
outFile = new File(inFolder, outFile.getName());
long numBytes = 0l;
try (RandomAccessFile inRaFile = new RandomAccessFile(inFile, "r");
RandomAccessFile outRaFile = new RandomAccessFile(outFile, "rw");
FileChannel inChannel = inRaFile.getChannel();
FileChannel outChannel = outRaFile.getChannel()) {
inBuffer.clear();
while(inChannel.read(inBuffer) > 0) {
inBuffer.flip();
decompressedBuffer.clear();
long compressedSize = Zstd.decompressDirectByteBuffer(decompressedBuffer, 0, decompressedBuffer.capacity(), inBuffer, 0, inBuffer.limit());
System.out.println(Zstd.isError(compressedSize) + " " + compressedSize);
numBytes+=compressedSize;
decompressedBuffer.position((int)compressedSize);
decompressedBuffer.flip();
outChannel.write(decompressedBuffer);
inBuffer.clear();
}
}
return numBytes;
}
Yes, the static methods you use in your example assume the whole compressed file fits in one ByteBuffer. As far as I understand your requirements, you need streaming decompression using ByteBuffers. ZstdDirectBufferDecompressingStream already provides this:
https://static.javadoc.io/com.github.luben/zstd-jni/1.3.7-1/com/github/luben/zstd/ZstdDirectBufferDecompressingStream.html
and here is an example how to use it (from the tests):
https://github.com/luben/zstd-jni/blob/master/src/test/scala/Zstd.scala#L261-L302
but you have also to subclass it and override the "refill" method.
EDIT: here is a new test I just added that has exactly the same structure as your question - moving data beteen channels:
https://github.com/luben/zstd-jni/blob/master/src/test/scala/Zstd.scala#L540-L586

How can I read a specific number of bytes from a FileInputStream object using buffers

I have a series of objects stored within a file concatenated as below:
sizeOfFile1 || file1 || sizeOfFile2 || file2 ...
The size of the files are serialized long objects and the files are just the raw bytes of the files.
I am trying to extract the files from the input file. Below is my code:
FileInputStream fileInputStream = new FileInputStream("C:\Test.tst");
ObjectInputStream objectInputStream = new ObjectInputStream(fileInputStream);
while (fileInputStream.available() > 0)
{
long size = (long) objectInputStream.readObject();
FileOutputStream fileOutputStream = new FileOutputStream("C:\" + size + ".tst");
BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
int chunkSize = 256;
final byte[] temp = new byte[chunkSize];
int finalChunkSize = (int) (size % chunkSize);
final byte[] finalTemp = new byte[finalChunkSize];
while(fileInputStream.available() > 0 && size > 0)
{
if (fileInputStream.available() > finalChunkSize)
{
int i = fileInputStream.read(temp);
secBufferedOutputStream.write(temp, 0, i);
size = size - i;
}
else
{
int i = fileInputStream.read(finalTemp);
secBufferedOutputStream.write(finalTemp, 0, i);
size = 0;
}
}
bufferedOutputStream.close();
}
fileOutputStream.close();
My code fails after it reads the first sizeOfFile; it just reads the rest of the input file into one file when there are multiple files stored.
Can anyone see the issue here?
Regards.
Wrap it in a DataInputStream and use readFully(byte[]).
But I question the design. Serialization and random access do not mix. It sounds like you should be using a database.
NB you are misusing available(). See the method's Javadoc page. It is never correct to use it as a count of the total number of bytes in the stream. There are few if any correct uses of available(), and this isn't one of them.
you could try NIO instead...
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, SIZE);
This reads only SIZE bytes from the file.
B
This is using DataInput to read longs. In this particular case I am not using readFully() as a segment might be too long to keep it in memory:
DataInputStream in = new DataInputStream(FileInputStream());
byte[] buf = new byte[64*1024];
while(true) {
OutputStream out = ...;
long size;
try { size = in.readLong(); } catch (EOFException e) { break; }
while(size > 0) {
int len = (size > buf.length)?buf.length:size;
len = in.read(buf, 0, len);
out.write(buf, 0, len);
size-=len;
}
out.close();
}
Save yourself a lot of trouble by doing one of these things:
Switch to using Avro, trust me you would be crazy not to. It's easy to learn, and will accomodate schema changes. Using ObjectXXXStream is one of the worst ideas ever, as soon as you change your schema your old files are garbage.
or use Thrift
or use Hibernate (but this is probably not a great option, hibernate takes a lot of time to learn, and takes a lot of configuration)
If you really refuse to switch to avro, I recommend reading up on apache's IOUtils class. It has a method to copy from one input stream to another, saving you a lot of headaches. Unfortunately what you want to do is a little more complicated, you want the size prefixing each file. You might be able to use a combination of SequenceInputStream objects to do that.
There is also GzipOutputStream and ZipOutputStream, but I think those require some other jars added to your classpath too.
I'm not going to write an example because I honestly think you should just learn avro or thrift and use that.

Memory problems loading a file, plus converting into hex

I'm trying to make a file hexadecimal converter (input file -> output hex string of the file)
The code I came up with is
static String open2(String path) throws FileNotFoundException, IOException,OutOfMemoryError {
System.out.println("BEGIN LOADING FILE");
StringBuilder sb = new StringBuilder();
//sb.ensureCapacity(2147483648);
int size = 262144;
FileInputStream f = new FileInputStream(path);
FileChannel ch = f.getChannel( );
byte[] barray = new byte[size];
ByteBuffer bb = ByteBuffer.wrap( barray );
while (ch.read(bb) != -1)
{
//System.out.println(sb.capacity());
sb.append(bytesToHex(barray));
bb.clear();
}
System.out.println("FILE LOADED; BRING IT BACK");
return sb.toString();
}
I am sure that "path" is a valid filename.
The problem is with big files (>=
500mb), the compiler outputs a OutOfMemoryError: Java Heap Space on the StringBuilder.append.
To create this code I followed some tips from http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly but I got a doubt when I tried to force a space allocation for the StringBuilder sb: "2147483648 is too big for an int".
If I want to use this code even with very big files (let's say up to 2gb if I really have to stop somewhere) what's the better way to output a hexadecimal string conversion of the file in terms of speed?
I'm now working on copying the converted string into a file. Anyway I'm having problems of "writing the empty buffer on the file" after the eof of the original one.
static String open3(String path) throws FileNotFoundException, IOException {
System.out.println("BEGIN LOADING FILE (Hope this is the last change)");
FileWriter fos = new FileWriter("HEXTMP");
int size = 262144;
FileInputStream f = new FileInputStream(path);
FileChannel ch = f.getChannel( );
byte[] barray = new byte[size];
ByteBuffer bb = ByteBuffer.wrap( barray );
while (ch.read(bb) != -1)
{
fos.write(bytesToHex(barray));
bb.clear();
}
System.out.println("FILE LOADED; BRING IT BACK");
return "HEXTMP";
}
obviously the file HEXTMP created has a size multiple of 256k, but if the file is 257k it will be a 512 file with LOT of "000000" at the end.
I know I just have to create a last byte array with cut length.
(I used a file writer because i wanted to write the string of hex; otherwise it would have just copied the file as-is)
Why are you loading complete file?
You can load few bytes in buffer from input file, process bytes in buffer, then write processed bytes buffer to output file. Continue this till all bytes from input file are not processed.
FileInputStream fis = new FileInputStream("in file");
FileOutputStream fos = new FileOutputStream("out");
byte buffer [] = new byte[8192];
while(true){
int count = fis.read(buffer);
if(count == -1)
break;
byte[] processed = processBytesToConvert(buffer, count);
fos.write(processed);
}
fis.close();
fos.close();
So just read few bytes in buffer, convert it to hex string, get bytes from converted hex string, then write back these bytes to file, and continue for next few input bytes.
The problem here is that you try to read the whole file and store it in memory.
You should use stream, read some lines of your input file, convert them and write them in the output file. That way your program can scale, whatever the size of the input file is.
The key would be to read file in chunks instead of reading all of it in one go. Depending on its use you could vary size of the chunk. For example, if you are trying to make a hex viewer / editor determine how much content is being shown in the viewport and read only as much of data from file. Or if you are simply converting and dumping hex to another file use any chunk size that is small enough to fit in memory but big enough for performance. This should be tunable over some runs. Perhaps use filesystem NIO in Java 7 so that you can do all three tasks - reading, processing and writing - concurrently. The link included in question gives good primer on reading files.

Java - File To Byte Array - Fast One

I want to read a file into a byte array. So, I am reading it using:
int len1 = (int)(new File(filename).length());
FileInputStream fis1 = new FileInputStream(filename);
byte buf1[] = new byte[len1];
fis1.read(buf1);
However, it is realy very slow. Can anyone inform me a very fast approach (possibly best one) to read a file into byte array. I can use java library also if needed.
Edit: Is there any benchmark which one is faster (including library approach).
It is not very slow, at least there is not way to make it faster. BUT it is wrong. If file is big enough the method read() will not return all bytes from fist call. This method returns number of bytes it managed to read as return value.
The right way is to call this method in loop:
public static void copy(InputStream input,
OutputStream output,
int bufferSize)
throws IOException {
byte[] buf = new byte[bufferSize];
int bytesRead = input.read(buf);
while (bytesRead != -1) {
output.write(buf, 0, bytesRead);
bytesRead = input.read(buf);
}
output.flush();
}
call this as following:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
copy(new FileInputStream(myfile), baos);
byte[] bytes = baos.toByteArray();
Something like this is implemented in a lot of packages, e.g. FileUtils.readFileToByteArray() mentioned by #Andrey Borisov (+1)
EDIT
I think that reason for slowness in your case is the fact that you create so huge array. Are you sure you really need it? Try to re-think your design. I believe that you do not have to read this file into array and can process data incrementally.
apache commons-io FileUtils.readFileToByteArray

Fastest way of reading relatively huge byte-files in Java

what's the probably fastest way of reading relatively huge files with Java's I/O-methods? My current solution uses the BufferedInputStream saving to an byte-array with 1024 bytes allocated to it. Each buffer is than saved in an ArrayList for later use. The whole process is called via a separate thread (callable-interface).
Not very fast though.
ArrayList<byte[]> outputArr = new ArrayList<byte[]>();
try {
BufferedInputStream reader = new BufferedInputStream(new FileInputStream (dir+filename));
byte[] buffer = new byte[LIMIT]; // == 1024
int i = 0;
while (reader.available() != 0) {
reader.read(buffer);
i++;
if (i <= LIMIT){
outputArr.add(buffer);
i = 0;
buffer = null;
buffer = new byte[LIMIT];
}
else continue;
}
System.out.println("FileReader-Elements: "+outputArr.size()+" w. "+buffer.length+" byte each.");
I would use a memory mapped file which is fast enough to do in the same thread.
final FileChannel channel = new FileInputStream(fileName).getChannel();
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
// when finished
channel.close();
This assumes the file is smaller than 2 GB and will take 10 milli-seconds or less.
Don't use available(): it's not reliable. And don't ignore the result of the read() method: it tells you how many bytes were actually read. And if you want to read everything in memory, use a ByteArrayOutputStream rather than using a List<byte[]>:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int read;
while ((read = reader.read(buffer)) >= 0) {
baos.write(buffer, 0, read);
}
byte[] everything = baos.toByteArray();
I think 1024 is a bit small as a buffer size. I would use a larger buffer (something like 16 KB or 32KB)
Note that Apache commons IO and Guava have utility methods that do this for you, and have been optimized already.
Have a look at Java NIO (Non-Blocking Input/Output) API. Also, this question might prove being useful.
I don't have much experience with IO, but I've heard that NIO is much more efficient way of handling large sets of data.

Categories