Fastest way of reading relatively huge byte-files in Java

Fastest way of reading relatively huge byte-files in Java - java

what's the probably fastest way of reading relatively huge files with Java's I/O-methods? My current solution uses the BufferedInputStream saving to an byte-array with 1024 bytes allocated to it. Each buffer is than saved in an ArrayList for later use. The whole process is called via a separate thread (callable-interface).
Not very fast though.
ArrayList<byte[]> outputArr = new ArrayList<byte[]>();
try {
BufferedInputStream reader = new BufferedInputStream(new FileInputStream (dir+filename));
byte[] buffer = new byte[LIMIT]; // == 1024
int i = 0;
while (reader.available() != 0) {
reader.read(buffer);
i++;
if (i <= LIMIT){
outputArr.add(buffer);
i = 0;
buffer = null;
buffer = new byte[LIMIT];
}
else continue;
}
System.out.println("FileReader-Elements: "+outputArr.size()+" w. "+buffer.length+" byte each.");

I would use a memory mapped file which is fast enough to do in the same thread.
final FileChannel channel = new FileInputStream(fileName).getChannel();
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
// when finished
channel.close();
This assumes the file is smaller than 2 GB and will take 10 milli-seconds or less.

Don't use available(): it's not reliable. And don't ignore the result of the read() method: it tells you how many bytes were actually read. And if you want to read everything in memory, use a ByteArrayOutputStream rather than using a List<byte[]>:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int read;
while ((read = reader.read(buffer)) >= 0) {
baos.write(buffer, 0, read);
}
byte[] everything = baos.toByteArray();
I think 1024 is a bit small as a buffer size. I would use a larger buffer (something like 16 KB or 32KB)
Note that Apache commons IO and Guava have utility methods that do this for you, and have been optimized already.

Have a look at Java NIO (Non-Blocking Input/Output) API. Also, this question might prove being useful.
I don't have much experience with IO, but I've heard that NIO is much more efficient way of handling large sets of data.

Related

File writing in Java using FileOutputStream

String remoteFile2 = "/test/song.mp3";
File downloadFile2 = new File("D:/Downloads/song.mp3");
OutputStream outputStream2 = new BufferedOutputStream(new FileOutputStream(downloadFile2));
InputStream inputStream = ftpClient.retrieveFileStream(remoteFile2);
byte[] bytesArray = new byte[4096];
int bytesRead = -1;
while ((bytesRead = inputStream.read(bytesArray)) != -1) {
outputStream2.write(bytesArray, 0, bytesRead);
}
This is a sample file writing code in java,
byte[] bytesArray = new byte[4096];
In this line what exactly 4096 means, what is the possibility of changing this value?

When deal with stream, you often read bytes in chunk.
If you read / write byte one by one then there are lots of overhead (like init the array to store the byte, put the byte to stream, remember the current position in file... etc) for each byte.
So if you read a group of bytes, you still have those overhead but lesser (For example if you have 4000 bytes, you have 4000x overhead. But if you read 100 bytes per time, you have 4000/100 = 40x overhead only)
The length of chunk is often choosen to balance between the time to read/write the chunk and the size of chunk.
Its often set to 2k or 4k. Might be related with disk sector (512 bytes, 2048 bytes...)

Here 4096 is the buffer size. So whenever you loop is going on it first read 4096 bytes and after that it will go inside the loop.

Base64 encode file by chunks

I want to split a file into multiple chunks (in this case, trying lengths of 300) and base64 encode it, since loading the entire file to memory gives a negative array exception when base64 encoding it. I tried using the following code:
int offset = 0;
bis = new BufferedInputStream(new FileInputStream(f));
while(offset + 300 <= f.length()){
byte[] temp = new byte[300];
bis.skip(offset);
bis.read(temp, 0, 300);
offset += 300;
System.out.println(Base64.encode(temp));
}
if(offset < f.length()){
byte[] temp = new byte[(int) f.length() - offset];
bis.skip(offset);
bis.read(temp, 0, temp.length);
System.out.println(Base64.encode(temp));
}
At first it appears to be working, however, at one point it switches to just printing out "AAAAAAAAA" and fills up the entire console with it, and the new file is corrupted when decoded. What could be causing this error?

skip() "Skips over and discards n bytes of data from the input stream", and read() returns "the number of bytes read".
So, you read some bytes, skip some bytes, read some more, skip, .... eventually reaching EOF at which point read() returns -1, but you ignore that and use the content of temp which contains all 0's, that are then encoded to all A's.
Your code should be:
try (InputStream in = new BufferedInputStream(new FileInputStream(f))) {
int len;
byte[] temp = new byte[300];
while ((len = in.read(temp)) > 0)
System.out.println(Base64.encode(temp, 0, len));
}
This code reuses the single buffer allocated before the loop, so it will also cause much less garbage collection than your code.
If Base64.encode doesn't have a 3 parameter version, do this:
try (InputStream in = new BufferedInputStream(new FileInputStream(f))) {
int len;
byte[] temp = new byte[300];
while ((len = in.read(temp)) > 0) {
byte[] data;
if (len == temp.length)
data = temp;
else {
data = new byte[len];
System.arraycopy(temp, 0, data, 0, len);
}
System.out.println(Base64.encode(data));
}
}

Be sure to use a buffer size that is a multiple of 3 for encoding and a multiple of 4 for decoding when using chunks of data.
300 fulfills both, so that is already OK. Just as an info for those trying different buffer sizes.
Keep in mind, that reading from a stream into a buffer can in some cicumstances result in a buffer not being fully filled, even though the end of the stream was not yet reached. Might be possible when reading from an internet stream and a timeout occures.
You can heal that, but taking that into account would lead to much more complex coding, that would not be educational anymore.

How can I read a specific number of bytes from a FileInputStream object using buffers

I have a series of objects stored within a file concatenated as below:
sizeOfFile1 || file1 || sizeOfFile2 || file2 ...
The size of the files are serialized long objects and the files are just the raw bytes of the files.
I am trying to extract the files from the input file. Below is my code:
FileInputStream fileInputStream = new FileInputStream("C:\Test.tst");
ObjectInputStream objectInputStream = new ObjectInputStream(fileInputStream);
while (fileInputStream.available() > 0)
{
long size = (long) objectInputStream.readObject();
FileOutputStream fileOutputStream = new FileOutputStream("C:\" + size + ".tst");
BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
int chunkSize = 256;
final byte[] temp = new byte[chunkSize];
int finalChunkSize = (int) (size % chunkSize);
final byte[] finalTemp = new byte[finalChunkSize];
while(fileInputStream.available() > 0 && size > 0)
{
if (fileInputStream.available() > finalChunkSize)
{
int i = fileInputStream.read(temp);
secBufferedOutputStream.write(temp, 0, i);
size = size - i;
}
else
{
int i = fileInputStream.read(finalTemp);
secBufferedOutputStream.write(finalTemp, 0, i);
size = 0;
}
}
bufferedOutputStream.close();
}
fileOutputStream.close();
My code fails after it reads the first sizeOfFile; it just reads the rest of the input file into one file when there are multiple files stored.
Can anyone see the issue here?
Regards.

Wrap it in a DataInputStream and use readFully(byte[]).
But I question the design. Serialization and random access do not mix. It sounds like you should be using a database.
NB you are misusing available(). See the method's Javadoc page. It is never correct to use it as a count of the total number of bytes in the stream. There are few if any correct uses of available(), and this isn't one of them.

you could try NIO instead...
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, SIZE);
This reads only SIZE bytes from the file.
B

This is using DataInput to read longs. In this particular case I am not using readFully() as a segment might be too long to keep it in memory:
DataInputStream in = new DataInputStream(FileInputStream());
byte[] buf = new byte[64*1024];
while(true) {
OutputStream out = ...;
long size;
try { size = in.readLong(); } catch (EOFException e) { break; }
while(size > 0) {
int len = (size > buf.length)?buf.length:size;
len = in.read(buf, 0, len);
out.write(buf, 0, len);
size-=len;
}
out.close();
}

Save yourself a lot of trouble by doing one of these things:
Switch to using Avro, trust me you would be crazy not to. It's easy to learn, and will accomodate schema changes. Using ObjectXXXStream is one of the worst ideas ever, as soon as you change your schema your old files are garbage.
or use Thrift
or use Hibernate (but this is probably not a great option, hibernate takes a lot of time to learn, and takes a lot of configuration)
If you really refuse to switch to avro, I recommend reading up on apache's IOUtils class. It has a method to copy from one input stream to another, saving you a lot of headaches. Unfortunately what you want to do is a little more complicated, you want the size prefixing each file. You might be able to use a combination of SequenceInputStream objects to do that.
There is also GzipOutputStream and ZipOutputStream, but I think those require some other jars added to your classpath too.
I'm not going to write an example because I honestly think you should just learn avro or thrift and use that.

Java - File To Byte Array - Fast One

I want to read a file into a byte array. So, I am reading it using:
int len1 = (int)(new File(filename).length());
FileInputStream fis1 = new FileInputStream(filename);
byte buf1[] = new byte[len1];
fis1.read(buf1);
However, it is realy very slow. Can anyone inform me a very fast approach (possibly best one) to read a file into byte array. I can use java library also if needed.
Edit: Is there any benchmark which one is faster (including library approach).

It is not very slow, at least there is not way to make it faster. BUT it is wrong. If file is big enough the method read() will not return all bytes from fist call. This method returns number of bytes it managed to read as return value.
The right way is to call this method in loop:
public static void copy(InputStream input,
OutputStream output,
int bufferSize)
throws IOException {
byte[] buf = new byte[bufferSize];
int bytesRead = input.read(buf);
while (bytesRead != -1) {
output.write(buf, 0, bytesRead);
bytesRead = input.read(buf);
}
output.flush();
}
call this as following:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
copy(new FileInputStream(myfile), baos);
byte[] bytes = baos.toByteArray();
Something like this is implemented in a lot of packages, e.g. FileUtils.readFileToByteArray() mentioned by #Andrey Borisov (+1)
EDIT
I think that reason for slowness in your case is the fact that you create so huge array. Are you sure you really need it? Try to re-think your design. I believe that you do not have to read this file into array and can process data incrementally.

apache commons-io FileUtils.readFileToByteArray

Reading from set position in binary file (java)

I am making a small program in java, and i want it to read from a set position in a binary file. Like substring only on file streams. Any good way to do this?
byte[] buffer = new byte[1024];
FileInputStream in = new FileInputStream("test.bin");
while (bytesRead != -1) {
int bytesRead = inn.read(buffer, 0 , buffer.length);
}
in.close();

One way to do that is to use a java.io.RandomAccessFile and it's java.nio.FileChannel to read and/or write data from/to that file, for example
File file; // initialize somewhere
ByteBuffer buffer; // initialize somewhere
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel fc = raf.getChannel();
fc.position(pos); // position to the byte you want to start reading
fc.read(buffer); // read data into buffer
byte[] data = buffer.array();

Use seek to move the stream to the desired start location.
http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html#skip(long)

I would use RandomAcessFile for the above.
If you are loading a large amount of data I would use memory mapping as this will appear to be much faster (and sometimes it is) BTW You can use FileInputStream for memory mapping as well.
FileChannel in = new FileInputStream("test.bin").getChannel();
MappedByteBuffer mbb = in.map(FileChannel.MapMode, 0, (int) in.size());
// access mbb anywhere
long l = mbb.getLong(40000000); // long at byte 40,000,000
//
in.close();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Fastest way of reading relatively huge byte-files in Java - java

Have a look at Java NIO (Non-Blocking Input/Output) API. Also, this question might prove being useful. I don't have much experience with IO, but I've heard that NIO is much more efficient way of handling large sets of data.

Related

File writing in Java using FileOutputStream

Base64 encode file by chunks

How can I read a specific number of bytes from a FileInputStream object using buffers

Java - File To Byte Array - Fast One

Reading from set position in binary file (java)

Categories

Resources