Fastest way to read huge number of int from binary file - java

I use Java 1.5 on an embedded Linux device and want to read a binary file with 2MB of int values. (now 4bytes Big Endian, but I can decide, the format)
Using DataInputStream via BufferedInputStream using dis.readInt()), these 500 000 calls needs 17s to read, but the file read into one big byte buffer needs 5 seconds.
How can i read that file faster into one huge int[]?
The reading process should not use more than additionally 512 kb.
This code below using nio is not faster than the readInt() approach from java io.
// asume I already know that there are now 500 000 int to read:
int numInts = 500000;
// here I want the result into
int[] result = new int[numInts];
int cnt = 0;
RandomAccessFile aFile = new RandomAccessFile("filename", "r");
FileChannel inChannel = aFile.getChannel();
ByteBuffer buf = ByteBuffer.allocate(512 * 1024);
int bytesRead = inChannel.read(buf); //read into buffer.
while (bytesRead != -1) {
buf.flip(); //make buffer ready for get()
while(buf.hasRemaining() && cnt < numInts){
// probably slow here since called 500 000 times
result[cnt] = buf.getInt();
cnt++;
}
buf.clear(); //make buffer ready for writing
bytesRead = inChannel.read(buf);
}
aFile.close();
inChannel.close();
Update: Evaluation of the answers:
On PC the Memory Map with IntBuffer approach was the fastest in my set up.
On the embedded device, without jit, the java.io DataiInputStream.readInt() was a bit faster (17s, vs 20s for the MemMap with IntBuffer)
Final Conclusion:
Significant speed up is easier to achieve via Algorithmic change. (Smaller file for init)

I don't know if this will be any faster than what Alexander provided, but you could try mapping the file.
try (FileInputStream stream = new FileInputStream(filename)) {
FileChannel inChannel = stream.getChannel();
ByteBuffer buffer = inChannel.map(FileChannel.MapMode.READ_ONLY, 0, inChannel.size());
int[] result = new int[500000];
buffer.order( ByteOrder.BIG_ENDIAN );
IntBuffer intBuffer = buffer.asIntBuffer( );
intBuffer.get(result);
}

You can use IntBuffer from nio package -> http://docs.oracle.com/javase/6/docs/api/java/nio/IntBuffer.html
int[] intArray = new int[ 5000000 ];
IntBuffer intBuffer = IntBuffer.wrap( intArray );
...
Fill in the buffer, by making calls to inChannel.read(intBuffer).
Once the buffer is full, your intArray will contain 500000 integers.
EDIT
After realizing that Channels only support ByteBuffer.
// asume I already know that there are now 500 000 int to read:
int numInts = 500000;
// here I want the result into
int[] result = new int[numInts];
// 4 bytes per int, direct buffer
ByteBuffer buf = ByteBuffer.allocateDirect( numInts * 4 );
// BIG_ENDIAN byte order
buf.order( ByteOrder.BIG_ENDIAN );
// Fill in the buffer
while ( buf.hasRemaining( ) )
{
// Per EJP's suggestion check EOF condition
if( inChannel.read( buf ) == -1 )
{
// Hit EOF
throw new EOFException( );
}
}
buf.flip( );
// Create IntBuffer view
IntBuffer intBuffer = buf.asIntBuffer( );
// result will now contain all ints read from file
intBuffer.get( result );

I ran a fairly careful experiment using serialize/deserialize, DataInputStream vs ObjectInputStream, both based on ByteArrayInputStream to avoid IO effects. For a million ints, readObject was about 20msec, readInt was about 116. The serialization overhead on a million-int array was 27 bytes. This was on a 2013-ish MacBook Pro.
Having said that, object serialization is sort of evil, and you have to have written the data out with a Java program.

Related

How to read Serial data the same way in Processing as C#?

In C#, I use the SerialPort Read function as so:
byte[] buffer = new byte[100000];
int bytesRead = serial.Read(buffer, 0, 100000);
In Processing, I use readBytes as so:
byte[] buffer = new byte[100000];
int bytesRead = serial.readBytes(buffer);
In Processing, I'm getting the incorrect byte values when I loop over the buffer array from the readBytes function, but when I just use the regular read function I get the proper values, but I can't grab the data into a byte array. What am I doing wrong in the Processing version of the code that's leading me to get the wrong values in the buffer array?
I print out the data the same way in both versions:
for(int i=0; i<bytesRead; i++){
println(buffer[i]);
}
C# Correct Output:
Processing Incorrect Output:
Java bytes are signed, so any value over 128 will overflow.
A quick solution is to do
int anUnsignedByte = (int) aSignedByte & 0xff;
to each of your bytes.

File writing in Java using FileOutputStream

String remoteFile2 = "/test/song.mp3";
File downloadFile2 = new File("D:/Downloads/song.mp3");
OutputStream outputStream2 = new BufferedOutputStream(new FileOutputStream(downloadFile2));
InputStream inputStream = ftpClient.retrieveFileStream(remoteFile2);
byte[] bytesArray = new byte[4096];
int bytesRead = -1;
while ((bytesRead = inputStream.read(bytesArray)) != -1) {
outputStream2.write(bytesArray, 0, bytesRead);
}
This is a sample file writing code in java,
byte[] bytesArray = new byte[4096];
In this line what exactly 4096 means, what is the possibility of changing this value?
When deal with stream, you often read bytes in chunk.
If you read / write byte one by one then there are lots of overhead (like init the array to store the byte, put the byte to stream, remember the current position in file... etc) for each byte.
So if you read a group of bytes, you still have those overhead but lesser (For example if you have 4000 bytes, you have 4000x overhead. But if you read 100 bytes per time, you have 4000/100 = 40x overhead only)
The length of chunk is often choosen to balance between the time to read/write the chunk and the size of chunk.
Its often set to 2k or 4k. Might be related with disk sector (512 bytes, 2048 bytes...)
Here 4096 is the buffer size. So whenever you loop is going on it first read 4096 bytes and after that it will go inside the loop.

Base64 encode file by chunks

I want to split a file into multiple chunks (in this case, trying lengths of 300) and base64 encode it, since loading the entire file to memory gives a negative array exception when base64 encoding it. I tried using the following code:
int offset = 0;
bis = new BufferedInputStream(new FileInputStream(f));
while(offset + 300 <= f.length()){
byte[] temp = new byte[300];
bis.skip(offset);
bis.read(temp, 0, 300);
offset += 300;
System.out.println(Base64.encode(temp));
}
if(offset < f.length()){
byte[] temp = new byte[(int) f.length() - offset];
bis.skip(offset);
bis.read(temp, 0, temp.length);
System.out.println(Base64.encode(temp));
}
At first it appears to be working, however, at one point it switches to just printing out "AAAAAAAAA" and fills up the entire console with it, and the new file is corrupted when decoded. What could be causing this error?
skip() "Skips over and discards n bytes of data from the input stream", and read() returns "the number of bytes read".
So, you read some bytes, skip some bytes, read some more, skip, .... eventually reaching EOF at which point read() returns -1, but you ignore that and use the content of temp which contains all 0's, that are then encoded to all A's.
Your code should be:
try (InputStream in = new BufferedInputStream(new FileInputStream(f))) {
int len;
byte[] temp = new byte[300];
while ((len = in.read(temp)) > 0)
System.out.println(Base64.encode(temp, 0, len));
}
This code reuses the single buffer allocated before the loop, so it will also cause much less garbage collection than your code.
If Base64.encode doesn't have a 3 parameter version, do this:
try (InputStream in = new BufferedInputStream(new FileInputStream(f))) {
int len;
byte[] temp = new byte[300];
while ((len = in.read(temp)) > 0) {
byte[] data;
if (len == temp.length)
data = temp;
else {
data = new byte[len];
System.arraycopy(temp, 0, data, 0, len);
}
System.out.println(Base64.encode(data));
}
}
Be sure to use a buffer size that is a multiple of 3 for encoding and a multiple of 4 for decoding when using chunks of data.
300 fulfills both, so that is already OK. Just as an info for those trying different buffer sizes.
Keep in mind, that reading from a stream into a buffer can in some cicumstances result in a buffer not being fully filled, even though the end of the stream was not yet reached. Might be possible when reading from an internet stream and a timeout occures.
You can heal that, but taking that into account would lead to much more complex coding, that would not be educational anymore.

Reading from set position in binary file (java)

I am making a small program in java, and i want it to read from a set position in a binary file. Like substring only on file streams. Any good way to do this?
byte[] buffer = new byte[1024];
FileInputStream in = new FileInputStream("test.bin");
while (bytesRead != -1) {
int bytesRead = inn.read(buffer, 0 , buffer.length);
}
in.close();
One way to do that is to use a java.io.RandomAccessFile and it's java.nio.FileChannel to read and/or write data from/to that file, for example
File file; // initialize somewhere
ByteBuffer buffer; // initialize somewhere
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel fc = raf.getChannel();
fc.position(pos); // position to the byte you want to start reading
fc.read(buffer); // read data into buffer
byte[] data = buffer.array();
Use seek to move the stream to the desired start location.
http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html#skip(long)
I would use RandomAcessFile for the above.
If you are loading a large amount of data I would use memory mapping as this will appear to be much faster (and sometimes it is) BTW You can use FileInputStream for memory mapping as well.
FileChannel in = new FileInputStream("test.bin").getChannel();
MappedByteBuffer mbb = in.map(FileChannel.MapMode, 0, (int) in.size());
// access mbb anywhere
long l = mbb.getLong(40000000); // long at byte 40,000,000
//
in.close();

Fastest way of reading relatively huge byte-files in Java

what's the probably fastest way of reading relatively huge files with Java's I/O-methods? My current solution uses the BufferedInputStream saving to an byte-array with 1024 bytes allocated to it. Each buffer is than saved in an ArrayList for later use. The whole process is called via a separate thread (callable-interface).
Not very fast though.
ArrayList<byte[]> outputArr = new ArrayList<byte[]>();
try {
BufferedInputStream reader = new BufferedInputStream(new FileInputStream (dir+filename));
byte[] buffer = new byte[LIMIT]; // == 1024
int i = 0;
while (reader.available() != 0) {
reader.read(buffer);
i++;
if (i <= LIMIT){
outputArr.add(buffer);
i = 0;
buffer = null;
buffer = new byte[LIMIT];
}
else continue;
}
System.out.println("FileReader-Elements: "+outputArr.size()+" w. "+buffer.length+" byte each.");
I would use a memory mapped file which is fast enough to do in the same thread.
final FileChannel channel = new FileInputStream(fileName).getChannel();
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
// when finished
channel.close();
This assumes the file is smaller than 2 GB and will take 10 milli-seconds or less.
Don't use available(): it's not reliable. And don't ignore the result of the read() method: it tells you how many bytes were actually read. And if you want to read everything in memory, use a ByteArrayOutputStream rather than using a List<byte[]>:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int read;
while ((read = reader.read(buffer)) >= 0) {
baos.write(buffer, 0, read);
}
byte[] everything = baos.toByteArray();
I think 1024 is a bit small as a buffer size. I would use a larger buffer (something like 16 KB or 32KB)
Note that Apache commons IO and Guava have utility methods that do this for you, and have been optimized already.
Have a look at Java NIO (Non-Blocking Input/Output) API. Also, this question might prove being useful.
I don't have much experience with IO, but I've heard that NIO is much more efficient way of handling large sets of data.

Categories