How to read a big binary file in Java - java

I want to read a binary file in Java, which contains m datasets. I know that each dataset consists of 3 elements: a long number, a double number and a long number in that sequence. The datasets are repeated one after each other till the end of the file. Supposing that the number m of datasets is known, how can I read the file in Java without passing all the datasets to main memory, so as to be able to read large files as well, which do not "fit" in the main memory?

If you want sequential access:
import java.io.FileInputStream;
import java.io.DataInputStream;
DataInputStream dis = new DataInputStream(new FileInputStream("input.bin"))
for(int i = 0; i < m; i++){
long l1 = dis.readLong();
double d1 = dis.readDouble();
long l2 = dis.readLong();
/* do what you need */
}
dis.close();

If the "records" in your file have a fixed size, you can use RandomAccessFile, and particularly the seek method to move to the position you want to read from. The API also provides methods for reading longs and doubles.
The instruction I have is: "The datasets should not be all read in main memory". How can I tackle this?
Use seek to position the file and only read the datasets that need to be read.

I used java.nio http://download.oracle.com/javase/7/docs/api/java/nio/package-summary.html which provides buffering as I needed some additional functions like setting the byte order as well.
In addition to johnchen902's solution you read the number of bytes in your buffer from the stream and put them into the buffer.

Related

How to find a name in unordered list of names in a 8GB flat file in Java

Ok, so we have this problem and I know I can use InputStream to read stream instead of reading the whole file as that would cause the Memory issues.
Referring to this answer: https://stackoverflow.com/a/14037510/1316967
However, the concern is speed, as I would, in this case, be reading each line of the entire file. Considering this file contains millions of names in an unordered fashion and this operation has to be achieved in few seconds, how do I go about solving this problem.
Because the list is unordered there is no alternative to reading the entire file.
If you're lucky, the first name is the name you're looking for: o(1).
If you're unlucky, it's the last name: O(n).
Apart from this, it doesn't matter if you do it the java.io way (Files.newBufferedReader()) or the java.nio way (Files.newByteChannel()), they both - more or less - perform the same. If the input file is line based (as in your case), you may use
Files.lines().filter(l -> name.equals(l)).findFirst();
which internally uses a BufferedReader.
If you really wan't to speed up things, you have to sort the names in the file (see How do I sort very large files), now you're able to read from an
EDIT: ordered list using an index
Once you have an ordered list, you could fast-scan and create an index using a TreeMap and then jump right to correct file position (use a RandomAccessFile or SeekableByteChannel) and read the name.
For example:
long blockSize = 1048576L;
Path file = Paths.get("yourFile");
long fileSize = Files.size(file);
RandomAccessFile raf = new RandomAccessFile(file.toFile(), "r");
//create the index
TreeMap<String, Long> index = new TreeMap<>();
for(long pos = 0; pos < fileSize; pos += blockSize) {
//jump the next block
raf.seek(pos);
index.put(raf.readLine(), pos);
}
//get the position of a name
String name = "someName";
//get the beginning and end of the block
long offset = Optional.ofNullable(index.lowerEntry(name)).map(Map.Entry::getValue).orElse(0L);
long limit = Optional.ofNullable(index.ceilingEntry(name)).map(Map.Entry::getValue).orElse(fileSize);
//move the pointer to the offset position
raf.seek(offset);
long cur;
while((cur = raf.getFilePointer()) < limit){
if(name.equals(raf.readLine())) {
return cur;
}
}
The block size is a tradeoff between index-size, index-creation time and data-access time. The larger the blocks, the smaller the index and index-creation time but the larger the data-access time.
I would suggest to move the data to a database (checkout SQLite for a serverless option).
If that is not possible, you can try to have multiple threads reading the file, each starting at a different offset in the file and reading only a portion of the file.
You would have to use a RandomAccessFile . This will only be beneficial if you are on a RAID system, as benchmarked here: http://www.drdobbs.com/parallel/multithreaded-file-io/220300055?pgno=2

handling comp3 and ebcidic conversion in java to ASCII for large files

I am trying to convert comp3 and EBCIDIC characters in my java code but im running into out of memory exception as the amount of data handled is huge about 5 gb. my code is currently as follows:
byte[] data = Files.readAllBytes(path);
this is resulting in an out of memory exception which i can understand, but i cant use a file scanner as well since the data in the file wont be split into lines.
Can anyone point me in the correct direction on how to handle this
Note: the file may contain records of different length hence splitting it based on record length seams not possible.
As Bill said you could (should) ask for the data to be converted to display characters on the mainframe and if English speaking you can do a ascii transfer.
Also how are you deciding where comp-3 fields start ???
You do not have to read the whole file into memory, you can still read the file in blocks, This method will fill an array of bytes:
protected final int readBuffer(InputStream in, final byte[] buf)
throws IOException {
int total = 0;
int num = in.read(buf, total, buf.length);
while (num >= 0 && total + num < buf.length) {
total += num;
num = in.read(buf, total, buf.length - total);
}
return num;
}
if all the records are the same length, create an array of the record length and the above method will read one record at a time.
Finally the JRecord project has classes to read fixed length files etc. It can do comp-3 conversion. Note: I am the author of JRecord.
I'm running into out of memory exception as the amount of data handled is huge about 5 gb.
You only need to read one record at a time.
My code is currently as follows:
byte[] data = Files.readAllBytes(path);
This is resulting in an out of memory exception which i can understand
Me too.
but i cant use a file scanner as well since the data in the file wont be split into lines.
You mean you can't use the Scanner class? That's not the only way to read a record at a time.
In any case not all files have record delimiters. Some have fixed-length records, some have length words at the start of each record, and some have record type attributes at the start of each record, or in both cases at least in the fixed part of the record.
I'll have to split it based on an attribute record_id at a particular position(say at the begining of each record) that will tell me the record length
So read that attribute, decode it if necessary, and read the rest of the record according to the record length you derive from the attribute. One at a time.
I direct your attention to the methods of DataInputStream, especially readFully(). You will also need a Java COMP-3 library. There are several available. Most of the rest can be done by built-in EBCDIC character set decoders.

java reading from a file from a certain point to another

This is my first question here, I hope you kind sirs can help me and I thank you in advance.
I am trying to write a Java project using threads and the replicated workers paradigm. What I want to do is create a workpool of tasks. The tasks that the workers have to do is simply count the number of words in a specified file between two indices. I want to create a task like this: (file,startIndex,finishIndex). I have problems finding out what file handling class I should use to open a file and read the words from startIndex to finishIndex. I should also mention that I am given a chunk size and I am supposed to split the tasks using that. ChunkSize is an int representing the number of bytes
Bottom line: I want to read from a file from startIndex to startIndex + chunkSize.
I think you are looking for the RandomAccessFile class. It has a "seek" method that allows you to skip to a certain position in the file. Example:
int chunkSize = 64;
long startingIndex = 55;
byte[] bytesRead = new byte[chunkSize];
RandomAccessFile file = new RandomAccessFile("file.txt", "r");
file.seek(startingIndex);
file.read(bytesRead);
file.close();
Note that this will seek a number of bytes, not words. It's impossible to know where are the words in a file before reading it. Counting the spaces and adding one is a naive method that would work well in this case.

File splitting loss of data

I wrote a program for file splitting and joining. When I break the file into small pieces I found that the size of smaller file is not equal to the original one, there is loss of approximately 30-50 bytes of data. and the combined file doesn't run correctly
e.g. a file ABC has been broken into 2 parts, ABC1 and ABC2 but the problem is
sizeof(ABC) is not equal to sizeof(ABC1) + sizeof(ABC2). By sizeof(ABC) I mean from Windows's perspective, i.e. from the Windows property dialog box.
My code is:
for(int i =0;i<no_of_parts;i++)
{
copied_data = 0;// a variable that count the no of byte transferred in the part of file
fos = new FileOutputStream(jTextField2.getText()+"\\".part"+i);
bouts = new BufferedOutputStream(fos);
while((b = bins.read())!= -1)
{
bouts.write(b);
copied_data++;
if(copied_data==each_part_size_in_byte)
break;
}
}
What about closing your output stream? It will flush the buffer and free the file descriptor you use. Call bouts.close().
When you create a file, it is created in blocks of memories instead of individual bytes. So when you divide the file into two, both of them have sizes in fixed blocks which may be more than your actual size of the written data.

Reading big files and performing some operations in java

First of all I would try to explain what I need to do.
I need to read a file (whose size could be from 1 byte to 2 GB), 2 GB maximum because I try to use MappedByteBuffer for fast reading. Maybe later I will try to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII encoding) to chars which later I put into a StringBuilder and then I put this String Builder into an ArrayList
However I also need to do the following:
User could type blockSize which is the number of chars I have to read into the StringBuilder (which is basically number of file bytes converted to chars)
Once I have collected the user defined char count, I create a copy of the String Builder and put it into an Array List
All steps are performed for every char read. The problem is with String Builder since if the file is big (<500 MB), I get the exception OutOfMemoryError.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
at java.lang.StringBuilder.<init>(StringBuilder.java:80)
at java.lang.StringBuilder.<init>(StringBuilder.java:106)
at borrows.wheeler.ReadFile.readFile(ReadFile.java:43)
Java Result: 1
I post my code, maybe someone could suggest improvements to this code or suggest some alternatives.
public class ReadFile {
//matrix block size
public int blockSize = 100;
public int charCounter = 0;
public ArrayList readFile(File file) throws FileNotFoundException, IOException {
FileChannel fc = new FileInputStream(file).getChannel();
MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, (int) fc.size());
ArrayList characters = new ArrayList();
int counter = 0;
StringBuilder sb = new StringBuilder();//blockSize-1
while (mbb.hasRemaining()) {
char charAscii = (char)mbb.get();
counter++;
charCounter++;
if (counter == blockSize){
sb.append(charAscii);
characters.add(new StringBuilder(sb));//new StringBuilder(sb)
sb.delete(0, sb.length());
counter = 0;
}else{
sb.append(charAscii);
}
if(!mbb.hasRemaining()){
characters.add(sb);
}
}
fc.close();
return characters;
}
}
EDIT:
I am doing Burrows-Wheeler transformation. There i should read every file then by Block Size create as many as needed matrixes. well i believe that wiki will explain better than me:
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
If you load large files, it's not entirely surprising that you run out of memory.
How much memory do you have? Are you on a 64-bit system with 64-bit Java? How much heap memory have you allocated (e.g using -Xmx setting)?
Bear in mind that you will need at least twice as much memory as the filesize, because Java uses Unicode UTF-16, which uses at least 2 bytes for each character, but your input is one byte per character. So to load a 2GB file you will need at least 4GB allocated to the heap just for storing this text data.
Also, you need to sort out the logic in your code - you do the same sb.append(charAscii) in the if and the else, and you test !mbb.hasRemaining() in every iteration of a while((mbb.hasRemaining()) loop.
As I asked in your previous question, do you need to store StringBuilders, or would the resulting Strings be OK? Storing strings would save space because StringBuilder allocates memory in big chunks (I think it doubles in size every time it runs out of space!) so may waste a lot.
If you do have to use StringBuilders then pre-sizing them to the value of blockSize would make the code more memory-efficient (and faster).
I try to use MappedByteBuffer for fast reading. Maybe later I will try
to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII
encoding) to chars which later I put into a StringBuilder and then I
put this String Builder into an ArrayList
This sounds more like a problem than a solution. I suggest to you that the file already is ASCII, or character data; that it could be read pretty efficiently using a BufferedReader; and that it can be processed one line at a time.
So do that. You won't get even double the speed by using a MappedByteBuffer, and everything you're doing including the MappedByteBuffer is consuming memory on a truly heroic scale.
If the file isn't such that it can be processed line by line, or record by record, there is something badly wrong upstream.

Categories