How to read arbitrary but continuous n lines from a huge file

How to read arbitrary but continuous n lines from a huge file - java

I would like to read arbitrary number of lines. The files are normal ascii text files for the moment (they may be UTF8/multibyte character files later)
So what I want is for a method to read a file for specific lines only (for example from 101-200) and while doing so it should not block any thing (ie same file can be read by another thread for 201-210 and it should not wait for the first reading operation.
In the case there are no lines to read it should gracefully return what ever it could read. The output of the methods could be a List
The solution I thought up so far was to read the entire file first to find number of lines as well as the byte positions of each new line character. Then use the RandomAccessFile to read bytes and convert them to lines. I have to convert the bytes to Strings (but that can be done after the reading is done). I would avoid the end of file exception for reading beyond file by proper book keeping. The solution is bit inefficient as it does go through the file twice, but the file size can be really big and we want to keep very little in the memory.
If there is a library for such thing that would work, but a simpler native java solution would be great.
As always I appreciate your clarification questions and I will edit this question as it goes.

Why not use Scanner and just loop through hasNextLine() until you get to the count you want, and then grab as many lines as you wish... if it runs out, it'll fail gracefully. That way you're only reading the file once (unless Scanner reads it fully... I've never looked under the hood... but it doesn't sound like you care, so... there you go :)

If you want to minimise memory consumption, I would use a memory mapped file. This uses almost no heap. The amount of the file kept in memory is handled by the OS so you don't need to tune the behaviour yourself.
FileChannel fc = new FileInputStream(fileName).getChannel();
final MappedByteBuffer map = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
If you have a file of 2 GB or more, you need multiple mappings. In the simplest case you can scan the data and remember all the indexes. The indexes them selves could take lots of space so you might only remember every Nth e.g. every tenth.
e.g. a 2 GB file with 40 byte lines could have 50 million lines requiring 400 MB of memory.
Another way around having a large index is to create another memory mapped file.
FileChannel fc = new RandomAccessFile(fileName).getChannel();
final MappedByteBuffer map2 = fc.map(FileChannel.MapMode.READ_WRITE, 0, fc.size()/10);
The problem being, you don't know how big the file needs to be before you start. Fortunately if you make it larger than needed, it doesn't consume memory or disk space, so the simplest thing to do is make it very large and truncate it when you know the size it needs to be.
This could also be use to avoid re-indexing the file each time you load the file (only when it is changed) If the file is only appended to, you could index from the end of the file each time.
Note: Using this approach can use a lot of virtual memory, for a 64-bit JVM this is no problem as your limit is likely to 256 TB. For a 32-bit application, you limits is likely to be 1.5 - 3.5 GB depending on your OS.

Related

In Java, can I remove specific bytes from a file?

So far I managed to do something with Byte Stream : read the original file, and write in a new file while omitting the desired bytes (and then finish by deleting/renaming the files so that there's only one left).
I'd like to know if there's a way to directly modify the bytes without requiring to manipulate more than one file. The reason is because this has to be performed when there is low memory and the file is too big, so cloning the file before trimming it may not be the best option.

I'd like to know if there's a way to directly modify the bytes without requiring to manipulate more than one file.
There isn't a SAFE way to do it.
The unsafe way to do it involves (for example) mapping the file using a MappedByteBuffer, and shuffling the bytes around.
But the problem is that if something goes wrong while you are doing this, you are liable to end up with a corrupted file.
Therefore, if the user asks to perform this operation when the device's memory is too full to hold a second copy of the file, the best thing is to tell the user to "delete some files first".
The reason is because this has to be performed when there is low memory and the file is too big, so cloning the file before trimming it may not be the best option.
If you are primarily worried about "memory" on the storage device, see above.
If you are worried about RAM, then #RealSkeptic's observation is correct. You shouldn't need to hold the entire file in RAM at the same time. You can read, modify, write it a buffer at a time.

You can't remove bytes in the middle of the file without placing the rest of the file in memory. But you can replace bytes if it can help you.

How to evaluate size of file in Java before creating it?

In my Java programm I need to create files and write in it something that i can get by Inputstream's read() method. How can I evaluate the size of file before creating it?

Normally, you don't need to know how big the file will be, but if you really do:
The only way you could do that would be to fully read the content from the InputStream into memory first, and then see how much you have.
You have several options for how to read it all into memory, one of which might be to write it to a ByteArrayOutputStream. (And then, of course, write that out to the file when you're ready.)
But again, the great thing about streams is that you don't have to read things all into memory; if you can avoid needing to know the size in advance, that would be best.
Also note that the space the file will occupy on disk won't be exactly the same as the file size; most file systems work in chunks (4k, 8k, 16k, 32k) and so a file that's (say) 12k on a file system using 8k chunks will actually occupy 16k of space.

It's depend of the encoding used, but you can write it in memorystream and get the length.

Reading a file vs loading a file into main memory from disk for processing

how do I load a file into main memory?
I read the files using,
I use
BufferReader buf = new BufferedReader(FileReader());
I presume that this is reading the file line by line from the disk. What is the advantage of this?
What is the advantage of loading the file directly into memory?
How do we do that in Java?
I found some examples on Scanner or RandomAccessFile methods. Do they load the files into memory? Should I use them? Which of the two should I use ?
Thanks in advance!!!

BufferReader buf = new BufferedReader(FileReader());
I presume that this is reading the file line by line from the disk. What is the advantage of this?
Not exactly. It is reading the file in chunks whose size is the default buffer size (8k bytes I think).
The advantage is that you don't need a huge heap to read a huge file. This is a significant issue since the maximum heap size can only be specified at JVM startup (with Hotspot Java).
You also don't consume the system's physical / virtual memory resources to represent the huge heap.
What is the advantage of loading the file directly into memory?
It reduces the number of system calls, and may read the file faster. How much faster depends on a number of factors. And you have the problem of dealing with really large files.
How do we do that in Java?
Find out how large the file is.
Allocate a byte (or character) array big enough.
Use the relevant read(byte[], int, int) or read(char[], int, int) method to read the entire file.
You can also use a memory-mapped file ... but that requires using the Buffer APIs which can be a bit tricky to use.
I found some examples on Scanner or RandomAccessFile methods. Do they load the files into memory?
No, and no.
Should I use them? Which of the two should I use ?
Do they provide the functionality that you require? Do you need to read / parse text-based data? Do you need to do random access on a binary data?
Under normal circumstances, you should chose your I/O APIs based primarily on the functionality that you require, and secondarily on performance considerations. Using a BufferedInputStream or BufferedReader is usually enough to get acceptable* performance if you intend to parse it as you read it. (But if you actually need to hold the entire file in memory in its original form, then a BufferedXxx wrapper class actually makes reading a bit slower.)
* - Note that acceptable performance is not the same as optimal performance, but your client / project manager probably would not want your to waste time writing code to perform optimally ... if this is not a stated requirement.

If you're reading in the file and then parsing it, walking from beginning to end once to extract your data, then not referencing the file again, a buffered reader is about as "optimal" as you'll get. You can "tune" the performance somewhat by adjusting the buffer size -- a larger buffer will read larger chunks from the file. (Make the buffer a power of 2 -- eg 262144.) Reading in an entire large file (larger than, say, 1mb) will generally cost you performance in paging and heap management.

Fastest way of writing the first 10 000 lines of data file to new file

I want the first ten thousand lines of a hyuuge (.csv) file.
The naive way of
1) creating a reader & writer
2) reading the original file line for line
3) writing the first ten thousand lines to a new file
can't be the fastest, can it?
This will be a common operation in my app so I'm slightly concerned about speed, but also just curious.
Thanks.

There are a few ways of doing fast I/O in Java but without benchmarking for your particular case, it's kind of difficult to shoot out a figure/advice. Here are a few ways you can try benchmarking:
Buffered reader/writers with maybe varying buffer sizes
Reading the entire file in memory (if it can be) and doing an in-memory split and writing it all in a single go
Using NIO file API for reading/writing files (look into Channels)

If you only want to read/write 10,000 lines or so:
it will probably take longer to start up a new JVM than to read / write the file,
the read / write time should be a fraction of a second ... doing it the naive way, and
the overall speed up from a copying algorithm is unlikely to be worthwhile.
Having said that, you can do better than reading a line at a time using BufferedReader.readLine() or whatever.
Depending on the character encoding of the file, you will get better performance by doing byte-wise I/O with a BufferedInputStream and BufferedOutputStream with large buffer sizes. Just write a loop to read a byte, conditionally update the line counter and write the byte ... until you have copied the requisite number of lines. (This assumes that you can detect the CR and/or LF characters by examining the bytes. This is true for all character encodings I know about.)
If you use NIO and ByteBuffers, you can further reduce the amount of in-memory copying, though the CR / LF counting logic will be more complicated.
But the first question you should ask is whether it is even worthwhile bothering to optimize this.

Are the lines the same length. If so you can use RandomAccessFile to read x bytes and then write those bytes to a new file. It may be quite memory intensive though. I suspect this would be quicker but probably worth benchmarking. This solution would only work for fixed length lines

Shift the file while writing?

Is it possible to shift the contents of a file while writing to it using FileWriter?
I need to write data constants to the head of the file and if I do that it overwrites the file.
What technique should I use to do this or should I make make copies of the file (with the new data on top) on every file write?

If you want to overwrite certain bytes of the file and not others, you can use seek and write to do so. If you want to change the content of every byte in the file (by, for example, adding a single byte to the beginning of the file) then you need to write a new file and potentially rename it after you've done writing it.
Think of the answer to the question "what will be the contents of the byte at offset x after I'm done?". If, for a large percent of the possible values of x the answer is "not what it used to be" then you need a new file.

Rather than contending ourselves with the question "what will be the contents of the byte at offset x after I'm done?", lets change the mindset and ask why can't the file system or perhaps the hard disk firmware do : a) provide another mode of accessing the file [let's say, inline] b) increase the length of the file by the number of bytes added at the front or in the middle or even at the end c) move each byte that starts from the crossection by the newcontent.length positions
It would be easier and faster to handle these operations at the disk firmware or file system implementation level rather than leaving that job to the application developer. I hope file system writers or hard disk vendors would offer such feature soon.
Regards,
Samba

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.