Reading N lines with optimization - Java - java

I have a large text file with N number of lines. Now I have to read these lines in i iterations. Which means that I have to read n = Math.floor(N/i) lines in a single iteration. Now in each iteration I have to fill a string array of n length. So the basic question is that how should I read n lines in optimum time? The simplest way to do this is to use a BufferedReader and read one line at time with BufferedReader.readLine() but it will significantly decrease performance if n is too large. Is there a way to read exactly n lines at a time?

To read n lines from a text file, from a system point of view there is no other way than reading as many characters as necessary until you have seen n end-of-line delimiters (unless the file has been preprocessed to detect these, but I doubt this is allowed here).
As far as I know, no file I/O system in the world does support a function to read "until the nth occurrence of some character", nor "the n following lines" (but I am probably wrong).
If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.

I agree with Yves Daoust's answer, except for the paragraph recommending
If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.
There's no need to "detect the end-of-lines yourself". Something like
new BufferedReader(new InputStreamReader(is, charset), 8192);
creates a reader with a buffer of 8192 chars. The question is how useful this is for reading data in blocks. For this a byte[] is needed and there's a sun.nio.cs.StreamDecoder in between which I haven't looked into.
To be sure use
new BufferedReader(new InputStreamReader(new BufferedInputStream(is, 8192), charset));
so you get a byte[] buffer.
Note that 8192 is the default size for both BufferedReader and InputStreamReader, so leaving it out would change nothing in my above examples. Note that using much larger buffers makes no sense and can even be detrimental for performance.
Update
So far you get all the buffering needed and this should suffice. In case it doesn't, you can try:
to avoid the decoder overhead. When your lines are terminated by \n, you can look for (byte) \n in the file content without decoding it (unless you're using some exotic Charset).
to prefetch the data yourself. Normally, the OS should take care of it, so when your buffer becomes empty, Java calls into the OS, and it has the data already in memory.
to use a memory mapped file, so that no OS calls are needed for fetching more data (as all data "are" there when the mapping gets created).

Related

It is possible to read csv file from the middle?

Can read file from start index to end index.
Files.lines(Paths.get("file.csv")).skip(1000000).limit(1000).forEach(s-> {});
But it isn't performance. It is possible to read csv file performance from middle of file?
Would java RandomAccessFile class methods help? Something like seek, skipBytes, etc. You can find tutorials.
It depends on how predictable it is, and on what exactly you mean by "middle", and what you consider to be "read from middle".
If the "middle" has to be an exact line, then you will have to read all the bytes before that middle, because otherwise you can miss on bytes that end lines (and the only way, with a CSV file, of knowing where line N is, is to have read exactly N-1 end-of-line characters until arriving at that position). Having to read all bytes up to a point is linear in time, and is certainly not as fast as actually jumping there in 1 go - but it may count as "reading from middle" for you.
If the file is highly predictable (all lines have approximately the same length), and you do not care much about getting exactly at the middle, then you can always take the length of the file, L, and jump to the last position before position L/2 which contains a newline character. The next position is, with high probability (since your file is predictable), the "middle line".

How does RandomAccessFile.seek() work?

As per the API, these are the facts:
The seek(long bytePosition) method simply put, moves the pointer to
the position specified with the bytePosition parameter.
When the bytePosition is greater than the file length, the file
length does not change unless a byte is written at the (new) end.
If data is present in the length skipped over, such data is left
untouched.
However, the situation I'm curious about is: When there is a file with no data (0 bytes) and I execute the following code:
file.seek(100000-1);
file.write(0);
All the 100,000 bytes are filled with 0 almost instantly. I can clock over 200GB in say, 10 ms.
But when I try to write 100000 bytes using other methods such as BufferedOutputStream the same process takes an almost infinitely longer time.
What is the reason for this difference in time? Is there a more efficient way to create a file of n bytes and fill it with 0s?
EDIT:
If the data is not actually written, how is the file filled with data?
Sample this code:
RandomAccessFile out=new RandomAccessFile("D:/out","rw");
out.seek(100000-1);
out.write(0);
out.close();
This is the output:
Plus, If the file is huge enough I can no longer write to the disk due to lack of space.
When you write 100,000 bytes to a BufferedOutputStream, your program is explicitly accessing each byte of the file and writing a zero.
When you use a RandomAccessFile.seek() on a local file, you are indirectly using the C system call fseek(). How that gets handled depends on the operating system.
In most modern operating systems, sparse files are supported. This means that if you ask for an empty 100,000 byte file, 100,000 bytes of disk space are not actually used. When you write to byte 100,001, the OS still doesn't use 100,001 bytes of disk. It allocates a small amount of space for the block containing "real" data, and keeps track of the empty space separately.
When you read a sparse file, for example, by fseek()ing to byte 50,000, then reading, the OS can say "OK, I have not allocated disk space for byte 50,000 because I have noted that bytes 0 to 100,000 are empty. Therefore I can return 0 for this byte.". This is invisible to the caller.
This has the dual purpose of saving disk space, and improving speed. You have noticed the speed improvement.
More generally, fseek() goes directly to a position in a file, so it's O(1) rather than O(n). If you compare a file to an array, it's like doing x = arr[n] instead of for(i = 0; i<=n; i++) { x = arr[i]; }
This description, and that on Wikipedia, is probably sufficient to understand why seeking to byte 100,000 then writing is faster than writing 100,000 zeros. However you can read the Linux kernel source code to see how sparse files are implemented, you can read the RandomAccessFile source code in the JDK, and the JRE source code, to see how they interact. However, this is probably more detail than you need.
Your operating system and filesystem support sparse files and when it's the case, seek is implemented to make use of this feature.
This is not really related to Java, it's just a feature of fseek and fwrite functions from C library, which are most likely the backend behind File implementation on the JRE you are using.
more info: https://en.wikipedia.org/wiki/Sparse_file
Is there a more efficient way to create a file of n bytes and fill it with 0s?
On operating systems that support it, you could truncate the file to the desired size instead of issuing a write call. However, this seems to be not available in Java APIs.

Fastest way to read and process string from large file in java?

I have a large string in a file (its encoded data, my my custom encoding) and I want to read it and process it into my special format (decode). I want to know whats the fastest way I can do it to get the final format. I thought of some ways but not sure which would be best.
1) read entire string in 1 line and then process that string.
2) read character by character from the file and process while I am reading.
Can anyone help?
Thanks
Chances are the process will be IO bound not CPU bound so it probably wont matter much and if it does it will be because of the decode function, which isn't given in the question.
In theory you have two trade situations, which will determine if (1) or (2) is faster.
The assumption is that the decode is fast and so your process will be IO bound.
If by reading the whole file into memory at once you are doing less context switching then you will wasting less CPU cycles on those context switches so then reading the whole file is faster.
If by reading in the file char by char you don't prematurely yield your time to a CPU then in theory you could use the IO waiting CPU cycles to run the
decode so then ready char by char will be faster.
Here are some timelines
read char by char good case
TIME -------------------------------------------->
IO: READ CHAR --> wait --> READ CHAR --> wait
DECODE: wait ------> DECODE --> wait ---> DECODE ...
read char by char bad case
TIME -------------------------------------------->
IO: READ CHAR --> YIELD --> READ CHAR --> wait
DECODE: wait ------> YIELD --> DECODE ---> wait DECODE ---> ...
read whole file
TIME -------------------------------------------->
IO: READ CHAR ..... READ CHAR --> FINISH
DECODE: -----------------------------> DECODE --->
If your decode was really slow then a producer consumer model would probably be faster. Your best bet is to use a BufferedReader will do as much IO as it can while waisting/yielding the least amount of CPU cycles.
It's fine to use a BufferedReader or BufferedInputStream and then process character by character; the buffer will read in multiple characters at a time transparently. This should give good enough performance for typical requirements.
Reading whole string is called "slurping" and given memory overhead is generally considered to be a last resort for file processing. If you are processing the in-memory string character by character anyway, it may not even have a detectable speed benefit since all you are doing is your own (very large) buffer.
With a BufferedReader or BufferedInputStream you can adjust the buffer size so it can be large if really necessary.
Given your file size (20-30mb), depending upon encoding of that file note also that Java char is 16-bit so for an ASCII text file, or a UTF-8 file with few extended characters, you must allow for double your memory usage for typical JVM implementations.
It depends on the decode processing.
If you can parallelize it, you might consider a map/reduce approach. Break the file contents into separate map steps and combine them to get the final result in the reduce step.
Most machines have multiple cores. If there's no communication required between processors you can reduce the time to process by 1/N if you have N cores. You'll really have something if you have GPUs you can leverage.

How to find a substring/word if it lies at the boundary of bounded buffer

I am reading from an Inputstream with a bounded buffer of 200 bytes and I want to find a substring in it. I used the string.indexOf(substring).
But it does not return the right answer if substring crosses the boundary. e.g starts from 199th byte.
Any suggestions?
There are two approaches that I can think of:
Normalize the circular buffer (*) before executing indexOf(). By "normalize" I mean copy the bytes within the buffer so that the beginning of the buffer is at index 0, and therefore the contents of the buffer are not circular anymore. This will greatly improve the performance of searching through the buffer, but it will incur a performance penalty on the first search that follows a modification of the buffer, because at that moment you will have to first normalize. Since you are only dealing with a 200 byte buffer, the penalty will be negligible, and if you are planning to do multiple searches per buffer modification, the savings might end up being huge.
Write your own indexOf( MyCircularBuffer, String ) method which searches inside your circular buffer for the first character of the string and when found, performs the comparison of the remainder of the string by generating indexes based on the same logic that your circular buffer uses for generating indexes.
* We are writing software for computers with finite memory, so every single buffer is by definition a bounded buffer, so the term "bounded buffer" does not convey any useful information either with respect to how you are supposed to use it, or with respect to how it is internally structured. What you are referring to as "Bounded Buffer" is in fact a "Circular Buffer". The term "circular" still gives no hint about its use, but at least it gives a hint about its internal structure.

changing the index positioning in InputStream

I have a binary file which contains keys and after every key there is an image associated with it. I want to jump off different keys but could not find any method which changes the index positioning in input stream. I have seen the mark() method but it does not jump on different places.
Does anybody have any idea how to do that?
There's a long skip(long n) method that you may be able to use:
Skips over and discards n bytes of data from this input stream. The skip method may, for a variety of reasons, end up skipping over some smaller number of bytes, possibly 0. This may result from any of a number of conditions; reaching end of file before n bytes have been skipped is only one possibility. The actual number of bytes skipped is returned. If n is negative, no bytes are skipped.
As documented, you're not guaranteed that n bytes will be skipped, so doublecheck the returned value always. Note that this does not allow you to "skip backward", but if it's markSupported(), you can reset() first and then skip forward to an earlier position if you must.
Other options
You may also use java.io.RandomAccessFile, which as the name implies, permits random access with its seek(long pos) method.
You mentioned images, so if you are using Java Advanced Imaging, another possible option is com.sun.media.jai.codec.FileSeekableStream, which is a SeekableStream that takes its input from a File or RandomAccessFile. Note that this class is not a committed part of the JAI API. It may be removed or changed in future releases of JAI.

Categories