Can read file from start index to end index.
Files.lines(Paths.get("file.csv")).skip(1000000).limit(1000).forEach(s-> {});
But it isn't performance. It is possible to read csv file performance from middle of file?
Would java RandomAccessFile class methods help? Something like seek, skipBytes, etc. You can find tutorials.
It depends on how predictable it is, and on what exactly you mean by "middle", and what you consider to be "read from middle".
If the "middle" has to be an exact line, then you will have to read all the bytes before that middle, because otherwise you can miss on bytes that end lines (and the only way, with a CSV file, of knowing where line N is, is to have read exactly N-1 end-of-line characters until arriving at that position). Having to read all bytes up to a point is linear in time, and is certainly not as fast as actually jumping there in 1 go - but it may count as "reading from middle" for you.
If the file is highly predictable (all lines have approximately the same length), and you do not care much about getting exactly at the middle, then you can always take the length of the file, L, and jump to the last position before position L/2 which contains a newline character. The next position is, with high probability (since your file is predictable), the "middle line".
Related
Introduction
We store tuples (string,int) in a binary file. The string represents a word (no spaces nor numbers). In order to find a word, we apply binary search algorithm, since we know that all the tuples are sorted with respect to the word.
In order to store this, we use writeUTF for the string and writeInt for the integer. Other than that, let's assume for now there are no ways to distinguish between the start and the end of the tuple unless we know them in advance.
Problem
When we apply binary search, we get a position (i.e. (a+b)/2) in the file, which we can read using methods in Random Access File, i.e. we can read the byte at that place. However, since we can be in the middle of the word, we cannot know where this words starts or finishes.
Solution
Here're two possible solutions we came up with, however, we're trying to decide which one will be more space efficient/faster.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character in the end of the tuple. That is, we can be sure that none of the methods used to serialize the data will use the null character, since the information we store (numbers and digits) have higher ASCII value representations.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file). In this case, we assume that words have a low entropy, so it's very unlikely they will have any signs of randomness. Even if the integer may get 4 bytes that are exactly the same as those in the random noise, the additional two bytes that follow will not (with high probability).
Which of these methods would you recommend? Is there a better way to store this kind of information. Note, we cannot serialize the entire file and later de-serialize it into memory, since it's very big (and we are not allowed to).
I assume you're trying to optimize for speed & space (in that order).
I'd use a different layout, built from 2 files:
Interger + Index file
Each "record" is exactly 8 bytes long, the lower 4 are the integer value for the record, and the upper 4 bytes are an integer representing the offset for the record in the other file (the characters file).
Characters file
Contiguous file of characters (UTF-8 encoding or anything you choose). "Records" are not separated, not terminated in any way, simple 1 by 1 characters. For example, the records Good, Hello, Morning will look like GoodHelloMorning.
To iterate the dataset, you iterate the integer/index file with direct access (recordNum * 8 is the byte offset of the record), read the integer and the characters offset, plus the character offset of the next record (which is the 4 byte integer at recordNum * 8 + 12), then read the string from the characters file between the offsets you read from the index file. Done!
it's less than 200MB. Max 20 chars for a word.
So why bother? Unless you work on some severely restricted system, load everything into a Map<String, Integer> and get a few orders of magnitude speed up.
But let's say, I'm overlooking something and let's continue.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character
You don't have to as you said that your word contains no numbers. So you can always parse things like 0124some456word789 uniquely.
The efficiency depends on the distribution. You may win a factor of 4 (single digit numbers) or lose a factor of 2.5 (10-digit numbers). You could save something by using a higher base. But there's the storage for the string and it may dominate.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file).
This is too wasteful. Using four zeros between the data byte would do:
Find a sequence of at least four zeros.
Find the last zero.
That's the last separator byte.
Method 3: Using some hacks, you could ensure that the number contains no zero byte (either assuming that it doesn't use the whole range or representing it with five bytes). Then a single zero byte would do.
Method 4: As disk is organized in blocks, you should probably split your data into 4 KiB blocks. Then you can add some time header allowing you quick access to the data (start indexes for the 8th, 16th, etc. piece of data). The range between e.g., the 8th and 16th block should be scanned sequentially as it's both simpler and faster than binary search.
Alright, so we need to store a list of words and their respective position in a much bigger text. We've been asked if it's more efficient to save the position represented as text or represented as bits (data streams in Java).
I think that a bitwise representation is best since the text "1024" takes up 4*8=32 bits while only 11 if represented as bits.
The follow up question is should the index be saved in one or two files. Here I thought "perhaps you can't combine text and bitwise-representation in one file?" and that's the reason you'd need two files?
So the question first and foremost is can I store text-information (the word) combined with bitwise-information (it's position) in one file?
Too vague in terms of whats really needed.
If you have up to a few million words + positions, don't even bother thinking about it. Store in whatever format is the simplest to implement; space would only be an issue if you need to sent the data over a low bandwidth network.
Then there is general data compression available, by just wrapping your Input/OutputStreams with deflater or gzip (already built in the JRE) you will get reasonably good compression (50% or more for text). That easily beats what you can quickly write yourself. If you need better compression there is XZ for java (implements LZMA compression), open source.
If you need random access, you're on the wrong track, you will want to carefully design the data layout for the access patterns and storage should be only of tertiary concern.
The number 1024 would at least take 2-4 bytes (so 16-32 bits), as you need to know where the number ends and where it starts, and so it must have a fixed size. If your positions are very big, like 124058936, you would need to use 4 bytes per numbers (which would be better than 9 bytes as a string representation).
Using binary files you'll need of a way to know where the string starts and end, too. You can do this storing a byte before it, with its length, and reading the string like this:
byte[] arr = new byte[in.readByte()]; // in.readByte()*2 if the string is encoded in 16 bits
in.read(arr); // in is a FileInputStream / RandomAccessFile
String yourString = new String(arr, "US-ASCII");
The other possiblity would be terminating your string with a null character (00), but you would need to create your own implementation for that, as no readers support it by default (AFAIK).
Now, is it really worth storing it as binary data? That really depends on how big your positions are (because the strings, if in the text version are separated from their position with a space, would take the same amount of bytes).
My recommendation is that you use the text version, as it will probably be easier to parse and more readable.
About using one or two files, it doesn't really matter. You can combine text and binary in the same file, and it would take the same space (though making it in two separated files will always take a bit more space, and it might make it more messy to edit).
I have a large text file with N number of lines. Now I have to read these lines in i iterations. Which means that I have to read n = Math.floor(N/i) lines in a single iteration. Now in each iteration I have to fill a string array of n length. So the basic question is that how should I read n lines in optimum time? The simplest way to do this is to use a BufferedReader and read one line at time with BufferedReader.readLine() but it will significantly decrease performance if n is too large. Is there a way to read exactly n lines at a time?
To read n lines from a text file, from a system point of view there is no other way than reading as many characters as necessary until you have seen n end-of-line delimiters (unless the file has been preprocessed to detect these, but I doubt this is allowed here).
As far as I know, no file I/O system in the world does support a function to read "until the nth occurrence of some character", nor "the n following lines" (but I am probably wrong).
If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.
I agree with Yves Daoust's answer, except for the paragraph recommending
If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.
There's no need to "detect the end-of-lines yourself". Something like
new BufferedReader(new InputStreamReader(is, charset), 8192);
creates a reader with a buffer of 8192 chars. The question is how useful this is for reading data in blocks. For this a byte[] is needed and there's a sun.nio.cs.StreamDecoder in between which I haven't looked into.
To be sure use
new BufferedReader(new InputStreamReader(new BufferedInputStream(is, 8192), charset));
so you get a byte[] buffer.
Note that 8192 is the default size for both BufferedReader and InputStreamReader, so leaving it out would change nothing in my above examples. Note that using much larger buffers makes no sense and can even be detrimental for performance.
Update
So far you get all the buffering needed and this should suffice. In case it doesn't, you can try:
to avoid the decoder overhead. When your lines are terminated by \n, you can look for (byte) \n in the file content without decoding it (unless you're using some exotic Charset).
to prefetch the data yourself. Normally, the OS should take care of it, so when your buffer becomes empty, Java calls into the OS, and it has the data already in memory.
to use a memory mapped file, so that no OS calls are needed for fetching more data (as all data "are" there when the mapping gets created).
This was an interview question and concerns about the efficiency. When there is a very large file (in GB) something like a log file. How can we find the 10th occurrence of a word like 'error' or 'java' etc. from the end of file. I can only think of scanning through the entire file and then finding out occurrence in reverse order. But I don't think it is the right way to do it! (Coding preferably in C or Java)
I would also like to know another thing. When an interviewer specifically mentions that its a very large file, what are the factors that should be considered when we start writing the code (apart from keeping in mind the scanning is really costly affair)
To search a word in a large text, the Boyer Moore algorithm is extensively used.
Principle (see the link for a live example) : when starting the comparison at some place (index) in the file, if the first letter of the text being compared is not at all in the word being searched, there is no need to compare its other [wordLength - 1] characters with the text, and the index can move forward of the word length. If the letter is in the word, not here exactly, but shifted by a few chars, the comparison can also be shifted by a few chars etc...
depending on the word length and similarity with the text, the search may be accelerated a lot (up to naiveSearchTime / wordLength).
edit Since you search from the end of the file, the 1st letter of the word (not the last) is to be compared at first. E.g. Searching "space" in "2001 a space odyssey", word space 's' is to be compared with the odyssey first 'y'. Next comparison is the same 's' against the text space 'c'.
And finally, to find the nth occurrence, a simple counter (initialized to n) is decremented each time the word is found, when it reaches 0, that's it.
The algorithm is easy to understand and to implement. Ideal for interviews.
You may ask also if the file is to be searched only once or several times? If it is intended to be searched multiple times, you can suggest to index the words from the file. I.e. create in memory a structure that allows to find quickly if a word is in it, where, how many times etc... I like the Trie algorithm also easy to understand, and very fast (can be pretty memory greedy also depending on the text). Its complexity is O(wordLength).
--
When the interviewer mentions "very large file" there are many factors to be considered, like
search algorithm as above
can the text fit in memory? (for instance when processing all of it) Do I have to implement a file-seek algorithm (i.e. use only part of the file in memory at a time)
where is the file? Memory (fast), hard-disk (slower but at least local), remote (usually slower, connection issues, accesses to remote, firewalls, network speed etc..)
is the file compressed? (will take even more space once uncompressed)
is the file made of one file or several chunks?
Does it contain text or binary? If text, its language gives an indication on the probability of a letter appearance (eg in English the Y appears much more frequently than in French).
Offer to index the file words if relevant
Offer to create a simpler file from big-one (like removing repeated words etc...) in order to have something smaller that can be processed more easily
...
There are two parts of the answer to this question. One is the algorithm used which can be any good string search algorithm (Boyer Moore / KMP / Trie etc). The other part is IO.
To speed up things since you can't really read backwards from a file a good approach will be :
allocate a chunk of memory say 10MB
for (i = 1; (filesize - 10MB * i) >= 0; i++) {
seek to (filesize - 10MB * i) and read 10MB into memory
Search for the string in the current chunk backwards and increment a counter
Stop when the counter gets to 10
This is a heavily IO oriented question and you can improve this approach using multithreaded systems or multiple machines wherein you can do the search and read from file to memory (i.e., steps 3 and 4) in parallel.
This is C++ code but you know how to do it in Java.
Adding to the comment by #AlexeyFrunze, see the related post here: read file backwards (last line first). Perhaps, however, the interviewer was interested in a solution in which you read in normal forward direction to see how you would address the problem of limited memory.
Awesome post by #ring0, so I will only mention something about the problem of finding specifically the k-th word from the end where k is small like 10, in a really large file, which suggests that you should not memorize the entire file and then search backwards.
You can maintain a first-in-first-out buffer a.k.a. queue of size k in which you store positions of matches as you encounter them. As you are finding more and more matches, you will forget about the earlier ones. Given const int k = 10; and long match_pos[k]; initialized to zero, and count, you could address match_pos[count % k] = pos for queueing. Once you arrive at the end of the file,
if (count >= k)
{
int kth_match_pos = match_pos[(count + 1) % k];
// ...
}
checks the oldest entry in your buffer, so you can jump back n byte, where n is pos - kth_match_pos. If the relevent context is stored in the queue as well, no seek will be necessary.
I have a binary file which contains keys and after every key there is an image associated with it. I want to jump off different keys but could not find any method which changes the index positioning in input stream. I have seen the mark() method but it does not jump on different places.
Does anybody have any idea how to do that?
There's a long skip(long n) method that you may be able to use:
Skips over and discards n bytes of data from this input stream. The skip method may, for a variety of reasons, end up skipping over some smaller number of bytes, possibly 0. This may result from any of a number of conditions; reaching end of file before n bytes have been skipped is only one possibility. The actual number of bytes skipped is returned. If n is negative, no bytes are skipped.
As documented, you're not guaranteed that n bytes will be skipped, so doublecheck the returned value always. Note that this does not allow you to "skip backward", but if it's markSupported(), you can reset() first and then skip forward to an earlier position if you must.
Other options
You may also use java.io.RandomAccessFile, which as the name implies, permits random access with its seek(long pos) method.
You mentioned images, so if you are using Java Advanced Imaging, another possible option is com.sun.media.jai.codec.FileSeekableStream, which is a SeekableStream that takes its input from a File or RandomAccessFile. Note that this class is not a committed part of the JAI API. It may be removed or changed in future releases of JAI.