goto file line number in Java - java

I want to know how to directly reach a particular line no of a text file in java.
one Method is this.
int line=0;
BufferedReader read=new BufferedReader(new FileReader(Filename));
while(read.readLine()!=null){
line++;
if(line==LIMIT) break;
}
But this will create a lot of String objects which wont be freed unless gc runs.
Please provide a solution that will be fast and doesn't consume a lot of memory.
PS:I am reading from a file that has millions of lines.

Lets assume that the text file has variable length lines, and that you haven't preprocessed it to create an index. (Otherwise, it should be possible to predetermine the position of the Nth line, and then "seek" to it.)
First observation is that (with the above assumptions), it is not possible to find the Nth line without examining every character before the start of the Nth line.
But you can still do this in a way that doesn't generate lots of garbage. Here's a simple version:
BufferedReader br = new BufferedReader(new FileReader(filename));
for (int i = 1; i < LIMIT; i++) {
while ((ch = br.read()) != '\n') {
if (ch == -1) {
// reached the end of file too soon ...
throw new IOException("The file has < " + LIMIT + " lines");
}
}
}
line = br.readLine();
The trick is to skip over the lines without forming them into String objects.
Now there is a small flaw in the above. It is assuming that the lines of the text file are terminated by a newline character ('\n'), whereas the readLine can cope with 3 kinds of line separator. But that could be addressed ... without generating extra garbage. I'll leave it as "an exercise for the reader", along with investigating tweaks like using read(char[]) instead of read().
You could probably get better performance if you opened the file using a FileInputStream, obtained the FileChannel, read the bytes into a ByteBuffer and then searched it for (byte) '\n'. But the code is significantly more complicated.
However, I'd like to reinforce a point made in the comments. You are probably wasting your time with this. The chances are that your original version runs fast enough for your purposes, despite generating lots of garbage. In reality, GC is fast when the ratio of garbage to non-garbage is high. And for a program that reads an discards lines, you are pretty much guaranteed that will be the case.
Rather than spending time figuring out how to make your program fast based on a false premise, you would be better of writing a simple version and measuring its performance on typical input files. Only optimize if the program is actually too slow.

Instead of reading strings, you can read data in blocks (may be 1024 bytes block) and search line characters. To read block of data, you can use byte array, so it will be reused and so no memory issues. You have to take care of:
Handling of both \r and \n characters
Encoding of the file (like Unicode or other)
Reading data in blocks instead of byte by byte will be more efficient.

I think this should help :
FileReader fr = new FileReader("file1.txt");
BufferedReader br = new BufferedReader(fr);
LineIterator it = IOUtils.lineIterator(br);
for (int l = 0; it.hasNext(); l++) {
String line = (String) it.next();
if (l == LIMIT) {
return line;
}
}

Related

Reading from a file, running out of memory

I was just recently asked an interview question that held to deal with reading from a CSV file and summing up entries in certain cells. When asked to optimize it, I couldn't answer how to deal with the case of running out of memory if we were given a CSV of size say 100 gigs.
In Java, how exactly does reading from a file work? How do we know when something is too big? How do we deal with that? I was told that you could pass in the intermediate reader object instead of trying to read the entire thing?
The interviewer gave you a hint - BufferedReader. It is an efficient choice for reading a large file line by line.
Small example:
String line;
BufferedReader br = new BufferedReader("c:/test.txt");
while ((line= br.readLine()) != null) {
//do processing
}
br.close();
Here is the documentation
There are several ways to read from a file in Java, some of them involve keeping all of the files lines (or data) in memory as you "read" the data delimited by something like a newline character (reading line by line for example).
For large files you want to process smaller bits at a time using the Scannerclass (or something like it to read specific bytes at a time).
Sample code:
FileInputStream inputStream = new FileInputStream(path);
Scanner sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
// System.out.println(line);
}
You can use RandomAccessFile to read the file. It may not be the best solution though.

Java: What's the most efficient way to read relatively large txt files and store its data?

I was supposed to write a method that reads a DNA sequence in order to test some string matching algorithms on it.
I took some existing code I use to read text files (don't really know any others):
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while((line = br.readLine()) != null) {
seq += line;
}
br.close();
}
catch(FileNotFoundException e) { e.printStackTrace(); }
catch(IOException e) { e.printStackTrace(); }
This seems to work just fine for small text files with ~3000 characters, but it takes forever (I just cancelled it after 10 minutes) to read files containing more than 45 million characters.
Is there a more efficient way of doing this?
One thing I notice is that you are doing seq+=line. seq is probably a String? If so, then you have to remember that strings are immutable. So in fact what you are doing is creating a new String each time you are trying to append a line to it. Please use StringBuilder instead. Also, if possible you don't want to do create a string and then process. That way you have to do it twice. Ideally you want to process as you read, but I don't know your situation.
The main element slowing your progress is the "concatenation" of the String seq and line when you call seq+=line. I use quotes for concatenation because in Java, Strings cannot be modified once they are created (e.g. immutable as user1598503 mentioned). Initially, this is not an issue, as the Strings are small, however once the Strings become very long, e.e. hundreds of thousands of characters, memory must be reallocated for the new String, which takes quite a bit of time. StringBuilder will allow you to do these concatenations in place, meaning you will not be creating a new Object every single time.
Your problem is not that the reading takes too much time, but the concatenating takes too much time. Just to verify this I ran your code (didn't finish) and then simply comented line 8 (seq += line) and it ran in under a second. You could try using seq = seq.concat(line) since it has been reported to be quite a bit faster most of the times, but I tried that too and didn't ran under 1-2 minutes (for a 9.6mb input file). My solution would be to store your lines in an ArrayList (or a container of your choice). The ArrayList example worked in about 2-3 seconds with the same input file. (so the content of your while loop would be list.add(line);). If you really, really want to store your entire file in a string you could do something like this (using the Scanner class):
String content = new Scanner(new File("input")).useDelimiter("\\Z").next();
^^This works in a matter of seconds as well. I should mention that "\Z" is the end of file delimiter so that's why it reads the whole thing in one swoop.

Java reading nth line

I am trying to read a specific line from a text file, however I don't want to load the file into memory (it can get really big).
I have been looking but every example i have found requires either to read every line (this would slow my code down as there are over 100,000 lines) or load the whole thing into an array and get the correct element (file will have alot of lines to input).
An example of what I want to do:
String line = File.getLine(5);
"code is not actual code, it is made up to show the principle of what i want"
Is there a way to do this?
-----Edit-----
I have just realized this file will be written too in between reading lines (adding to the end of the file).
Is there a way to do this?
Not unless the lines are of a fixed number of bytes each, no.
You don't have to actually keep each line in memory - but you've got to read through the whole file to get to the line you want, as otherwise you won't know where to start reading.
You have to read the file line by line. Otherwise how do you know when you have gotten to line 5 (as in your example)?
Edit:
You might also want to check out Random Access Files which could be helpful if you know how many bytes per line, as Jon Skeet has said.
The easiest way to do this would be to use a BufferedReader (http://docs.oracle.com/javase/1.5.0/docs/api/java/io/BufferedReader.html), because you can specify your buffer size. You could do something like:
BufferedReader in = new BufferedReader(new FileReader("foo.in"), 1024);
in.readLine();
in.readLine();
in.readLine();
in.readLine();
String line = in.readLine();
1) read a line which the user selects,
If you only need to read a user-selected line once or infrequently (or if the file is small enough), then you just have to read the file line-by-line from the start, until you get to the selected line.
If, on the other hand you need to read a user-selected line frequently, you should build an index of line numbers and offsets. So, for example, line 42 corresponds to an offset of 2347 bytes into the file. Ideally, then, you would only read the entire file once and store the index--for example, in a map, using the line numbers as keys and offsets as values.
2) read new lines added since
the last read. i plan to read the file every 10 seconds. i have got
the line count and can find out the new line numbers but i need to
read that line
For the second point, you can simply save the current offset into the file instead of saving the current line number--but it certainly wouldn't hurt to continue building the index if it will continue to provide a significant performance benefit.
Use RandomAccessFile.seek(long offset) to set the file pointer to the most recently saved offset (confirm the file is longer than the most recently saved offset first--if not, nothing new has been appended).
Use RandomAccessFile.readLine() to read a line of the file
Call RandomAccessFile.getFilePointer() to get the current offset after reading the line and optionally put(currLineNo+1, offset) into the index.
Repeat steps 2-3 until reaching the end of the file.
But don't get too carried away with performance optimizations unless the performance is already a problem or is highly likely to be a problem.
For small files:
String line = Files.readAllLines(Paths.get("file.txt")).get(n);
For large files:
String line;
try (Stream<String> lines = Files.lines(Paths.get("file.txt"))) {
line = lines.skip(n).findFirst().get();
}
Java 7:
String line;
try (BufferedReader br = new BufferedReader(new FileReader("file.txt"))) {
for (int i = 0; i < n; i++)
br.readLine();
line = br.readLine();
}
Source: Reading nth line from file
The only way to do this is to build an index of where each line is (you only need to record the end of each line) Without a way to randomly access a line based on an index from the start, you have to read every byte before that line.
BTW: Reading 100,000 lines might take only one second on a fast machine.
If performance is a big concern here and you are frequently reading random lines from a static file then you can optimize this a bit by reading through the file and building an index (basically just a long[]) of the starting offset of each line of the file.
Once you have this you know exactly where to jump to in the file and then you can read up to the next newline character to retrieve the full line.
Here is a snippet of some code I had which will read a file and write every 10th line including the first line to a new file (writer.) You can always replace the try section with whatever you want to do. To change the number of lines to read just change the 0 in the if statement "lc.endsWith("0")" to whatever line you want to read. But if the file is being written to as you read it, this code will only work with the data that is contained inside the file when you run this code.
LineNumberReader lnr = new LineNumberReader(new FileReader(new File(file)));
lnr.skip(Long.MAX_VALUE);
int linecount=lnr.getLineNumber();
lnr.close();
for (int i=0; i<=linecount; i++){
//read lines
String line = bufferedReader.readLine();
String lc = String.valueOf(i);
if (lc.endsWith("0")){
try{
writer.append(line+"\n");
writer.flush();
}catch(Exception ee){
}
}
}

Reading socket one byte a time, how to change this, optimisation

I need some help about optimisation. I am trying to improve this open-source game server made with JAVA. Each player has its own thread, and each thread goes something like this:
BufferedReader _in = new BufferedReader(new InputStreamReader(_socket.getInputStream()));
String packet = "";
char charCur[] = new char[1];
while(_in.read(charCur, 0, 1)!=-1)
{
if (charCur[0] != '\u0000' && charCur[0] != '\n' && charCur[0] != '\r')
{
packet += charCur[0];
}else if(!packet.isEmpty())
{
parsePlayerPacket(packet);
packet = "";
}
}
I have been told so many times that this code is stupid, and I agree because when profiling it I see that reading each byte and appending it using packet += "" is just stupid and slow.
I want to improve this but I don't know how.. I'm sure I can find something, but I'm afraid it will be even slower than this, because I have to split packets based on the '\u0000', '\n', or '\r' to parse them. And I know that splitting 3 times is verry slow.
Can someone give me an idea? Or a piece of code for this? It will make my day.
If you're going to explain, please, please use verry simple words, with code examples, I'm just a JAVA beginner. Thank's
There is no significant performance issue with reading from a BufferedReader either in large chunks, or even one character at a time.
Unless your profiling has identified the BufferedReader.read() method as a specific hotspot in your code, the best thing you can do is make the code simple and readable, and not spend time optimizing it.
For your particular case:
yes that code is a bit lame, but
no it is unlikely to make a lot of difference from the performance perspective.
The real performance bottleneck is most likely the network itself. There are application level things that you can do to address this, but ultimately you can only send / receive data at a rate that the end-to-end network connection will support.
My profiling result is saying that it's coming from: BufferedReader.read(). What does it mean really?
Are you sure that the time is really being spent in the Socket's read method? If it is, then the real issue is that your application threads are spending lots of time waiting for network packets to arrive. If that is the case, then the only thing you could do is to reduce the number of client and server side flushes so that the network doesn't have to deal with so many small packets. Depending on your application, that may be infeasible.
I'd write your code as follows:
BufferedReader _in = new BufferedReader(
new InputStreamReader(_socket.getInputStream()));
StringBuilder packet = new StringBuilder();
int ch;
while ((ch = _in.read()) != 1) {
if (ch != '\u0000' && ch != '\n' && ch != '\r') {
packet.append((char) ch);
} else if (!packet.isEmpty()) {
parsePlayerPacket(packet.toString());
packet = new StringBuilder();
}
}
But I don't think it will make much difference to the performance ... unless the "packets" are typically hundreds of characters long. (The real point of my tweaks is to reduce the number of temporary strings that are created while reading a packet. I don't think that there's a simple way to make it spend less real time in the read calls.)
Perhaps you should look into the readLine() method of BufferedReader. Looks like you're reading Strings, calling BufferedReader.readLine() gives you the next line (sans the newline/linefeed).
Something like this:
String packet = _in.readLine();
while(packet!=null) {
parsePlayerPacket(packet);
packet = _in.readLine();
}
Just like you're implementation, readLine() will block until either the stream is closed or there's a newline/linefeed.
EDIT: yeah, this isn't going to split '\0'. You're best bet is probably a PushbackReader, read in some buffer of chars (like David Oliván Ubieto suggests)
PushbackReader _in = new PushbackReader(new InputStreamReader(_socket.getInputStream()));
StringBuilder packet = new StringBuilder();
char[] buffer = new char[1024];
// read in as much as we can
int bytesRead = _in.read(buffer);
while(bytesRead > 0) {
boolean process = false;
int index = 0;
// see if what we've read contains a delimiter
for(index = 0;index<bytesRead;index++) {
if(buffer[index]=='\n' ||
buffer[index]=='\r' ||
buffer[index]=='\u0000') {
process = true;
break;
}
}
if(process) {
// got a delimiter, process entire packet and push back the stuff we don't care about
_in.unread(buffer,index+1,bytesRead-(index+1)); // we don't want to push back our delimiter
packet.append(buffer,0,index);
parsePlayerPacket(packet);
packet = new StringBuilder();
}
else {
// no delimiter, append to current packet and read some more
packet.append(buffer,0,bytesRead);
}
bytesRead = _in.read(buffer);
}
I didn't debug that, but you get the idea.
Note that using String.split('\u0000') has the problem where a packet ending with '\u0000' won't get processed until a newline/linefeed is sent across the stream. Since you're writing some kind of game, I assume it's important to process an incoming packet as soon as you get it.
Read as many bytes as you can using a large buffer (1K). Check if the "terminator" is found ('\u0000', '\n', '\r'). If not, copy to a temporal buffer (larger than used to read socket), read again and copy to the temporal buffer until "terminator" found. When you have all the necessary bytes, copy the temporal buffer to any "final" buffer and process it. The remaining bytes should be considered as the "next" message and copied to the start of the temporal buffer.

Maximum line length for BufferedReader.readLine() in Java?

I use BufferedReader's readLine() method to read lines of text from a socket.
There is no obvious way to limit the length of the line read.
I am worried that the source of the data can (maliciously or by mistake) write a lot of data without any line feed character, and this will cause BufferedReader to allocate an unbounded amount of memory.
Is there a way to avoid that? Or do I have to implement a bounded version of readLine() myself?
The simplest way to do this will be to implement your own bounded line reader.
Or even simpler, reuse the code from this BoundedBufferedReader class.
Actually, coding a readLine() that works the same as the standard method is not trivial. Dealing with the 3 kinds of line terminator CORRECTLY requires some pretty careful coding. It is interesting to compare the different approaches of the above link with the Sun version and Apache Harmony version of BufferedReader.
Note: I'm not entirely convinced that either the bounded version or the Apache version is 100% correct. The bounded version assumes that the underlying stream supports mark and reset, which is certainly not always true. The Apache version appears to read-ahead one character if it sees a CR as the last character in the buffer. This would break on MacOS when reading input typed by the user. The Sun version handles this by setting a flag to cause the possible LF after the CR to be skipped on the next read... operation; i.e. no spurious read-ahead.
Another option is Apache Commons' BoundedInputStream:
InputStream bounded = new BoundedInputStream(is, MAX_BYTE_COUNT);
BufferedReader reader = new BufferedReader(new InputStreamReader(bounded));
String line = reader.readLine();
The limit for a String is 2 billion chars. If you want the limit to be smaller, you need to read the data yourself. You can read one char at a time from the buffered stream until the limit or a new line char is reached.
Perhaps the easiest solution is to take a slightly different approach. Instead of attempting to prevent a DoS by limiting one particular read, limit the entire amount of raw data read. In this way you don't need to worry about using special code for every single read and loop, so long as the memory allocated is proportionate to incoming data.
You can either meter the Reader, or probably more appropriately, the undecoded Stream or equivalent.
There are a few ways round this:
if the amount of data overall is very small, load data in from the socket into a buffer (byte array, bytebuffer, depending on what you prefer), then wrap the BufferedReader around the data in memory (via a ByteArrayInputStream etc);
just catch the OutOfMemoryError, if it occurs; catching this error is generally not reliable, but in the specific case of catching array allocation failures, it is basically safe (but does not solve the issue of any knock-on effect that one thread allocating large amounts from the heap could have on other threads running in your application, for example);
implement a wrapper InputStream that will only read so many bytes, then insert this between the socket and BufferedReader;
ditch BufferedReader and split your lines via the regular expressions framework (implement a CharSequence whose chars are pulled from the stream, and then define a regular expression that limits the length of lines); in principle, a CharSequence is supposed to be random access, but for a simple "line splitting" regex, in practice you will probably find that successive chars are always requested, so that you can "cheat" in your implementation.
In BufferedReader, instead of using String readLine(), use int read(char[] cbuf, int off, int len); you can then use boolean ready() to see if you got it all and convert in into a string using the constructor String(byte[] bytes, int offset, int length).
If you don't care about the whitespace and you just want to have a maximum number of characters per line, then the proposal Stephen suggested is really simple,
import java.io.BufferedReader;
import java.io.IOException;
public class BoundedReader extends BufferedReader {
private final int bufferSize;
private char buffer[];
BoundedReader(final BufferedReader in, final int bufferSize) {
super(in);
this.bufferSize = bufferSize;
this.buffer = new char[bufferSize];
}
#Override
public String readLine() throws IOException {
int no;
/* read up to bufferSize */
if((no = this.read(buffer, 0, bufferSize)) == -1) return null;
String input = new String(buffer, 0, no).trim();
/* skip the rest */
while(no >= bufferSize && ready()) {
if((no = read(buffer, 0, bufferSize)) == -1) break;
}
return input;
}
}
Edit: this is intended to read lines from a user terminal. It blocks until the next line, and returns a bufferSize-bounded String; any further input on the line is discarded.

Categories