I am trying to read a specific line from a text file, however I don't want to load the file into memory (it can get really big).
I have been looking but every example i have found requires either to read every line (this would slow my code down as there are over 100,000 lines) or load the whole thing into an array and get the correct element (file will have alot of lines to input).
An example of what I want to do:
String line = File.getLine(5);
"code is not actual code, it is made up to show the principle of what i want"
Is there a way to do this?
-----Edit-----
I have just realized this file will be written too in between reading lines (adding to the end of the file).
Is there a way to do this?
Not unless the lines are of a fixed number of bytes each, no.
You don't have to actually keep each line in memory - but you've got to read through the whole file to get to the line you want, as otherwise you won't know where to start reading.
You have to read the file line by line. Otherwise how do you know when you have gotten to line 5 (as in your example)?
Edit:
You might also want to check out Random Access Files which could be helpful if you know how many bytes per line, as Jon Skeet has said.
The easiest way to do this would be to use a BufferedReader (http://docs.oracle.com/javase/1.5.0/docs/api/java/io/BufferedReader.html), because you can specify your buffer size. You could do something like:
BufferedReader in = new BufferedReader(new FileReader("foo.in"), 1024);
in.readLine();
in.readLine();
in.readLine();
in.readLine();
String line = in.readLine();
1) read a line which the user selects,
If you only need to read a user-selected line once or infrequently (or if the file is small enough), then you just have to read the file line-by-line from the start, until you get to the selected line.
If, on the other hand you need to read a user-selected line frequently, you should build an index of line numbers and offsets. So, for example, line 42 corresponds to an offset of 2347 bytes into the file. Ideally, then, you would only read the entire file once and store the index--for example, in a map, using the line numbers as keys and offsets as values.
2) read new lines added since
the last read. i plan to read the file every 10 seconds. i have got
the line count and can find out the new line numbers but i need to
read that line
For the second point, you can simply save the current offset into the file instead of saving the current line number--but it certainly wouldn't hurt to continue building the index if it will continue to provide a significant performance benefit.
Use RandomAccessFile.seek(long offset) to set the file pointer to the most recently saved offset (confirm the file is longer than the most recently saved offset first--if not, nothing new has been appended).
Use RandomAccessFile.readLine() to read a line of the file
Call RandomAccessFile.getFilePointer() to get the current offset after reading the line and optionally put(currLineNo+1, offset) into the index.
Repeat steps 2-3 until reaching the end of the file.
But don't get too carried away with performance optimizations unless the performance is already a problem or is highly likely to be a problem.
For small files:
String line = Files.readAllLines(Paths.get("file.txt")).get(n);
For large files:
String line;
try (Stream<String> lines = Files.lines(Paths.get("file.txt"))) {
line = lines.skip(n).findFirst().get();
}
Java 7:
String line;
try (BufferedReader br = new BufferedReader(new FileReader("file.txt"))) {
for (int i = 0; i < n; i++)
br.readLine();
line = br.readLine();
}
Source: Reading nth line from file
The only way to do this is to build an index of where each line is (you only need to record the end of each line) Without a way to randomly access a line based on an index from the start, you have to read every byte before that line.
BTW: Reading 100,000 lines might take only one second on a fast machine.
If performance is a big concern here and you are frequently reading random lines from a static file then you can optimize this a bit by reading through the file and building an index (basically just a long[]) of the starting offset of each line of the file.
Once you have this you know exactly where to jump to in the file and then you can read up to the next newline character to retrieve the full line.
Here is a snippet of some code I had which will read a file and write every 10th line including the first line to a new file (writer.) You can always replace the try section with whatever you want to do. To change the number of lines to read just change the 0 in the if statement "lc.endsWith("0")" to whatever line you want to read. But if the file is being written to as you read it, this code will only work with the data that is contained inside the file when you run this code.
LineNumberReader lnr = new LineNumberReader(new FileReader(new File(file)));
lnr.skip(Long.MAX_VALUE);
int linecount=lnr.getLineNumber();
lnr.close();
for (int i=0; i<=linecount; i++){
//read lines
String line = bufferedReader.readLine();
String lc = String.valueOf(i);
if (lc.endsWith("0")){
try{
writer.append(line+"\n");
writer.flush();
}catch(Exception ee){
}
}
}
Related
I have a large sequence file with around 60 million entries (almost 4.5GB).
I want to split it. For example, I want to split it into three parts, each having 20 million entries. So far my code is like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
JavaPairRDD<IntWritable,VectorWritable> part=seqVectors.coalesce(3);
part.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
But unfortunately, each of the generated sequence files is around 4GB too (total 12GB)!
Can anyone suggest a better/valid approach?
Perhaps not the exact answer you are looking for, but it might be worth trying the second method for sequenceFile reading, the one that takes a minPartitions argument. Keep in mind that coalesce, which you are using, can only decrease the partitions.
Your code should then look like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class, 3);
seqVectors.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
Another thing that may cause problems is that some SequenceFiles are not splittable.
Maybe I'm not understanding your question right, but why not just read your file line by line (= entry by entry?) and build your three files this way?
It would be something like this:
int i = 0;
List<PrintWriter> files = new ArrayList<PrintWriter>();
files.add(new PrintWriter("the-file-name1.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name2.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name3.txt", "UTF-8"));
for String line in Files.readAllLines(Paths.get(fileName)){
files.get(i % 3).writeln(line);
i++;
}
In this case, one line every three line goes into the frist, the second and the third file.
An other solution would be to make a binary read, if the file is not a text file, using Files.readAllBytes(Paths.get(inputFileName)) and writing into your output files with Files.write(Paths.get(output1), byteToWrite).
However, I do not have an answer to why the output takes so much more place in the way you are doing it. Maybe the encoding is guilty? I think java encodes in UTF-8 by default and your input file might be encoded in ASCII.
I'm writing a java program that does the following:
Reads in a line from a file
Does some action based on that line
Delete the line (or replace it with ""), and if 2 is not successful, write it to a new file
Continue on to the next line for all lines in file (as opposed to removing an arbitrary line)
Currently I have:
try (BufferedReader br = new BufferedReader(new FileReader(inputFile))) {
String line;
while ((line = br.readLine()) != null) {
try {
if (!do_stuff(line)){ //do_stuff returns bool based on success
write_non_success(line);
}
} catch (Exception e) {
e.printStackTrace(); //eat the exception for now, do something in the future
}
}
Obviously I'm going to need to not use a BufferedReader for this, as it can't write, but what class should I use? Also, read order doesn't matter
This differs from this question because I want to remove all lines, as opposed to an arbitrary line number as the other OP wants, and if possible I'd like to avoid writing the temp file after every line, as my files are approximately 1 million lines
If you do everything according to the algorithm that you describe, the content left in the original file would be the same as the content of "new file" from step #3:
If a line is processed successfully, it gets removed from the original file
If a line is not processed successfully, it gets added to the new file, and it also stays in the original file.
It is easy to see why at the end of this process the original file is the same as the "new file". All you need to do is to carry out your algorithm to the end, and then copy the new file in place of the original.
If your concern is that the process is going to crash in the middle, the situation becomes very different: now you have to write out the current state of the original file after processing each line, without writing over the original until you are sure that it is going to be in a consistent state. You can do it by reading all lines into a list, deleting the first line from the list once it has been processed, writing the content of the entire list into a temporary file, and copying it in place of the original. Obviously, this is very expensive, so it shouldn't be attempted in a tight loop. However, this approach ensures that the original file is not left in an inconsistent state, which is important when you are looking to avoid doing the same work multiple times.
I want to read nth line from the end of the file. However my file size is very huge like 15MB, so I cannot go through each line to find out the last line. Is there an efficient way to get this nth line ?
I went through RandomAccessFile API however my line sizes are not constant so i was not able to move my file pointer to that nth line location from the end. Can some one help me.
You basically have to read the file backwards. The simplest approach, without using "block" reads, is to the get the length of the file, and then use RandomAccessFile to read bytes at (length--) until you have counted the required number of line feeds / carriage returns. You can then read the bytes forward for one line.
Something like this....
RandomAccessFile randomAccessFile = new RandomAccessFile("the.log", "r");
long offset = randomAccessFile.length() - 1;
int found = 0;
while (offset > 0 && found < 10) {
randomAccessFile.seek(offset--);
if (randomAccessFile.read() == 10) {
found++;
}
}
System.out.println(randomAccessFile.readLine());
Single byte reads many not be super efficient. If performance becomes a problem, you take the same approach, but read larger blocks of the file (say 8K) at a time, rather than 1 byte at a time.
Have a look at this answer, which shows that you do need to read through the file (15MB is not big). As long as you are are only storing the latest 9 rows, then you will be able to fly through the file.
I want to know how to directly reach a particular line no of a text file in java.
one Method is this.
int line=0;
BufferedReader read=new BufferedReader(new FileReader(Filename));
while(read.readLine()!=null){
line++;
if(line==LIMIT) break;
}
But this will create a lot of String objects which wont be freed unless gc runs.
Please provide a solution that will be fast and doesn't consume a lot of memory.
PS:I am reading from a file that has millions of lines.
Lets assume that the text file has variable length lines, and that you haven't preprocessed it to create an index. (Otherwise, it should be possible to predetermine the position of the Nth line, and then "seek" to it.)
First observation is that (with the above assumptions), it is not possible to find the Nth line without examining every character before the start of the Nth line.
But you can still do this in a way that doesn't generate lots of garbage. Here's a simple version:
BufferedReader br = new BufferedReader(new FileReader(filename));
for (int i = 1; i < LIMIT; i++) {
while ((ch = br.read()) != '\n') {
if (ch == -1) {
// reached the end of file too soon ...
throw new IOException("The file has < " + LIMIT + " lines");
}
}
}
line = br.readLine();
The trick is to skip over the lines without forming them into String objects.
Now there is a small flaw in the above. It is assuming that the lines of the text file are terminated by a newline character ('\n'), whereas the readLine can cope with 3 kinds of line separator. But that could be addressed ... without generating extra garbage. I'll leave it as "an exercise for the reader", along with investigating tweaks like using read(char[]) instead of read().
You could probably get better performance if you opened the file using a FileInputStream, obtained the FileChannel, read the bytes into a ByteBuffer and then searched it for (byte) '\n'. But the code is significantly more complicated.
However, I'd like to reinforce a point made in the comments. You are probably wasting your time with this. The chances are that your original version runs fast enough for your purposes, despite generating lots of garbage. In reality, GC is fast when the ratio of garbage to non-garbage is high. And for a program that reads an discards lines, you are pretty much guaranteed that will be the case.
Rather than spending time figuring out how to make your program fast based on a false premise, you would be better of writing a simple version and measuring its performance on typical input files. Only optimize if the program is actually too slow.
Instead of reading strings, you can read data in blocks (may be 1024 bytes block) and search line characters. To read block of data, you can use byte array, so it will be reused and so no memory issues. You have to take care of:
Handling of both \r and \n characters
Encoding of the file (like Unicode or other)
Reading data in blocks instead of byte by byte will be more efficient.
I think this should help :
FileReader fr = new FileReader("file1.txt");
BufferedReader br = new BufferedReader(fr);
LineIterator it = IOUtils.lineIterator(br);
for (int l = 0; it.hasNext(); l++) {
String line = (String) it.next();
if (l == LIMIT) {
return line;
}
}
I have a semicolon delimited input file where first column is a 3 char fixed width code, while the remaining columns are some string data.
001;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
I want to divide above file into number of files based on different values of first column.
For e.g. in above example, there are three different values in the first column, so I will divide the file into three files viz. 001.txt, 002.txt, 003.txt
The output file should contain item count as line one and data as remaining lines.
So there are 5 001 rows, so 001.txt will be:
5
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
Similarly, 002 file will have first line as 4 and then 4 lines of data and 003 file will have first line as 5 and then five lines of data.
What would be the most efficient way to achieve this considering very large input file with greater then 100,000 rows?
I have written below code to read lines from the file:
try{
FileInputStream fstream = new FileInputStream(this.inputFilePath);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null) {
String[] tokens = strLine.split(";");
}
in.close();
}catch(IOException e){
e.printStackTrace();
}
for each line
extract chunk name, e.g 001
look for file named "001-tmp.txt"
if one exist, read first line - it will give you number of lines, then increment the value and write into same file using seek function with argument 0 and then use writeUTF to override the string. Perhaps some string length calculation has to be applied here, leave placeholder for 10 spaces for example.
if one does not exist, then create one and write 1 as first line, padded with 10 spaces
append current line to the file
close current file
proceed with next line of source file
One of the solutions that comes to mind is to keep a 'Map' and only open every file once. But you wont be able to this because you have around 1 lac rows, so no OS will allow you that many open file descriptors.
So one of the way is to open the file in append mode and keep writing to it and closing it. But because the of huge many file open close calls , the process may slow up. You can test it for your self though.
If the above is not providing satisfying results, you may try a mix of approach 1 and 2, where by you only open 100 open files at any time and only closing a file if a new file that is not already opened needs to be written to....
First, create HashMap<String, ArrayList<String>> map to collect all the data from the file.
Second, use strLine.split(";",2) instead of strLine.split(";"). The result will be array of length 2, first element be the code and the second the data.
Then, add decoded string to the map:
ArrayList<String> list=map.get(tokens[0]);
if (list==null) {
map.put(tokens[0], list=new ArrayList<String>();
}
list.add(tokens[1]);
At the end, scan the map.keySet() and for each key, create a file named as that key and write list's size and list's content to it.
For each three character code, you're going to have a list of input lines. To me the obvious solution would be to use a Map, with String keys (your three character codes) pointing to the corresponding List that contains all of the lines.
For each of those keys, you'd create a file with the relevant name, the first line would be the size of the list, and then you'd iterate over it to write the remaining lines.
I guess you are not fixed to three files so I suggest you create a map of writers with your three characters code as key and the writer as value.
For each line you read, you select or create the required reader and write the lines into. Also you need a second map to maintain the line count values for all files.
Once you are done with reading the source file, you flush and close all writers and read the files one by one again. This time you just add the line count in front of the file. There is no other way but to rewrite the entire file to my knowledge because its not directly possible to add anything to the beginning of a file without buffering and rewriting the entire file. I suggest you use a temporary file for this one.
This answer applies only in case you file is too large to be stored fully in memory. In case storing is possible, there are faster solutions to this. Like storing the contents of the file fully in StringBuffer objects before writing it to files.