Using RandomAccessFile along with BufferedReader to speed up file read - java

I have to :-
Read large text file line by line.
Note down file pointer position after every line read.
Stop the file read if running time is greater than 30 seconds.
Resume from last noted file pointer in a new process.
What I am doing :
Using RandomAccessFile.getFilePointer() to note the file pointer.
Wrap RandomAccessFile into another BufferedReader to speed up file read process as per this answer.
When time is greater than 30 seconds, I stop reading the file. Restarting the process with new RandomAccessFile and using RandomAccessFile.seek method to move file pointer to where I left.
Problem:
As I am reading through BufferedReader wrapped around RandomAccessFile, it seems file pointer is moving far ahead in a single call to BufferedReader.readLine(). However, if I use RandomAccessFile.readLine() directely, file pointer is moving properly step by step in forward direction.
Using BufferedReader as a wrapper :
RandomAccessFile randomAccessFile = new RandomAccessFile("mybigfile.txt", "r");
BufferedReader brRafReader = new BufferedReader(new FileReader(randomAccessFile.getFD()));
while((line = brRafReader.readLine()) != null) {
System.out.println(line+", Position : "+randomAccessFile.getFilePointer());
}
Output:
Line goes here, Position : 13040
Line goes here, Position : 13040
Line goes here, Position : 13040
Line goes here, Position : 13040
Using Direct RandomAccessFile.readLine
RandomAccessFile randomAccessFile = new RandomAccessFile("mybigfile.txt", "r");
while((line = randomAccessFile.readLine()) != null) {
System.out.println(line+", Position : "+randomAccessFile.getFilePointer());
}
Output: (This is as expected. File pointer moving properly with each call to readline)
Line goes here, Position : 11011
Line goes here, Position : 11089
Line goes here, Position : 12090
Line goes here, Position : 13040
Could anyone tell, what wrong am I doing here ? Is there any way I can speed up reading process using RandomAccessFile ?

The reason for the observed behavior is that, as the name suggests, the BufferedReader is buffered. It reads a larger chunk of data at once (into a buffer), and returns only the relevant parts of the buffer contents - namely, the part up to the next \n line separator.
I think there are, broadly speaking, two possible approaches:
You could implement your own buffering logic.
Using some ugly reflection hack to obtain the required buffer offset
For 1., you would no longer use RandomAccessFile#readLine. Instead, you'd do your own buffering via
byte buffer[] = new byte[8192];
...
// In a loop:
int read = randomAccessFile.read(buffer);
// Figure out where a line break `\n` appears in the buffer,
// return the resulting lines, and take the position of the `\n`
// into account when storing the "file pointer"
As the vague comment indicates: This may be cumbersome and fiddly. You'd basically re-implement what the readLine method does in the BufferedReader class. And at this point, I don't even want to mention the headaches that different line separators or character sets could cause.
For 2., you could simply access the field of the BufferedReader that stores the buffer offset. This is implemented in the example below. Of course, this is a somewhat crude solution, but mentioned and shown here as a simple alternative, depending on how "sustainable" the solution should be and how much effort you are willing to invest.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.RandomAccessFile;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.List;
public class LargeFileRead {
public static void main(String[] args) throws Exception {
String fileName = "myBigFile.txt";
long before = System.nanoTime();
List<String> result = readBuffered(fileName);
//List<String> result = readDefault(fileName);
long after = System.nanoTime();
double ms = (after - before) / 1e6;
System.out.println("Reading took " + ms + "ms "
+ "for " + result.size() + " lines");
}
private static List<String> readBuffered(String fileName) throws Exception {
List<String> lines = new ArrayList<String>();
RandomAccessFile randomAccessFile = new RandomAccessFile(fileName, "r");
BufferedReader brRafReader = new BufferedReader(
new FileReader(randomAccessFile.getFD()));
String line = null;
long currentOffset = 0;
long previousOffset = -1;
while ((line = brRafReader.readLine()) != null) {
long fileOffset = randomAccessFile.getFilePointer();
if (fileOffset != previousOffset) {
if (previousOffset != -1) {
currentOffset = previousOffset;
}
previousOffset = fileOffset;
}
int bufferOffset = getOffset(brRafReader);
long realPosition = currentOffset + bufferOffset;
System.out.println("Position : " + realPosition
+ " with FP " + randomAccessFile.getFilePointer()
+ " and offset " + bufferOffset);
lines.add(line);
}
return lines;
}
private static int getOffset(BufferedReader bufferedReader) throws Exception {
Field field = BufferedReader.class.getDeclaredField("nextChar");
int result = 0;
try {
field.setAccessible(true);
result = (Integer) field.get(bufferedReader);
} finally {
field.setAccessible(false);
}
return result;
}
private static List<String> readDefault(String fileName) throws Exception {
List<String> lines = new ArrayList<String>();
RandomAccessFile randomAccessFile = new RandomAccessFile(fileName, "r");
String line = null;
while ((line = randomAccessFile.readLine()) != null) {
System.out.println("Position : " + randomAccessFile.getFilePointer());
lines.add(line);
}
return lines;
}
}
(Note: The offsets may still appear to be off by 1, but this is due to the line separator not being taken into account in the position. This could be adjusted if necessary)
NOTE: This is only a sketch. The RandomAccessFile objects should be closed properly when reading is finished, but that depends on how the reading is supposed to be interrupted when the time limit is exceeded, as described in the question

BufferedReader reads a block of data from the file, 8 KB by default. Finding line breaks on order to return the next line is done in the buffer.
I guess, this is why you see a huge increment in the physical file position.
RandomAccessFile will not be using a buffer when reading the next line. It will read byte after byte. That's really slow.
How is performance when you just use a BufferedReader and remember the line you need to continue from?

Related

How sort N files

Following this answer -->
How do I sort very large files
I need only the Merge function on N already sorted files on disk ,
I want to sort them into one Big file my limitation is the memory Not more than K lines in the memory (K < N) so i cannot fetch all them and then sort, preferred with java
so far I Tried as the code below , but I need a good way to iterate over all N of files line by line (not more than K LINES in memory) + store to disk the sorted final file
public void run() {
try {
System.out.println(file1 + " Started Merging " + file2 );
FileReader fileReader1 = new FileReader(file1);
FileReader fileReader2 = new FileReader(file2);
//......TODO with N ?? ......
FileWriter writer = new FileWriter(file3);
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
//Merge 2 files based on which string is greater.
while (line1 != null || line2 != null) {
if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
writer.write(line2 + "\r\n");
line2 = bufferedReader2.readLine();
} else {
writer.write(line1 + "\r\n");
line1 = bufferedReader1.readLine();
}
}
System.out.println(file1 + " Done Merging " + file2 );
new File(file1).delete();
new File(file2).delete();
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
regards,
You can use something like this
public static void mergeFiles(String target, String... input) throws IOException {
String lineBreak = System.getProperty("line.separator");
PriorityQueue<Map.Entry<String,BufferedReader>> lines
= new PriorityQueue<>(Map.Entry.comparingByKey());
try(FileWriter fw = new FileWriter(target)) {
String header = null;
for(String file: input) {
BufferedReader br = new BufferedReader(new FileReader(file));
String line = br.readLine();
if(line == null) br.close();
else {
if(header == null) fw.append(header = line).write(lineBreak);
line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
for(;;) {
Map.Entry<String, BufferedReader> next = lines.poll();
if(next == null) break;
fw.append(next.getKey()).write(lineBreak);
final BufferedReader br = next.getValue();
String line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
catch(Throwable t) {
for(Map.Entry<String,BufferedReader> br: lines) try {
br.getValue().close();
} catch(Throwable next) {
if(t != next) t.addSuppressed(next);
}
}
}
Note that this code, unlike the code in your question, handles the header line. Like the original code, it will delete the input lines. If that’s not intended, you can remove the DELETE_ON_CLOSE option and simplify the entire reader construction to
BufferedReader br = new BufferedReader(new FileReader(file));
It has exactly as much lines in memory, as you have files.
While in principle, it is possible to hold less line strings in memory, to re-read them when needed, it would be a performance disaster for a questionable little saving. E.g. you have already N strings in memory when calling this method, due to the fact that you have N file names.
However, when you want to reduce the number of lines held at the same time, at all costs, you can simply use the method shown in your question. Merge the first two files into a temporary file, merge that temporary file with the third to another temporary file, and so on, until merging the temporary file with the last input file to the final result. Then you have at most two line strings in memory (K == 2), saving less memory than the operating system will use for buffering, trying to mitigate the horrible performance of this approach.
Likewise, you can use the method shown above to merge K files into a temporary file, then merge the temporary file with the next K-1 file, and so on, until merging the temporary file with the remaining K-1 or less files to the final result, to have a memory consumption scaling with K < N. This approach allows to tune K to have a reasonable ratio to N, to trade memory for speed. I think, in most practical cases, K == N will work just fine.
#Holger gave a nice answer assuming that K>=N.
You can extend it to the K<N case by using mark(int) and reset() methods of the BufferedInputStream.
The parameter of mark is how many bytes a single line can have.
The idea is as follows:
Instead of putting all the N lines in the TreeMap, you can only have K of them. Whenever you put a new line into the set and it is already 'full' you evict the smallest one from it. Additionally, you reset the stream from which it came. So when you will read it again the same data can pop up.
You have to keep track of the maximum line not kept in the TreeSet, lets call it the lower bound. Once there are no elements in the TreeSet greater than the maintained lower bound, you scan all the files once again and repopulate the set.
I'm not sure if this approach is optimal, but should be ok.
Moreover, you have to be aware that BufferedInputStream has an internal buffer at least the size of a single line, so that will consume a lot of your memory, perhaps it would be better to maintain buffering on your own.

Is it possible to put a string array into buffered Reader?

I'm working on a project at the moment which requires me to set up a distributed network simulator, I had it working with taking output from a file and parsing through each line with a buffered reader as you can see below but I want to now use a predefined array and make my bufferedReader take input from that instead I've looked up a few solutions online to help me put this array into the buffered Reader but non seem to have worked.
I'm getting no errors when running and terminating the code but just seems to be stuck in an endless loop at some point and I presume it's the new buffered reader segment using the array instead. The idea behind this was to make the process simpler than re-writing many segments to fit around the array parsing and instead find a simpler way by having the array in the buffered Reader but as this is proving difficult I may have to resort to changing. I have tested if the array is being initialised correctly and that's not the problem so it's one less thing to take into consideration.
**Previous code:**
private void parseFile(String fileName) throws IOException {
System.out.println("Parsing Array");
Path path = Paths.get(fileName);
try (BufferedReader br = Files.newBufferedReader(path)) {
String line = null;
line = br.readLine(); // Skip first line
while ((line = br.readLine()) != null) {
parseLine(line);
}
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}
}
The recommendation online was to use an input stream with the buffered reader for it but that didn't seem to work at all as it over wrote the array, any recommendations on what I can use for the buffered reader segment would be grand.
The Array method above is just a void creating the array which is called before the method so the array should be initialised I presume, If anyone can look over and potentially let me know where I'm going wrong and problems that would be amazing if not I appreciate your time to read this anyway, Thanks for your time.
New Code Attempt:
//Creating array to parse.
private void createArray(){
myStringArray[0] = "Node_id Neighbours";
myStringArray[1] = "1 2 10";
myStringArray[2] = "2 3 1";
myStringArray[3] = "3 4 2";
myStringArray[4] = "4 5 3";
myStringArray[5] = "5 6 4";
myStringArray[6] = "6 7 5";
myStringArray[7] = "7 8 6";
myStringArray[8] = "8 9 7";
myStringArray[9] = "9 10 8";
myStringArray[10] = "10 1 9";
myStringArray[11] = "ELECT 1 2 3 4 5 6 7 8 9";
}
private void parseArray() throws IOException {
//InputStreamReader isr = new InputStreamReader(System.in);
System.out.println("Parsing Array");
// try (BufferedReader br = Files.newBufferedReader(path))
try (BufferedReader br = new BufferedReader(isr)) {
for(int i=0;i<12;i++)
{
String line = null;
line = br.readLine(); // Skip first line
while ((myStringArray[i] = br.readLine()) != null) {
parseLine(line);
}
}
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}
}
Answer: You cannot do this with buffered reader. I fixed it like this if this is any use to anyone. Thanks a lot to #L.Spillner for the explanation and answer.
code fix:
private void parseArray() throws IOException {
System.out.println("Parsing Array");
for(int i=1;i<12;i++) //First row ignored.
{
String line = null;
line = myStringArray[i];
//Begin parsing process of each entity.
parseLine(line);
}
}
Let's kick it off with a precise answer to the Question.
You cannot put anything into a BufferedReader directly. Especially when it's some kind of data structure like an array.
The BufferedReader's purpose is to handle I/O Operations, input operations to be more precise. According to the javadoc the BufferedReader takes a Reader as an argument. Reader is an abstract class which contains 'tools' to handle character InputStreams.
The way the BufferedReader's readLine() method works is: Any character arriving on the InputStream gets stored in a buffer until a \n (new line/linefeed) or \r (carriage retun) arrives. When one of these two special characters show up the buffer gets interpreted as a String and is returned to the call.
Answer is you can't. Thanks for the feedback though guys and got it working through just looping through the array and assigning each item to line.

Reading a specific set of lines in a file [duplicate]

In Java, is there any method to read a particular line from a file? For example, read line 32 or any other line number.
For small files:
String line32 = Files.readAllLines(Paths.get("file.txt")).get(32)
For large files:
try (Stream<String> lines = Files.lines(Paths.get("file.txt"))) {
line32 = lines.skip(31).findFirst().get();
}
Unless you have previous knowledge about the lines in the file, there's no way to directly access the 32nd line without reading the 31 previous lines.
That's true for all languages and all modern file systems.
So effectively you'll simply read lines until you've found the 32nd one.
Not that I know of, but what you could do is loop through the first 31 lines doing nothing using the readline() function of BufferedReader
FileInputStream fs= new FileInputStream("someFile.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fs));
for(int i = 0; i < 31; ++i)
br.readLine();
String lineIWant = br.readLine();
Joachim is right on, of course, and an alternate implementation to Chris' (for small files only because it loads the entire file) might be to use commons-io from Apache (though arguably you might not want to introduce a new dependency just for this, if you find it useful for other stuff too though, it could make sense).
For example:
String line32 = (String) FileUtils.readLines(file).get(31);
http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html#readLines(java.io.File, java.lang.String)
You may try indexed-file-reader (Apache License 2.0). The class IndexedFileReader has a method called readLines(int from, int to) which returns a SortedMap whose key is the line number and the value is the line that was read.
Example:
File file = new File("src/test/resources/file.txt");
reader = new IndexedFileReader(file);
lines = reader.readLines(6, 10);
assertNotNull("Null result.", lines);
assertEquals("Incorrect length.", 5, lines.size());
assertTrue("Incorrect value.", lines.get(6).startsWith("[6]"));
assertTrue("Incorrect value.", lines.get(7).startsWith("[7]"));
assertTrue("Incorrect value.", lines.get(8).startsWith("[8]"));
assertTrue("Incorrect value.", lines.get(9).startsWith("[9]"));
assertTrue("Incorrect value.", lines.get(10).startsWith("[10]"));
The above example reads a text file composed of 50 lines in the following format:
[1] The quick brown fox jumped over the lazy dog ODD
[2] The quick brown fox jumped over the lazy dog EVEN
Disclamer: I wrote this library
Although as said in other answers, it is not possible to get to the exact line without knowing the offset (pointer) before. So, I've achieved this by creating an temporary index file which would store the offset values of every line. If the file is small enough, you could just store the indexes (offset) in memory without needing a separate file for it.
The offsets can be calculated by using the RandomAccessFile
RandomAccessFile raf = new RandomAccessFile("myFile.txt","r");
//above 'r' means open in read only mode
ArrayList<Integer> arrayList = new ArrayList<Integer>();
String cur_line = "";
while((cur_line=raf.readLine())!=null)
{
arrayList.add(raf.getFilePointer());
}
//Print the 32 line
//Seeks the file to the particular location from where our '32' line starts
raf.seek(raf.seek(arrayList.get(31));
System.out.println(raf.readLine());
raf.close();
Also visit the Java docs on RandomAccessFile for more information:
Complexity: This is O(n) as it reads the entire file once. Please be aware for the memory requirements. If it's too big to be in memory, then make a temporary file that stores the offsets instead of ArrayList as shown above.
Note: If all you want in '32' line, you just have to call the readLine() also available through other classes '32' times. The above approach is useful if you want to get the a specific line (based on line number of course) multiple times.
Another way.
try (BufferedReader reader = Files.newBufferedReader(
Paths.get("file.txt"), StandardCharsets.UTF_8)) {
List<String> line = reader.lines()
.skip(31)
.limit(1)
.collect(Collectors.toList());
line.stream().forEach(System.out::println);
}
No, unless in that file format the line lengths are pre-determined (e.g. all lines with a fixed length), you'll have to iterate line by line to count them.
In Java 8,
For small files:
String line = Files.readAllLines(Paths.get("file.txt")).get(n);
For large files:
String line;
try (Stream<String> lines = Files.lines(Paths.get("file.txt"))) {
line = lines.skip(n).findFirst().get();
}
In Java 7
String line;
try (BufferedReader br = new BufferedReader(new FileReader("file.txt"))) {
for (int i = 0; i < n; i++)
br.readLine();
line = br.readLine();
}
Source: Reading nth line from file
If you are talking about a text file, then there is really no way to do this without reading all the lines that precede it - After all, lines are determined by the presence of a newline, so it has to be read.
Use a stream that supports readline, and just read the first X-1 lines and dump the results, then process the next one.
It works for me:
I have combined the answer of
Reading a simple text file
But instead of return a String I am returning a LinkedList of Strings. Then I can select the line that I want.
public static LinkedList<String> readFromAssets(Context context, String filename) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(context.getAssets().open(filename)));
LinkedList<String>linkedList = new LinkedList<>();
// do reading, usually loop until end of file reading
StringBuilder sb = new StringBuilder();
String mLine = reader.readLine();
while (mLine != null) {
linkedList.add(mLine);
sb.append(mLine); // process line
mLine = reader.readLine();
}
reader.close();
return linkedList;
}
Use this code:
import java.nio.file.Files;
import java.nio.file.Paths;
public class FileWork
{
public static void main(String[] args) throws IOException {
String line = Files.readAllLines(Paths.get("D:/abc.txt")).get(1);
System.out.println(line);
}
}
You can use LineNumberReader instead of BufferedReader. Go through the api. You can find setLineNumber and getLineNumber methods.
You can also take a look at LineNumberReader, subclass of BufferedReader. Along with the readline method, it also has setter/getter methods to access line number. Very useful to keep track of the number of lines read, while reading data from file.
public String readLine(int line){
FileReader tempFileReader = null;
BufferedReader tempBufferedReader = null;
try { tempFileReader = new FileReader(textFile);
tempBufferedReader = new BufferedReader(tempFileReader);
} catch (Exception e) { }
String returnStr = "ERROR";
for(int i = 0; i < line - 1; i++){
try { tempBufferedReader.readLine(); } catch (Exception e) { }
}
try { returnStr = tempBufferedReader.readLine(); } catch (Exception e) { }
return returnStr;
}
you can use the skip() function to skip the lines from begining.
public static void readFile(String filePath, long lineNum) {
List<String> list = new ArrayList<>();
long totalLines, startLine = 0;
try (Stream<String> lines = Files.lines(Paths.get(filePath))) {
totalLines = Files.lines(Paths.get(filePath)).count();
startLine = totalLines - lineNum;
// Stream<String> line32 = lines.skip(((startLine)+1));
list = lines.skip(startLine).collect(Collectors.toList());
// lines.forEach(list::add);
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
list.forEach(System.out::println);
}
EASY WAY - Reading a line using line number.
Let's say Line number starts from 1 till null .
public class TextFileAssignmentOct {
private void readData(int rowNum, BufferedReader br) throws IOException {
int n=1; //Line number starts from 1
String row;
while((row=br.readLine()) != null) { // Reads every line
if (n == rowNum) { // When Line number matches with which you want to read
System.out.println(row);
}
n++; //This increments Line number
}
}
public static void main(String[] args) throws IOException {
File f = new File("../JavaPractice/FileRead.txt");
FileReader fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr);
TextFileAssignmentOct txf = new TextFileAssignmentOct();
txf.readData(4, br); //Read a Specific Line using Line number and Passing buffered reader
}
}
for a text file you can use an integer with a loop to help you get the number of the line, don't forget to import the classes we are using in this example
File myObj = new File("C:\\Users\\LENOVO\\Desktop\\test.txt");//path of the file
FileReader fr = new FileReader(myObj);
fr.read();
BufferedReader bf = new BufferedReader(fr); //BufferedReader of the FileReader fr
String line = bf.readLine();
int lineNumber = 0;
while (line != null) {
lineNumber = lineNumber + 1;
if(lineNumber == 7)
{
//show line
System.out.println("line: " + lineNumber + " has :" + line);
break;
}
//lecture de la prochaine ligne, reading next
line = bf.readLine();
}
They are all wrong I just wrote this in about 10 seconds.
With this I managed to just call the object.getQuestion("linenumber") in the main method to return whatever line I want.
public class Questions {
File file = new File("Question2Files/triviagame1.txt");
public Questions() {
}
public String getQuestion(int numLine) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(file));
String line = "";
for(int i = 0; i < numLine; i++) {
line = br.readLine();
}
return line; }}

what is the efficent way to process larges text files?

I have two files:
1- with 1400000 line or record --- 14 MB
2- with 16000000 -- 170 MB
I want to find if each record or line in file 1 is also in file 2 or not
I develop a java app that do the following: Read file line by line and pass each line to a method that loop in file 2
Here is my code:
public boolean hasIDin(String bioid) throws Exception {
BufferedReader br = new BufferedReader(new FileReader("C://AllIDs.txt"));
long bid = Long.parseLong(bioid);
String thisLine;
while((thisLine = br.readLine( )) != null)
{
if (Long.parseLong(thisLine) == bid)
return true;
}
return false;
}
public void getMBD() throws Exception{
BufferedReader br = new BufferedReader(new FileReader("C://DIDs.txt"));
OutputStream os = new FileOutputStream("C://MBD.txt");
PrintWriter pr = new PrintWriter(os);
String thisLine;
int count=1;
while ((thisLine = br.readLine( )) != null){
String bioid = thisLine;
System.out.println(count);
if(! hasIDin(bioid))
pr.println(bioid);
count++;
}
pr.close();
}
When I run it seems it will take more 1944.44444444444 hours to complete as every line processing takes 5 sec. That is about three months!
Is there any ideas to make it done in a much much more less time.
Thanks in advance.
Why don't you;
read all the lines in file2 into a set. Set is fine, but TLongHashSet would be more efficient.
for each line in the second file see if it is in the set.
Here is a tuned implementation which prints the following and uses < 64 MB.
Generating 1400000 ids to /tmp/DID.txt
Generating 16000000 ids to /tmp/AllIDs.txt
Reading ids in /tmp/DID.txt
Reading ids in /tmp/AllIDs.txt
Took 8794 ms to find 294330 valid ids
Code
public static void main(String... args) throws IOException {
generateFile("/tmp/DID.txt", 1400000);
generateFile("/tmp/AllIDs.txt", 16000000);
long start = System.currentTimeMillis();
TLongHashSet did = readLongs("/tmp/DID.txt");
TLongHashSet validIDS = readLongsUnion("/tmp/AllIDs.txt",did);
long time = System.currentTimeMillis() - start;
System.out.println("Took "+ time+" ms to find "+ validIDS.size()+" valid ids");
}
private static TLongHashSet readLongs(String filename) throws IOException {
System.out.println("Reading ids in "+filename);
BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
TLongHashSet ids = new TLongHashSet();
for(String line; (line = br.readLine())!=null;)
ids.add(Long.parseLong(line));
br.close();
return ids;
}
private static TLongHashSet readLongsUnion(String filename, TLongHashSet validSet) throws IOException {
System.out.println("Reading ids in "+filename);
BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
TLongHashSet ids = new TLongHashSet();
for(String line; (line = br.readLine())!=null;) {
long val = Long.parseLong(line);
if (validSet.contains(val))
ids.add(val);
}
br.close();
return ids;
}
private static void generateFile(String filename, int number) throws IOException {
System.out.println("Generating "+number+" ids to "+filename);
PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter(filename), 128*1024));
Random rand = new Random();
for(int i=0;i<number;i++)
pw.println(rand.nextInt(1<<26));
pw.close();
}
170Mb + 14Mb is not so huge files.
My suggestion is to load the smallest one file into java.util.Map, parse the biggest one line-by-line (record-by-record) file and check if the current line present in this Map.
P.S. The question looks like something trivial in terms of RDBMS - maybe it's worth to use any?
You can't do an O(N^2) when each iteration is so long, that's completely unacceptable.
If you have enough RAM, you simply parse file 1, create a map of all numbers, then parse file 2 and check your map.
If you don't have enough RAM, parse file 1, create a map and store it to a file, then parse file 2 and read the map. The key is to make the map as easy to parse as possible - make it a binary format, maybe with a binary tree or something where you can quickly skip around and search. (EDIT: I have to add Michael Borgwardt's Grace Hash Join link, which shows an even better way: http://en.wikipedia.org/wiki/Hash_join#Grace_hash_join)
If there is a limit to the size of your files, option 1 is easier to implement - unless you're dealing with huuuuuuuge files (I'm talking lots of GB), you definitely want to do that.
Usually, memory-mapping is the most efficient way to read large files. You'll need to use java.nio.MappedByteBuffer and java.io.RandomAccessFile.
But your search algorithm is the real problem. Building some sort of index or hash table is what you need.

Java: How to read a text file

I want to read a text file containing space separated values. Values are integers.
How can I read it and put it in an array list?
Here is an example of contents of the text file:
1 62 4 55 5 6 77
I want to have it in an arraylist as [1, 62, 4, 55, 5, 6, 77]. How can I do it in Java?
You can use Files#readAllLines() to get all lines of a text file into a List<String>.
for (String line : Files.readAllLines(Paths.get("/path/to/file.txt"))) {
// ...
}
Tutorial: Basic I/O > File I/O > Reading, Writing and Creating text files
You can use String#split() to split a String in parts based on a regular expression.
for (String part : line.split("\\s+")) {
// ...
}
Tutorial: Numbers and Strings > Strings > Manipulating Characters in a String
You can use Integer#valueOf() to convert a String into an Integer.
Integer i = Integer.valueOf(part);
Tutorial: Numbers and Strings > Strings > Converting between Numbers and Strings
You can use List#add() to add an element to a List.
numbers.add(i);
Tutorial: Interfaces > The List Interface
So, in a nutshell (assuming that the file doesn't have empty lines nor trailing/leading whitespace).
List<Integer> numbers = new ArrayList<>();
for (String line : Files.readAllLines(Paths.get("/path/to/file.txt"))) {
for (String part : line.split("\\s+")) {
Integer i = Integer.valueOf(part);
numbers.add(i);
}
}
If you happen to be at Java 8 already, then you can even use Stream API for this, starting with Files#lines().
List<Integer> numbers = Files.lines(Paths.get("/path/to/test.txt"))
.map(line -> line.split("\\s+")).flatMap(Arrays::stream)
.map(Integer::valueOf)
.collect(Collectors.toList());
Tutorial: Processing data with Java 8 streams
Java 1.5 introduced the Scanner class for handling input from file and streams.
It is used for getting integers from a file and would look something like this:
List<Integer> integers = new ArrayList<Integer>();
Scanner fileScanner = new Scanner(new File("c:\\file.txt"));
while (fileScanner.hasNextInt()){
integers.add(fileScanner.nextInt());
}
Check the API though. There are many more options for dealing with different types of input sources, differing delimiters, and differing data types.
This example code shows you how to read file in Java.
import java.io.*;
/**
* This example code shows you how to read file in Java
*
* IN MY CASE RAILWAY IS MY TEXT FILE WHICH I WANT TO DISPLAY YOU CHANGE WITH YOUR OWN
*/
public class ReadFileExample
{
public static void main(String[] args)
{
System.out.println("Reading File from Java code");
//Name of the file
String fileName="RAILWAY.txt";
try{
//Create object of FileReader
FileReader inputFile = new FileReader(fileName);
//Instantiate the BufferedReader Class
BufferedReader bufferReader = new BufferedReader(inputFile);
//Variable to hold the one line data
String line;
// Read file line by line and print on the console
while ((line = bufferReader.readLine()) != null) {
System.out.println(line);
}
//Close the buffer reader
bufferReader.close();
}catch(Exception e){
System.out.println("Error while reading file line by line:" + e.getMessage());
}
}
}
Look at this example, and try to do your own:
import java.io.*;
public class ReadFile {
public static void main(String[] args){
String string = "";
String file = "textFile.txt";
// Reading
try{
InputStream ips = new FileInputStream(file);
InputStreamReader ipsr = new InputStreamReader(ips);
BufferedReader br = new BufferedReader(ipsr);
String line;
while ((line = br.readLine()) != null){
System.out.println(line);
string += line + "\n";
}
br.close();
}
catch (Exception e){
System.out.println(e.toString());
}
// Writing
try {
FileWriter fw = new FileWriter (file);
BufferedWriter bw = new BufferedWriter (fw);
PrintWriter fileOut = new PrintWriter (bw);
fileOut.println (string+"\n test of read and write !!");
fileOut.close();
System.out.println("the file " + file + " is created!");
}
catch (Exception e){
System.out.println(e.toString());
}
}
}
Just for fun, here's what I'd probably do in a real project, where I'm already using all my favourite libraries (in this case Guava, formerly known as Google Collections).
String text = Files.toString(new File("textfile.txt"), Charsets.UTF_8);
List<Integer> list = Lists.newArrayList();
for (String s : text.split("\\s")) {
list.add(Integer.valueOf(s));
}
Benefit: Not much own code to maintain (contrast with e.g. this). Edit: Although it is worth noting that in this case tschaible's Scanner solution doesn't have any more code!
Drawback: you obviously may not want to add new library dependencies just for this. (Then again, you'd be silly not to make use of Guava in your projects. ;-)
Use Apache Commons (IO and Lang) for simple/common things like this.
Imports:
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang3.ArrayUtils;
Code:
String contents = FileUtils.readFileToString(new File("path/to/your/file.txt"));
String[] array = ArrayUtils.toArray(contents.split(" "));
Done.
Using Java 7 to read files with NIO.2
Import these packages:
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
This is the process to read a file:
Path file = Paths.get("C:\\Java\\file.txt");
if(Files.exists(file) && Files.isReadable(file)) {
try {
// File reader
BufferedReader reader = Files.newBufferedReader(file, Charset.defaultCharset());
String line;
// read each line
while((line = reader.readLine()) != null) {
System.out.println(line);
// tokenize each number
StringTokenizer tokenizer = new StringTokenizer(line, " ");
while (tokenizer.hasMoreElements()) {
// parse each integer in file
int element = Integer.parseInt(tokenizer.nextToken());
}
}
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
To read all lines of a file at once:
Path file = Paths.get("C:\\Java\\file.txt");
List<String> lines = Files.readAllLines(file, StandardCharsets.UTF_8);
All the answers so far given involve reading the file line by line, taking the line in as a String, and then processing the String.
There is no question that this is the easiest approach to understand, and if the file is fairly short (say, tens of thousands of lines), it'll also be acceptable in terms of efficiency. But if the file is long, it's a very inefficient way to do it, for two reasons:
Every character gets processed twice, once in constructing the String, and once in processing it.
The garbage collector will not be your friend if there are lots of lines in the file. You're constructing a new String for each line, and then throwing it away when you move to the next line. The garbage collector will eventually have to dispose of all these String objects that you don't want any more. Someone's got to clean up after you.
If you care about speed, you are much better off reading a block of data and then processing it byte by byte rather than line by line. Every time you come to the end of a number, you add it to the List you're building.
It will come out something like this:
private List<Integer> readIntegers(File file) throws IOException {
List<Integer> result = new ArrayList<>();
RandomAccessFile raf = new RandomAccessFile(file, "r");
byte buf[] = new byte[16 * 1024];
final FileChannel ch = raf.getChannel();
int fileLength = (int) ch.size();
final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0,
fileLength);
int acc = 0;
while (mb.hasRemaining()) {
int len = Math.min(mb.remaining(), buf.length);
mb.get(buf, 0, len);
for (int i = 0; i < len; i++)
if ((buf[i] >= 48) && (buf[i] <= 57))
acc = acc * 10 + buf[i] - 48;
else {
result.add(acc);
acc = 0;
}
}
ch.close();
raf.close();
return result;
}
The code above assumes that this is ASCII (though it could be easily tweaked for other encodings), and that anything that isn't a digit (in particular, a space or a newline) represents a boundary between digits. It also assumes that the file ends with a non-digit (in practice, that the last line ends with a newline), though, again, it could be tweaked to deal with the case where it doesn't.
It's much, much faster than any of the String-based approaches also given as answers to this question. There is a detailed investigation of a very similar issue in this question. You'll see there that there's the possibility of improving it still further if you want to go down the multi-threaded line.
read the file and then do whatever you want
java8
Files.lines(Paths.get("c://lines.txt")).collect(Collectors.toList());

Categories