Resume read of huge text file in Java - java

I am reading a huge text file of words (one word per line) but I have to stop it from time to time to resume the read the next day. Right now I'm using Apache's lineiterator but it's totally the wrong solution. My file is 7Gb and I had to interrupt reading it around at 1Gb. To resume the read I saved the number of line already read. This means that I have an if statement on the while loop. Apache's FileUtils doesn't allow to seek so that was my solution.
What is the best/fastest solution? I thought to use RandomAccessfile to get to the right line and continue reading, but I'm not sure if I can go to the right place AND how do I save the correct place I read last. I can reead again a couple of lines, so the precision is not so important, but I haven't found a way to get the pointer. I have a BufferedReader to read the File and a RandomAccessFile to seek to the right place, but I don't know how to periodically save a position with the BufferedReader.
Any hints?
Code: (note the "SOMETHING" where I should print the value I can use on the seekToByte )
try {
RandomAccessFile rand = new RandomAccessFile(file,"r");
rand.seek(seekToByte);
startAtByte = rand.getFilePointer();
rand.close();
} catch(IOException e) {
// do something
}
// Do it using the BufferedReader
BufferedReader reader = null;
FileReader freader = null;
try {
freader = new FileReader(file);
reader = new BufferedReader(freader);
reader.skip(startAtByte);
long i=0;
for(String line; (line = reader.readLine()) != null; ) {
lines.add(line);
System.out.print(i+" ");
if (lines.size()>1000) {
commit(lines);
System.out.println("");
lines.clear();
System.out.println(SOMETHING?);
}
}
} catch(Exception e) {
// handle this
} finally {
if (reader != null) {
try {reader.close();} catch(Exception ignore) {}
}
}

RandomAccessfile is indeed one way to go. Use
long position = file.getFilePointer();
When you stop reading to save where you are in the file, and then restore with:
file.seek(position);
To resume reading at the same place.
However, be careful when using RandomAccessfile, as its readLine method does not completely support Unicode.

Can you somehow use predetermined offsets, for instance chop the file into four pieces (offset0, offset1) (offset1, offset2)..etc, and use RecursiveAction (ForkJoin API) to take advantage of parallelism.

Related

Unexpected amount of lines when writing to a csv file

A part of my application writes data to a .csv file in the following way:
public class ExampleWriter {
public static final int COUNT = 10_000;
public static final String FILE = "test.csv";
public static void main(String[] args) throws Exception {
try (OutputStream os = new FileOutputStream(FILE)){
os.write(239);
os.write(187);
os.write(191);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(os, StandardCharsets.UTF_8));
for (int i = 0; i < COUNT; i++) {
writer.write(Integer.toString(i));
writer.newLine();
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(checkLineCount(COUNT, new File(FILE)));
}
public static String checkLineCount(int expectedLineCount, File file) throws Exception {
BufferedReader expectedReader = new BufferedReader(new FileReader(file));
try {
int lineCount = 0;
while (expectedReader.readLine() != null) {
lineCount++;
}
if (expectedLineCount == lineCount) {
return "correct";
} else {
return "incorrect";
}
}
finally {
expectedReader.close();
}
}
}
The file will be opened in excel and all kind of languages are present in the data. The os.write parts are for prefixing the file with a byte order mark as to enable all kinds of characters.
Somehow the amount of lines in the file do not match the count in the loop and I can not figure out how. Any help on what I am doing wrong here would be greatly appreciated.
You simply need to flush and close your output stream (forcing fsync) before opening the file for input and counting. Try adding:
writer.flush();
writer.close();
inside your try-block. after the for-loop in the main method.
(As a side note).
Note that using a BOM is optional, and (in many cases) reduces the portability of your files (because not all consuming app's are able to handle it well). It does not guarantee that the file has the advertised character encoding. So i would recommend to remove the BOM. When using Excel, just select the file and and choose UTF-8 as encoding.
You are not flushing the stream,Refer oracle docs for more info
which says that
Flushes this output stream and forces any buffered output bytes to be
written out. The general contract of flush is that calling it is an
indication that, if any bytes previously written have been buffered by
the implementation of the output stream, such bytes should immediately
be written to their intended destination. If the intended destination
of this stream is an abstraction provided by the underlying operating
system, for example a file, then flushing the stream guarantees only
that bytes previously written to the stream are passed to the
operating system for writing; it does not guarantee that they are
actually written to a physical device such as a disk drive.
The flush method of OutputStream does nothing.
You need to flush as well as close the stream. There are 2 ways
manually call close() and flush().
use try with resource
As I can see from your code that you have already implemented try with resource and also BufferedReader class also implements Closeable, Flushable so use code as per below
public static void main(String[] args) throws Exception {
try (OutputStream os = new FileOutputStream(FILE); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(os, StandardCharsets.UTF_8))){
os.write(239);
os.write(187);
os.write(191);
for (int i = 0; i < COUNT; i++) {
writer.write(Integer.toString(i));
writer.newLine();
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(checkLineCount(COUNT, new File(FILE)));
}
When COUNT is 1, the code in main() will write a file with two lines, a line with data plus an empty line afterwards. Then you call checkLineCount(COUNT, file) expecting that it will return 1 but it returns 2 because the file has actually two lines.
Therefore if you want the counter to match you must not write a new line after the last line.
(As another side note).
Notice that writing CSV-files the way you are doing is really bad practice. CSV is not so easy as it may look at first sight! So, unless you really know what you are doing (so being aware of all CSV quirks), use a library!

executing C from Java strange errors

I am using Java as a front end for a chess AI i am writing. The Java handles all the graphics, and then executes some C using a few command line arguments. Sometimes the C will never finish, and not get back to the Java. I have found cases in which this happens, and tested them with just the .exe and no java. When i take out the java, these cases work everytime. I am not sure where to go from here. Here is some code that i think is relavant, and the whole project as at https://github.com/AndyGrant/JChess
try{
Process engine = Runtime.getRuntime().exec(buildCommandLineExecuteString(lastMove));
engine.waitFor();
int AImoveIndex = engine.exitValue();
String line;
BufferedReader input = new BufferedReader(new InputStreamReader(engine.getInputStream()));
while ((line = input.readLine()) != null)
System.out.println(line);
input.close();
if (AImoveIndex == -1){
activeGame = false;
System.out.println("Fatal Error");
while (true){
}
}
else{
JMove AIMove = JChessEngine.getAllValid(types,colors,moved,lastMove,!gameTurn).get(AImoveIndex);
AIMove.makeMove(types,colors,moved);
lastMove = AIMove;
validMoves = JChessEngine.getAllValid(types,colors,moved,lastMove,gameTurn);
}
waitingOnComputer = false;
parent.repaint();
}
catch(Exception e){
e.printStackTrace();
}
Sometimes, the external process will get stuck on IO, trying to write to the console. If the console buffer is full, the next printf will block.
How much text is it writing to the console?
Try moving your engine.waitFor() after the part where you read all the input from it.
An alternative would be to have the external process write to a temp file, and then you read the temp file.
Maybe remove
while (true){
}
If your AImoveIndex == -1, your program will enter in a never ending loop.

Improve performance of a Java Program

I've made an Applet Search Utility in which I provide a string as input and find that string in the specified file or folder.
I've done with this but I m not happy with its performance.
The process is taking too much time to respond.
I decided to do its profiling to see what is happening and I noticed that the method scanner.hasNextLine() is taking most of the time.
Though this is very important method for my program because I have to read all the lines and find that string, Is there any other way by which I can improve its performance and reduce the execution time
Here is the code where I am using this method ....
fw = new FileWriter("filePath", true);
bw = new BufferedWriter(fw);
for (File file : filenames) {
if(file.isHidden())
continue;
if (!file.isDirectory()) {
Scanner scanner = new Scanner(file);
int cnt = 0;
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
if(!exactMatch)
{
if(!caseSensitive)
{
if (line.toLowerCase().contains(searchString.toLowerCase())) {
// System.out.println(line);
cnt += StringUtils.countMatches(line.toLowerCase(),
searchString.toLowerCase());
}
}
else
{
if (line.contains(searchString)) {
// System.out.println(line);
cnt += StringUtils.countMatches(line,
searchString);
}
}
}
And yes the method toLowerCase() is also taking more time then expected.
I have changed my code and now I am using BufferedReader in place of Scanner as Alex and Nrj suggested and I found a nice improvement in the performance of my application.
It is now processing in one third time of its earlier version.
Thanks to all that replied.....
Following your question I examined code of Scanner and I think that your are right. It is not optimized to work with large data. I'd recommend you to use simple BufferedReader that wraps InputStreamReader that wraps FileInputStream:
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)))
then read line-by-line:
r.readLine()
If this is not enough for you try to read bulks of lines and then process them.
Concerning to toLowerCase() you can try to use regular expressions instead. The benefit is that you do not have to change the case of line every time. The disadvantage is that in simple cases regular expression works a bit slower than regular string comparison.
I would suggest redesigning your solution and use something like Lucene to do the search for you. You can index and search files with Lucene much more efficiently, tutorial on how to do it with text files can be found here: http://www.avajava.com/tutorials/lessons/how-do-i-use-lucene-to-index-and-search-text-files.html
(Only small optimizations, in response to comment above.)
if(!caseSensitive)
{
searchString = searchString.toLowerCase();
}
while (true) {
String line = bufferedReader.readLine();
if (line == null)
break;
if(!caseSensitive)
{
line = line.toLowerCase();
}
if(!exactMatch)
{
if (line.contains(searchString)) {
// System.out.println(line);
cnt += StringUtils.countMatches(line,
searchString);
}
}
Try using BufferedReader
Make use of threads. You can search the files in parallel which should reduce the search time.
I would not use Java to search the file system for matches of the string. Instead invoke a native algorithm from Java instead. I would invoke grep from Java using something like this:
ProcessBuilder pb = new ProcessBuilder("grep", "-r", "foo");
pb.directory(new File("myDir"));
Process p = pb.start();
InputStream in = p.getInputStream();
//Do whatever you prefer with the stream

Is there a better way to read from a process's inputstream and then handle using specified methods?

I am writing a program doing the following works:
Run a command using ProcessBuilder (like "svn info" or "svn diff");
Read the output of the command from the process's getInputStream();
With the output of the command, I want either:
Parse the output and get what I want and use it later, OR:
Write the output directly to a specified file.
Now what I am doing is using BufferedReader to read whatever the command outputs by lines and save them to an ArrayList, and then decide if I would just scan the lines to find out something or write the lines to a file.
Obviously this is an ugly implement because the ArrayList should not be needed if I want a command's output to be saved to a file. So what will you suggest, to do it in a better way?
Here is some of my codes:
Use this to run command and read from the output of the process
private ArrayList<String> runCommand(String[] command) throws IOException {
ArrayList<String> result = new ArrayList<>();
_processBuilder.command(command);
Process process = null;
try {
process = _processBuilder.start();
try (InputStream inputStream = process.getInputStream();
InputStreamReader inputStreamReader = new InputStreamReader(inputStream);
BufferedReader bufferedReader = new BufferedReader(inputStreamReader)) {
String line;
while ((line = bufferedReader.readLine()) != null) {
result.add(line);
}
}
}
catch (IOException ex) {
_logger.log(Level.SEVERE, "Error!", ex);
}
finally {
if (process != null) {
try {
process.waitFor();
}
catch (InterruptedException ex) {
_logger.log(Level.SEVERE, null, ex);
}
}
}
return result;
}
and in one method I may do like this:
ArrayList<String> reuslt = runCommand(command1);
for (String line: result) {
// ...parse the line here...
}
and in another I may do like this:
ArrayList<String> result = runCommand(command2);
File file = new File(...filename, etc...);
try (PrintWriter printWriter = new PrintWriter(new FileWriter(file, false))) {
for (String line: result) {
printWriter.println(line);
}
}
Returning the process output in an ArrayList seems like a fine abstraction to me. Then the caller of runCommand() doesn't need to worry about how the command was run or the output read. The memory used by the extra list is probably not significant unless your command is very prolix.
The only time I could see this being an issue would be if the caller wanted to start processing the output while the command was still running, which doesn't seem to be the case here.
For very big output that you don't want to copy into memory first, one option would be to have runCommand() take a callback like Guava's LineProcessor that it will call for each line of the output. Then runCommand() can still abstract away the whole deal of running the process, reading the output, and closing everything afterwards, but data can be passed out to the callback as it runs rather than waiting for the method to return the whole response in one array.
I don't think it's a performance issue that you store the text uselessly in some cases. Nonetheless, for cleanliness, it might be better to write two methods:
private ArrayList<String> runCommand(String[] command)
private void runCommandAndDumpToFile(String[] command, File file)
(It wasn't quite clear from your question, but I assume that you know before running your process whether you'll just write the output to file or process it.)

Java: Pause thread and get position in file

I'm writing an application in Java with multithreading which I want to pause and resume.
The thread is reading a file line by line while finding matching lines to a pattern. It has to continue on the place I paused the thread. To read the file I use a BufferedReader in combination with an InputStreamReader and FileInputStream.
fip = new FileInputStream(new File(*file*));
fileBuffer = new BufferedReader(new InputStreamReader(fip));
I use this FileInputStream because I need the filepointer for the position in the file.
When processing the lines it writes the matching lines to a MySQL database. To use a MySQL-connection between the threads I use a ConnectionPool to make sure just one thread is using one connection.
The problem is when I pause the threads and resume them, a few matching lines just disappear. I also tried to subtract the buffersize from the offset but it still has the same problem.
What is a decent way to solve this problem or what am I doing wrong?
Some more details:
The loop
// Regex engine
RunAutomaton ra = new RunAutomaton(this.conf.getAuto(), true);
lw = new LogWriter();
while((line=fileBuffer.readLine()) != null) {
if(line.length()>0) {
if(ra.run(line)) {
// Write to LogWriter
lw.write(line, this.file.getName());
lw.execute();
}
}
}
// Loop when paused.
while(pause) { }
}
Calculating place in file
// Get the position in the file
public long getFilePosition() throws IOException {
long position = fip.getChannel().position() - bufferSize + fileBuffer.getNextChar();
return position;
}
Putting it into the database
// Get the connector
ConnectionPoolManager cpl = ConnectionPoolManager.getManager();
Connector con = null;
while(con == null)
con = cpl.getConnectionFromPool();
// Insert the query
con.executeUpdate(this.sql.toString());
cpl.returnConnectionToPool(con);
Here's an example of what I believe you're looking for. You didn't show much of your implementation so it's hard to debug what might be causing gaps for you. Note that the position of the FileInputStream is going to be a multiple of 8192 because the BufferedReader is using a buffer of that size. If you want to use multiple threads to read the same file you might find this answer helpful.
public class ReaderThread extends Thread {
private final FileInputStream fip;
private final BufferedReader fileBuffer;
private volatile boolean paused;
public ReaderThread(File file) throws FileNotFoundException {
fip = new FileInputStream(file);
fileBuffer = new BufferedReader(new InputStreamReader(fip));
}
public void setPaused(boolean paused) {
this.paused = paused;
}
public long getFilePos() throws IOException {
return fip.getChannel().position();
}
public void run() {
try {
String line;
while ((line = fileBuffer.readLine()) != null) {
// process your line here
System.out.println(line);
while (paused) {
sleep(10);
}
}
} catch (IOException e) {
// handle I/O errors
} catch (InterruptedException e) {
// handle interrupt
}
}
}
I think the root of the problem is that you shouldn't be subtracting bufferSize. Rather you should be subtracting the number of unread characters in the buffer. And I don't think there's a way to get this.
The easiest solution I can think of is to create a custom subclass of FilterReader that keeps track of the number of characters read. Then stack the streams as follows:
FileReader
< BufferedReader
< custom filter reader
< BufferedReader(sz == 1)
The final BufferedReader is there so that you can use readLine ... but you need to set the buffer size to 1 so that the character count from your filter matches the position that the application has reached.
Alternatively, you could implement your own readLine() method in the custom filter reader.
After a few days searching I found out that indeed subtracting the buffersize and adding the position in the buffer wasn't the right way to do it. The position was never right and I was always missing some lines.
When searching a new way to do my job I didn't count the number of characters because it are just too many characters to count which will decrease my performance a lot. But I've found something else. Software engineer Mark S. Kolich created a class JumpToLine which uses the Apache IO library to jump to a given line. It can also provide the last line it has readed so this is really what I need.
There are some examples on his homepage for those interested.

Categories