I have a need for my application to be able to read large (very large, 100GB+) text files and process the content in these files potentially at different times. For instance, it might run for an hour and finish processing a few GBs, and then I shut it down and come back to it a few days later to resume processsing the same file.
To do this I will need to read the files into memory-friendly chunks; each chunk/page/block/etc will be read in, one at a time, processed, before then next chunk is read into memory.
I need the program to be able to mark where it is inside the input file, so if it shuts down, or if I need to "replay" the last chunk being processed, I can jump right to the point in the file where I am and continue processing. Specifically, I need to be able to do the following things:
When the processing begings, scan a file for a "MARKER" (some marker that indicates where we left off processing the last time)
If the MARKER exists, jump to it and begin processing from that point
Else, if the MARKER doesn't exist, then place a MARKER after the first chunk (for now, let's say that a "chunk" is just a line-of-text, as BufferedReader#readLine() would read in) and begin processing the first chunk/line
For each chunk/line processed, move the MARKER after the next chunk (thus, advancing the MARKER further down the file)
If we reach a point where there are no more chunks/lines after the current MARKER, we've finished processing the file
I tried coding this up myself and notice that BufferedReader has some interesting methods on it that sounds like they're suited for this very purpose: mark(), reset(), etc. But the Javadocs on them are kind of vague and I'm not sure that these "File Marker" methods will accomplish all the things I need to be able to do. I'm also completely open to a 3rd party JAR/lib that has this capability built into it, but Google didn't turn anything up.
Any ideas here?
Forget about markers. You cannot "insert" text without rewritting the whole file.
Use a RandomAccessFile and store the current position you are reading. When you need to open again the file, just use seek to find the position.
A Reader's "mark" is not persistent; it solely forms part of the state of the Reader itself.
I suggest that you not store the state information in the text file itself; instead, have a file alongside which stores the byte offset of the most recently processed chunk. That'll eliminate the obvious problems involving overwriting data in the original text file.
The marker of the buffered reader is not persisted after different runs of your application. I would neither change the content of that huge file to mark a position, since this can lead to significant IO and/or filesystem fragmentation, depending on your OS.
I would use a properties file to store the configuration of the program externally. Have a look at the documentation, the API is straight forward:
http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html
Related
I have used PrintWriter to write to a text file with java, the process takes time, bcz there is an iteration process, at end of each iteration, a string is appended to the file, if I wait till the process ends, I can open and see the content of the file without problem, but how to see the updating process to the file while it is still writing to, e.g. I refresh inside the folder then I see the size becomes larger and larger with each refresh click?
You can't see the progress of your stuff being written into the file, since it might not actually be.
First: Java Streams usually optimize access to the harddrive, to access it as little as possible since doing so is slow. To force Java to do what it thinks is wrinting to disk, call PrintWriter.flush() after you have appended something.
Second: even that might not do the trick, since most operating systems do the same optimisation with the same result, and you can't simply force you os to flush().
As already mentioned, the best you can do is Printwriter.flush().
As #Elliott mentioned you should do pw.flush() after each append to see the size change immediately.
I'm writing save logic for an application and part of it will save a dynamic list of "chunks" of data to a single file. Some of those chunks might have been provided by a plugin though (which would have included logic to read it back), so I need to find a way to properly skip unrecognized chunks of data if the plugin which created it has been removed.
My current solution is to write a length (int32) before each "chunk" so if there's an error the reader can skip past it and continue reading the next chunk.
However, this requires calculating the length of the data before writing any of it, and since our system is somewhat dynamic and allows nested data types, I'd rather avoid the overhead of caching everything just to measure it.
I'm considering using file markers somehow - I could scan the file for a very specific byte sequence that separates chunks. That could be written after each chunk rather than before.
Are other options I'm not thinking of? My goal is to find a way to write the data as immediately, without the need for caching and measuring it.
I'm working on a program to read in various text files and display them without having to read the entire file to memory.
A brief description of what I want the program to do:
Assume a file of say, 3000 lines.
Display 50 lines at a time
Allow user to scroll down to read further lines, however these lines are loaded in real-time from reader. lines further up that are now not displayed are not stored in memory.
Allow user to scroll up to read previous lines, however these lines are also loaded in real-time or at least in a similar manner as reading forward.
What I want is to tailor the program to be memory-efficient and handle large files without fizzling out. I was looking at BufferedReaders but there doesn't seem to be a dependable way of backwards traversal. I was originally looking at how mark() and reset() functioned but I couldn't find a class that could set multiple marks.
I'm wondering if anybody could help me out and give me a few pointers towards some useful classes I could use. I was starting to poke around NIO classes like ByteBuffers and CharBuffers but I'm rather lost as to how I could implement them towards what I want to accomplish.
Thanks!
Back in the olden days of computing (1980's), this is exactly how we had to process large files for display.
Basically, you need an input method that can read specified lines from a file. Something like
List<String> block = readFile(file, 51, 100);
which would read the 51st through 100th lines of the file.
I see two ways you could accomplish this.
Read from the beginning of the file each time, skipping the first nth records and retrieving 50 (or some other number) strings.
Read the file once, and break it up into x files of length 50 (or some other number). Read your temporary files to get blocks of strings.
Either way, you would keep 3 blocks in memory; the current block, the previous block and the next block.
As you move forward through the strings, the current block becomes the previous block, the next block becomes the current block, and you read a new next block.
As you move backward through the strings, the current block becomes the next block, the previous block becomes the current block, and you read a new previous block.
Random access to a file is available in Java. So you can surf through the bytes of a file pretty easily, and have a region of file mapped to memory at a time.
You can have a Deque<E> implementation for readable region. Then you can add/remove data chunks or "lines" from both ends, to represent a visual data scroll.
If the "lines" are defined the characters that fit the width of the visual display (such as a console window), then you might just keep loading next x bytes/characters, and removing previous x bytes/characters.
Otherwise, you may need to scan ahead, and build some metadata about the file, noting down the positions of lines, or other interesting structures within the file. Then you can use this metadata to quickly navigate the file.
I am using RandomAccessFile but writeBytes() overwrites the
existing content Mostly i want to implement this without using any new
temporary file at least methods name few clues or techniques will do.
To insert without the use of any temporary file, you'll have to find a way to read in and shift all subsequent lines down by one when you insert a new line.
The sequential storage of lines of text poses the same kind of issues that the use of a standard array (vs. a LinkedList) does: You can't just unhook a pointer, plop in some text, and then hook up the pointer to next item to point to a subsequent line in the file. You have to perform the entire shift.
So I'd guess that you'd want to go to end of file, shift it down by a line, then move up each line and perform the same shift until you hit the position at which you want to insert the new line, at which point, you'll have cleared a space for it.
This seems very performance inefficient.
Hopefully someone knows of a better way, but this would be my approach, were a temporary file not an option.
(Alternately, you could also always just read the whole thing into a StringBuffer if it were small enough, peform the insert within that, and then write the file back out from scratch, but I imagine that you've considered that option already.)
I want to read a file incrementally in java while the file is being modified/written by some other process. So suppose Process "A" is writing/logging a file "X" and another process "B" wants to incrementally read the file "X", say every 1 sec (or even continuously) to find a particular pattern. What's the best way to do this in java? I know I can use RandomAccessFile's 'seek' method but will that interfere with the writing of the file? Is there a better way to do this?
Poll the data modified of the file. Opening it up for reading can prevent other programs from writing to the file at the same time.
If you're able to use Java 7, you could take advantage of the WatchService ... but it doesn't solve having to parse the whole file.
The only thing I can think off is maintaining some kind of "marker" that would indicate the last position you were up to. The next time you came to read the file, you could skip to this point and read from there (updating the marker when you're done)