Reading (Potentially Large) Text Files without Reading to Memory in Java? - java

I'm working on a program to read in various text files and display them without having to read the entire file to memory.
A brief description of what I want the program to do:
Assume a file of say, 3000 lines.
Display 50 lines at a time
Allow user to scroll down to read further lines, however these lines are loaded in real-time from reader. lines further up that are now not displayed are not stored in memory.
Allow user to scroll up to read previous lines, however these lines are also loaded in real-time or at least in a similar manner as reading forward.
What I want is to tailor the program to be memory-efficient and handle large files without fizzling out. I was looking at BufferedReaders but there doesn't seem to be a dependable way of backwards traversal. I was originally looking at how mark() and reset() functioned but I couldn't find a class that could set multiple marks.
I'm wondering if anybody could help me out and give me a few pointers towards some useful classes I could use. I was starting to poke around NIO classes like ByteBuffers and CharBuffers but I'm rather lost as to how I could implement them towards what I want to accomplish.
Thanks!

Back in the olden days of computing (1980's), this is exactly how we had to process large files for display.
Basically, you need an input method that can read specified lines from a file. Something like
List<String> block = readFile(file, 51, 100);
which would read the 51st through 100th lines of the file.
I see two ways you could accomplish this.
Read from the beginning of the file each time, skipping the first nth records and retrieving 50 (or some other number) strings.
Read the file once, and break it up into x files of length 50 (or some other number). Read your temporary files to get blocks of strings.
Either way, you would keep 3 blocks in memory; the current block, the previous block and the next block.
As you move forward through the strings, the current block becomes the previous block, the next block becomes the current block, and you read a new next block.
As you move backward through the strings, the current block becomes the next block, the previous block becomes the current block, and you read a new previous block.

Random access to a file is available in Java. So you can surf through the bytes of a file pretty easily, and have a region of file mapped to memory at a time.
You can have a Deque<E> implementation for readable region. Then you can add/remove data chunks or "lines" from both ends, to represent a visual data scroll.
If the "lines" are defined the characters that fit the width of the visual display (such as a console window), then you might just keep loading next x bytes/characters, and removing previous x bytes/characters.
Otherwise, you may need to scan ahead, and build some metadata about the file, noting down the positions of lines, or other interesting structures within the file. Then you can use this metadata to quickly navigate the file.

Related

Using java.io.RandomAccessFile, how do I write a file and keep adding content to the beginning?

How do I write to a file from the beginning using RandomAccessFile? I am writing into a file in 3mb byte chunks until it hits 100mb for benchmarking.
How do I write to a file from the beginning using RandomAccessFile?
You have to move the content already written.
Imagine the hard disk as a Lego base plate.
You can start at one edge to put blocks one after the other. The blocks are the chunks of data you want to write and any consecutive line of blocks is a "file".
However, if you want to put something at the beginning of the "file" you have to take of the blocks already there, put the new block at their position and put back the new locks behind them.

How to append and delete data in the middle of a text file, shifting the remaining file content using java code?

I am using RandomAccessFile but writeBytes() overwrites the
existing content Mostly i want to implement this without using any new
temporary file at least methods name few clues or techniques will do.
To insert without the use of any temporary file, you'll have to find a way to read in and shift all subsequent lines down by one when you insert a new line.
The sequential storage of lines of text poses the same kind of issues that the use of a standard array (vs. a LinkedList) does: You can't just unhook a pointer, plop in some text, and then hook up the pointer to next item to point to a subsequent line in the file. You have to perform the entire shift.
So I'd guess that you'd want to go to end of file, shift it down by a line, then move up each line and perform the same shift until you hit the position at which you want to insert the new line, at which point, you'll have cleared a space for it.
This seems very performance inefficient.
Hopefully someone knows of a better way, but this would be my approach, were a temporary file not an option.
(Alternately, you could also always just read the whole thing into a StringBuffer if it were small enough, peform the insert within that, and then write the file back out from scratch, but I imagine that you've considered that option already.)

How to change specific part of a file using java?

I was writing a program that implements a dictionary.
Actually what I did is just to write a java applet to show the words which is defined in a .xml file. And I did that with the org.w3c.dom package.
Now, I want to add a new feature that users can modify a word in the dictionary in the the program then the modification will be saved to the original .xml file.
Here is my question: what should I do to save the changes? Note that users can only modify one word a time so I don't want to load the whole file and modify the certain part and re-write the whole file to the disk. Is there a novel way to do that?
An XML file is a sequential text file. This means that there is no formula or other convenient way to locate the n-th word in a dictionary stored in XML. Elements need to be written one after the other, character by character (and one character may or may not result in a byte). Thus, what is called a random update, is out.
Look at JAXB for a most convenient way to read and write XML, and invest some work so that a user cannot update in memory and terminate the program without saving.
Reading and writing files in specific formats is a little bit trickier that what you portray.
Seen with "XML eyes" you are only changing a portion of the file - but to do that on the file level you need to seek to the position of change and write new bytes from there. The problem with that is that the content after that position won't adjust according to the new portion you write.
TL;DR - no - you need to read+write the complete XML file when making changes.

Java File Marker API

I have a need for my application to be able to read large (very large, 100GB+) text files and process the content in these files potentially at different times. For instance, it might run for an hour and finish processing a few GBs, and then I shut it down and come back to it a few days later to resume processsing the same file.
To do this I will need to read the files into memory-friendly chunks; each chunk/page/block/etc will be read in, one at a time, processed, before then next chunk is read into memory.
I need the program to be able to mark where it is inside the input file, so if it shuts down, or if I need to "replay" the last chunk being processed, I can jump right to the point in the file where I am and continue processing. Specifically, I need to be able to do the following things:
When the processing begings, scan a file for a "MARKER" (some marker that indicates where we left off processing the last time)
If the MARKER exists, jump to it and begin processing from that point
Else, if the MARKER doesn't exist, then place a MARKER after the first chunk (for now, let's say that a "chunk" is just a line-of-text, as BufferedReader#readLine() would read in) and begin processing the first chunk/line
For each chunk/line processed, move the MARKER after the next chunk (thus, advancing the MARKER further down the file)
If we reach a point where there are no more chunks/lines after the current MARKER, we've finished processing the file
I tried coding this up myself and notice that BufferedReader has some interesting methods on it that sounds like they're suited for this very purpose: mark(), reset(), etc. But the Javadocs on them are kind of vague and I'm not sure that these "File Marker" methods will accomplish all the things I need to be able to do. I'm also completely open to a 3rd party JAR/lib that has this capability built into it, but Google didn't turn anything up.
Any ideas here?
Forget about markers. You cannot "insert" text without rewritting the whole file.
Use a RandomAccessFile and store the current position you are reading. When you need to open again the file, just use seek to find the position.
A Reader's "mark" is not persistent; it solely forms part of the state of the Reader itself.
I suggest that you not store the state information in the text file itself; instead, have a file alongside which stores the byte offset of the most recently processed chunk. That'll eliminate the obvious problems involving overwriting data in the original text file.
The marker of the buffered reader is not persisted after different runs of your application. I would neither change the content of that huge file to mark a position, since this can lead to significant IO and/or filesystem fragmentation, depending on your OS.
I would use a properties file to store the configuration of the program externally. Have a look at the documentation, the API is straight forward:
http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html

How to read arbitrary but continuous n lines from a huge file

I would like to read arbitrary number of lines. The files are normal ascii text files for the moment (they may be UTF8/multibyte character files later)
So what I want is for a method to read a file for specific lines only (for example from 101-200) and while doing so it should not block any thing (ie same file can be read by another thread for 201-210 and it should not wait for the first reading operation.
In the case there are no lines to read it should gracefully return what ever it could read. The output of the methods could be a List
The solution I thought up so far was to read the entire file first to find number of lines as well as the byte positions of each new line character. Then use the RandomAccessFile to read bytes and convert them to lines. I have to convert the bytes to Strings (but that can be done after the reading is done). I would avoid the end of file exception for reading beyond file by proper book keeping. The solution is bit inefficient as it does go through the file twice, but the file size can be really big and we want to keep very little in the memory.
If there is a library for such thing that would work, but a simpler native java solution would be great.
As always I appreciate your clarification questions and I will edit this question as it goes.
Why not use Scanner and just loop through hasNextLine() until you get to the count you want, and then grab as many lines as you wish... if it runs out, it'll fail gracefully. That way you're only reading the file once (unless Scanner reads it fully... I've never looked under the hood... but it doesn't sound like you care, so... there you go :)
If you want to minimise memory consumption, I would use a memory mapped file. This uses almost no heap. The amount of the file kept in memory is handled by the OS so you don't need to tune the behaviour yourself.
FileChannel fc = new FileInputStream(fileName).getChannel();
final MappedByteBuffer map = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
If you have a file of 2 GB or more, you need multiple mappings. In the simplest case you can scan the data and remember all the indexes. The indexes them selves could take lots of space so you might only remember every Nth e.g. every tenth.
e.g. a 2 GB file with 40 byte lines could have 50 million lines requiring 400 MB of memory.
Another way around having a large index is to create another memory mapped file.
FileChannel fc = new RandomAccessFile(fileName).getChannel();
final MappedByteBuffer map2 = fc.map(FileChannel.MapMode.READ_WRITE, 0, fc.size()/10);
The problem being, you don't know how big the file needs to be before you start. Fortunately if you make it larger than needed, it doesn't consume memory or disk space, so the simplest thing to do is make it very large and truncate it when you know the size it needs to be.
This could also be use to avoid re-indexing the file each time you load the file (only when it is changed) If the file is only appended to, you could index from the end of the file each time.
Note: Using this approach can use a lot of virtual memory, for a 64-bit JVM this is no problem as your limit is likely to 256 TB. For a 32-bit application, you limits is likely to be 1.5 - 3.5 GB depending on your OS.

Categories