Read and a write a file in a reverse order - Java - java

I have a very big file (might be even 1G) that I want to create a new file from in a reversed order (in Java).
For example:
Original file:
This is the first line
This is the 2nd line
This is the 3rd line
The reversed file:
This is the 3rd line
This is the 2nd line
This is the first line
Since the file is very big, loading the entire file to memory at once and reversing the order there might be problematic (there is a limit to the memory I can use).
How can I achieve this in Java?
Thanks

Nothing very direct, I'm afraid. But you can easily create some (say) ReverseBufferedRead class wrapping a RandomAccessFile.
See also here.

Read the file by chunks of few hundreds lines, reverse the order of lines in the chunks and write them to temporary files. Then join the temporary files in the reverse order and clean up.
In other words, use disk instead of memory.

I would propose making a RandomAccessFile for the output and using setLength() to make it appropriately sized.
Then start scanning the original file and write it out in chunks starting at the end of the RandomAccessFile in reverse.
Java-ish Pseudo:
out.seek(size_of_out_file); //seek to end
RandomAccessFile out = new RandomAccessFile("out_fname", "rw");
out.setLength(size_of_file_to_be_reversed)
File in = new File ("in_fname");
while (hasMoreData(in)){
String chunk = in.readsize();
out.seekBackwardsBy(chunk.length());
out.write(chunk.reverse);
out.seekBackwardsBy(chunk.length());
}

Reading a file line-by-line in reverse order is fundamentally tricky.
It's not too bad if you've got a fixed width encoding. It's feasible if you've got a variable width encoding which you can detect the first byte of etc (e.g. UTF-8). It's virtually impossible to do efficiently if the encoding is variable width with no sensible way of determining boundaries (or if it uses "shifting" for example).
I have an implementation in C# in another question, but it would take a fair amount of effort to port that to Java.

If you use the RandomAccessFile like leonbloy suggested you can use a FileChannel
to skip to the end of the file, you can then read the line and write it to another file.
There is a simple example here in the Java tutorials: example

I would assume you know how to read a file. One way i would advise you do it is with an ArrayList of generic type string. So you read each line of the file and store it in that list. After reading you print the list out or do whatever you want to.
Just wrote something that might be of help here : http://pastebin.com/iWTVrAvm

Read using RandomAccessFile - position the file using randomAccesFile.length()and write using BufferedWriter

A better solution is use a ReversedLinesFileReader provided in Apache Commons IO package. Look at the API here https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/ReversedLinesFileReader.html

Related

Java processing lines in file and data structures

I have read a bit about multidimensional arrays would it make sense to solve this problem using such data structures in Java, or how should I proceed?
Problem
I have a text file containing records which contain multiple lines. One record is anything between <SUBBEGIN and <SUBEND.
The lines in the record follow no predefined order and may be absent from a record. In the input file (see below) I am only interested in lines MSISDN, CB,CF and ODBIC fields.
For each of these fields I would like to apply regular expressions to extract the value to the right of the equals.
Output file would be a comma separated file containing these values, example
MSISDN=431234567893 the value 431234567893 is written to the output file
error checking
NoMSISDNnofound when no MSISDN is found in a record
noCFUALLPROVNONE when no CFU-ALL-PROV-NONE is found in a recored
Search and replace operations
CFU-ALL-PROV-NONE should be replaced by CFU-ALL-PROV-1/1/1
CFU-TS10-ACT-914369223311 should be replaced by CFU-TS10-ACT-1/1/0/4369223311
Output for first record
431234567893,BAOC-ALL-PROV,BOIC-ALL-PROV,BOICEXHC-ALL-PROV,BICROAM-ALL-PROV,CFU-ALL-PROV-1/1/1,CFB-ALL-PROV-1/1/1,CFNRY-ALL-PROV-1/1/1,CFNRY-ALL-PROV-1/1/1,CFU-TS10-ACT-1/1/1/4369223311,BAIC,BAOC
Input file
<BEGINFILE>
<SUBBEGIN
IMSI=11111111111111;
MSISDN=431234567893;
CB=BAOC-ALL-PROV;
CB=BOIC-ALL-PROV;
CB=BOICEXHC-ALL-PROV;
CB=BICROAM-ALL-PROV;
IMEISV=4565676567576576;
CW=CW-ALL-PROV;
CF=CFU-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFB-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRY-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRC-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFU-TS10-ACT-914369223311-YES-YES-25-YES-65535-YES-YES-NO-NO-NO-YES-YES-
YES-YES-NO;
ODBIC=BAIC;
ODBOC=BAOC;
ODBROAM=ODBOHC;
ODBPRC=ENTER;
ODBPRC=INFO;
ODBPLMN=NONE;
ODBPOS=NOBPOS-BOTH;
ODBECT=OdbAllECT;
ODBDECT=YES;
ODBMECT=YES;
ODBPREMSMS=YES;
ODBADULTSMS=YES;
<SUBEND
<SUBBEGIN
IMSI=11111111111133;
MSISDN=431234567899;
CB=BAOC-ALL-PROV;
CB=BOIC-ALL-PROV;
CB=BOICEXHC-ALL-PROV;
CB=BICROAM-ALL-PROV;
CW=CW-ALL-PROV;
CF=CFU-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO+-NO-NO;
CF=CFB-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRY-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRC-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFU-TS10-ACT-914369223311-YES-NO-NONE-YES-65535-YES-YES-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFD-TS10-REG-91430000000-YES-YES-25-YES-65535-YES-YES-NO-NO-NO-YES-YES-YES-YES-NO;
ODBIC=BICCROSSDOMESTIC;
ODBOC=BAOC;
ODBROAM=ODBOH;
ODBPRC=INFO;
ODBPLMN=PLMN1
ODBPLMN=PLMN3;
ODBPOS=NOBPOS-BOTH;
ODBECT=OdbAllECT;
ODBDECT=YES;
ODBMECT=YES;
ODBPREMSMS=NO;
ODBADULTSMS=YES;
<SUBEND
From what I understand, you are simply reading a text file and processing it and maybe replacing some words. You do not therefore need a data structure to store the words in. Instead you can simply read the file line by line and pass it through a bunch of if statements (maybe a couple booleans to check if the specific parameters you are searching for have been found?) and then rewrite the line you want to a new file.
Dealing with big files to implement data in machine learning algorithms, I did it by passing all of the file contents in a variable, and then using the String.split("delimeter") method (Supported from Java 8 and later), I broke the contents in a one-dimensional array, where each cell had the info before the delimeter.
Firstly read the file via a scanner or your way of doing it (let content be the variable with your info), and then break it with
content.split("<SUBEND");

performance and size limitations on HttpServletResponse.getOutputStream.print(string) vs getWriter(String)

For a web project I'm writing large sections of text to a webpage(table) or even bigger (could be several MB) to CSV files for download.
The java method dealing with this receives a StringBuilder content string, which originally (by the creator of this module) was being sent char by char in a loop:
response.getOutputStream().write(content.charAt(i)).
Upon questioning about the loop, the reason given was that he thought the string might be too big for writing in one go. (using java 1.6).
I can't find any size restrictions anywhere, and then also the question came which method to use instead: print() or getWriter()?
The data in the string is all text.
He assumed wrong. If anything it's inefficient, or at least useless to do that one character at a time. If you have a String in memory, you can write it out at one go without worrying.
If you're only writing text, use a Writer. OutputStream is for binary data (although you can wrap it in an OutputStreamWriter to convert between the two). See Writer or OutputStream?

Reading ahead with BufferedReader (Java)

I'm writing a parser for files that look like this:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
I want to get information preceded by certain tags (DEFINITION, VERSION etc.) but some descriptions cover multiple lines and I do need all of it. This is a problem when using BufferdReader to read my file.
I almost figured it out by using mark() and reset() but when executing my program I noticed that it only works for one tag and other tags are somehow skipped. This is the code I have so far:
Pattern pTag = Pattern.compile("^[A-Z]{2,}");//regex: 2 or more uppercase letters is a tag
Matcher mTagCurr = pTag.matcher(line);
if (mTagCurr.find()) {
reader.mark(1000);
String nextLine = reader.readLine();
Matcher mTagNext = pTag.matcher(nextLine);
if (mTagNext.find()){
reader.reset();
continue;
}
Pattern pWhite = Pattern.compile("^\\s{6,}");
Matcher mWhite = pWhite.matcher(nextLine);
while (mWhite.find()) {
line = line.concat(nextLine);
}
System.out.println(line);
}
This piece of code is supposed to find tags and concatenate descriptions that cover more than one line. Some answers I found here advised using Scanner. This is not an option for me. The files I work with can be very large (largest I encountered was >50GB) and by using BufferedReader I wish to put less of a strain on my system.
I suggest accumulating the information you get as your read it in a single pass parser. This will be simpler and faster in this case I suspect.
BTW, you want to cache your Patterns as creating them is quite expensive. You may find that you want ovoid using them entirely in some cases.
The code starts by finding a continuation line and calling reset() if it does not find it, but the code that reads additional lines does not seem to do that. Could it be reading the start of another section in the Genbank file and not putting it back? I don't see all the loop control code here, but what I do see appears to be correct.
If all else fails and you need something easy, there's always BioJava (see How to Read a Genbank File with Biojava3 and see if it helps). I have tried to use BioJava for my own projects, but it always falls a little short.
When I have written FASTA and FASTQ parsers, I read into a byte or char buffer and process it that way, but there is more buffer management code to write. That way, I don't have to worry about putting bytes back in a buffer. This can also avoid regex, which can be expensive in a time-critical application. Of course, this take more time to implement.
Tip: For fastest implementation if you are managing the buffer yourself, check out NIO (Java NIO Tutorial). I have seen give up a 10x speedup in some cases (writing data). The only drawback is that I have not found an easy way to read gzipped sequence data with NIO yet.

Reading huge ascii text file quickly in Java. Need help using MappedByteBuffer

I have a text file with thousands of lines of data like the following:
38.48,88.25
48.20,98.11
100.24,181.39
83.01,97.33
... and the list keeps going (thousands of lines just like that).
I figured out how to separate this data into usable tokens using FileReader and Scanner but this method is far too slow.
I created the following delimeter:
src.useDelimiter(",|\n");
and then used the scanner class nextDouble() to get each piece of data.
I have done a lot of research and it looks like the solution is to use a MappedByteBuffer to place the data into memory and access it there. The problem is I don't know how to use MappedByteBuffer to separate this data into usable tokens.
I found this site: http://javarevisited.blogspot.com/2012/01/memorymapped-file-and-io-in-java.html - which helps me to map the file into memory and it explains how to read the file but it looks like the data is returned as a byte or perhaps in binary form? The file I am trying to access is ascii and I need to be able to read the data as ascii as well. Can anyone explain how to do that? Is there a way to scan a file mapped into memory in the same way that I have done using scanner with the previous FileReader method? Or is there another method that would be faster? My current method takes nearly 800x the amount of time that it should take.
I know some may say I am trying to reinvent the wheel but this is for academic purposes and thus, I am not allowed to use external libraries.
Thank you!
To get the data loaded into memory you can use the Scanner in the same way you did earlier, then store each row on a list like the following.
List<Pair> data = new ArrayList<Pair>();
Where Pair is defined as
class Pair {
private final double first;
private final double second;
public Pair(double first, double second) {
this.first = first;
this.second = second;
}
....
}
MappedByteBuffer is a subclass of ByteBuffer on which you can call asCharBuffer. That returns a CharBuffer which implements Readable, which can then be supplied to Scanner.
That way you can use Scanner on the file via MappedByteBuffer. Whether that makes it perform any faster I don't know.

What is the fastest method for reading from a text file in Java?

I currently use:
BufferedReader input = new BufferedReader(new FileReader("filename"));
Is there a faster way?
While what you've got isn't necessarily the absolute fastest, it's simple. In fact, I wouldn't use quite that form - I'd use something which allows me to specify a charset, e.g.
// Why is there no method to give this guaranteed charset
// without "risk" of exceptions? Grr.
Charset utf8 = Charset.forName("UTF-8");
BufferedReader input = new BufferedReader(
new InputStreamReader(
new FileInputStream("filename"),
utf8));
You can probably make it go faster using NIO, but I wouldn't until I'd seen an actual problem. If you see a problem, but you're doing other things with the data, make sure they're not the problem first: write a program to just read the text of the file. Don't forget to do whatever it takes on your box to clear file system caches between runs though...
If it's /fast/ you want, keep the character data in encoded form (and I don't mean UTF-16). Although disc I/O is generally slow (unless it's cached), decoding and keeping twice the data can also be a problem. Although the fastest to load is probably through java.nio.channels.FileChannel.map(MapMode.READ_ONLY, ...), that has severe problems with deallocation.
Usual caveats apply.
Look into java.nio.channels.FileChannel.
Have you benchmarked your other options? I imagine that not using a BufferedReader may be faster in some cases - like extremely small files. I would recommend that you at the very least do some small benchmarks and find the fastest implementation that works with your typical use cases.
Depends on what you want to read. The complete file, or from a specific location, do you need to able to seatch through it, or do you want to read the complete text in one go?
File file = new File("querySourceFileName");
Scanner s = new Scanner(file);
while (s.hasNext()) {
System.out.println(s.nextLine());
}

Categories