I'm writing a parser for files that look like this:
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
I want to get information preceded by certain tags (DEFINITION, VERSION etc.) but some descriptions cover multiple lines and I do need all of it. This is a problem when using BufferdReader to read my file.
I almost figured it out by using mark() and reset() but when executing my program I noticed that it only works for one tag and other tags are somehow skipped. This is the code I have so far:
Pattern pTag = Pattern.compile("^[A-Z]{2,}");//regex: 2 or more uppercase letters is a tag
Matcher mTagCurr = pTag.matcher(line);
if (mTagCurr.find()) {
reader.mark(1000);
String nextLine = reader.readLine();
Matcher mTagNext = pTag.matcher(nextLine);
if (mTagNext.find()){
reader.reset();
continue;
}
Pattern pWhite = Pattern.compile("^\\s{6,}");
Matcher mWhite = pWhite.matcher(nextLine);
while (mWhite.find()) {
line = line.concat(nextLine);
}
System.out.println(line);
}
This piece of code is supposed to find tags and concatenate descriptions that cover more than one line. Some answers I found here advised using Scanner. This is not an option for me. The files I work with can be very large (largest I encountered was >50GB) and by using BufferedReader I wish to put less of a strain on my system.
I suggest accumulating the information you get as your read it in a single pass parser. This will be simpler and faster in this case I suspect.
BTW, you want to cache your Patterns as creating them is quite expensive. You may find that you want ovoid using them entirely in some cases.
The code starts by finding a continuation line and calling reset() if it does not find it, but the code that reads additional lines does not seem to do that. Could it be reading the start of another section in the Genbank file and not putting it back? I don't see all the loop control code here, but what I do see appears to be correct.
If all else fails and you need something easy, there's always BioJava (see How to Read a Genbank File with Biojava3 and see if it helps). I have tried to use BioJava for my own projects, but it always falls a little short.
When I have written FASTA and FASTQ parsers, I read into a byte or char buffer and process it that way, but there is more buffer management code to write. That way, I don't have to worry about putting bytes back in a buffer. This can also avoid regex, which can be expensive in a time-critical application. Of course, this take more time to implement.
Tip: For fastest implementation if you are managing the buffer yourself, check out NIO (Java NIO Tutorial). I have seen give up a 10x speedup in some cases (writing data). The only drawback is that I have not found an easy way to read gzipped sequence data with NIO yet.
Related
I have some confusion if someone can help. Tried searching the web for it but didn't get any satisfying answer.
Why don't we simply use System.in.somemethod() to take input in Java just like we do for output? Like System.out.println() is used so why not System.in as it is? Why is there a long process for Input?
The only methods that System.in, an InputStream, provides are the overloads of read. Sure, you could do something like:
byte[] bytes = new byte[5];
System.in.read(bytes);
System.out.println(Arrays.toString(bytes));
to read five bytes from the console. But this has the following disadvantages:
You need to handle the checked IOException. (not shown in the code snippet above)
Hard to work with bytes. (unless you want them specifically)
You usually just want to read the input until the end of a line. With this it's hard to know where the end of a line is.
So that's why people use Scanners to wrap the System.in stream into something more user-friendly.
Taking input from the command line will always be trickier than just outputting data. This is because there is no way to know that the input is semantically correct, structured correctly or even syntactically correct.
If you just want to read bytes from System.in then a lot of the uncertainty of the input disappears. In that case there is only two things to take into account: I/O errors and end-of-input - both of which are also present for System.out. The only other thing that may be tricky is that InputStream may not return all the bytes that are requested in a single call to read.
So reading data from System.in isn't hard; interpreting the data - which often comes down to parsing the data or validating the data - is the hard part. And that's why often the Scanner class is used to make sense of the input.
Just as you cannot use System.out.somemethod() instead of System.out.println() in the same way you cannot use System.in.somemethod() instead of System.in.read().
For a web project I'm writing large sections of text to a webpage(table) or even bigger (could be several MB) to CSV files for download.
The java method dealing with this receives a StringBuilder content string, which originally (by the creator of this module) was being sent char by char in a loop:
response.getOutputStream().write(content.charAt(i)).
Upon questioning about the loop, the reason given was that he thought the string might be too big for writing in one go. (using java 1.6).
I can't find any size restrictions anywhere, and then also the question came which method to use instead: print() or getWriter()?
The data in the string is all text.
He assumed wrong. If anything it's inefficient, or at least useless to do that one character at a time. If you have a String in memory, you can write it out at one go without worrying.
If you're only writing text, use a Writer. OutputStream is for binary data (although you can wrap it in an OutputStreamWriter to convert between the two). See Writer or OutputStream?
I'm writing a routine that takes a string and formats it as quoted printable. And it's got to be as fast as possible. My first attempt copied characters from one stringbuffer to another encoding and line wrapping along the way. Then I thought it might be quicker to just modify the original stringbuffer rather than copy all that data which is mostly identical. Turns out the inserts are far worse than copying, the second version (with the stringbuffer inserts) was 8 times slower, which makes sense, as it must be moving a lot of memory.
What I was hoping for was some kind of gap buffer data structure so the inserts wouldn't involve physically moving all the characters in the rest of the stringbuffer.
So any suggestions about the fastest way to rip through a string inserting characters every once in a while?
Suggestions to use the standard mimeutils library are not helpful because I'm also dot escaping the string so it can be dumped out to an smtp server in one shot.
At the end, your gap data structure would have to be transformed into a String, which would need assembling all the chunks in a single array by appending them to a StringBuilder.
So using a StringBuilder directly will be faster. I don't think you'll find a faster technique than that. Make sure to initialize the StringBuilder with a large enough size to avoid copies of the whole buffer once the capacity is exhausted.
So taking the advice of some of the other answers here I've been writing many versions of this function, seeing what goes quickest and for future reference if anybody can gain from what I found:
1) The slowest: stringbuffer.append() but we knew that.
2) Almost twice as fast: stringbuilder.append(). locks are very expensive it seems.
3) another 20% faster is.... copying from one char[] to another.
4) and finally, coming in three times faster than even that... a JNI call to the exact same code compiled in C that copies from one char array to another.
You may consider #4 cheating, but cheaters win. It is by far the fastest way to go.
There is a risk of the GetCharArrayElements call causing the java char array to be copied so it can be handed to the C program, but I can't tell if that's happening, and it's still wicked fast compared to any java implementation.
I think a good balance between speed and coding grace would be using Matcher.appendReplacement. Formulate a regex that will catch all insertion points. In a loop you use find, analyze Matcher.group() to see what exactly has matched, and use your program logic to decide what to give to appendReplacement.
In any case, it is important not to copy the text over char by char. You must copy in the largest chunks possible.
The Matcher API is quite unfortunately bound to the StringBuffer, but, as you find, that only steels the final 5% from you.
I have a very big file (might be even 1G) that I want to create a new file from in a reversed order (in Java).
For example:
Original file:
This is the first line
This is the 2nd line
This is the 3rd line
The reversed file:
This is the 3rd line
This is the 2nd line
This is the first line
Since the file is very big, loading the entire file to memory at once and reversing the order there might be problematic (there is a limit to the memory I can use).
How can I achieve this in Java?
Thanks
Nothing very direct, I'm afraid. But you can easily create some (say) ReverseBufferedRead class wrapping a RandomAccessFile.
See also here.
Read the file by chunks of few hundreds lines, reverse the order of lines in the chunks and write them to temporary files. Then join the temporary files in the reverse order and clean up.
In other words, use disk instead of memory.
I would propose making a RandomAccessFile for the output and using setLength() to make it appropriately sized.
Then start scanning the original file and write it out in chunks starting at the end of the RandomAccessFile in reverse.
Java-ish Pseudo:
out.seek(size_of_out_file); //seek to end
RandomAccessFile out = new RandomAccessFile("out_fname", "rw");
out.setLength(size_of_file_to_be_reversed)
File in = new File ("in_fname");
while (hasMoreData(in)){
String chunk = in.readsize();
out.seekBackwardsBy(chunk.length());
out.write(chunk.reverse);
out.seekBackwardsBy(chunk.length());
}
Reading a file line-by-line in reverse order is fundamentally tricky.
It's not too bad if you've got a fixed width encoding. It's feasible if you've got a variable width encoding which you can detect the first byte of etc (e.g. UTF-8). It's virtually impossible to do efficiently if the encoding is variable width with no sensible way of determining boundaries (or if it uses "shifting" for example).
I have an implementation in C# in another question, but it would take a fair amount of effort to port that to Java.
If you use the RandomAccessFile like leonbloy suggested you can use a FileChannel
to skip to the end of the file, you can then read the line and write it to another file.
There is a simple example here in the Java tutorials: example
I would assume you know how to read a file. One way i would advise you do it is with an ArrayList of generic type string. So you read each line of the file and store it in that list. After reading you print the list out or do whatever you want to.
Just wrote something that might be of help here : http://pastebin.com/iWTVrAvm
Read using RandomAccessFile - position the file using randomAccesFile.length()and write using BufferedWriter
A better solution is use a ReversedLinesFileReader provided in Apache Commons IO package. Look at the API here https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/ReversedLinesFileReader.html
I'm wondering if util code already exists to implement some/all of *NIX tail. I'd like to copy the last n lines of some file/reader to another file/reader, etc.
This seems like a good bet: Tailer Library. This implementation is based on it, but isn't the same. Neither implement a lookback to get the last 100 lines though. :(
You could take a look at this tail implementation in one of Heritrix's utility classes. I didn't write it but I wrote the code that uses it, works correctly as far as I can tell.
This is a UI app - you can look at the source though to see what it does (basically some threading & IO). Follow.
The "last n lines" is quite tricky to do with potentially variable width encodings etc.
I wrote a reverse line iterator in C# in response to another SO question. The code is all there, although it uses iterator blocks which aren't available in C# - you'd probably be better off passing the desired size into the method and getting it to build a list. (You can then convert the yield return statements in my code into list.add() calls.) You'll need to use a Java Charset instead of Encoding of course, and their APIs are slightly different too. Finally, you'll need to reverse the list when you're done.
This is all assuming you don't want to just read the whole file. If you don't mind doing that, you could use a circular buffer to keep "the last n lines at the moment", reading through until the end and returning the buffer afterwards. That would be much much simpler to implement, but will be much less efficient for very long files. It's easy to make that cope with any reader though, instead of just a few selected charsets over a stream (which my reverse iterator does).