Java: How to extract matching lines from a large text file fast? - java

Although aware that there are plenty of offered solutions to my problem in general,
I am still not satisfied with the runtime they require in my special case.
Consider a 35G large text file in FASTA format, like this:
>Protein_1 So nice and cute little fella
MTTKKCLQKFHLESLGKLGDSFLKYAISIQLFKSYENHYEGLPSIKKNKIISNAALFKLG
YARKILRFIRNEPFDLKVGLIPSDNSQAYNFGKEFLMPSVKMCSRVK*
>Protein_2 Fancy incredible description of its function
MADDSKFCFFLVSTFLLLAVVVNVTLAANYVPGDDILLNCGGPDNLPDADGRKWGTDIGS
[…] etc.
I need to extract the > lines only.
Using grep '>' proteins.fasta > protein_descriptions.txt to achieve this takes only a couple of minutes.
But using Java 7 this is now already running for over 90 minutes:
public static void main(String[] args) throws Exception {
BufferedReader fastaIn = new BufferedReader(new FileReader(args[0]));
List<String> l = new ArrayList<String>();
String str;
while ((str = fastaIn.readLine()) != null) {
if (str.startsWith(">")) {
l.append(str);
}
}
fastaIn.close();
// …
}
Does anyone have an idea of how to speed this up to grep performance?
Your help will be much appreciated.
Cheers!

If you write it to the outfile immediatelly instead of accumulating objects in the memory it will improve performance (and will be more like what you did with grep anyway).
...
BufferedWriter fastaOut = new BufferedWriter(new FileWriter(args[1]));
...
while ((str = fastaIn.readLine()) != null) {
if (str.startsWith(">")) {
fastaOut.write(str);
fastaOut.newLine();
}
}
...
fastaOut.close();

The biojava.org provides a fasta reader.
For reading huge files you would have to consider using a SeekableByteChannell and using the ByteBuffers.
The biojava library uses bytebuffers.

You could probably speed this up considerably using multiple threads. If the file is X bytes long, and you have n threads, you start each thread at X/n intervals, and read X/n bytes. You will want to synchronize your ArrayList to ensure your results are added correctly

Related

Java: number of lines in a file without processing it

I need to know the number of lines of a file before processing it, because I need to know the number of lines before read it, or in the worst case escenario read it twice..... so I made this code but It not works.. so maybe is just not possible ?
InputStream inputStream2 = getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(getInputStream()));
String line;
int numLines = 0;
while ((line = reader.readLine()) != null) {
numLines++;
}
TextFileDataCollection dataCollection = new TextFileDataCollection (numLines, 50);
BufferedReader reader2 = new BufferedReader(new InputStreamReader(inputStream2));
while ((line = reader2.readLine()) != null) {
StringTokenizer st = new StringTokenizer(reader2.readLine(), ",");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
}
Here's a similar question with java code, although it's a bit older:
Number of lines in a file in Java
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
EDIT:
Here's a reference related to inputstreams specifically:
From Total number of rows in an InputStream (or CsvMapper) in Java
"Unless you know the row count ahead of time, it is not possible without looping. You have to read that file in its entirety to know how many lines are in it, and neither InputStream nor CsvMapper have a means of reading ahead and abstracting that for you (they are both stream oriented interfaces).
None of the interfaces that ObjectReader can operate on support querying the underlying file size (if it's a file) or number of bytes read so far.
One possible option is to create your own custom InputStream that also provides methods for grabbing the total size and number of bytes read so far, e.g. if it is reading from a file, it can expose the underlying File.length() and also track the number of bytes read. This may not be entirely accurate, especially if Jackson buffers far ahead, but it could get you something at least."
You write
I need to know the number of lines of a file before processing it
but you don't present any file in your code; rather, you present only an InputStream. This makes a difference, because indeed no, you cannot know the number of lines in the input without examining the input to count them.
If you had a file name, File object, or similar mechanism by which you could access the data more than once, then that would be straightforward, but a stream is not guaranteed to be associated with any persistent file -- it might convey data piped from another process or communicated over a network connection, for example. Therefore, each byte provided by a generic InputStream can be read only once.
InputStream does provide an API for marking (mark()) a position and later returning to it (reset()), but stream implementations are not required to support it, and many do not. Those that do support it typically impose a limit on how far past the mark you can read before invalidating it. Readers support such a facility as well, with similar limitations.
Overall, if your only access to the data is via an InputStream, then your best bet is to process it without relying on advance knowledge of the contents. But if you want to be able to read the data twice, to count lines first, for example, then you need to make your own arrangements to stash the data somewhere in order to ensure your ability to do so. For example, you might copy it to a temporary file, or if you're prepared to rely on the input not being too large for it then you might store the contents in memory as a List of byte, byte[], char, or String.

Improve performace of string search using Patter.compile in large files

I have huge text files whose size can range from 500KB to 500MB. I have a list of keywords that needs to be found in the file content. The no. of keywords can be up to 400,000.
Right now I'm using the below code to find the keywords in the file content
public static void main(String[] args) {
StringBuilder fileContent = new StringBuilder();
try (BufferedReader reader = new BufferedReader(new FileReader("C:\\Users\\harshita.sethi\\Desktop\\merge\\MNT.txt"))) {
String line;
while ((line = reader.readLine()) != null) {
fileContent.append(line).append("\n");
}
}
String content = fileContent.toString();
Set<List<String>> keywords = getDbQuery(); // size can be up to 4*10^5
for (List<String> key : keywords) {
if (checkOccurence(content, key.get(0))) {
//Do Somethng
}
}
}
private static boolean checkOccurence(String content, String keyword) {
Boolean flag = false;
try {
Pattern p = Pattern.compile("\\b" + keyword + "\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(content);
flag = m.find();
} catch (PatternSyntaxException ex) {
System.out.println("cannot report occrence of " + keyword);
}
return flag;
}
The problem is with huge file size it takes a lot of time to scan through the file. I have done all sorts of testing and came to the conclusion that Pattern.Compile is making the code progress slow.
I have read on the internet since Pattern.compile compiles the regex everytime the function is called it consumes a lot of time.
Can anyone please suggest how can improve the performance of this code so that the string search is faster.
PS: I'm restricted to use Java 6 version.
Edit -
I tried the compiling all the keywords before the for loop as suggested by few people. I can see there is no much difference in the code execution time.
Although I noticed that by removing the boundary regex, the performance of the code changed drastically. It just took few seconds to complete the full run where it was taking 8-10 mins earlier. But by removing boundary regex, I'm not getting the desired output.
Question - Is there a way to fine tune the performance using boundaries. Why did the performance changed drastically?
My aim(for example) is to get
false if abcd is found while searching abc and
true if abc. or abc, or abc etc is found while searching for abc.
I would prefer to load key words and compile all patters before the search process.
The next step to improve the performance is to use the Java 8 stream api which allows you to paralyze the compile and search process.
I think that can help.

Java read csv file as matrix

I'm new to writing java code as such. I have experience writing code in scripting type languages. I'm trying to rewrite a piece of code I had in python in java.
Python code below -
import pandas as pd
myFile = 'dataFile'
df = pd.DataFrame(pd.read_csv(myFile,skiprows=0))
inData = df.as_matrix()
I'm looking for a method in java that is equivalent to as_matrix in python. This function converts the data frame into a matrix.
I did look up for sometime now but can't find a method as such that does the conversion like in python. Is there a 3rd party library or something on those lines I could use? Any direction would help me a lot please. Thank you heaps.
What you want to do is really simple and requires minimal code on your part, therefore I suggest you code it yourself. Here is an example implementation:
List<String[]> rowList = new ArrayList<String[]>();
try (BufferedReader br = new BufferedReader(new FileReader("pathtocsvfile.csv"))) {
String line;
while ((line = br.readLine()) != null) {
String[] lineItems = line.split(",");
rowList.add(lineItems);
}
br.close();
}
catch(Exception e){
// Handle any I/O problems
}
String[][] matrix = new String[rowList.size()][];
for (int i = 0; i < rowList.size(); i++) {
String[] row = rowList.get(i);
matrix[i] = row;
}
What this does is really simple: It opens a buffered reader that will read the csv file line by line and paste the contents to an array of Strings after splitting them based on comma (which is your delimiter). Then it will add them to a list of arrays. I know this might not be perfect, so afterwards I take the contents of that list of arrays and turn it into a neat 2D matrix. Hope this helps.
Hint: there are a lot of improvements that could be made to this little piece of code (i.e. take care of trailing and leading spaces, add user-defined delimiters etc.), but this should be a good starting point.

Java large String returned from findWithinHorizon converted to InputStream

I have wrote an application which in one of its modules parses huge file and saves this file chunk by chunk into a database.
First of all the following code works, and my main problem is to reduce memory usage and general increase in performance.
The following code snippet is a small part of the big picture, but is the most problematic after doing some YourKit profiling, The lines that are marked by /*Here*/ allocate huge amount of memory.
....
Scanner fileScanner = new Scanner(file,"UTF-8");
String scannedFarm;
try{
Pattern p = Pattern.compile("(?:^.++$(?:\\r?+\\n)?+){2,100000}+",Pattern.MULTILINE);
String [] tableName = null;
/*HERE*/while((scannedFarm = fileScanner.findWithinHorizon(p, 0)) != null){
boolean continuePrevStream = false;
Scanner scanner = new Scanner(scannedFarm);
String[] tmpTableName = scanner.nextLine().split(getSeparator());
if (tmpTableName.length==2){
tableName = tmpTableName;
}else{
if (tableName==null){
continue;
}
continuePrevStream = true;
}
scanner.close();
/*HERE*/ InputStream is = new ByteArrayInputStream(scannedFarm.getBytes("UTF-8"));
....
It is acceptable to allocate huge amount of memory since the String is large (i need it too be such large chunk), My main problem is that the same allocation happens twice as a result of getBytes,
So my question is their a way to transfer the findWithinHorizon Result directly to InputStream without allocating memory twice?
Is their more efficient way to achieve the same functionality?
Not exactly the same approach but instead of findWithinHorizon, you could try reading each line and searching for the pattern within the line context. This is sure to reduce memory pressure as you're not buffering the whole file as the API states:
If horizon is 0, then the horizon is ignored and this method continues
to search through the input looking for the specified pattern without
bound. In this case it may buffer all of the input searching for the
pattern.
Something like:
while(String line = fileScanner.nextLine() != null) {
if(grep for pattern in line) {
}
}

Using StringBuilder to process csv files to save heap space

I am reading a csv file that has about 50,000 lines and 1.1MiB in size (and can grow larger).
In Code1, I use String to process the csv, while in Code2 I use StringBuilder (only one thread executes the code, so no concurrency issues)
Using StringBuilder makes the code a little bit harder to read that using normal String class.
Am I prematurely optimizing things with StringBuilder in Code2 to save a bit of heap space and memory?
Code1
fr = new FileReader(file);
BufferedReader reader = new BufferedReader(fr);
String line = reader.readLine();
while ( line != null )
{
int separator = line.indexOf(',');
String symbol = line.substring(0, seperator);
int begin = separator;
separator = line.indexOf(',', begin+1);
String price = line.substring(begin+1, seperator);
// Publish this update
publisher.publishQuote(symbol, price);
// Read the next line of fake update data
line = reader.readLine();
}
Code2
fr = new FileReader(file);
StringBuilder stringBuilder = new StringBuilder(reader.readLine());
while( stringBuilder.toString() != null ) {
int separator = stringBuilder.toString().indexOf(',');
String symbol = stringBuilder.toString().substring(0, separator);
int begin = separator;
separator = stringBuilder.toString().indexOf(',', begin+1);
String price = stringBuilder.toString().substring(begin+1, separator);
publisher.publishQuote(symbol, price);
stringBuilder.replace(0, stringBuilder.length(), reader.readLine());
}
Edit
I eliminated the toString() call, so there will be less string objects produced.
Code3
while( stringBuilder.length() > 0 ) {
int separator = stringBuilder.indexOf(",");
String symbol = stringBuilder.substring(0, separator);
int begin = separator;
separator = stringBuilder.indexOf(",", begin+1);
String price = stringBuilder.substring(begin+1, separator);
publisher.publishQuote(symbol, price);
Thread.sleep(10);
stringBuilder.replace(0, stringBuilder.length(), reader.readLine());
}
Also, the original code is downloaded from http://www.devx.com/Java/Article/35246/0/page/1
Will the optimized code increase performance of the app? - my question
The second code sample will not save you any memory nor any computation time. I am afraid you might have misunderstood the purpose of StringBuilder, which is really meant for building strings - not reading them.
Within the loop or your second code sample, every single line contains the expression stringBuilder.toString(), essentially turning the buffered string into a String object over and over again. Your actual string operations are done against these objects. Not only is the first code sample easier to read, but it is most certainly as performant of the two.
Am I prematurely optimizing things with StringBuilder? - your question
Unless you have profiled your application and have come to the conclusion that these very lines causes a notable slowdown on the execution speed, yes. Unless you are really sure that something will be slow (eg if you recognize high computational complexity), you definately want to do some profiling before you start making optimizations that hurt the readability of your code.
What kind of optimizations could be done to this code? - my question
If you have profiled the application, and decided this is the right place for an optimization, you should consider looking into the features offered by the Scanner class. Actually, this might both give you better performance (profiling will tell you if this is true) and more simple code.
Am I prematurely optimizing things with StringBuilder in Code2 to save a bit of heap space and memory?
Most probably: yes. But, only one way to find out: profile your code.
Also, I'd use a proper CSV parser instead of what you're doing now: http://ostermiller.org/utils/CSV.html
Code2 is actually less efficient than Code1 because every time you call stringBuilder.toString() you're creating a new java.lang.String instance (in addition to the existing StringBuilder object). This is less efficient in terms of space and time due to the object creation overhead.
Assigning the contents of readLine() directly to a String and then splitting that String will typically be performant enough. You could also consider using the Scanner class.
Memory Saving Tip
If you encounter multiple repeating tokens in your input consider using String.intern() to ensure that each identical token references the same String object; e.g.
String[] tokens = parseTokens(line);
for (String token : tokens) {
// Construct business object referencing interned version of token.
BusinessObject bo = new BusinessObject(token.intern());
// Add business object to collection, etc.
}
StringBuilder is usually used like this:
StringBuilder sb = new StringBuilder();
sb.append("You").append(" can chain ")
.append(" your ").append(" strings ")
.append("for better readability.");
String myString = sb.toString(); // only call once when you are done
System.out.prinln(sb); // also calls sb.toString().. print myString instead
StringBuilder has several good things
StringBuffer's operations are synchronized but StringBuilder is not, so using StringBuilder will improve performance in single threaded scenarios
Once the buffer is expanded the buffer can be reused by invoking setLength(0) on the object. Interestingly if you step into the debugger and examine the contents of StringBuilder you will see that contents are still exists even after invoking setLength(0). The JVM simply resets the pointer beginning of the string. Next time when you start appending the chars the pointer moves
If you are not really sure about length of string, it is better to use StringBuilder because once the buffer is expanded you can reuse the same buffer for smaller or equal size
StringBuffer and StringBuilder are almost same in all operations except that StringBuffer is synchronized and StringBuilder is not
If you dont have multithreading then it is better to use StringBuilder

Categories