Improve performace of string search using Patter.compile in large files

Improve performace of string search using Patter.compile in large files - java

I have huge text files whose size can range from 500KB to 500MB. I have a list of keywords that needs to be found in the file content. The no. of keywords can be up to 400,000.
Right now I'm using the below code to find the keywords in the file content
public static void main(String[] args) {
StringBuilder fileContent = new StringBuilder();
try (BufferedReader reader = new BufferedReader(new FileReader("C:\\Users\\harshita.sethi\\Desktop\\merge\\MNT.txt"))) {
String line;
while ((line = reader.readLine()) != null) {
fileContent.append(line).append("\n");
}
}
String content = fileContent.toString();
Set<List<String>> keywords = getDbQuery(); // size can be up to 4*10^5
for (List<String> key : keywords) {
if (checkOccurence(content, key.get(0))) {
//Do Somethng
}
}
}
private static boolean checkOccurence(String content, String keyword) {
Boolean flag = false;
try {
Pattern p = Pattern.compile("\\b" + keyword + "\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(content);
flag = m.find();
} catch (PatternSyntaxException ex) {
System.out.println("cannot report occrence of " + keyword);
}
return flag;
}
The problem is with huge file size it takes a lot of time to scan through the file. I have done all sorts of testing and came to the conclusion that Pattern.Compile is making the code progress slow.
I have read on the internet since Pattern.compile compiles the regex everytime the function is called it consumes a lot of time.
Can anyone please suggest how can improve the performance of this code so that the string search is faster.
PS: I'm restricted to use Java 6 version.
Edit -
I tried the compiling all the keywords before the for loop as suggested by few people. I can see there is no much difference in the code execution time.
Although I noticed that by removing the boundary regex, the performance of the code changed drastically. It just took few seconds to complete the full run where it was taking 8-10 mins earlier. But by removing boundary regex, I'm not getting the desired output.
Question - Is there a way to fine tune the performance using boundaries. Why did the performance changed drastically?
My aim(for example) is to get
false if abcd is found while searching abc and
true if abc. or abc, or abc etc is found while searching for abc.

I would prefer to load key words and compile all patters before the search process.
The next step to improve the performance is to use the Java 8 stream api which allows you to paralyze the compile and search process.
I think that can help.

Related

Java String Analysis for complete string regular expression

I am looking for a tool like Java String Analysis (JSA) that could sum up a string as a regex. I have tried to do that with JSA, but there I need to search for a specific method like StringBuffer.append or other string operations.
I have strings like that:
StringBuilder test=new StringBuilder("hello ");
boolean codition=false;
if(codition){
test.append("world");
}
else{
test.append("other world");
}
test.append(" so far");
for(int i=0;i<args.length;i++){
test.append(" again hello");
}
// regularExpression = "hello (world| other world) so far( again hello)*"
And my JSA implementation looks like that so far:
public static void main(String[] args) {
StringAnalysis.addDirectoryToClassPath("bootstrap.jar");
StringAnalysis.loadClass("org.apache.catalina.loader.Extension");
List<ValueBox> list = StringAnalysis.getArgumentExpressions("<java.lang.StringBuffer: java.lang.StringBuffer append(java.lang.String)>", 0);
StringAnalysis sa = new StringAnalysis(list);
for (ValueBox e : list) {
Automaton a = sa.getAutomaton(e);
if (a.isFinite()) {
Iterator<String> si = a.getFiniteStrings().iterator();
StringBuilder sb = new StringBuilder();
while (si.hasNext()) {
sb.append((String) si.next());
}
System.out.println(sb.toString());
} else if (a.complement().isEmpty()) {
System.out.println(e.getValue());
} else {
System.out.println("common prefix:" + a.getCommonPrefix());
}
}
}
I would be very appreciated for any help with the JSA tool or for a hint to another tool. My biggest issue with the regex the control flow structure around the string constant.

I'm not aware of a tool which yields you a regex out of the box.
But since you have issues with the CFG I would recommend you to write a static analysis tailored to your problem. You could use a static analysis/bytecode framework like OPAL (Scala) or Soot (Java). You will find tutorials on each project page.
Once you set it up you can load the target jar. You should be able to leverage the control flow of the program then like in the following example:
1 public static void example(String unknown) {
2 String source = "hello";
3 if(Math.random() * 20 > 5){
4 source += "world";
5 } else {
6 source += "unknown";
7 }
8 source += unknown;
}
If your analysis finds a String or StringBuilder which is initialized you can start to build your regular expression. Line number two for instance would bring your regex to "hello". If you meet a conditional in the control flow of your program you can analyze each path and combine them via an "|" later on.
Then branch: "world" (line 4)
Else branch: "unknown" (line 6)
This could be summarized at line 7 to (world)|(unknown) and append to the regex before the conditional.
If you encounter a variable you either can trace it back if you do an inter-procedural analysis or you have to use the wildcard operator ".*" otherwise.
Final regex: "hello((world)|(unknown)).*"
I hope that this leads you to your solution you want to achieve.

Apache Lucene has some tools around finite state automata and regular expressions. In particular, you can take the union of automata, so I'd guess you can easily build an automaton accepting a finite number of words.

Fastest way to parse txt file in Java

I have to parse a txt file for a tax calculator that has this form:
Name: Mary Jane
Age: 23
Status: Married
Receipts:
Id: 1
Place: Restaurant
Money Spent: 20
Id: 2
Place: Mall
Money Spent: 30
So, what i have done so far is:
public void read(File file) throws FileNotFoundException{
Scanner scanner = new Scanner(file);
String[] tokens = null;
while(scanner.hasNext()){
String line= scanner.nextLine();
tokens = line.split(":");
String lastToken = tokens[tokens.length - 1];
System.out.println(lastToken);
So, I want to access only the second column of this file (Mary Jane, 23, Married) to a class taxpayer(name, age, status) and the receipts' info to an Arraylist.
I thought of taking the last token and save it to an String array, but I can't do that because I can't save string to string array. Can someone help me? Thank you.

The fastest way, if your data is ASCII and you don't need charset conversion, is to use a BufferedInputStream and do all the parsing yourself -- find the line terminators, parse the numbers. Do NOT use a Reader, or create Strings, or create any objects per line, or use parseInt. Just use byte arrays and look at the bytes. It's a little messier, but pretend you're writing C code, and it will be faster.
Also give some thought to how compact the data structure you're creating is, and whether you can avoid creating an object per line there too by being clever.

Frankly, I think the "fastest" is a red herring. Unless you have millions of these files, it is unlikely that the speed of your code will be relevant.
And in fact, your basic approach to parsing (read line using Scanner, split line using String.split(...) seems pretty sound.
What you are missing is that the structure of your code needs to match the structure of the file. Here's a sketch of how I would do it.
If you are going to ignore the first field of each line, you need a method that:
reads a line, skipping empty lines
splits it, and
returns the second field.
If you are going to check that the first field contains the expected keyword, then modify the method to take a parameter, and check the field. (I'd recommend this version ...)
Then call the above method in the correct pattern; e.g.
call it 3 times to extract the name, age and marital status
call it 1 time to skip the "reciepts" line
use a while loop to call the method 3 times to read the 3 fields for each receipt.

First why do you need to invest time into the fastest possible solution? Is it because the input file is huge? I also do not understand how you want to store result of parsing? Consider new class with all fields you need to extract from file per person.
Few tips:
- Avoid unnecessary per-line memory allocations. line.split(":") in your code is example of this.
- Use buffered input.
- Minimize input/output operations.
If these are not enough for you try to read this article http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly

Do you really need it to be as fast as possible? In situations like this, it's often fine to create a few objects and do a bit of garbage collection along the way in order to have more maintainable code.
I'd use two regular expressions myself (one for the taxpayer and another for the receipts loop).
My code would look something like:
public class ParsedFile {
private Taxpayer taxpayer;
private List<Receipt> receipts;
// getters and setters etc.
}
public class FileParser {
private static final Pattern TAXPAYER_PATTERN =
// this pattern includes capturing groups in brackets ()
Pattern.compile("Name: (.*?)\\s*Age: (.*?)\\s*Status: (.*?)\\s*Receipts:", Pattern.DOTALL);
public ParsedFile parse(File file) {
BufferedReader reader = new BufferedReader(new FileReader(file)));
String firstChunk = getNextChunk(reader);
Taxpayer taxpayer = parseTaxpayer(firstChunk);
List<Receipt> receipts = new ArrayList<Receipt>();
String chunk;
while ((chunk = getNextChunk(reader)) != null) {
receipts.add(parseReceipt(chunk));
}
return new ParsedFile(taxpayer, receipts);
}
private TaxPayer parseTaxPayer(String chunk) {
Matcher matcher = TAXPAYER_PATTERN.matcher(chunk);
if (!matcher.matches()) {
throw new Exception(chunk + " does not match " + TAXPAYER_PATTERN.pattern());
}
// this is where we use the capturing groups from the regular expression
return new TaxPayer(matcher.group(1), matcher.group(2), ...);
}
private Receipt parseReceipt(String chunk) {
// TODO implement
}
private String getNextChunk(BufferedReader reader) {
// keep reading lines until either a blank line or end of file
// return the chunk as a string
}
}

Java large String returned from findWithinHorizon converted to InputStream

I have wrote an application which in one of its modules parses huge file and saves this file chunk by chunk into a database.
First of all the following code works, and my main problem is to reduce memory usage and general increase in performance.
The following code snippet is a small part of the big picture, but is the most problematic after doing some YourKit profiling, The lines that are marked by /*Here*/ allocate huge amount of memory.
....
Scanner fileScanner = new Scanner(file,"UTF-8");
String scannedFarm;
try{
Pattern p = Pattern.compile("(?:^.++$(?:\\r?+\\n)?+){2,100000}+",Pattern.MULTILINE);
String [] tableName = null;
/*HERE*/while((scannedFarm = fileScanner.findWithinHorizon(p, 0)) != null){
boolean continuePrevStream = false;
Scanner scanner = new Scanner(scannedFarm);
String[] tmpTableName = scanner.nextLine().split(getSeparator());
if (tmpTableName.length==2){
tableName = tmpTableName;
}else{
if (tableName==null){
continue;
}
continuePrevStream = true;
}
scanner.close();
/*HERE*/ InputStream is = new ByteArrayInputStream(scannedFarm.getBytes("UTF-8"));
....
It is acceptable to allocate huge amount of memory since the String is large (i need it too be such large chunk), My main problem is that the same allocation happens twice as a result of getBytes,
So my question is their a way to transfer the findWithinHorizon Result directly to InputStream without allocating memory twice?
Is their more efficient way to achieve the same functionality?

Not exactly the same approach but instead of findWithinHorizon, you could try reading each line and searching for the pattern within the line context. This is sure to reduce memory pressure as you're not buffering the whole file as the API states:
If horizon is 0, then the horizon is ignored and this method continues
to search through the input looking for the specified pattern without
bound. In this case it may buffer all of the input searching for the
pattern.
Something like:
while(String line = fileScanner.nextLine() != null) {
if(grep for pattern in line) {
}
}

How to set a java string variable equal to "htp://website htp://website " [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
so I have a large list of websites and I want to put them all in a String variable. I know I can not individually go to all of the links and escape the //, but is there is over a few hundred links. Is there a way to do a "block escape", so everything in between the "block" is escaped? This is an example of what I want to save in the variable.
String links="http://website http://website http://website http://website http://website http://website"
Also can anyone think of any other problems I might run into while doing this?
I made it htp instead of http because I am not allowed to post "hyperlinks" according to stack overflow as I am not at that level :p
Thanks so much
Edit: I am making a program because I have about 50 pages of a word document that is filled with both emails and other text. I want to filter out just the emails. I wrote the program to do this which was very simple, not I just need to figure away to store the pages in a string variable in which the program will be run on.

Your question is not well-written. Improve it, please. In its current format it will be closed as "too vague".
Do you want to filter e-mails or websites? Your example is about websites, you text about e-mails. As I don't know and I decided to try to help you anyway, I decided to do both.
Here goes the code:
private static final Pattern EMAIL_REGEX =
Pattern.compile("[A-Za-z0-9](:?(:?[_\\.\\-]?[a-zA-Z0-9]+)*)#(:?[A-Za-z0-9]+)(:?(:?[\\.\\-]?[a-zA-Z0-9]+)*)\\.(:?[A-Za-z]{2,})");
private static final Pattern WEBSITE_REGEX =
Pattern.compile("http(:?s?)://[_#\\.\\-/\\?&=a-zA-Z0-9]*");
public static String readFileAsString(String fileName) throws IOException {
File f = new File(fileName);
byte[] b = new byte[(int) f.length()];
InputStream is = null;
try {
is = new FileInputStream(f);
is.read(b);
return new String(b, "UTF-8");
} finally {
if (is != null) is.close();
}
}
public static List<String> filterEmails(String everything) {
List<String> list = new ArrayList<String>(8192);
Matcher m = EMAIL_REGEX.matcher(everything);
while (m.find()) {
list.add(m.group());
}
return list;
}
public static List<String> filterWebsites(String everything) {
List<String> list = new ArrayList<String>(8192);
Matcher m = WEBSITE_REGEX.matcher(everything);
while (m.find()) {
list.add(m.group());
}
return list;
}
To ensure that it works, first lets test the filterEmails and filterWebsites method:
public static void main(String[] args) {
System.out.println(filterEmails("Orange, pizza whatever else joe#somewhere.com a lot of text here. Blahblah blah with Luke Skywalker (luke#starwars.com) hfkjdsh fhdsjf jdhf Paulo <aaa.aaa#bgf-ret.com.br>"));
System.out.println(filterWebsites("Orange, pizza whatever else joe#somewhere.com a lot of text here. Blahblah blah with Luke Skywalker (http://luke.starwars.com/force) hfkjdsh fhdsjf jdhf Paulo <https://darth.vader/blackside?sith=true&midclorians> And the http://www.somewhere.com as x."));
}
It outputs:
[joe#somewhere.com, luke#starwars.com, aaa.aaa#bgf-ret.com.br]
[http://luke.starwars.com/force, https://darth.vader/blackside?sith=true&midclorians, http://www.somewhere.com]
To test the readFileAsString method:
public static void main(String[] args) {
System.out.println(readFileAsString("C:\\The_Path_To_Your_File\\SomeFile.txt"));
}
If that file exists, its content will be printed.
If you don't like the fact that it returns List<String> instead of a String with items divided by spaces, this is simple to solve:
public static String collapse(List<String> list) {
StringBuilder sb = new StringBuilder(50 * list.size());
for (String s : list) {
sb.append(" ").append(s);
}
sb.delete(0, 1);
return sb.toString();
}
Sticking all together:
String fileName = ...;
String webSites = collapse(filterWebsites(readFileAsString(fileName)));
String emails = collapse(filterEmails(readFileAsString(fileName)));

I suggest that you save your Word document as plain text. Then you can use classes from the java.io package (such as Scanner to read the text).
To solve the issue of overwriting the String variable each time you read a line, you can use an array or ArrayList. This is much more ideal than holding all the web addresses in a single String because you can easily access each address individually whenever you like.

For your first problem, take all the text out of word, put it in something that does regular expressions, use regular expressions to quote each line and end each line with +. Now edit the last line and change + to ;. Above the first line write String links =. Copy this new file into your java source.
Here's an example using regexr.
To answer your second question (thinking of problems) there is an upper limit for a Java string literal if I recall correctly 2^16 in length.
Oh and Perl was basically written for you to do this kind of thing (take 50 pages of text and separate out what is a url and what is an email)... not to mention grep.

I'm not sure what kind of 'list of websites' you're referring to, but for eg. a comma-separated file of websites you could read the entire file and use the String split function to get an array, or you could use a BufferedReader to read the file line by line and add to an ArrayList.
From there you can simply loop the array and append to a String, or if you need to:
do a "block escape", so everything in between the "block" is escaped
You can use a Regular Expression to extract parts of each String according to a pattern:
String oldString = "<someTag>I only want this part</someTag>";
String regExp = "(?i)(<someTag.*?>)(.+?)(</someTag>)";
String newString = oldString.replaceAll(regExp, "$2");
The above expression would remove the xml tags due to the "$2" which means you're interested in the second group of the expression, where groups are identified by round brackets ( ).
Using "$1$3" instead should then give you only the surrounding xml tags.
Another much simpler approach to removing certain "blocks" from a String is the String replace function, where to remove the block you could simply pass in an empty string as the new value.
I hope any of this helps, otherwise you could try to provide a full example with you input "list of websites" and the output you want.

Need help parsing strings in Java

I am reading in a csv file in Java and, depending on the format of the string on a given line, I have to do something different with it. The three different formats contained in the csv file are (using random numbers):
833
"79, 869"
"56-57, 568"
If it is just a single number (833), I want to add it to my ArrayList. If it is two numbers separated by a comma and surrounded by quotations ("79, 869)", I want to parse out the first of the two numbers (79) and add it to the ArrayList. If it is three numbers surrounded by quotations (where the first two numbers are separated by a dash, and the third by a comma ["56-57, 568"], then I want to parse out the third number (568) and add it to the ArrayList.
I am having trouble using str.contains() to determine if the string on a given line contains a dash or not. Can anyone offer me some help? Here is what I have so far:
private static void getFile(String filePath) throws java.io.IOException {
BufferedReader reader = new BufferedReader(new FileReader(filePath));
String str;
while ((str = reader.readLine()) != null) {
if(str.endsWith("\"")){
if (str.contains(charDash)){
System.out.println(str);
}
}
}
}
Thanks!

I recommend using the version of indexOf that actually takes a char rather than a string, since this method is much faster. (It is a simple loop, without a nested loop.)
I.e.
if (str.indexOf('-')!=-1) {
System.out.println(str);
}
(Note the single quotes, so this is a char, rather than a string.)
But then you have to split the line and parse the individual values. At present, you are testing if the whole line ends with a quote, which is probably not what you want.

The following code works for me (note: I wrote it with no optimization in mind - it's just for testing purposes):
public static void main(String args[]) {
ArrayList<String> numbers = GetNumbers();
}
private static ArrayList<String> GetNumbers() {
String str1 = "833";
String str2 = "79, 869";
String str3 = "56-57, 568";
ArrayList<String> lines = new ArrayList<String>();
lines.add(str1);
lines.add(str2);
lines.add(str3);
ArrayList<String> numbers = new ArrayList<String>();
for (Iterator<String> s = lines.iterator(); s.hasNext();) {
String thisString = s.next();
if (thisString.contains("-")) {
numbers.add(thisString.substring(thisString.indexOf(",") + 2));
} else if (thisString.contains(",")) {
numbers.add(thisString.substring(0, thisString.indexOf(",")));
} else {
numbers.add(thisString);
}
}
return numbers;
}
Output:
833
79
568

Although it gets a lot of hate these days, I still really like the StringTokenizer for this kind of stuff. You can set it up to return the tokens and, at least to me, it makes the processing trivial without interacting with regexes
you'd have to create it using ",- as your tokens, then just kick it off in a loop.
st=new StringTokenizer(line, "\",-", true);
Then you set up a loop:
while(st.hasNextToken()) {
String token=st.nextToken();
Each case becomes it's own little part of the loop:
// Use punctuation to set flags that tell you how to interpret the numbers.
if(token == "\"") {
isQuoted = !isQuoted;
} else if(token == ",") {
...
} else if(...) {
...
} else { // The punctuation has been dealt with, must be a number group
// Apply flags to determine how to parse this number.
}
I realize that StringTokenizer is outdated now, but I'm not really sure why. Parsing regular expressions can't be faster and the syntax is--well split is a pretty sweet syntax I gotta admit.
I guess if you and everyone you work with is really comfortable with Regular Expressions you could replace that with split and just iterate over the resultant array but I'm not sure how to get split to return the punctuation--probably that "+" thing from other answers but I never trust that some character I'm passing to a regular expression won't do something utterly unexpected.

will
if (str.indexOf(charDash.toString()) > -1){
System.out.println(str);
}
do the trick?
which by the way is fastest than contains... because it implements indexOf

Will this work?
if(str.contains("-")) {
System.out.println(str);
}
I wonder if the charDash variable is not what you are expecting it to be.

I think three regexes would be your best bet - because with a match, you also get the bit you're interested in. I suck at regex, but something along the lines of:
.*\-.*, (.+)
.*, (.+)
and
(.+)
ought to do the trick (in order, because the final pattern matches anything including the first two).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Improve performace of string search using Patter.compile in large files - java

I would prefer to load key words and compile all patters before the search process. The next step to improve the performance is to use the Java 8 stream api which allows you to paralyze the compile and search process. I think that can help.

Related

Java String Analysis for complete string regular expression

Fastest way to parse txt file in Java

Java large String returned from findWithinHorizon converted to InputStream

How to set a java string variable equal to "htp://website htp://website " [closed]

Need help parsing strings in Java

Categories

Resources