Java String Analysis for complete string regular expression - java

I am looking for a tool like Java String Analysis (JSA) that could sum up a string as a regex. I have tried to do that with JSA, but there I need to search for a specific method like StringBuffer.append or other string operations.
I have strings like that:
StringBuilder test=new StringBuilder("hello ");
boolean codition=false;
if(codition){
test.append("world");
}
else{
test.append("other world");
}
test.append(" so far");
for(int i=0;i<args.length;i++){
test.append(" again hello");
}
// regularExpression = "hello (world| other world) so far( again hello)*"
And my JSA implementation looks like that so far:
public static void main(String[] args) {
StringAnalysis.addDirectoryToClassPath("bootstrap.jar");
StringAnalysis.loadClass("org.apache.catalina.loader.Extension");
List<ValueBox> list = StringAnalysis.getArgumentExpressions("<java.lang.StringBuffer: java.lang.StringBuffer append(java.lang.String)>", 0);
StringAnalysis sa = new StringAnalysis(list);
for (ValueBox e : list) {
Automaton a = sa.getAutomaton(e);
if (a.isFinite()) {
Iterator<String> si = a.getFiniteStrings().iterator();
StringBuilder sb = new StringBuilder();
while (si.hasNext()) {
sb.append((String) si.next());
}
System.out.println(sb.toString());
} else if (a.complement().isEmpty()) {
System.out.println(e.getValue());
} else {
System.out.println("common prefix:" + a.getCommonPrefix());
}
}
}
I would be very appreciated for any help with the JSA tool or for a hint to another tool. My biggest issue with the regex the control flow structure around the string constant.

I'm not aware of a tool which yields you a regex out of the box.
But since you have issues with the CFG I would recommend you to write a static analysis tailored to your problem. You could use a static analysis/bytecode framework like OPAL (Scala) or Soot (Java). You will find tutorials on each project page.
Once you set it up you can load the target jar. You should be able to leverage the control flow of the program then like in the following example:
1 public static void example(String unknown) {
2 String source = "hello";
3 if(Math.random() * 20 > 5){
4 source += "world";
5 } else {
6 source += "unknown";
7 }
8 source += unknown;
}
If your analysis finds a String or StringBuilder which is initialized you can start to build your regular expression. Line number two for instance would bring your regex to "hello". If you meet a conditional in the control flow of your program you can analyze each path and combine them via an "|" later on.
Then branch: "world" (line 4)
Else branch: "unknown" (line 6)
This could be summarized at line 7 to (world)|(unknown) and append to the regex before the conditional.
If you encounter a variable you either can trace it back if you do an inter-procedural analysis or you have to use the wildcard operator ".*" otherwise.
Final regex: "hello((world)|(unknown)).*"
I hope that this leads you to your solution you want to achieve.

Apache Lucene has some tools around finite state automata and regular expressions. In particular, you can take the union of automata, so I'd guess you can easily build an automaton accepting a finite number of words.

Related

Improve performace of string search using Patter.compile in large files

I have huge text files whose size can range from 500KB to 500MB. I have a list of keywords that needs to be found in the file content. The no. of keywords can be up to 400,000.
Right now I'm using the below code to find the keywords in the file content
public static void main(String[] args) {
StringBuilder fileContent = new StringBuilder();
try (BufferedReader reader = new BufferedReader(new FileReader("C:\\Users\\harshita.sethi\\Desktop\\merge\\MNT.txt"))) {
String line;
while ((line = reader.readLine()) != null) {
fileContent.append(line).append("\n");
}
}
String content = fileContent.toString();
Set<List<String>> keywords = getDbQuery(); // size can be up to 4*10^5
for (List<String> key : keywords) {
if (checkOccurence(content, key.get(0))) {
//Do Somethng
}
}
}
private static boolean checkOccurence(String content, String keyword) {
Boolean flag = false;
try {
Pattern p = Pattern.compile("\\b" + keyword + "\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(content);
flag = m.find();
} catch (PatternSyntaxException ex) {
System.out.println("cannot report occrence of " + keyword);
}
return flag;
}
The problem is with huge file size it takes a lot of time to scan through the file. I have done all sorts of testing and came to the conclusion that Pattern.Compile is making the code progress slow.
I have read on the internet since Pattern.compile compiles the regex everytime the function is called it consumes a lot of time.
Can anyone please suggest how can improve the performance of this code so that the string search is faster.
PS: I'm restricted to use Java 6 version.
Edit -
I tried the compiling all the keywords before the for loop as suggested by few people. I can see there is no much difference in the code execution time.
Although I noticed that by removing the boundary regex, the performance of the code changed drastically. It just took few seconds to complete the full run where it was taking 8-10 mins earlier. But by removing boundary regex, I'm not getting the desired output.
Question - Is there a way to fine tune the performance using boundaries. Why did the performance changed drastically?
My aim(for example) is to get
false if abcd is found while searching abc and
true if abc. or abc, or abc etc is found while searching for abc.
I would prefer to load key words and compile all patters before the search process.
The next step to improve the performance is to use the Java 8 stream api which allows you to paralyze the compile and search process.
I think that can help.

Scanning a number and returning the lexeme in the input stream- Java?

I am trying to write a method that will scan the input and return a String representing the lexeme found in the input string.
This is what I have so far but I don't know if I'm going in the right direction-- all help would be appreciated :)
private String scanNumbers(char input)
{
String result= "";
int value = in.read()
if(value != -1)
{
If(isDigit(input))
{
result = Integer.toString(value);
}
}
return result;
}
public static boolean isDigit(char input)
{
return (input >= '0' && input <= '9');
}
Thank you I am new to parsing/lexemes/compilers.
Introduction
Questions that appear to be related to a homework exercise are often slow to be answered on SO. We often wait until the deadline has well passed!
You mention you are new to the topics of parsing/lexemes/compilers, and want some help in writing a Java method to scan the input and return a string representing the lexeme found in the input string. Later you clarify, indicating that you want a method that skips characters until it finds digits.
There is quite a bit of confusion in your question which produces conflicts in what you want to achieve.
It is not clear if you are wanting to learn about performing lexical analysis in Java as part of a larger compiler project, whether you only want to do it with numbers, whether you are looking for existing tools or methods that do this or are trying to learn how to program such methods yourself. If you are programming, whether you only need to know about reading a number, or if this is just an example of the kind of things you want to do.
Lexical Analysis
Lexical analysis, which is also known as scanning, is the process of reading a corpus of text which is composed of characters. This can be done for several purposes, such as data input, linguistic analysis of written material (such as word frequency counting) or part of language compilation or interpretation. When done as part of compilation it is one (and usually the first) of a sequence of phases that include parsing, semantic analysis, code generation, optimisation and such. In the writing of compilers code generator tools are usually used, so if it was desired to write a compiler in Java, then a Java lexical generator and a Java parser generator would often be used to create the Java code for those compiler components. Sometimes that lexer and parser are hand written, but it is not a recommended task for a novice. It would require a compiler writing specialist to build a compiler by hand better than a tool-set. Sometimes, as a class exercise, students are asked to write code to perform a piece lexical analysis to help them understand the process, but this is often for a few lexemes, like your digit exercise.
The term lexeme is used to describe a sequence of characters that compose an individual entity recognised by a lexical analyser. Once recognised it is usually represented by a token. The lexeme is therefore replaced by a token as part of the lexical analysis process. A lexical analyser will sometime record the lexeme in a symbol table for later use before replacing it by the token. This is how identifiers in programs are often recorded in a compiler.
There are several tools for building lexers in Java. Two of the most common are Jlex and JFlex. To illustrate how they work, to recognise an integer whilst skipping whitespace, we would use the following rules:
%%
WHITE_SPACE_CHAR=[\n\ \t\b\012]
DIGIT=[0-9]
%%
{WHITE_SPACE_CHAR}+ { }
{DIGIT}+ { return(new Yytoken(42,yytext(),yyline,yychar,yychar + yytext().length())); }
%%
which would be processed by the tool to produce Java methods to achieve that task.
The notations used to describe the lexemes are usually written as regular expressions. Computer Science theory can help us with the programming of a lexical analyser. Regular expressions can be represented by a form of finite state automata. There is a particular style of coding that can be used to match lexemes that experienced programers would recognise and use in this situation, which involves a switch inside a loop:
while ( ! eof ) {
switch ( next_symbol() ) {
case symbol:
...
break;
default:
error(diagnostic); break;
}
}
It is often these concepts that a simple lexical programming exercise is intended to introduce to students.
Tokenizing in Java
With all those preliminary explanations out of the way, lets get down to your piece of Java code. As mentioned in the comments there is a difference in Java between reading bytes from an input stream and reading characters, as characters are in unicode, which is represented by two bytes. You have used a byte read within a character processing method.
The recognising simple tokens in an input stream, particularly for data entry, is such a common activity that Java has a specific built-in class for that called the StreamTokenizer.
We could implement your task in the following way, for example:
// create a new tokenizer
Reader r = new BufferedReader(new InputStreamReader( System.in ));
StreamTokenizer st = new StreamTokenizer(r);
// print the stream tokens
boolean eof = false;
do {
int token = st.nextToken();
switch (token) {
case StreamTokenizer.TT_EOF:
System.out.println("End of File encountered.");
eof = true;
break;
case StreamTokenizer.TT_EOL:
System.out.println("End of Line encountered.");
break;
case StreamTokenizer.TT_NUMBER:
System.out.println("Number: " + st.nval);
break;
default:
System.out.println((char) token + " encountered.");
if (token == '!') {
eof = true;
}
}
} while (!eof);
However, this does not return the string of the lexeme for a number, only matches the number and gets the value.
I see you have noticed the Java class java.util.scanner because your question had that as a tag. This is another class that can perform similar operations.
We could get an integer lexeme from the input like this:
Scanner s = new Scanner(System.in);
System.out.println(s.nextInt());
Solution
Finally, lets re-write your original code to find the lexeme for an integer skipping over an unwanted characters, in which I use java regular expression matching.
import java.io.IOException; import java.io.InputStreamReader;
import java.util.regex.Pattern;
public class ReadNumbers {
static InputStreamReader in = null; // Have input source as a global
static int value = -1; // and the current input value
public static void main ( String [] args ) {
try {
in = new InputStreamReader(System.in); // Set up the input
value = in.read(); // pre-fill the input state
System.out.println(scanNumbers()) ;
}
catch (Exception e) {
e.printStackTrace(); // print error
}
}
private static String scanNumbers() {
String SkipCharacters = "\\s" ; // Characters that can be skipped
String result= ""; // empty string to store lexeme
int charcount=0;
try {
while ( (value != -1) && Pattern.matches(SkipCharacters,"" + (char)value) )
// Now skip optional characters before the number
value = in.read() ; // pre-load the next character
while ( (value != -1) && isDigit((char)value)) {
// Now find the number digits
result = result + (char)value; // append digit character to result
value = in.read() ; // pre-load the next character
}
} finally {
return result;
}
}
public static boolean isDigit(char input) {
return (input >= '0' && input <= '9');
}
}
Afterword
The comment from #markspace is interesting and useful, as it points out not all numbers are soley decimal digits.
Consider numbers in other bases, like hexdecimal. Java allows integer constants to be specified in those number bases which do not just use the digits 0..9.

Java pattern matching going to infinite loop

A friend gave me this piece of code and said there is a bug. And yes, this code runs for ever.
The answer I got is:
It runs for >10^15 years before printing anything.
public class Match {
public static void main(String[] args) {
Pattern p = Pattern.compile("(aa|aab?)+");
int count = 0;
for(String s = ""; s.length() < 200; s += "a")
if (p.matcher(s).matches())
count++;
System.out.println(count);
}
}
I didn't really understand why am I seeing this behavior, I am new to java, do you have any suggestions?
The pattern you are using is known as an evil regex according to OWASP (they know what they're talking about most of the time):
https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS
It basically matches aa OR aa or aab (since the b is optional by addition of ?)
A Regex like this is vulnerable to a ReDoS or Regex Denial of Service Attack.
So yes, sort out what you want to match. I suggest in the above example you should simply match aa, no need for groups, repitition or alternation:
Pattern p = Pattern.compile("aa");
Also as someone pointed out, who now deleted his post, you should not use += to append to strings. You should use a StringBuffer instead:
public class Match {
public static void main(String[] args) {
Pattern p = Pattern.compile("aa");
StringBuffer buffy = new StringBuffer(200);
int count = 0;
for(int i = 0; i < 200; i++){
buffy.append("a")
if (p.matcher(buffy.toString()).matches()){
count++;
}
}
System.out.println(count);
}
}
The regular expression (aa|aab?)+ is one that takes an especially long time for the regular expression engine to handle. These are colorfully called evil regexes. It is similar to the (a|aa)+ example at the link. This particular one is very slow on a string composed entirely of as.
What this code does is check the evil regex against increasingly long strings of as, up to length 200, so it certainly ought to take a long time, and it doesn't print until the loop ends. I'd be interested to know where the 10^15 years figure came from.
Edit
OK, the 10^15 (and in fact the entire piece of code in the question) comes from this talk, slide 37. Thanks to zengr for that link. The most relevant piece of information to the question is that the check for this regex takes time that is exponential in the length of the string. Specifically it's O(2^(n/2)), so it takes 2^99 (or so) times longer to check the last string than the first one.

Need help parsing strings in Java

I am reading in a csv file in Java and, depending on the format of the string on a given line, I have to do something different with it. The three different formats contained in the csv file are (using random numbers):
833
"79, 869"
"56-57, 568"
If it is just a single number (833), I want to add it to my ArrayList. If it is two numbers separated by a comma and surrounded by quotations ("79, 869)", I want to parse out the first of the two numbers (79) and add it to the ArrayList. If it is three numbers surrounded by quotations (where the first two numbers are separated by a dash, and the third by a comma ["56-57, 568"], then I want to parse out the third number (568) and add it to the ArrayList.
I am having trouble using str.contains() to determine if the string on a given line contains a dash or not. Can anyone offer me some help? Here is what I have so far:
private static void getFile(String filePath) throws java.io.IOException {
BufferedReader reader = new BufferedReader(new FileReader(filePath));
String str;
while ((str = reader.readLine()) != null) {
if(str.endsWith("\"")){
if (str.contains(charDash)){
System.out.println(str);
}
}
}
}
Thanks!
I recommend using the version of indexOf that actually takes a char rather than a string, since this method is much faster. (It is a simple loop, without a nested loop.)
I.e.
if (str.indexOf('-')!=-1) {
System.out.println(str);
}
(Note the single quotes, so this is a char, rather than a string.)
But then you have to split the line and parse the individual values. At present, you are testing if the whole line ends with a quote, which is probably not what you want.
The following code works for me (note: I wrote it with no optimization in mind - it's just for testing purposes):
public static void main(String args[]) {
ArrayList<String> numbers = GetNumbers();
}
private static ArrayList<String> GetNumbers() {
String str1 = "833";
String str2 = "79, 869";
String str3 = "56-57, 568";
ArrayList<String> lines = new ArrayList<String>();
lines.add(str1);
lines.add(str2);
lines.add(str3);
ArrayList<String> numbers = new ArrayList<String>();
for (Iterator<String> s = lines.iterator(); s.hasNext();) {
String thisString = s.next();
if (thisString.contains("-")) {
numbers.add(thisString.substring(thisString.indexOf(",") + 2));
} else if (thisString.contains(",")) {
numbers.add(thisString.substring(0, thisString.indexOf(",")));
} else {
numbers.add(thisString);
}
}
return numbers;
}
Output:
833
79
568
Although it gets a lot of hate these days, I still really like the StringTokenizer for this kind of stuff. You can set it up to return the tokens and, at least to me, it makes the processing trivial without interacting with regexes
you'd have to create it using ",- as your tokens, then just kick it off in a loop.
st=new StringTokenizer(line, "\",-", true);
Then you set up a loop:
while(st.hasNextToken()) {
String token=st.nextToken();
Each case becomes it's own little part of the loop:
// Use punctuation to set flags that tell you how to interpret the numbers.
if(token == "\"") {
isQuoted = !isQuoted;
} else if(token == ",") {
...
} else if(...) {
...
} else { // The punctuation has been dealt with, must be a number group
// Apply flags to determine how to parse this number.
}
I realize that StringTokenizer is outdated now, but I'm not really sure why. Parsing regular expressions can't be faster and the syntax is--well split is a pretty sweet syntax I gotta admit.
I guess if you and everyone you work with is really comfortable with Regular Expressions you could replace that with split and just iterate over the resultant array but I'm not sure how to get split to return the punctuation--probably that "+" thing from other answers but I never trust that some character I'm passing to a regular expression won't do something utterly unexpected.
will
if (str.indexOf(charDash.toString()) > -1){
System.out.println(str);
}
do the trick?
which by the way is fastest than contains... because it implements indexOf
Will this work?
if(str.contains("-")) {
System.out.println(str);
}
I wonder if the charDash variable is not what you are expecting it to be.
I think three regexes would be your best bet - because with a match, you also get the bit you're interested in. I suck at regex, but something along the lines of:
.*\-.*, (.+)
.*, (.+)
and
(.+)
ought to do the trick (in order, because the final pattern matches anything including the first two).

String capitalize - better way

What method of capitalizing is better?
mine:
char[] charArray = string.toCharArray();
charArray[0] = Character.toUpperCase(charArray[0]);
return new String(charArray);
or
commons lang - StringUtils.capitalize:
return new StringBuffer(strLen)
.append(Character.toTitleCase(str.charAt(0)))
.append(str.substring(1))
.toString();
I think mine is better, but i would rather ask.
I guess your version will be a little bit more performant, since it does not allocate as many temporary String objects.
I'd go for this (assuming the string is not empty):
StringBuilder strBuilder = new StringBuilder(string);
strBuilder.setCharAt(0, Character.toUpperCase(strBuilder.charAt(0))));
return strBuilder.toString();
However, note that they are not equivalent in that one uses toUpperCase() and the other uses toTitleCase().
From a forum post:
Titlecase <> uppercase
Unicode
defines three kinds of case mapping:
lowercase, uppercase, and titlecase.
The difference between uppercasing and
titlecasing a character or character
sequence can be seen in compound
characters (that is, a single
character that represents a compount
of two characters).
For example, in Unicode, character
U+01F3 is LATIN SMALL LETTER DZ. (Let
us write this compound character
using ASCII as "dz".) This character
uppercases to character U+01F1, LATIN
CAPITAL LETTER DZ. (Which is
basically "DZ".) But it titlecases to
to character U+01F2, LATIN CAPITAL
LETTER D WITH SMALL LETTER Z. (Which
we can write "Dz".)
character uppercase titlecase
--------- --------- ---------
dz DZ Dz
If I were to write a library, I'd try to make sure I got my Unicode right beofre worrying about performance. Off the top of my head:
int len = str.length();
if (len == 0) {
return str;
}
int head = Character.toUpperCase(str.codePointAt(0));
String tail = str.substring(str.offsetByCodePoints(0, 1));
return new String(new int[] { head }).concat(tail);
(I'd probably also look up the difference between title and upper case before I committed.)
Performance is equal.
Your code copies the char[] calling string.toCharArray() and new String(charArray).
The apache code on buffer.append(str.substring(1)) and buffer.toString(). The apache code has an extra string instance that has the base char[1,length] content. But this will not be copied when the instance String is created.
StringBuffer is declared to be thread safe, so it might be less effective to use it (but one shouldn't bet on it before actually doing some practical tests).
StringBuilder (from Java 5 onwards) is faster than StringBuffer if you don't need it to be thread safe but as others have said you need to test if this is better than your solution in your case.
Have you timed both?
Honestly, they're equivalent.. so the one that performs better for you is the better one :)
Not sure what the difference between toUpperCase and toTitleCase is, but it looks as if your solution requires one less instantiation of the String class, while the commons lang implementation requires two (substring and toString create new Strings I assume, since String is immutable).
Whether that's "better" (I guess you mean faster) I don't know. Why don't you profile both solutions?
look at this question titlecase-conversion . apache FTW.
/**
* capitalize the first letter of a string
*
* #param String
* #return String
* */
public static String capitalizeFirst(String s) {
if (s == null || s.length() == 0) {
return "";
}
char first = s.charAt(0);
if (Character.isUpperCase(first)) {
return s;
} else {
return Character.toUpperCase(first) + s.substring(1);
}
}
If you only capitalize limited words, you better cache it.
#Test
public void testCase()
{
String all = "At its base, a shell is simply a macro processor that executes commands. The term macro processor means functionality where text and symbols are expanded to create larger expressions.\n" +
"\n" +
"A Unix shell is both a command interpreter and a programming language. As a command interpreter, the shell provides the user interface to the rich set of GNU utilities. The programming language features allow these utilities to be combined. Files containing commands can be created, and become commands themselves. These new commands have the same status as system commands in directories such as /bin, allowing users or groups to establish custom environments to automate their common tasks.\n" +
"\n" +
"Shells may be used interactively or non-interactively. In interactive mode, they accept input typed from the keyboard. When executing non-interactively, shells execute commands read from a file.\n" +
"\n" +
"A shell allows execution of GNU commands, both synchronously and asynchronously. The shell waits for synchronous commands to complete before accepting more input; asynchronous commands continue to execute in parallel with the shell while it reads and executes additional commands. The redirection constructs permit fine-grained control of the input and output of those commands. Moreover, the shell allows control over the contents of commands’ environments.\n" +
"\n" +
"Shells also provide a small set of built-in commands (builtins) implementing functionality impossible or inconvenient to obtain via separate utilities. For example, cd, break, continue, and exec cannot be implemented outside of the shell because they directly manipulate the shell itself. The history, getopts, kill, or pwd builtins, among others, could be implemented in separate utilities, but they are more convenient to use as builtin commands. All of the shell builtins are described in subsequent sections.\n" +
"\n" +
"While executing commands is essential, most of the power (and complexity) of shells is due to their embedded programming languages. Like any high-level language, the shell provides variables, flow control constructs, quoting, and functions.\n" +
"\n" +
"Shells offer features geared specifically for interactive use rather than to augment the programming language. These interactive features include job control, command line editing, command history and aliases. Each of these features is described in this manual.";
String[] split = all.split("[\\W]");
// 10000000
// upper Used 606
// hash Used 114
// 100000000
// upper Used 5765
// hash Used 1101
HashMap<String, String> cache = Maps.newHashMap();
long start = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++)
{
String upper = split[i % split.length].toUpperCase();
// String s = split[i % split.length];
// String upper = cache.get(s);
// if (upper == null)
// {
// cache.put(s, upper = s.toUpperCase());
//
// }
}
System.out.println("Used " + (System.currentTimeMillis() - start));
}
The text is picked from here.
Currently, I need to upper case the table name and columns, many many more times, but they are limited.Use the hashMap to cache will be better.
:-)
use this method for capitalizing of string. its totally working without any bug
public String capitalizeString(String value)
{
String string = value;
String capitalizedString = "";
System.out.println(string);
for(int i = 0; i < string.length(); i++)
{
char ch = string.charAt(i);
if(i == 0 || string.charAt(i-1)==' ')
ch = Character.toUpperCase(ch);
capitalizedString += ch;
}
return capitalizedString;
}

Categories