Java Regular Expressions using Pattern and Matcher - java

My question is related to Regular Expressions in Java, and in particular, multiple matches for a given search pattern. All of the info i need to get is on 1 line and it contains an alias (e.g. SA) which maps to an IP address. Each one is separated by a comma. I need to extract each one.
SA "239.255.252.1", SB "239.255.252.2", SC "239.255.252.3", SD "239.255.252.4"
My Reg Ex looks like this:
Pattern alias = Pattern.compile("(\\S+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"");
Matcher match = alias.matcher(lineInFile)
while(match.find()) {
// do something
}
This works but I'm not totally happy with it because since introducing this small piece of code, my program has slowed down a bit (< 1 sec) but enough to notice a difference.
So my question is, am I going about this in the correct manner? Is there a more efficient or possibly lightweight solution without the need for a while(match) loop? and/or Pattern/Matcher classes?

If the line may not contain anything except that alias definition, then using .match() instead of .find() might speed up the searching on non-matches.

You can improve your regex to: "(\\S{2})\\s+\"((\\d{1,3}\\.){3}\\d{1,3})\"" by specifying an IP address more explicitly.
Try out the performance of using a StringTokenizer. It does not use regular expressions.
(If you are concerned about using a legacy class, then take a look at its source and see how it is done.)
StringTokenizer st = new StringTokenizer(lineInFile, " ,\"");
while(st.hasMoreTokens()){
String key = st.nextToken();
String ip = st.nextToken();
System.out.println(key + " ip: " + ip);
}

I don't know if this will yield a big performance benefit, but you could also first do
string.split(", ") // separate groups
and then
string.split(" ?\"") // separate alias from IP address
on the matches.

Precompiling and reusing the Pattern object is (IMO) likely to be the most effective optimization. Pattern compilation is potentially an expensive step.
Reusing the Matcher instance (e.g. using reset(CharSequence)) might help, but I doubt that it will make much difference.
The regex itself cannot be optimized significantly. One possible speedup would be to replace (\d+\.\d+\.\d+\.\d+) with ([0-9\.]+). This might help because it reduces the number of potential backtrack points ... but you'd need to do some experiments to be sure. And the obvious downside is that it matches character sequences that are not valid IP addresses.

If you`re noticing a difference of < 1 sec on that piece of code, then your input string must contain around a million (ot at least some 100k) of entries. I think that's a pretty fair performance and I cannot see how you could significantly optimize that without writing your own specialized parser.

I'm afraid your code looks pretty efficient already.
Here's my version:
Matcher match = Pattern
.compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"")
.matcher(lineInFile);
while(match.find()) {
//do something
}
There are two micro-optimizations:
No need to keep pattern in an extra
variable, inlined that
For the alias, search for word
characters, not non-space characters
Actually, if you do a lot of processing like this and the pattern never changes, you should keep the compiled pattern in a constant:
private static final Pattern PATTERN = Pattern
.compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"");
Matcher match = PATTERN.matcher(lineInFile);
while(match.find()) {
//do something
}
Update: I took some time on RegExr to come up with a much more specific pattern, which should only detect valid IP addresses as a bonus. I know it's ugly as hell, but my guess is that it's pretty efficient, as it eliminates most of the backtracking:
([A-Z]+)\s*\"((?:1[0-9]{2}|2(?:(?:5[0-5]|[0-9]{2})|[0-9]{1,2})\.)
{3}(?:1[0-9]{2}|2(?:5[0-5]|[0-9]{2})|[0-9]{1,2}))
(Wrapped for readability, all back-slashes need to be escaped in java, but you can test it on RegExr as it is with the OP's test string)

Related

Java - Parsing strings - String.split() versus Pattern & Matcher

Given a String containing a comma delimited list representing a proper noun & category/description pair, what are the pros & cons of using String.split() versus Pattern & Matcher approach to find a particular proper noun and extract the associated category/description pair?
The haystack String format will not change. It will always contain comma delimited data in the form of
PROPER_NOUN|CATEGORY/DESCRIPTION
Common variables for both approaches:
String haystack="EARTH|PLANET/COMFORTABLE,MARS|PLANET/HARDTOBREATHE,PLUTO|DWARF_PLANET/FARAWAY";
String needle="PLUTO";
String result=null;
Using String.split():
for (String current : haystack.split(","))
if (current.contains(needle))
{
result=current.split("\\|")[1]);
break; // *edit* Not part of original code - added in response to comment from Pshemo
{
Using Pattern & Matcher:
Pattern pattern = pattern.compile("(" +needle+ "\|)(\w+/\w+)");
Matcher matches = pattern.matcher(haystack);
if (matches.find())
result=matches.group(2);
Both approaches provide the information I require.
I'm wondering if any reason exists to choose one over the other. I am not currently using Pattern & Matcher within my project so this approach will require imports from java.util.regex
And, of course, if there is an objectively 'better' way to parse the information I will welcome your input.
Thank you for your time!
Conclusion
I've opted for the Pattern/Matcher approach. While a little tricky to read w/the regex, it is faster than .split()/.contains()/.split() and, more importantly to me, captures the first match only.
For what it is worth, here are the results of my imperfect benchmark tests, in nanoseconds, after 100,000 iterations:
.split()/.contains()/.split
304,212,973
Pattern/Matcher w/ Pattern.compile() invoked for each iteration
230,511,000
Pattern/Matcher w/Pattern.compile() invoked prior to iteration
111,545,646
In a small case such as this, it won't matter that much. However, if you have extremely large strings, it may be beneficial to use Pattern/Matcher directly.
Most string functions that use regular expressions (such as matches(), split(), replaceAll(), etc.) makes use of Matcher/Pattern directly. Thus it will create a Matcher object every time, causing inefficiency when used in a large loop.
Thus if you really want speed, you can use Matcher/Pattern directly and ideally only create a single Matcher object.
There are no advantages to using pattern/matcher in cases where the manipulation to be done is as simple as this.
You can look at String.split() as a convenience method that leverages many of the same functionalities you use when you use a pattern/matcher directly.
When you need to do more complex matching/manipulation, use a pattern/matcher, but when String.split() meets your needs, the obvious advantage to using it is that it reduces code complexity considerably - and I can think of no good reason to pass this advantage up.
I would say that the split() version is much better here due to the following reasons:
The split() code is very clear, and it is easy to see what it does. The regex version demands much more analysis.
Regular expressions are more complex, and therefore the code becomes more error-prone.

How do I determine if a string is not a regular expression?

I am trying to improve the performance of some code. It looks something like this:
public boolean isImportant(String token) {
for (Pattern pattern : patterns) {
return pattern.matches(token).find();
}
}
What I noticed is that many of the Patterns seem to be simple string literals with no regular expression constructs. So I want to simply store these in a separate list (importantList) and do an equality test instead of performing a more expensive pattern match, such as follows:
public boolean isImportant(String token) {
if (importantList.contains(token)) return true;
for (Pattern pattern : patterns) {
return pattern.matches(token).find();
}
}
How do I programmatically determine if a particular string contains no regular expression constructs?
Edit:
I should add that the answer doesn't need to be performance-sensitive. (i.e. regular expressions can be used) I'm mainly concerned with the performance of isImportant() because it's called millions of times, while the initialzation of the patterns is only done once.
I normally hate answers that say this but...
Don't do that.
It probably won't make the code run faster, in fact it might even cause the program to take more time.
if you really need to optimize your code, there are likely much mush much more effective places where you can go.
It's going to be difficult. You can check for the non-presence of any regex metacharacters; that should be a good approximation:
Pattern regex = Pattern.compile("[$^()\\[\\]{}.*+?\\\\]");
Matcher regexMatcher = regex.matcher(subjectString);
regexIsLikely = regexMatcher.find();
Whether it's worth it is another question. Are you sure a regex match is slower than a list lookup (especially since you'll be doing a regex match after that in many cases anyway)? I'd bet it's much faster to just keep the regex match.
There is no way to determine it as every regex pattern is nothing else than a string. Furthermore there is nearly no performance difference as regex is smart nowadays and I'm pretty sure, if the pattern and source lengths are the same, equity check is the first that will be done
This is wrong
for (Pattern pattern : patterns)
you should create one big regex that ORs all patterns; then for each input you only match once.

Regex to find variables and ignore methods

I'm trying to write a regex that finds all variables (and only variables, ignoring methods completely) in a given piece of JavaScript code. The actual code (the one which executes regex) is written in Java.
For now, I've got something like this:
Matcher matcher=Pattern.compile(".*?([a-z]+\\w*?).*?").matcher(string);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
So, when value of "string" is variable*func()*20
printout is:
variable
func
Which is not what I want. The simple negation of ( won't do, because it makes regex catch unnecessary characters or cuts them off, but still functions are captured. For now, I have the following code:
Matcher matcher=Pattern.compile(".*?(([a-z]+\\w*)(\\(?)).*?").matcher(formula);
while(matcher.find()) {
if(matcher.group(3).isEmpty()) {
System.out.println(matcher.group(2));
}
}
It works, the printout is correct, but I don't like the additional check. Any ideas? Please?
EDIT (2011-04-12):
Thank you for all answers. There were questions, why would I need something like that. And you are right, in case of bigger, more complicated scripts, the only sane solution would be parsing them. In my case, however, this would be excessive. The scraps of JS I'm working on are intented to be simple formulas, something like (a+b)/2. No comments, string literals, arrays, etc. Only variables and (probably) some built-in functions. I need variables list to check if they can be initalized and this point (and initialized at all). I realize that all of it can be done manually with RPN as well (which would be safer), but these formulas are going to be wrapped with bigger script and evaluated in web browser, so it's more convenient this way.
This may be a bit dirty, but it's assumed that whoever is writing these formulas (probably me, for most of the time), knows what is doing and is able to check if they are working correctly.
If anyone finds this question, wanting to do something similar, should now the risks/difficulties. I do, at least I hope so ;)
Taking all the sound advice about how regex is not the best tool for the job into consideration is important. But you might get away with a quick and dirty regex if your rule is simple enough (and you are aware of the limitations of that rule):
Pattern regex = Pattern.compile(
"\\b # word boundary\n" +
"[A-Za-z]# 1 ASCII letter\n" +
"\\w* # 0+ alnums\n" +
"\\b # word boundary\n" +
"(?! # Lookahead assertion: Make sure there is no...\n" +
" \\s* # optional whitespace\n" +
" \\( # opening parenthesis\n" +
") # ...at this position in the string",
Pattern.COMMENTS);
This matches an identifier as long as it's not followed by a parenthesis. Of course, now you need group(0) instead of group(1). And of course this matches lots of other stuff (inside strings, comments, etc.)...
If you are rethinking using regex and wondering what else you could do, you could consider using an AST instead to access your source programatically. This answer shows you could use the Eclipse Java AST to build a syntax tree for Java source. I guess you could do similar for Javascript.
A regex won't cut in this case because Java isn't regular. Your best best is to get a parser that understands Java syntax and build onto that. Luckily, ANTLR has a Java 1.6 grammar (and 1.5 grammar).
For your rather limited use case you could probably easily extend the variable assignment rules and get the info you need. It's a bit of a learning curve but this will probably be your best best for a quick and accurate solution.
It's pretty well established that regex cannot be reliably used to parse structured input. See here for the famous response: RegEx match open tags except XHTML self-contained tags
As any given sequence of characters may or may not change meaning depending on previous or subsequent sequences of characters, you cannot reliably identify a syntactic element without both lexing and parsing the input text. Regex can be used for the former (breaking an input stream into tokens), but cannot be used reliably for the latter (assigning meaning to tokens depending on their position in the stream).

Android: Matcher.find() never returns

First of all, here is a chunk of affected code:
// (somewhere above, data is initialized as a String with a value)
Pattern detailsPattern = Pattern.compile("**this is a valid regex, omitted due to length**", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher detailsMatcher = detailsPattern.matcher(data);
Log.i("Scraper", "Initialized pattern and matcher, data length "+data.length());
boolean found = detailsMatcher.find();
Log.i("Scraper", "Found? "+((found)?"yep":"nope"));
I omitted the regex inside Pattern.compile because it's very long, but I know it works with the given data set; or if it doesn't, it shoudn't break anything anyway.
The trouble is, I do get the feedback I/Scraper(23773): Initialized pattern and matcher, data length 18861 but I never see the "Found?" line, it is just stuck on the find() call.
Is this a known Android bug? I've tried it over and over and just can't get it to work. Somehow, I think something over the past few days broke this because my app was working fine before, and I have in the past couple days received several comments of the app not working so it is clearly affecting other users as well.
How can I further debug this?
Some regexes can take a very, very long time to evaluate. In particular, regexes that have lots of quantifiers can cause the regex engine to do a huge amount of backtracking to explore all of the possible ways that the input string might match. And if it is going to fail, it has to explore all of those possibilities.
(Here is an example:
regex = "a*a*a*a*a*a*b"; // 6 quantifiers
input = "aaaaaaaaaaaaaaaaaaaa"; // 20 characters
A typical regex engine will do in the region of 20^6 character comparisons before deciding that the input string does not match.)
If you showed us the regex and the string you are trying to match, we could give a better diagnosis, and possibly offer some alternatives. But if you are trying to extract information from HTML, then the best solution is to not use regexes at all. There are HTML parsers that are specifically designed to deal with real-world HTML.
How long is the string you are trying to parse ?
How long and how complicated is the regex you are trying to match ?
Have you tried to break down your regex down to simpler bits ? Adding up the bits one after another will let you see when it breaks and maybe why.
make some RE like [a-zA-Z]* pass it as argument to compile(),here this example allows only characters small & cap.
Read my blogpost on android validation for more info.
I had the same issue and I solved it replacing all the wildchart . with [\s\S]. I really don't know why it worked for me but it did. I come from Javascript world and I know in there that expression is faster for being evaluated.

Regular expression performance in Java -- better few complex or many simple?

I am doing some fairly extensive string manipulations using regular expressions in Java. Currently, I have many blocks of code that look something like:
Matcher m = Pattern.compile("some pattern").matcher(text);
StringBuilder b = new StringBuilder();
int prevMatchIx = 0;
while (m.find()) {
b.append(text.substring(prevMatchIx, m.start()));
String matchingText = m.group(); //sometimes group(n)
//manipulate the matching text
b.append(matchingText);
prevMatchIx = m.end();
}
text = b.toString()+text.substring(prevMatchIx);
My question is which of the two alternatives is more efficient (primarily time, but space to some extent):
1) Keep many existing blocks as above (assuming there isn't a better way to handle such blocks -- I can't use a simple replaceAll() because the groups must be operated on).
2) Consolidate the blocks into one big block. Use a "some pattern" that is the combination of all the old blocks' patterns using the |/alternation operator. Then, use if/else if within the loop to handle each of the matching patterns.
Thank you for your help!
If the order in which the replacements are made matters, you would have to be careful when using technique #1. Allow me to give an example: If I want to format a String so it is suitable for inclusion in XML, I have to first replace all & with & and then make the other replacements (like < to <). Using technique #2, you would not have to worry about this because you are making all the replacements in one pass.
In terms of performance, I think #2 would be quicker because you would be doing less String concatenations. As always, you could implement both techniques and record their speed and memory consumption to find out for certain. :)
I'd suggest caching the patterns and having a method that uses the cache.
Patterns are expensive to compile so at least you will only compile them once and there is code reuse in using the same method for each instance. Shame about the lack of closures though as that would make things a lot cleaner.
private static Map<String, Pattern> patterns = new HashMap<String, Pattern>();
static Pattern findPattern(String patStr) {
if (! patterns.containsKey(patStr))
patterns.put(patStr, Pattern.compile(patStr));
return patterns.get(patStr);
}
public interface MatchProcessor {
public void process(String field);
}
public static void processMatches(String text, String pat, MatchProcessor processor) {
Matcher m = findPattern(pat).matcher(text);
int startInd = 0;
while (m.find(startInd)) {
processor.process(m.group());
startInd = m.end();
}
}
Last time I was in your position I used a product called jflex.
Java's regex doesn't provide the traditional O(N log M) performance guarantees of true regular expression engines (for input strings of length N, and patterns of length M). Instead it inherits from its perl roots exponential time for some patterns. Unfortunately these pathological patterns, while rare in normal use, are all too common when combining regexes as you propose to do (I can attest to this from personal experience).
Consequently, my advice is to either:
a) pre-compile your patterns as "static final Pattern" constants, so they will be initialized once during [cinit]; or
b) switch to a lexer package such as jflex, which will provide a more declarative, and far more readable, syntax to approach this sort of cascading/sequential regex processing; and
c) seriously consider using a parser generator package. My current favourite is Beaver, but CUP is also a good option. Both of these are excellent tools and I highly recommend both of them, and as they both sit on top of jflex you can add them as/when you need them.
That being said, if you haven't used a parser-generator before and you are in a hurry, it will be easier to get up to speed with JavaCC. Not as powerful as Beaver/CUP but its parsing model is easier to understand.
Whatever you do, please don't use Antlr. It is very fashionable, and has great cheerleaders, but its online documentation sucks, its syntax is awkward, its performance is poor, and its scannerless design makes several common simple cases painful to handle. You would be better off using an abomination like sablecc(v1).
Note: Yes I have used everything I have mentioned above, and more besides; so this advice comes from personal experience.
First, does this need to be efficient? If not, don't bother -- complexification won't help code maintainability.
Assuming it does, doing them separately is usually the most efficient. This is especially true if there are large blocks of text in the expressions: without alternation this can be used to speed up matching, with it can't help at all.
If performance is really critical, you can code it several ways and test with sample data.
Option #2 is almost certainly the better way to go, assuming it isn't too difficult to combine the regexes. And you don't have to implement it from scratch, either; the lower-level API that replaceAll() is built on (i.e., appendReplacement() and appendTail()), is also available for your use.
Taking the example that #mangst used, here's how you might process some text to be inserted into an XML document:
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
{
String test_in = "One < two & four > three.";
Pattern p = Pattern.compile("(&)|(<)|(>)");
Matcher m = p.matcher(test_in);
StringBuffer sb = new StringBuffer(); // (1)
while (m.find())
{
String repl = m.start(1) != -1 ? "&" :
m.start(2) != -1 ? "<" :
m.start(3) != -1 ? ">" : "";
m.appendReplacement(sb, ""); // (2)
sb.append(repl);
}
m.appendTail(sb);
System.out.println(sb.toString());
}
}
In this very simple example, all I need to know about each match is which capture group participated in it, which I find out by means of the start(n) method. But you can use the group() or group(n) method to examine the matched text, as you mentioned in the question.
Note (1) As of JDK 1.6, we have to use a StringBuffer here because StringBuilder didn't exist yet when the Matcher class was written. JDK 1.7 will add support for StringBuilder, plus some other improvements.
Note (2) appendReplacement(StringBuffer, String) processes the String argument to replace any $n sequence with the contents of the n'th capture group. We don't want that to happen, so we pass it an empty string and then append() the replacement string ourselves.

Categories