How to find multiple Pattern(s) (using Matcher) in Java - java

Suppose I have a String, say one two three one one two one.
Now I'm using a Pattern and a Matcher to find any specific Pattern in the String.
Like this:
Pattern findMyPattern = Pattern.compile("one");
Matcher foundAMatch = findMyPattern.matcher(myString);
while(foundAMatch.find())
// do my stuff
But, suppose I want to find multiple patterns. For the example String I took, I want to find both one and two. Now it's a quite small String,so probable solution is using another Pattern and find my matches. But, this is just a small example. Is there an efficient way to do it, instead of just trying it out in a loop over all the set of Patterns?

Use the power of regular expressions: change the pattern to match one or two.
Pattern findMyPattern = Pattern.compile("one|two");
Matcher foundAMatch = findMyPattern.matcher(myString);
while(foundAMatch.find())
// do my stuff

this wont solve my issue, this way we can change (one|two) to a perticular string.
but my requirementis to change
pattern one -> three & two -> four

Related

Java - Parsing strings - String.split() versus Pattern & Matcher

Given a String containing a comma delimited list representing a proper noun & category/description pair, what are the pros & cons of using String.split() versus Pattern & Matcher approach to find a particular proper noun and extract the associated category/description pair?
The haystack String format will not change. It will always contain comma delimited data in the form of
PROPER_NOUN|CATEGORY/DESCRIPTION
Common variables for both approaches:
String haystack="EARTH|PLANET/COMFORTABLE,MARS|PLANET/HARDTOBREATHE,PLUTO|DWARF_PLANET/FARAWAY";
String needle="PLUTO";
String result=null;
Using String.split():
for (String current : haystack.split(","))
if (current.contains(needle))
{
result=current.split("\\|")[1]);
break; // *edit* Not part of original code - added in response to comment from Pshemo
{
Using Pattern & Matcher:
Pattern pattern = pattern.compile("(" +needle+ "\|)(\w+/\w+)");
Matcher matches = pattern.matcher(haystack);
if (matches.find())
result=matches.group(2);
Both approaches provide the information I require.
I'm wondering if any reason exists to choose one over the other. I am not currently using Pattern & Matcher within my project so this approach will require imports from java.util.regex
And, of course, if there is an objectively 'better' way to parse the information I will welcome your input.
Thank you for your time!
Conclusion
I've opted for the Pattern/Matcher approach. While a little tricky to read w/the regex, it is faster than .split()/.contains()/.split() and, more importantly to me, captures the first match only.
For what it is worth, here are the results of my imperfect benchmark tests, in nanoseconds, after 100,000 iterations:
.split()/.contains()/.split
304,212,973
Pattern/Matcher w/ Pattern.compile() invoked for each iteration
230,511,000
Pattern/Matcher w/Pattern.compile() invoked prior to iteration
111,545,646
In a small case such as this, it won't matter that much. However, if you have extremely large strings, it may be beneficial to use Pattern/Matcher directly.
Most string functions that use regular expressions (such as matches(), split(), replaceAll(), etc.) makes use of Matcher/Pattern directly. Thus it will create a Matcher object every time, causing inefficiency when used in a large loop.
Thus if you really want speed, you can use Matcher/Pattern directly and ideally only create a single Matcher object.
There are no advantages to using pattern/matcher in cases where the manipulation to be done is as simple as this.
You can look at String.split() as a convenience method that leverages many of the same functionalities you use when you use a pattern/matcher directly.
When you need to do more complex matching/manipulation, use a pattern/matcher, but when String.split() meets your needs, the obvious advantage to using it is that it reduces code complexity considerably - and I can think of no good reason to pass this advantage up.
I would say that the split() version is much better here due to the following reasons:
The split() code is very clear, and it is easy to see what it does. The regex version demands much more analysis.
Regular expressions are more complex, and therefore the code becomes more error-prone.

How to get best match using java.util.regex.Pattern

Here is my use case. I have different file processing modules which is invoked based on the file name. So if the filename matches the pattern associated with a certain module that module will pick up the file.
I have a catch all pattern defined which is used to do default processing, but this pattern should only kick in if I haven't got a better match.
Consider the following scenario
Pattern 1 - Sample_[0-9]*.xls
Pattern 2 - [a-zA-Z]*_[0-9]*.xls
Now given a file "Sample_11", I want Pattern 1 to be applied as its a better match than Pattern 2, however the method java.util.regex.Pattern.matcher().matches() just returns true or false.
Is there any way to identify what is the better match?
EDIT:
The patterns are defined outside the system (this is a weird use case), so I cannot order
them as suggested by many. In a sense I am looking infer the results of matching to decide if that is the best match or not. Hope this clarifies my question.
Thanks,
Raam
Use the chain of responsibility design pattern (wiki here). Loop (or iterate down a list) through each regex Pattern from most specific to least specific until you find one that matches. Then do the appropriate processing for that match.
Why is the Boolean not sufficient here? Your logic should be checking a more specific regex (or list of regex) first, going down the code path tied to whatever specific regex matches. It should only go on to the catch all if it found no match for the specific patterns. I think the Boolean should work fine for you unless there is more to your problem that I don't see.
Imagine a Map where the key is the pattern and the value is a custom interface for handling a match (let's call it MatchHandler). Iterate the map and if a pattern matches, invoke that MatchHandler. If no match, check the default pattern and if a match, invoke the default MatchHandler. If you needed ordered processing you could use a LinkedHashMap.
Now if you won't know the patterns before hand (and it sounds like that's the case for you) then things get a little more tricky. One possible answer would be to write another regex that evaluates the occurrences of general matching constructs in the pattern (things like [a-z], *, etc). Patterns with more occurrences of these general matching constructs will be less specific matches. It's not perfect but it could work for what you are doing. Just be sure to do a lot of escaping in this other pattern due to the fact that it is looking for regex based constructs using regex itself.

How to retrieve portion of number that's within parenthesis in Java?

For part of my Java assignment I'm required to select all records that have a certain area code. I have custom objects within an ArrayList, like ArrayList<Foo>.
Each object has a String phoneNumber variable. They are formatted like "(555) 555-5555"
My goal is to search through each custom object in the ArrayList<Foo> (call it listOfFoos) and place the objects with area code "616" in a temporaryListOfFoos ArrayList<Foo>.
I have looked into tokenizers, but was unable to get the syntax correct. I feel like what I need to do is similar to this post, but since I'm only trying to retrieve the first 3 digits (and I don't care about the remaining 7), this really didn't give me exactly what I was looking for. Ignore parentheses with string tokenizer?
What I did as a temporary work-around, was...
for (int i = 0; i<listOfFoos.size();i++){
if (listOfFoos.get(i).getPhoneNumber().contains("616")){
tempListOfFoos.add(listOfFoos.get(i));
}
}
This worked for our current dataset, however, if there was a 616 anywhere else in the phone numbers [like "(555) 616-5555"] it obviously wouldn't work properly.
If anyone could give me advice on how to retrieve only the first 3 digits, while ignoring the parentheses, I would greatly appreciate it.
You have two options:
Use value.startsWith("(616)") or,
Use regular expressions with this pattern "^\(616\).*"
The first option will be a lot quicker.
areaCode = number.substring(number.indexOf('(') + 1, number.indexOf(')')).trim() should do the job for you, given the formatting of phone numbers you have.
Or if you don't have any extraneous spaces, just use areaCode = number.substring(1, 4).
I think what you need is a capturing group. Have a look at the Groups and capturing section in this document.
Once you are done matching the input with a pattern (for example "\((\\d+)\) \\d+-\\d+"), you can get the number in the parentheses using a matcher (object of java.util.regex.Matcher) with matcher.group(1).
You could use a regular expression as shown below. The pattern will ensure the entire phone number conforms to your pattern ((XXX) XXX-XXXX) plus grabs the number within the parentheses.
int areaCodeToSearch = 555;
String pattern = String.format("\\((%d)\\) \\d{3}-\\d{4}", areaCodeToSearch);
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(phoneNumber);
if (m.matches()) {
String areaCode = m.group(1);
// ...
}
Whether you choose to use a regular expression versus a simple String lookup (as mentioned in other answers) will depend on how bothered you are about the format of the entire string.

Java pattern matcher find multiple strings

I am using Pattern.compile() to find if a text string contains two other strings. But it needs to be in one regex pattern.
For example the string must have "StringOne" and "StringTwo" in it.
I could do Pattern.compile("(StringOne StringTwo|StrinTwo StringOne"), but both strings are quite long and I want to see if I can compress it.
If I do "(StringOne )?StringTwo( StringOne)?" it would match "StringTwo" and "StringOne StringTwo StringOne".
Use this regex:
^(?=.*\\bStringOne\\b)(?=.*\\bStringTwo\\b)
This uses two look-aheads anchored to start of input to assert that both strings appear somewhere
Edit:
Added word boundaries \b to ends of strings to prevent matches of one string within another, although this was not a stated requirement of the question.
There is question of speed.
You could probably use lookaheads to accomplish this, but it's costly speed-wise. lookaheads are really expansive on long strings.
If the strings are long, the faster approach would be to do two separate matches.
If you really need to do one, use your original way string A string B|String B String A

Pattern match numbers/operators

Hey, I've been trying to figure out why this regular expression isn't matching correctly.
List l_operators = Arrays.asList(Pattern.compile(" (\\d+)").split(rtString.trim()));
The input string is "12+22+3"
The output I get is -- [,+,+]
There's a match at the beginning of the list which shouldn't be there? I really can't see it and I could use some insight. Thanks.
Well, technically, there is an empty string in front of the first delimiter (first sequence of digits). If you had, say a line of CSV, such as abc,def,ghi and another one ,jkl,mno you would clearly want to know that the first value in the second string was the empty string. Thus the behaviour is desirable in most cases.
For your particular case, you need to deal with it manually, or refine your regular expression somehow. Like this for instance:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(rtString);
if (m.find()) {
List l_operators = Arrays.asList(p.split(rtString.substring(m.end()).trim()));
// ...
}
Ideally however, you should be using a parser for these type of strings. You can't for instance deal with parenthesis in expressions using just regular expressions.
That's the behavior of split in Java. You just have to take it (and deal with it) or use other library to split the string. I personally try to avoid split from Java.
An example of one alternative is to look at Splitter from Google Guava.
Try Guava's Splitter.
Splitter.onPattern("\\d+").omitEmptyStrings().split(rtString)

Categories