Find pattern in string with regex -> how to improve my solution - java

i would like to parse a string and get the "stringIAmLookingFor"-part of it, which is surrounded by "\_" at the end and the beginning. I'm using a regex to match that and then remove the "\_" in the found string. This is working, but I'm wondering if there is a more elegant approach to this problem?
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w)*_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
String match = m.group();
match = match.replaceAll("_", "");
System.out.println(match);
}

Solution (partial)
Please also check the next section. Don't just read the solution here.
Just modify your code a bit:
String test = "xyz_stringIAmLookingFor_zxy";
// Make the capturing group capture the text in between (\w*)
// A capturing group is enclosed in (pattern), denoting the part of the
// pattern whose text you want to get separately from the main match.
// Note that there is also non-capturing group (?:pattern), whose text
// you don't need to capture.
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// The text is in the capturing group numbered 1
// The numbering is by counting the number of opening
// parentheses that makes up a capturing group, until
// the group that you are interested in.
String match = m.group(1);
System.out.println(match);
}
Matcher.group(), without any argument will return the text matched by the whole regex pattern. Matcher.group(int group) will return the text matched by capturing group with the specified group number.
If you are using Java 7, you can make use of named capturing group, which makes the code slightly more readable. The string matched by the capturing group can be accessed with Matcher.group(String name).
String test = "xyz_stringIAmLookingFor_zxy";
// (?<name>pattern) is similar to (pattern), just that you attach
// a name to it
// specialText is not a really good name, please use a more meaningful
// name in your actual code
Pattern p = Pattern.compile("_(?<specialText>\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// Access the text captured by the named capturing group
// using Matcher.group(String name)
String match = m.group("specialText");
System.out.println(match);
}
Problem in pattern
Note that \w also matches _. The pattern you have is ambiguous, and I don't know what your expected output is for the cases where there are more than 2 _ in the string. And do you want to allow underscore _ to be part of the output?

You can define the group you actually want, since you're already using parentheses. You just need to tweak your pattern a bit.
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
System.out.println(m.group(1));
}

Use group(1) instead of group() because group() will get you the entire pattern and not the matching group.
Reference : http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#group(int)

"xyz_stringIAmLookingFor_zxy".replaceAll("_(\\w)*_", "$1");
will replace everything by this group in parenthesis

a simpler regex, no group needed:
"(?<=_)[^_]*"
if you want it more strict:
"(?<=_)[^_]+(?=_)"

try
String s = "xyz_stringIAmLookingFor_zxy".replaceAll(".*_(\\w*)_.*", "$1");
System.out.println(s);
output
stringIAmLookingFor

Related

Java how to check multiple regex patterns against an input?

(If I'm taking the complete wrong direction let me know if there is a better way I should be approaching this)
I have a Java program that will have multiple patterns that I want to compare against an input. If one of the patterns matches then I want to save that value in a String. I can get it to work with a single pattern but I'd like to be able to check against many.
Right now I have this to check if an input matches one pattern:
Pattern pattern = Pattern.compile("TST\\w{1,}");
Matcher match = pattern.matcher(input);
String ID = match.find()?match.group():null;
So, if the input was TST1234 or abcTST1234 then ID = "TST1234"
I want to have multiple patterns like:
Pattern pattern = Pattern.compile("TST\\w{1,}");
Pattern pattern = Pattern.compile("TWT\\w{1,}");
...
and then to a collection and then check each one against the input:
List<Pattern> rxs = new ArrayList<Pattern>();
rxs.add(pattern);
rxs.add(pattern2);
String ID = null;
for (Pattern rx : rxs) {
if (rx.matcher(requestEnt).matches()){
ID = //???
}
}
I'm not sure how to set ID to what I want. I've tried
ID = rx.matcher(requestEnt).group();
and
ID = rx.matcher(requestEnt).find()?rx.matcher(requestEnt).group():null;
Not really sure how to make this work or where to go from here though. Any help or suggestions are appreciated. Thanks.
EDIT: Yes the patterns will change over time. So The patten list will grow.
I just need to get the string of the match...ie if the input is abcTWT123 it will first check against "TST\w{1,}", then move on to "TWT\w{1,}" and since that matches the ID String will be set to "TWT123".
To collect the matched string in the result you may need to create a group in your regexp if you are matching less than the entire string:
List<Pattern> patterns = new ArrayList<>();
patterns.add(Pattern.compile("(TST\\w+)");
...
Optional<String> result = Optional.empty();
for (Pattern pattern: patterns) {
Matcher matcher = pattern.match();
if (matcher.matches()) {
result = Optional.of(matcher.group(1));
break;
}
}
Or, if you are familiar with streams:
Optional<String> result = patterns.stream()
.map(Pattern::match).filter(Matcher::matches)
.map(m -> m.group(1)).findFirst();
The alternative is to use find (as in #Raffaele's answer) that implicitly creates a group.
Another alternative you may want to consider is to put all your matches into a single pattern.
Pattern pattern = Pattern.compile("(TST\\w+|TWT\\w+|...");
Then you can match and group in a single operation. However this might might it harder to change the matches over time.
Group 1 is the first matched group (i.e. the match inside the first set of parentheses). Group 0 is the entire match. So if you want the entire match (I wasn't sure from your question) then you could perhaps use group 0.
Use an alternation | (a regex OR):
Pattern pattern = Pattern.compile("TST\\w+|TWT\\w+|etc");
Then just check the pattern once.
Note also that {1,} can be replaced with +.
Maybe you just need to end the loop when the first pattern matches:
// TST\\w{1,}
// TWT\\w{1,}
private List<Pattern> patterns;
public String findIdOrNull(String input) {
for (Pattern p : patterns) {
Matcher m = p.matcher(input);
// First match. If the whole string must match use .matches()
if (m.find()) {
return m.group(0);
}
}
return null; // Or throw an Exception if this should never happen
}
If your patterns are all going to be simple prefixes like your examples TST and TWT you can define all of those at once, and user regex alternation | so you won't need to loop over the patterns.
An example:
String prefixes = "TWT|TST|WHW";
String regex = "(" + prefixes + ")\\w+";
Pattern pattern = Pattern.compile(regex);
String input = "abcTST123";
Matcher match = pattern.matcher(input);
String ID = match.find() ? match.group() : null;
// given this, ID will come out as "TST123"
Now prefixes could be read in from a java .properties file, or a simple text file; or passed as a parameter to the method that does this.
You could also define the prefixes as a comma-separated list or one-per-line in a file then process that to turn them into one|two|three|etc before passing it on.
You may be looping over several inputs, and then you would want to create the regex and pattern variables only once, creating only the Matcher for each separate input.

how to exclude "<" in regex match

I have a String which looks like "<name><address> and <Phone_1>". I have get to get the result like
1) <name>
2) <address>
3) <Phone_1>
I have tried using regex "<(.*)>" but it returns just one result.
The regex you want is
<([^<>]+?)><([^<>]+?)> and <([^<>]+?)>
Which will then spit out the stuff you want in the 3 capture groups. The full code would then look something like this:
Matcher m = Pattern.compile("<([^<>]+?)><([^<>]+?)> and <([^<>]+?)>").matcher(string);
if (m.find()) {
String name = m.group(1);
String address = m.group(2);
String phone = m.group(3);
}
The pattern .* in a regex is greedy. It will match as many characters as possible between the first < it finds and the last possible > it can find. In the case of your string it finds the first <, then looks for as much text as possible until a >, which it will find at the very end of the string.
You want a non-greedy or "lazy" pattern, which will match as few characters as possible. Simply <(.+?)>. The question mark is the syntax for non-greedy. See also this question.
This will work if you have dynamic number of groups.
Pattern p = Pattern.compile("(<\\w+>)");
Matcher m = p.matcher("<name><address> and <Phone_1>");
while (m.find()) {
System.out.println(m.group());
}

Regex isn't extracting specific part rather whole string upto the group

This is the follow up to the question that i asked here
The given regex is perfect i.e., (?:[^\/]*\/){4}([A-Za-z]{3}[0-9]{3}). However, when i do it in java, The java matches the string upto the matching group rather just giving me that string.
String defaultRegex = "(?:[^\\/]*\\/){4}([A-Za-z]{3}[0-9]{3})";
String stringToMatch = "unknown/relevant/nonrelevant:2.2.2/random/ABC123:random/morerandom";
Pattern p = Pattern.compile(defaultRegex);
Matcher m = p.matcher (stringToMatch);
if (m.find()){
System.out.println(m.group());
}
The above thing is printing unknown/relevant/nonrelevant:2.2.2/random/ABC123 when I want regex just to give me ABC123
matcher.group() as well as matcher.group(0) always return the whole matched string.
To get the first capturing group, use matcher.group(1),
The second capturing group goes with matcher.group(2), and so on.

Java: Need to extract a number from a string

I have a string containing a number. Something like "Incident #492 - The Title Description".
I need to extract the number from this string.
Tried
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(theString);
String substring =m.group();
By getting an error
java.lang.IllegalStateException: No match found
What am I doing wrong?
What is the correct expression?
I'm sorry for such a simple question, but I searched a lot and still not found how to do this (maybe because it's too late here...)
You are getting this exception because you need to call find() on the matcher before accessing groups:
Matcher m = p.matcher(theString);
while (m.find()) {
String substring =m.group();
System.out.println(substring);
}
Demo.
There are two things wrong here:
The pattern you're using is not the most ideal for your scenario, it's only checking if a string only contains numbers. Also, since it doesn't contain a group expression, a call to group() is equivalent to calling group(0), which returns the entire string.
You need to be certain that the matcher has a match before you go calling a group.
Let's start with the regex. Here's what it looks like now.
Debuggex Demo
That will only ever match a string that contains all numbers in it. What you care about is specifically the number in that string, so you want an expression that:
Doesn't care about what's in front of it
Doesn't care about what's after it
Only matches on one occurrence of numbers, and captures it in a group
To that, you'd use this expression:
.*?(\\d+).*
Debuggex Demo
The last part is to ensure that the matcher can find a match, and that it gets the correct group. That's accomplished by this:
if (m.matches()) {
String substring = m.group(1);
System.out.println(substring);
}
All together now:
Pattern p = Pattern.compile(".*?(\\d+).*");
final String theString = "Incident #492 - The Title Description";
Matcher m = p.matcher(theString);
if (m.matches()) {
String substring = m.group(1);
System.out.println(substring);
}
You need to invoke one of the Matcher methods, like find, matches or lookingAt to actually run the match.

Java pattern to find two groups of two letters in `ABC`

I have a pattern defined like this:
private static final Pattern PATTERN = Pattern.compile("[a-zA-Z]{2}");
And in my code I'm doing this:
Matcher matcher = PATTERN.matcher(myString);
and using a while loop to find all matches.
while (matcher.find()){
//do something here
}
If myString is 12345AB3CD45 the matcher is finding those two groups of two letters (AB and CD). The problem is that I have sometimes myString as 12345ABC356 so I would like the matcher to find, first AB and then BC (is only finding `AB).
Am I doing this wrong or the regex is wrong or the matcher doesn't work this way?
You can't match a same position several times with a regex, but you can use a trick.
To do that you need to enclose your pattern in a lookahead and a capture group:
(?=([A-Za-z]{2})), because a lookahead matches no characters and consumes only one position.
The result you are looking for is in the capture group 1.
Fragment of text which was placed in group 0 (entire match) can't be reused in next match to be part of group 0.
12345ABC356
^^ - AB was placed in standard match (group 0)
^^ - B can't be reused here as part of standard match
You can solve this problem with look-around mechanisms like look-ahead, which doesn't consume matched part (they are zero-length), but you can place their content in separate capturing group which you will be able to access.
So your code can look like
private static final Pattern PATTERN = Pattern.compile("[a-zA-Z](?=([a-zA-Z]))");
// ^^^^^^^^ ^^^^^^^^^^
// group 0 group 1
//...
Matcher matcher = PATTERN.matcher(myString);
while (matcher.find()){
String match = matcher.group() + matcher.group(1);
//...
}

Categories