Java Regexp capturing group includes space, why? - java

I am trying to parse this string,
"斬釘截鐵 斩钉截铁 [zhan3 ding1 jie2 tie3] /to chop the nail and slice the iron (idiom)/resolute and decisive/unhesitating/definitely/without any doubt/";
With this code
private static final Pattern TRADITIONAL = Pattern.compile("(.*?) ");
private String extractSinglePattern(String row, Pattern pattern) {
Matcher matcher = pattern.matcher(row);
if (matcher.find()) {
return matcher.group();
}
return null;
}
However, for some reason the string returned contains a space at the end
org.junit.ComparisonFailure: expected:<斬釘截鐵[]> but was:<斬釘截鐵[ ]>
Is there something wrong with my pattern?
I have also tried
private static final Pattern TRADITIONAL = Pattern.compile("(.*?)\\s");
but to no avail
I have also tried matching with two spaces at the end of the pattern, but it doesn't match (there is only one space).

You're using Matcher.group() which is documented as:
Returns the input subsequence matched by the previous match.
The match includes the space. The capturing group within the match doesn't, but you haven't asked for that.
If you change your return statement to:
return matcher.group(1);
then I believe it'll do what you want.

use this regular expression (.+?)(?=\s+)

Related

Java how to check multiple regex patterns against an input?

(If I'm taking the complete wrong direction let me know if there is a better way I should be approaching this)
I have a Java program that will have multiple patterns that I want to compare against an input. If one of the patterns matches then I want to save that value in a String. I can get it to work with a single pattern but I'd like to be able to check against many.
Right now I have this to check if an input matches one pattern:
Pattern pattern = Pattern.compile("TST\\w{1,}");
Matcher match = pattern.matcher(input);
String ID = match.find()?match.group():null;
So, if the input was TST1234 or abcTST1234 then ID = "TST1234"
I want to have multiple patterns like:
Pattern pattern = Pattern.compile("TST\\w{1,}");
Pattern pattern = Pattern.compile("TWT\\w{1,}");
...
and then to a collection and then check each one against the input:
List<Pattern> rxs = new ArrayList<Pattern>();
rxs.add(pattern);
rxs.add(pattern2);
String ID = null;
for (Pattern rx : rxs) {
if (rx.matcher(requestEnt).matches()){
ID = //???
}
}
I'm not sure how to set ID to what I want. I've tried
ID = rx.matcher(requestEnt).group();
and
ID = rx.matcher(requestEnt).find()?rx.matcher(requestEnt).group():null;
Not really sure how to make this work or where to go from here though. Any help or suggestions are appreciated. Thanks.
EDIT: Yes the patterns will change over time. So The patten list will grow.
I just need to get the string of the match...ie if the input is abcTWT123 it will first check against "TST\w{1,}", then move on to "TWT\w{1,}" and since that matches the ID String will be set to "TWT123".
To collect the matched string in the result you may need to create a group in your regexp if you are matching less than the entire string:
List<Pattern> patterns = new ArrayList<>();
patterns.add(Pattern.compile("(TST\\w+)");
...
Optional<String> result = Optional.empty();
for (Pattern pattern: patterns) {
Matcher matcher = pattern.match();
if (matcher.matches()) {
result = Optional.of(matcher.group(1));
break;
}
}
Or, if you are familiar with streams:
Optional<String> result = patterns.stream()
.map(Pattern::match).filter(Matcher::matches)
.map(m -> m.group(1)).findFirst();
The alternative is to use find (as in #Raffaele's answer) that implicitly creates a group.
Another alternative you may want to consider is to put all your matches into a single pattern.
Pattern pattern = Pattern.compile("(TST\\w+|TWT\\w+|...");
Then you can match and group in a single operation. However this might might it harder to change the matches over time.
Group 1 is the first matched group (i.e. the match inside the first set of parentheses). Group 0 is the entire match. So if you want the entire match (I wasn't sure from your question) then you could perhaps use group 0.
Use an alternation | (a regex OR):
Pattern pattern = Pattern.compile("TST\\w+|TWT\\w+|etc");
Then just check the pattern once.
Note also that {1,} can be replaced with +.
Maybe you just need to end the loop when the first pattern matches:
// TST\\w{1,}
// TWT\\w{1,}
private List<Pattern> patterns;
public String findIdOrNull(String input) {
for (Pattern p : patterns) {
Matcher m = p.matcher(input);
// First match. If the whole string must match use .matches()
if (m.find()) {
return m.group(0);
}
}
return null; // Or throw an Exception if this should never happen
}
If your patterns are all going to be simple prefixes like your examples TST and TWT you can define all of those at once, and user regex alternation | so you won't need to loop over the patterns.
An example:
String prefixes = "TWT|TST|WHW";
String regex = "(" + prefixes + ")\\w+";
Pattern pattern = Pattern.compile(regex);
String input = "abcTST123";
Matcher match = pattern.matcher(input);
String ID = match.find() ? match.group() : null;
// given this, ID will come out as "TST123"
Now prefixes could be read in from a java .properties file, or a simple text file; or passed as a parameter to the method that does this.
You could also define the prefixes as a comma-separated list or one-per-line in a file then process that to turn them into one|two|three|etc before passing it on.
You may be looping over several inputs, and then you would want to create the regex and pattern variables only once, creating only the Matcher for each separate input.

Java: Need to extract a number from a string

I have a string containing a number. Something like "Incident #492 - The Title Description".
I need to extract the number from this string.
Tried
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(theString);
String substring =m.group();
By getting an error
java.lang.IllegalStateException: No match found
What am I doing wrong?
What is the correct expression?
I'm sorry for such a simple question, but I searched a lot and still not found how to do this (maybe because it's too late here...)
You are getting this exception because you need to call find() on the matcher before accessing groups:
Matcher m = p.matcher(theString);
while (m.find()) {
String substring =m.group();
System.out.println(substring);
}
Demo.
There are two things wrong here:
The pattern you're using is not the most ideal for your scenario, it's only checking if a string only contains numbers. Also, since it doesn't contain a group expression, a call to group() is equivalent to calling group(0), which returns the entire string.
You need to be certain that the matcher has a match before you go calling a group.
Let's start with the regex. Here's what it looks like now.
Debuggex Demo
That will only ever match a string that contains all numbers in it. What you care about is specifically the number in that string, so you want an expression that:
Doesn't care about what's in front of it
Doesn't care about what's after it
Only matches on one occurrence of numbers, and captures it in a group
To that, you'd use this expression:
.*?(\\d+).*
Debuggex Demo
The last part is to ensure that the matcher can find a match, and that it gets the correct group. That's accomplished by this:
if (m.matches()) {
String substring = m.group(1);
System.out.println(substring);
}
All together now:
Pattern p = Pattern.compile(".*?(\\d+).*");
final String theString = "Incident #492 - The Title Description";
Matcher m = p.matcher(theString);
if (m.matches()) {
String substring = m.group(1);
System.out.println(substring);
}
You need to invoke one of the Matcher methods, like find, matches or lookingAt to actually run the match.

Regex to get the string after # sign

I have a string like follows:
#78517700-1f01-11e3-a6b7-3c970e02b4ec, #68517700-1f01-11e3-a6b7-3c970e02b4ec, #98517700-1f01-11e3-a6b7-3c970e02b4ec, #38517700-1f01-11e3-a6b7-3c970e02b4ec ....
I want to extract the string after #.
I have the current code like follows:
private final static Pattern PATTERN_LOGIN = Pattern.compile("#[^\\s]+");
Matcher m = PATTERN_LOGIN.matcher("#78517700-1f01-11e3-a6b7-3c970e02b4ec , #68517700-1f01-11e3-a6b7-3c970e02b4ec, #98517700-1f01-11e3-a6b7-3c970e02b4ec, #38517700-1f01-11e3-a6b7-3c970e02b4ec");
while (m.find()) {
String mentionedLogin = m.group();
.......
}
... but m.group() gives me #78517700-1f01-11e3-a6b7-3c970e02b4ec but I wanted 78517700-1f01-11e3-a6b7-3c970e02b4ec
You should use the regex "#([^\\s]+)" and then m.group(1), which returns you what "captured" by the capturing parentheses ().
m.group() or m.group(0) return you the full matching string found by your regex.
I would modify the pattern to omit the at sign:
private final static Pattern PATTERN_LOGIN = Pattern.compile("#([^\\s]+)");
So the first group will be the GUID only
Correct answers are mentioned in other responses. I will add some clarification. Your code is working correctly, as expected.
Your regex means: match string which starts with # and after that follows one or more characters which isn't white space. So if you omit the parentheses you get you full string as expected.
The parentheses as mentioned in other responses are used for marking capturing groups. In layman terms - the regex engine does the matching multiple times for each parenthesis enclosed group, working it's way inside the nested structure.

java regular expression

Can anyone please help me do the following in a java regular expression?
I need to read 3 characters from the 5th position from a given String ignoring whatever is found before and after.
Example : testXXXtest
Expected result : XXX
You don't need regex at all.
Just use substring: yourString.substring(4,7)
Since you do need to use regex, you can do it like this:
Pattern pattern = Pattern.compile(".{4}(.{3}).*");
Matcher matcher = pattern.matcher("testXXXtest");
matcher.matches();
String whatYouNeed = matcher.group(1);
What does it mean, step by step:
.{4} - any four characters
( - start capturing group, i.e. what you need
.{3} - any three characters
) - end capturing group, you got it now
.* followed by 0 or more arbitrary characters.
matcher.group(1) - get the 1st (only) capturing group.
You should be able to use the substring() method to accomplish this:
string example = "testXXXtest";
string result = example.substring(4,7);
This might help: Groups and capturing in java.util.regex.Pattern.
Here is an example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Example {
public static void main(String[] args) {
String text = "This is a testWithSomeDataInBetweentest.";
Pattern p = Pattern.compile("test([A-Za-z0-9]*)test");
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Matched: " + m.group(1));
} else {
System.out.println("No match.");
}
}
}
This prints:
Matched: WithSomeDataInBetween
If you don't want to match the entire pattern rather to the input string (rather than to seek a substring that would match), you can use matches() instead of find(). You can continue searching for more matching substrings with subsequent calls with find().
Also, your question did not specify what are admissible characters and length of the string between two "test" strings. I assumed any length is OK including zero and that we seek a substring composed of small and capital letters as well as digits.
You can use substring for this, you don't need a regex.
yourString.substring(4,7);
I'm sure you could use a regex too, but why if you don't need it. Of course you should protect this code against null and strings that are too short.
Use the String.replaceAll() Class Method
If you don't need to be performance optimized, you can try the String.replaceAll() class method for a cleaner option:
String sDataLine = "testXXXtest";
String sWhatYouNeed = sDataLine.replaceAll( ".{4}(.{3}).*", "$1" );
References
https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html#using-regular-expressions-with-string-methods

String Pattern Matching In Java

I want to search for a given string pattern in an input sting.
For Eg.
String URL = "https://localhost:8080/sbs/01.00/sip/dreamworks/v/01.00/cui/print/$fwVer/{$fwVer}/$lang/en/$model/{$model}/$region/us/$imageBg/{$imageBg}/$imageH/{$imageH}/$imageSz/{$imageSz}/$imageW/{$imageW}/movie/Kung_Fu_Panda_two/categories/3D_Pix/item/{item}/_back/2?$uniqueID={$uniqueID}"
Now I need to search whether the string URL contains "/{item}/". Please help me.
This is an example. Actually I need is check whether the URL contains a string matching "/{a-zA-Z0-9}/"
You can use the Pattern class for this. If you want to match only word characters inside the {} then you can use the following regex. \w is a shorthand for [a-zA-Z0-9_]. If you are ok with _ then use \w or else use [a-zA-Z0-9].
String URL = "https://localhost:8080/sbs/01.00/sip/dreamworks/v/01.00/cui/print/$fwVer/{$fwVer}/$lang/en/$model/{$model}/$region/us/$imageBg/{$imageBg}/$imageH/{$imageH}/$imageSz/{$imageSz}/$imageW/{$imageW}/movie/Kung_Fu_Panda_two/categories/3D_Pix/item/{item}/_back/2?$uniqueID={$uniqueID}";
Pattern pattern = Pattern.compile("/\\{\\w+\\}/");
Matcher matcher = pattern.matcher(URL);
if (matcher.find()) {
System.out.println(matcher.group(0)); //prints /{item}/
} else {
System.out.println("Match not found");
}
That's just a matter of String.contains:
if (input.contains("{item}"))
If you need to know where it occurs, you can use indexOf:
int index = input.indexOf("{item}");
if (index != -1) // -1 means "not found"
{
...
}
That's fine for matching exact strings - if you need real patterns (e.g. "three digits followed by at most 2 letters A-C") then you should look into regular expressions.
EDIT: Okay, it sounds like you do want regular expressions. You might want something like this:
private static final Pattern URL_PATTERN =
Pattern.compile("/\\{[a-zA-Z0-9]+\\}/");
...
if (URL_PATTERN.matcher(input).find())
If you want to check if some string is present in another string, use something like String.contains
If you want to check if some pattern is present in a string, append and prepend the pattern with '.*'. The result will accept strings that contain the pattern.
Example: Suppose you have some regex a(b|c) that checks if a string matches ab or ac
.*(a(b|c)).* will check if a string contains a ab or ac.
A disadvantage of this method is that it will not give you the location of the match, you can use java.util.Mather.find() if you need the position of the match.
You can do it using string.indexOf("{item}"). If the result is greater than -1 {item} is in the string

Categories