Java regex matches only if matching is checked - java

Inside of a class I have a pattern private Pattern lossWer = Pattern.compile("^\\d+ \\d+ (\\d+).*"). One of the functions looks like this:
public double[] getWer(){
double[] wer = new double[someStrings.size()];
Matcher m;
for(int i = 0; i < wer.length; i++){
m = lossWer.matcher(someStrings.get(i));
wer[i] = Double.parseDouble(m.group(1));
}
return wer;
}
Calling this fails with java.lang.IllegalStateException: No match found. When I change it to this, though, it works:
public double[] getWer(){
double[] wer = new double[someStrings.size()];
Matcher m;
for(int i = 0; i < wer.length; i++){
m = lossWer.matcher(someStrings.get(i));
if(!m.matches())
;
wer[i] = Double.parseDouble(m.group(1));
}
return wer;
}
Of course my application doesn't just use a blank semi-colon for that line, but I'm illustrating that the line here does nothing but allows the program to proceed without error. Why are lines matched without errors in the second example but not in the first?

You can't use group() without first calling either find() or matches(). That's how regexes work. First you create a pattern, then a matcher, then you either find() instances of the regex or check if it matches().

Check this for the IllegalStateException
The explicit state of a matcher is initially undefined; attempting to
query any part of it before a successful match will cause an
IllegalStateException to be thrown. The explicit state of a matcher is
recomputed by every match operation.
This combined with Ryan's answer should give you what you need.

Until you call m.matches() you haven't tested the regex so there are no groups.
You say in that line, test against the regex. If there are no matches do nothing, then you continue on to check the group(1) of the match (and since there was a match with a group it works).
It would be best to change:
if(!m.matches())
;
wer[i] = Double.parseDouble(m.group(1));
To:
if(m.matches())
wer[i] = Double.parseDouble(m.group(1));
Or use !m.matches() to return an error or something. Your choice :)

Related

Extract string between a set of multiple limiters with groups

As title says, I've a string and I want to extract some data from It.
This is my String:
text = "|tab_PRO|1|1|#tRecordType#||0|tab_PRO|";
and I want to extract all the data between the pipes: tab_PRO, 1, 1...and so on
.
I've tried:
Pattern p = Pattern.compile("\\|(.*?)\\|");
Matcher m = p.matcher(text);
while(m.find())
{
for(int i = 1; i< 10; i++) {
test = m.group(i);
System.out.println(test);
}
}
and with this i get the first group that's tab_PRO. But i also get an error
java.lang.IndexOutOfBoundsException: No group 2
Now, probably I didn't understand quite well how the groups works, but I thought that with this I could get the remaining data that I need. I'm not able to understand what I'm missing.
Thanks in advance
Use String.split(). Take into account it expects a regex as an argument, and | is a reserved regex operand, so you'll need to escape it with a \. So, make it two \ so \| won't be interpreted as if you're using an - invalid - escape sequence for the | character:
String[] parts = text.split("\\|");
See it working here:
https://ideone.com/WibjUm
If you want to go with your regex approach, you'll need to group and capture every repetition of characters after every | and restrict them to be anything except |, possibly using a regex like \\|([^\\|]*).
In your loop, you iterate over m.find() and just use capture group 1 because its the only group every match will have.
String text = "|tab_PRO|1|1|#tRecordType#||0|tab_PRO|";
Pattern p = Pattern.compile("\\|([^\\|]*)");
Matcher m = p.matcher(text);
while(m.find()){
System.out.println(m.group(1));
}
https://ideone.com/RNjZRQ
Try using .split() or .substring()
As mentioned in the comments, this is easier done with String.split.
As for your own code, you are unnecessarily using the inner loop, and that's leading to that exception. You only have one group, but the for loop will cause you to query more than one group. Your loop should be as simple as:
Pattern p = Pattern.compile("(?<=\\|)(.*?)\\|");
Matcher m = p.matcher(text);
while (m.find()) {
String test = m.group(1);
System.out.println(test);
}
And that prints
tab_PRO
1
1
#tRecordType#
0
tab_PRO
Note that I had to use a look-behind assertion in your regex.

Regex pattern in java fails but works fine otherwise

I've implemented quite a complicated pattern` to match all occurences of ship set number. It works perfectly fine with global case insensitive comparison.
I use the following code to implement the same thing in Java but it doesn't match. Should Java regex be implemented differently?
int i = 0;
while (i < elementsArray.size()) {
System.out.println("List element:"+elementsArray.get(i));
String theRegex = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
if (elementsArray.get(i).matches(theRegex)) {
System.out.println("RESULT:");
String shipsets = "";
String thePattern = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
Pattern pattern = Pattern.compile(thePattern);
Matcher matcher = pattern.matcher(elementsArray.get(i));
if (matcher.find()) {
shipsets = matcher.group(0);
}
System.out.println("text==========" + shipsets);
}
i++;
}
Here is a simplification of your code which should work, assuming that your regex be working correctly in Java. From my preliminary investigations, it does seem to match many of the use cases in your link. You don't need to use String.matches() because you already are using a Matcher which will check whether or not you have a match.
List<String> elementsArray = new ArrayList<String>();
elementsArray.add("Shipset Number 323");
elementsArray.add("meh");
elementsArray.add("SS NO. : 34");
elementsArray.add("Mary had a little lamb");
elementsArray.add("Ship Set #2, #33 to #4.");
for (int i=0; i < elementsArray.size(); ++i) {
System.out.println("List element:"+elementsArray.get(i));
String shipsets = "";
String thePattern = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
Pattern pattern = Pattern.compile(thePattern);
Matcher matcher = pattern.matcher(elementsArray.get(i));
if (matcher.find()) {
shipsets = matcher.group(0);
System.out.println("Found a match at element " + i + ": " + shipsets);
}
}
}
You can see in the output below, that the three ship test strings all matched, and the controls "meh" and "Mary had a little lamb" did not match.
Output:
List element:Shipset Number 323
Found a match at element 0: Shipset Number 323
List element:meh
List element:SS NO. : 34
Found a match at element 2: SS NO. : 34
List element:Mary had a little lamb
List element:Ship Set #2, #33 to #4.
Found a match at element 4: Ship Set #2, #33 to #4.
In my opinion your problems are coused by:
usage of matches() in if(elementsArray.get(i).matches(theRegex)) - matches() will return
true only if whole string match to regex, so it will succeed in
many cases from your example, but it will fail with:
SS#1,SS#5,SS#6, SS1, SS2, SS3, SS4, etc. You can simulate this
situation by adding ^ at beginning and $ at the end of regex.
Check how it match HERE. So it would be better solution, to use
matcher.find() instead of String.matches(), like in Tim
Biegeleisen answer.
usage of if(matcher.find()) instead of while(matcher.find()) - in
some of strings you want to retrieve more than one result, so you
should use matcher.find() multiple times, to get all of them.
However if will act only once, so you will get only first matched
fragment from given string. To retrieve all, use loop, as matcher.find() will return false when it will not find next match in given String, and will end loop
Check this out. This is Tim Biegeleisen solution with small change (while, instead of if).

Regex to match only letters and numbers

Can you help with this code?
It seems easy, but always fails.
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
//Matcher matches = Pattern.compile( "([A-Z0-9])" ).matcher("P-12345678-P");
Matcher matches = Pattern.compile( "([\\w])" ).matcher("P-12345678-P");
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
assertEquals("P12345678P", ret.toString());
}
Constructing a Matcher does not automatically perform any matching. That's in part because Matcher supports two distinct matching behaviors, differing in whether the match is implicitly anchored to the beginning of the Matcher's region. It appears that you could achieve your desired result like so:
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]+" ).matcher("P-12345678-P");
while (matches.find()) {
ret.append(matches.group());
}
assertEquals("P12345678P", ret.toString());
}
Note in particular the invocation of Matcher.find(), which was a key omission from your version. Also, the nullary Matcher.group() returns the substring matched by the last find().
Furthermore, although your use of Matcher.groupCount() isn't exactly wrong, it does lead me suspect that you have the wrong idea about what it does. In particular, in your code it will always return 1 -- it inquires about the pattern, not about matches to it.
First of all you don't need to add any group because entire match can be always accessed by group 0, so instead of
(regex) and group(1)
you can use
regex and group(0)
Next thing is that \\w is already character class so you don't need to surround it with another [ ], because it will be similar to [[a-z]] which is same as [a-z].
Now in your
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
you will iterate over all groups from 1 but you will exclude last group, because they are indexed from 1 so n so i<n will not include n. You would need to use i <= matches.groupCount() instead.
Also it looks like you are confusing something. This loop will not find all matches of regex in input. Such loop is used to iterate over groups in used regex after match for regex was found.
So if regex would be something like (\w(\w))c and your match would be like abc then
for (int i = 1; i < matches.groupCount(); i++)
System.out.println(matches.group(i));
would print
ab
b
because
first group contains two characters (\w(\w)) before c
second group is the one inside first one, right after first character.
But to print them you actually would need to first let regex engine iterate over your input and find() match, or check if entire input matches() regex, otherwise you would get IllegalStateException because regex engine can't know from which match you want to get your groups (there can be many matches of regex in input).
So what you may want to use is something like
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]" ).matcher("P-12345678-P");
while (matches.find()){//find next match
ret.append(matches.group(0));
}
assertEquals("P12345678P", ret.toString());
Other way around (and probably simpler solution) would be actually removing all characters you don't want from your input. So you could just use replaceAll and negated character class [^...] like
String input = "P-12345678-P";
String result = input.replaceAll("[^A-Z0-9]+", "");
which will produce new string in which all characters which are not A-Z0-9 will be removed (replaced with "").

Discard the leading and trailing series of a character, but retain the same character otherwise

I have to process a string with the following rules:
It may or may not start with a series of '.
It may or may not end with a series of '.
Whatever is enclosed between the above should be extracted. However, the enclosed string also may or may not contain a series of '.
For example, I can get following strings as input:
''''aa''''
''''aa
aa''''
''''aa''bb''cc''''
For the above examples, I would like to extract the following from them (respectively):
aa
aa
aa
aa''bb''cc
I tried the following code in Java:
Pattern p = Pattern.compile("[^']+(.+'*.+)[^']*");
Matcher m = p.matcher("''''aa''bb''cc''''");
while (m.find()) {
int count = m.groupCount();
System.out.println("count = " + count);
for (int i = 0; i <= count; i++) {
System.out.println("-> " + m.group(i));
}
But I get the following output:
count = 1
-> aa''bb''cc''''
-> ''bb''cc''''
Any pointers?
EDIT: Never mind, I was using a * at the end of my regex, instead of +. Doing this change gives me the desired output. But I would still welcome any improvements for the regex.
This one works for me.
String str = "''''aa''bb''cc''''";
Pattern p = Pattern.compile("^'*(.*?)'*$");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
have a look at the boundary matcher of Java's Pattern class (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html). Especially $ (=end of a line) might be interesting. I also recommend the following eclipse plugin for regex testing: http://sourceforge.net/projects/quickrex/ it gives you the possibilty to exactly see what will be the match and the group of your regex for a given test string.
E.g. try the following pattern: [^']+(.+'*.+)+[^'$]
I'm not that good in Java, so I hope the regex is sufficient. For your examples, it works well.
s/^'*(.+?)'*$/$1/gm

Java Stringparsing with Regexp

I try to parse a String with a Regexp to get parameters out of it.
As an example:
String: "TestStringpart1 with second test part2"
Result should be: String[] {"part1", "part2"}
Regexp: "TestString(.*?) with second test (.*?)"
My Testcode was:
String regexp = "TestString(.*?) with second test (.*?)";
String res = "TestStringpart1 with second test part2";
Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(res);
int i = 0;
while(matcher.find()) {
i++;
System.out.println(matcher.group(i));
}
But it only outputs the "part1"
Could someone give me hint?
Thanks
may be some fix regexp
String regexp = "TestString(.*?) with second test (.*)";
and change println code ..
if (matcher.find())
for (int i = 1; i <= matcher.groupCount(); ++i)
System.out.println(matcher.group(i));
Well, you only ever ask it to... In your original code, the find keeps shifting the matcher from one match of the entire regular expression to the next, while within the while's body you only ever pull out one group. Actually, if there would have been multiple matches of the regexp in your string, you would have found that for the first occurence, you would have got the "part1", for the second occurence you would have got the "part2", and for any other reference you would have got an error.
while(matcher.find()) {
System.out.print("Part 1: ");
System.out.println(matcher.group(1));
System.out.print("Part 2: ");
System.out.println(matcher.group(2));
System.out.print("Entire match: ");
System.out.println(matcher.group(0));
}

Categories