Regex pattern in java fails but works fine otherwise - java

I've implemented quite a complicated pattern` to match all occurences of ship set number. It works perfectly fine with global case insensitive comparison.
I use the following code to implement the same thing in Java but it doesn't match. Should Java regex be implemented differently?
int i = 0;
while (i < elementsArray.size()) {
System.out.println("List element:"+elementsArray.get(i));
String theRegex = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
if (elementsArray.get(i).matches(theRegex)) {
System.out.println("RESULT:");
String shipsets = "";
String thePattern = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
Pattern pattern = Pattern.compile(thePattern);
Matcher matcher = pattern.matcher(elementsArray.get(i));
if (matcher.find()) {
shipsets = matcher.group(0);
}
System.out.println("text==========" + shipsets);
}
i++;
}

Here is a simplification of your code which should work, assuming that your regex be working correctly in Java. From my preliminary investigations, it does seem to match many of the use cases in your link. You don't need to use String.matches() because you already are using a Matcher which will check whether or not you have a match.
List<String> elementsArray = new ArrayList<String>();
elementsArray.add("Shipset Number 323");
elementsArray.add("meh");
elementsArray.add("SS NO. : 34");
elementsArray.add("Mary had a little lamb");
elementsArray.add("Ship Set #2, #33 to #4.");
for (int i=0; i < elementsArray.size(); ++i) {
System.out.println("List element:"+elementsArray.get(i));
String shipsets = "";
String thePattern = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
Pattern pattern = Pattern.compile(thePattern);
Matcher matcher = pattern.matcher(elementsArray.get(i));
if (matcher.find()) {
shipsets = matcher.group(0);
System.out.println("Found a match at element " + i + ": " + shipsets);
}
}
}
You can see in the output below, that the three ship test strings all matched, and the controls "meh" and "Mary had a little lamb" did not match.
Output:
List element:Shipset Number 323
Found a match at element 0: Shipset Number 323
List element:meh
List element:SS NO. : 34
Found a match at element 2: SS NO. : 34
List element:Mary had a little lamb
List element:Ship Set #2, #33 to #4.
Found a match at element 4: Ship Set #2, #33 to #4.

In my opinion your problems are coused by:
usage of matches() in if(elementsArray.get(i).matches(theRegex)) - matches() will return
true only if whole string match to regex, so it will succeed in
many cases from your example, but it will fail with:
SS#1,SS#5,SS#6, SS1, SS2, SS3, SS4, etc. You can simulate this
situation by adding ^ at beginning and $ at the end of regex.
Check how it match HERE. So it would be better solution, to use
matcher.find() instead of String.matches(), like in Tim
Biegeleisen answer.
usage of if(matcher.find()) instead of while(matcher.find()) - in
some of strings you want to retrieve more than one result, so you
should use matcher.find() multiple times, to get all of them.
However if will act only once, so you will get only first matched
fragment from given string. To retrieve all, use loop, as matcher.find() will return false when it will not find next match in given String, and will end loop
Check this out. This is Tim Biegeleisen solution with small change (while, instead of if).

Related

Check if a String satisfies a regex

I have a List of String and I want to filter out the String that doesn't match a regex pattern
Input List = Orthopedic,Orthopedic/Ortho,Length(in.)
My code
for(String s : keyList){
Pattern p = Pattern.compile("[a-zA-Z0-9-_]");
Matcher m = p.matcher(s);
if (!m.find()){
System.out.println(s);
}
}
I expect the 2nd and 3rd string to be printed as they do not match the regex. But it is not printing anything
Explanation
You are not matching the entire input. Instead, you are trying to find the next matching part in the input. From Matcher#finds documentation:
Attempts to find the next subsequence of the input sequence that matches the pattern.
So your code will match an input if at least one character is one of a-zA-Z0-9-_.
Solution
If you want to match the whole region you should use Matcher#matches (documentation):
Attempts to match the entire region against the pattern.
And you probably want to adjust your pattern to allow multiple characters, for example by a pattern like
[a-zA-Z0-9-_]+
The + allows 1 to infinite many repetitions of the pattern (? is 0 to 1 and * is 0 to infinite).
Notes
You have an extra - at the end of your pattern. You probably want to remove that. Or, if you intended to match the character litteraly, you need to escape it:
[a-zA-Z0-9\\-_]+
You can test your regex on sites like regex101.com, here's your pattern: regex101.com/r/xvT8V0/1.
Note that there is also String#matches (documentation). So you could write more compact code by just using s.matches("[a-zA-Z0-9_]+").
Also note that you can shortcut character sets like [a-zA-Z0-9_] by using predefined sets. The set \w (word character) matches exactly your desired pattern.
Since the pattern and also the matcher don't change, you might want to move them outside of the loop to slightly increase performance.
Code
All in all your code might then look like:
Pattern p = Pattern.compile("[a-zA-Z0-9_]+");
Matcher m = p.matcher(s);
for (String s : keyList) {
if (!m.matches()) {
System.out.println(s);
}
}
Or compact:
for (String s : keyList) {
if (!s.matches("\\w")) {
System.out.println(s);
}
}
Using streams:
keyList.stream()
.filter(s -> !s.matches("\\w"))
.forEach(System.out::println);
You shouldn't construct a Pattern in a loop, you currently only match a single character, and you can use !String.matches(String) and a filter() operation. Like,
List<String> keyList = Arrays.asList("Orthopedic", "Orthopedic/Ortho", "Length(in.)");
keyList.stream().filter(x -> !x.matches("[a-zA-Z0-9-_]+"))
.forEachOrdered(System.out::println);
Outputs (as requested)
Orthopedic/Ortho
Length(in.)
Or, using the Pattern, like
List<String> keyList = Arrays.asList("Orthopedic", "Orthopedic/Ortho", "Length(in.)");
Pattern p = Pattern.compile("[a-zA-Z0-9-_]+");
keyList.stream().filter(x -> !p.matcher(x).matches()).forEachOrdered(System.out::println);
There are two problems:
1) the regular expression is wrong, it matches just one character.
2) you need to use m.matches() instead of m.find().
You can use matches instead of find:
//Added the + at the end and removed the extra -
Pattern p = Pattern.compile("[a-zA-Z0-9_]+");
for(String s : keyList){
Matcher m = p.matcher(s);
if (!m.matches()){
System.out.println(s);
}
}
Also note that the point of compiling a pattern is to reuse it, so put it outside the loop. Otherwise you may as well use:
for(String s : keyList){
if (!s.matches("[a-zA-Z0-9_]+")){
System.out.println(s);
}
}

Multiple matches with delimiter

this is my regex:
([+-]*)(\\d+)\\s*([a-zA-Z]+)
group no.1 = sign
group no.2 = multiplier
group no.3 = time unit
The thing is, I would like to match given input but it can be "chained". So my input should be valid if and only if the whole pattern is repeating without anything between those occurrences (except of whitespaces). (Only one match or multiple matches next to each other with possible whitespaces between them).
valid examples:
1day
+1day
-1 day
+1day-1month
+1day +1month
+1day +1month
invalid examples:
###+1day+1month
+1day###+1month
+1day+1month###
###+1day+1month###
###+1day+1month###
I my case I can use matcher.find() method, this would do the trick but it will accept input like this: +1day###+1month which is not valid for me.
Any ideas? This can be solved with multiple IF conditions and multiple checks for start and end indexes but I'm searching for elegant solution.
EDIT
The suggested regex in comments below ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$ will partially do the trick but if I use it in the code below it returns different result than the result I'm looking for.
The problem is that I cannot use (*my regex*)+ because it will match the whole thing.
The solution could be to match the whole input with ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$and then use ([+-]*)(\\d+)\\s*([a-zA-Z]+)with matcher.find() and matcher.group(i) to extract each match and his groups. But I was looking for more elegant solution.
This should work for you:
^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$
First, by adding the beginning and ending anchors (^ and $), the pattern will not allow invalid characters to occur anywhere before or after the match.
Next, I included optional whitespace before and after the repeated pattern (\s*).
Finally, the entire pattern is enclosed in a repeater so that it can occur multiple times in a row ((...)+).
On a side, note, I'd also recommend changing [+-]* to [+-]? so that it can only occur once.
Online Demo
You could use ^$ for that, to match the start/end of string
^\s*(?:([+-]?)(\d+)\s*([a-z]+)\s*)+$
https://regex101.com/r/lM7dZ9/2
See the Unit Tests for your examples. Basically, you just need to allow the pattern to repeat and force that nothing besides whitespace occurs in between the matches.
Combined with line start/end matching and you're done.
You can use String.matches or Matcher.matches in Java to match the entire region.
Java Example:
public class RegTest {
public static final Pattern PATTERN = Pattern.compile(
"(\\s*([+-]?)(\\d+)\\s*([a-zA-Z]+)\\s*)+");
#Test
public void testDays() throws Exception {
assertTrue(valid("1 day"));
assertTrue(valid("-1 day"));
assertTrue(valid("+1day-1month"));
assertTrue(valid("+1day -1month"));
assertTrue(valid(" +1day +1month "));
assertFalse(valid("+1day###+1month"));
assertFalse(valid(""));
assertFalse(valid("++1day-1month"));
}
private static boolean valid(String s) {
return PATTERN.matcher(s).matches();
}
}
You can proceed like this:
String p = "\\G\\s*(?:([-+]?)(\\d+)\\s*([a-z]+)|\\z)";
Pattern RegexCompile = Pattern.compile(p, Pattern.CASE_INSENSITIVE);
String s = "+1day 1month";
ArrayList<HashMap<String, String>> results = new ArrayList<HashMap<String, String>>();
Matcher m = RegexCompile.matcher(s);
boolean validFormat = false;
while( m.find() ) {
if (m.group(1) == null) {
// if the capture group 1 (or 2 or 3) is null, it means that the second
// branch of the pattern has succeeded (the \z branch) and that the end
// of the string has been reached.
validFormat = true;
} else {
// otherwise, this is not the end of the string and the match result is
// "temporary" stored in the ArrayList 'results'
HashMap<String, String> result = new HashMap<String, String>();
result.put("sign", m.group(1));
result.put("multiplier", m.group(2));
result.put("time_unit", m.group(3));
results.add(result);
}
}
if (validFormat) {
for (HashMap item : results) {
System.out.println("sign: " + item.get("sign")
+ "\nmultiplier: " + item.get("multiplier")
+ "\ntime_unit: " + item.get("time_unit") + "\n");
}
} else {
results.clear();
System.out.println("Invalid Format");
}
The \G anchor matches the start of the string or the position after the previous match. In this pattern, it ensures that all matches are contigous. If the end of the string is reached, it's a proof that the string is valid from start to end.

Regex to match only letters and numbers

Can you help with this code?
It seems easy, but always fails.
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
//Matcher matches = Pattern.compile( "([A-Z0-9])" ).matcher("P-12345678-P");
Matcher matches = Pattern.compile( "([\\w])" ).matcher("P-12345678-P");
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
assertEquals("P12345678P", ret.toString());
}
Constructing a Matcher does not automatically perform any matching. That's in part because Matcher supports two distinct matching behaviors, differing in whether the match is implicitly anchored to the beginning of the Matcher's region. It appears that you could achieve your desired result like so:
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]+" ).matcher("P-12345678-P");
while (matches.find()) {
ret.append(matches.group());
}
assertEquals("P12345678P", ret.toString());
}
Note in particular the invocation of Matcher.find(), which was a key omission from your version. Also, the nullary Matcher.group() returns the substring matched by the last find().
Furthermore, although your use of Matcher.groupCount() isn't exactly wrong, it does lead me suspect that you have the wrong idea about what it does. In particular, in your code it will always return 1 -- it inquires about the pattern, not about matches to it.
First of all you don't need to add any group because entire match can be always accessed by group 0, so instead of
(regex) and group(1)
you can use
regex and group(0)
Next thing is that \\w is already character class so you don't need to surround it with another [ ], because it will be similar to [[a-z]] which is same as [a-z].
Now in your
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
you will iterate over all groups from 1 but you will exclude last group, because they are indexed from 1 so n so i<n will not include n. You would need to use i <= matches.groupCount() instead.
Also it looks like you are confusing something. This loop will not find all matches of regex in input. Such loop is used to iterate over groups in used regex after match for regex was found.
So if regex would be something like (\w(\w))c and your match would be like abc then
for (int i = 1; i < matches.groupCount(); i++)
System.out.println(matches.group(i));
would print
ab
b
because
first group contains two characters (\w(\w)) before c
second group is the one inside first one, right after first character.
But to print them you actually would need to first let regex engine iterate over your input and find() match, or check if entire input matches() regex, otherwise you would get IllegalStateException because regex engine can't know from which match you want to get your groups (there can be many matches of regex in input).
So what you may want to use is something like
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]" ).matcher("P-12345678-P");
while (matches.find()){//find next match
ret.append(matches.group(0));
}
assertEquals("P12345678P", ret.toString());
Other way around (and probably simpler solution) would be actually removing all characters you don't want from your input. So you could just use replaceAll and negated character class [^...] like
String input = "P-12345678-P";
String result = input.replaceAll("[^A-Z0-9]+", "");
which will produce new string in which all characters which are not A-Z0-9 will be removed (replaced with "").

Discard the leading and trailing series of a character, but retain the same character otherwise

I have to process a string with the following rules:
It may or may not start with a series of '.
It may or may not end with a series of '.
Whatever is enclosed between the above should be extracted. However, the enclosed string also may or may not contain a series of '.
For example, I can get following strings as input:
''''aa''''
''''aa
aa''''
''''aa''bb''cc''''
For the above examples, I would like to extract the following from them (respectively):
aa
aa
aa
aa''bb''cc
I tried the following code in Java:
Pattern p = Pattern.compile("[^']+(.+'*.+)[^']*");
Matcher m = p.matcher("''''aa''bb''cc''''");
while (m.find()) {
int count = m.groupCount();
System.out.println("count = " + count);
for (int i = 0; i <= count; i++) {
System.out.println("-> " + m.group(i));
}
But I get the following output:
count = 1
-> aa''bb''cc''''
-> ''bb''cc''''
Any pointers?
EDIT: Never mind, I was using a * at the end of my regex, instead of +. Doing this change gives me the desired output. But I would still welcome any improvements for the regex.
This one works for me.
String str = "''''aa''bb''cc''''";
Pattern p = Pattern.compile("^'*(.*?)'*$");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
have a look at the boundary matcher of Java's Pattern class (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html). Especially $ (=end of a line) might be interesting. I also recommend the following eclipse plugin for regex testing: http://sourceforge.net/projects/quickrex/ it gives you the possibilty to exactly see what will be the match and the group of your regex for a given test string.
E.g. try the following pattern: [^']+(.+'*.+)+[^'$]
I'm not that good in Java, so I hope the regex is sufficient. For your examples, it works well.
s/^'*(.+?)'*$/$1/gm

Find ASCII "arrows" in text

I'm trying to find all the occurrences of "Arrows" in text, so in
"<----=====><==->>"
the arrows are:
"<----", "=====>", "<==", "->", ">"
This works:
String[] patterns = {"<=*", "<-*", "=*>", "-*>"};
for (String p : patterns) {
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
}
but this doesn't:
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
No idea why. It often reports "<" instead of "<====" or similar.
What is wrong?
Solution
The following program compiles to one possible solution to the question:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class A {
public static void main( String args[] ) {
String p = "<=+|<-+|=+>|-+>|<|>";
Matcher m = Pattern.compile(p).matcher(args[0]);
while (m.find()) {
System.out.println(m.group());
}
}
}
Run #1:
$ java A "<----=====><<---<==->>==>"
<----
=====>
<
<---
<==
->
>
==>
Run #2:
$ java A "<----=====><=><---<==->>==>"
<----
=====>
<=
>
<---
<==
->
>
==>
Explanation
An asterisk will match zero or more of the preceding characters. A plus (+) will match one or more of the preceding characters. Thus <-* matches < whereas <-+ matches <- and any extended version (such as <--------).
When you match "<=*|<-*|=*>|-*>" against the string "<---", it matches the first part of the pattern, "<=*", because * includes zero or more. Java matching is greedy, but it isn't smart enough to know that there is another possible longer match, it just found the first item that matches.
Your first solution will match everything that you are looking for because you send each pattern into matcher one at a time and they are then given the opportunity to work on the target string individually.
Your second attempt will not work in the same manner because you are putting in single pattern with multiple expressions OR'ed together, and there are precedence rules for the OR'd string, where the leftmost token will be attempted first. If there is a match, no matter how minimal, the get() will return that match and continue on from there.
See Thangalin's response for a solution that will make the second work like the first.
for <======= you need <=+ as the regex. <=* will match zero or more ='s which means it will always match the zero case hence <. The same for the other cases you have. You should read up a bit on regexs. This book is FANTASTIC:
Mastering Regular Expressions
Your provided regex pattern String does work for your example: "<----=====><==->>"
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
However it is broken for some other examples pointed out in the answers such as input string "<-" yields "<", yet strangely "<=" yields "<=" as it should.

Categories