Is this Regex incorrect? No matches found - java

I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)

Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.

Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time

Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));

It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.

I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.

If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}

Related

Regex function to find specific depth in recursive

I have the following scenario where I am supposed to use regex (Java/PCRE) on a line of code and strip off certain defined function and only strong the value of that function like in example below:
Input
ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))
Output : Replace Regex
ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)
Here CInt, Math.Truncate, and ObjectToNumber is removed retaining on output as shown above
The functions CInt, Math.Truncate keep on changing to CStr or Math.Random etc etc so regex query can not be hardcoded.
I tried a lot of options on stackoverflow but most did not work.
Also it would be nice if the query is customizable like Cint returns everything function CInt refers to. ( find a text then everything between first ( and ) ignoring balanced parenthesis pairs in between.
I know it's not pretty, but it's your fault to use raw regex for this :)
#Test
void unwrapCIntCall() {
String input = "ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))";
String expectedOutput = "ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)";
String output = input.replaceAll("CInt\\s*\\(\\s*Math\\.Truncate\\s*\\(\\s*ObjectToNumber\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
assertEquals(expectedOutput, output);
}
Now some explanation; the \\s* parts allow any number of any whitespace character, where they are. In the pattern, I used (.*) in the middle, which means I match anything there, but it's fine*. I used (.*) instead of .* so that particular section gets captured as capturing group $1 (because $0 is always the whole match). The interesting part being captured, I can refer them in the replacement string.
*as long as you don't have multiple of such assignments within one string. Otherwise, you should break up the string into parts which contain only one such assignment and apply this replacement for each of those strings. Or, try (.*?) instead of (.*), it compiles for me - AFAIK that makes the .* match as few characters as possible.
If the methods actually being called vary, then replace their names in the regex with the variation you expect, like replace CInt with (?CInt|CStr), Math\\.Truncate with Math\\.(?Truncate|Random) etc. (Using (? instead of ( makes that group non-capturing, so they won't take up $1, $2, etc. slots).
If that gets too complicated, than you should really think whether you really want to do it with regex, or whether it'd be easier to just write a relatively longer function with plain string methods, like indexOf and substring :)
Bonus; if absolutely everything varies, but the call depth, then you might try this one:
String output = input.replaceAll("[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
Yes, it's definitely a nightmare to read, but as far as I understand, you are after this monster :)
You can use ([^()]*) instead of (.*) to prevent deeper nested expressions. Note, that fine control of depth is a real weakness of everyday regular expressions.

Regex extract string in java

I'm trying to extract a string from a String in Regex Java
Pattern pattern = Pattern.compile("((.|\\n)*).{4}InsurerId>\\S*.{5}InsurerId>((.|\\n)*)");
Matcher matcher = pattern.matcher(abc);
I'm trying to extract the value between
<_1:InsurerId>F2021633_V1</_1:InsurerId>
I'm not sure where am I going wrong but I don't get output for
if (matcher.find())
{
System.out.println(matcher.group(1));
}
You can use:
Pattern pattern = Pattern.compile("<([^:]+:InsurerId)>([^<]*)</\\1>");
Matcher matcher = pattern.matcher(abc);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
RegEx Demo
You may want to use the totally awesome page http://regex101.com/ to test your regular expressions. As you can see at https://regex101.com/r/rV8uM3/1, you only have empty capturing groups, but let me explain to you what you did. :D
((.|\n)*) This matches any character, or a new line, unimportant how often. It is capturing, so your first matching group will always be everything before <_1:InsurerId>, or an empty string. You can match any character instead, it will include new lines: .*. You can even leave it away as it isn't actually part of the String you want to match - using anything here will actually be a problem if you have multiple InsurerIds in your file and want to get them all.
.{4}InsurerId> This matches "InsurerId>" with any four characters in front of it and is exactly what you want. As the first character is probably always an opening angle bracket (and you don't want stuff like "<ExampleInsurerId>"), I'd suggest using <.{3}InsurerId> instead. This still could have some problems (<Test id="<" xInsurerId>), so if you know exactly that it's "_<a digit>:", why not use <_\d:InsurerId>?
\S* matches everything except for whitespaces - probably not the best idea as XML and similar files can be written to not contain any space at all. You want to have everything to the next tag, so use [^<]* - this matches everything except for an opening angle bracket. You also want to get this value later, so you have to use a capturing group: ([^<]*)
.{5}InsurerId> The same thing here: use <\/.{3}InsurerId> or <\/_\d:InsurerId> (forward slashes are actually characters interpreted by other RegEx implementations, so I suggest escaping them)
((.|\n)*) Again the same thing, just leave it away
The resulting Regular Expression would then be the following:
<_\d:InsurerId>([^<]*)<\/_\d:InsurerId>
And as you can see at https://regex101.com/r/mU6zZ3/1 - you have exactly one match, and it's even "F2021633_V1" :D
For Java, you have to escape the backslashes, so the resulting code would look like this:
Pattern pattern = Pattern.compile("<_\\d:InsurerId>([^<]*)<\\/_\\d:InsurerId>");
If you are using Java 7 and above, you can use naming groups to make the Regex a little bit more readable (also see the backreference group \k for close tag to match the openning tag):
Pattern pattern = Pattern.compile("(?:<(?<InsurancePrefix>.+)InsurerId>)(?<id>[A-Z0-9_]+)</\\k<InsurancePrefix>InsurerId>");
Matcher matcher = pattern.matcher("<_1:InsurerId>F2021633_V1</_1:InsurerId>");
if (matcher.matches()) {
System.out.println(matcher.group("id"));
}
Using back reference the matches() fails, for example, on this text
<_1:InsurerId>F2021633_V1</_2:InsurerId>
which is correct
Javadoc has a good explanation: https://docs.oracle.com/javase/8/docs/api/
Also you might consider using a different tool (XML parser) instead of Regex, as well, as other people have to support your code, and complex Regex is usually difficult to understand.

Java Regex to check "=number", ex "=5455"?

I want to check a string that matches the format "=number", ex "=5455".
As long as the fist char is "=" & the subsequence is any number in [0-9] (dot is not allowed), then it will popup "correct" message.
if(str.matches("^[=][0-9]+")){
Window.alert("correct");
}
So, is this ^[=][0-9]+ the correct one?
if it is not correct, can u provide a correct solution?
if it is correct, then can u find a better solution?
I'm no big regex expert and more knowledgeable people than me might correct this answer, but:
I don't think there's a point in using [=] rather than simply = - the [...] block is used to declare multiple choices, why declare a multiple choice of one character?
I don't think you need to use ^ (if your input string contains any character before =, it won't match anyway). I'm unsure as to whether its presence makes your regex faster, slower or has no effect.
In conclusion, I'd use =[0-9]+
That should be correct it is looking for an anchored at the beginning = sign and then 1 or more digits between 0-9
Your regex will work, even though it can be simplified:
.matches() does not really do regex matching, since it tries and matches all the input against the regex; therefore the beginning of input anchor is not needed;
you don't need the character class around the =.
Therefore:
if (str.matches("=[0-9]+")) { ... }
If you want to match a string which only begins with that regex, you have to use a Pattern, a Matcher and .find():
final Pattern p = Pattern.compile("^=[0-9]+");
final Matcher m = p.matcher(str);
if (m.find()) { ... }
And finally, Matcher also has .lookingAt() which anchors the regex only at the beginning of the input.

How can I obtain what .* matched in a regular expression?

I have thousands of different regular expressions and they look like this:
^Mozilla.*Android.*AppleWebKit.*Chrome.*OPR\/([0-9\.]+)
How do I obtain those substrings that match the .* in the regex? For example, for the above regex, I would get four substrings for four different .*s. In addition, I don't know in advance how many .*s there are, even though I can possibly find out by doing some simple operation on the given regex string, but that would impose more complexity on the program. I process a fairly big amount of data, so really focus on the efficiency here.
Replace the .*s with (.*)s and use matcher.group(n). For instance:
Pattern p = Pattern.compile("1(.*)2(.*)3");
Matcher m = p.matcher("1abc2xyz3");
m.find();
System.out.println(m.group(2));
xyz
Notice how the match of the second (.*) was returned (since m.group(2) was used).
Also, since you mentioned you won't know how many .*s your regex will contain, there is a matcher.groupCount() method you can use, if the only capturing groups in your regex will indeed be (.*)s.
For your own enlightenment, try reading about capturing groups.
How do I get those substrings that match the .* in the regex? For example, for the above regex, I would get four substrings for four different DOT STAR.
Use groups: (.*)
I addition, I don't know in advance how many DOT STARs there are
Build your regex string, then replace .* with (.*):
String myRegex = "your regex here";
myRegex = myRegex.replace(".*","(.*)");
even though I can possible find out about that by doing some simple operation on the given regex string, but that would impose more complexity on the program
If you don't know how the regex is made and the regex is not built by your application, the only way is to process it after you have it. If you are building the regex, then append (.*) to the regex string instead of appending .*

regular expression to allow only 1 dash

I have a textbox where I get the last name of a user. How do I allow only one dash (-) in a regular expression? And it's not supposed to be in the beginning or at the end of the string.
I have this code:
Pattern p = Pattern.compile("[^a-z-']", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(name);
Try to rephrase the question in more regexy terms. Rather than "allow only one dash, and it can't be at the beginning" you could say, "the string's beginning, followed by at least one non-dash, followed by one dash, followed by at least one non-dash, followed by the string's end."
the string's beginning: `^
at least one non-dash: [^-]+
followed by one dash: -
followed by at least one non-dash: [^-]+
followed by the string's end: $
Put those all together, and there you go. If you're using this in a context that matches against the complete string (not just any substring within it), you don't need the anchors -- though it may be good to put them in anyway, in case you later use that regex in a substring-matching context and forget to add them back in.
Why not just use indexOf() in String?
String s = "last-name";
int first = s.indexOf('-');
int last = s.lastIndexOf('-');
if(first == 0 || last == s.length()-1) // Checks if a dash is at the beginning or end
System.out.println("BAD");
if(first != last) // Checks if there is more than one dash
System.out.println("BAD");
It is slower than using regex but with usually small size of last names it should not be noticeable in the least bit. Also, it will make debugging and future maintenance MUCH easier.
It looks like your regex represents a fragment of an invalid value, and you're presumably using Matcher.find() to find if any part of your value matches that regex. Is that correct? If so, you can change your pattern to:
Pattern p = Pattern.compile("[^a-zA-Z'-]|-.*-|^-|-$");
which will match a non-letter-non-hyphen-non-apostrophe character, or a sequence of characters that both starts and ends with hyphens (thereby detecting a value that contains two hyphens), or a leading hyphen, or a trailing hyphen.
This regex represents one or more non-hyphens, followed by a single hyphen, followed by one or more non-hyphens.
^[^\-]+\-[^\-]+$
I'm not sure if the hyphen in the middle needs to be escaped with a backslash... That probably depends on what platform you're using for regex.
Try pattern something like [a-z]-[a-z].
Pattern p = Pattern.compile("[a-z]-[a-z]");

Categories