Java Regexp clarification

Java Regexp clarification - java

I have a string like :
<RandomText>
executeRule(x, y, z)
<MoreRandomText>
What I would like to accomplish is the following: if this executeRule string exists in the bigger text block, I would like to get its 2'nd parameter.
How could I do this ?

What do you mean the bigger text block?
If you want to extract the second param from that expression, it would be something like
executeRule\(\w+,\s*(\w+),\s*\w+\)
The second param is held on capture group $1.
Keep in mind that to use this expression in Java, you need to escape the '\'. Also, I'm just assuming \w is good enough to match your params, that would depend on your particular rules.
If you need some help with actually using regexes in Java, there are many resources you can turn to, I found this tutorial to be fairly simple and it explains the basic usages:
http://www.vogella.de/articles/JavaRegularExpressions/article.html

import java.util.regex.Matcher;
import java.util.regex.Pattern;
...
Pattern p = Pattern.compile("executeRule\\(\\w+, (\\w+), \\w+\\)");
Matcher m = p.matcher(YOUR_TEXT_FROM_FILE);
while (m.find()) {
String secondArgument = m.group(1);
...process secondArgument...
}
Once this code executes secondArgument will contain the value of y. The above regular expression assumes that you expect the arguments to be composed of word characters (i.e. small and capital letters, digits and underscore).
Double backslashes are needed by Java string literal syntax, regexp engine will see single backslashes.
If you'd like to allow for whitespace in the string as it is allowed in most programming languages, you may use the following regexp:
Pattern p = Pattern.compile("executeRule\\(\\s*\\w+\\s*,\\s*(\\w+)\\s*,\\s*\\w+\\s*\\)");

Related

Regex function to find specific depth in recursive

I have the following scenario where I am supposed to use regex (Java/PCRE) on a line of code and strip off certain defined function and only strong the value of that function like in example below:
Input
ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))
Output : Replace Regex
ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)
Here CInt, Math.Truncate, and ObjectToNumber is removed retaining on output as shown above
The functions CInt, Math.Truncate keep on changing to CStr or Math.Random etc etc so regex query can not be hardcoded.
I tried a lot of options on stackoverflow but most did not work.
Also it would be nice if the query is customizable like Cint returns everything function CInt refers to. ( find a text then everything between first ( and ) ignoring balanced parenthesis pairs in between.

I know it's not pretty, but it's your fault to use raw regex for this :)
#Test
void unwrapCIntCall() {
String input = "ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))";
String expectedOutput = "ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)";
String output = input.replaceAll("CInt\\s*\\(\\s*Math\\.Truncate\\s*\\(\\s*ObjectToNumber\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
assertEquals(expectedOutput, output);
}
Now some explanation; the \\s* parts allow any number of any whitespace character, where they are. In the pattern, I used (.*) in the middle, which means I match anything there, but it's fine*. I used (.*) instead of .* so that particular section gets captured as capturing group $1 (because $0 is always the whole match). The interesting part being captured, I can refer them in the replacement string.
*as long as you don't have multiple of such assignments within one string. Otherwise, you should break up the string into parts which contain only one such assignment and apply this replacement for each of those strings. Or, try (.*?) instead of (.*), it compiles for me - AFAIK that makes the .* match as few characters as possible.
If the methods actually being called vary, then replace their names in the regex with the variation you expect, like replace CInt with (?CInt|CStr), Math\\.Truncate with Math\\.(?Truncate|Random) etc. (Using (? instead of ( makes that group non-capturing, so they won't take up $1, $2, etc. slots).
If that gets too complicated, than you should really think whether you really want to do it with regex, or whether it'd be easier to just write a relatively longer function with plain string methods, like indexOf and substring :)
Bonus; if absolutely everything varies, but the call depth, then you might try this one:
String output = input.replaceAll("[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
Yes, it's definitely a nightmare to read, but as far as I understand, you are after this monster :)
You can use ([^()]*) instead of (.*) to prevent deeper nested expressions. Note, that fine control of depth is a real weakness of everyday regular expressions.

Java regex to parse a particular semicolon delimited param from a URL?

I have a URL I'm expecting like:
www.somewebsite.com/misc-session/;session-id=1FSDSF2132FSADASD13213
I want to parse out
session-id=1FSDSF2132FSADASD13213
Using a regular express in Java, what would be the best approach to take for this?
Using a test regex website I've experimented with some different ways but I'm wondering what is the best approach that is the most fail safe, and protected incase the URL is actually formed like:
www.somewebsite.com/misc-session/;session-id=1FSDSF2132FSADASD13213?someExtraParam=false
or
www.somewebsite.com/misc-session/extra-path/;session-id=1FSDSF2132FSADASD13213?someExtraParam=false
I am always just looking for the value of "session-id".
EDIT:
The value of session-id is NOT limited to digits and is guaranteed to contain a combination of both.

What is the best approach that is the most fail safe, and protected.
Well I think matching word boundary on both sides will be enough.
Regex: \bsession-id=\d+\b
Note:- Use \\d and \\b if regex flavor you are using needs double escaping.
Regex101 Demo
Just in case session-id have characters in range [A-Za-z0-9] use this regex.
Regex: \bsession-id=[A-Za-z0-9]+\b
Regex101 Demo
Ideone Demo
Remember to include
import java.util.regex.Matcher;
import java.util.regex.Pattern;

Try this one:
String str = "www.somewebsite.com/misc-session/;session-id=213213213";
Pattern p = Pattern.compile("(session-id=\\d+)");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(0));
}
Note that session-id= is always given and you are interested in the following number, that is represented with \d (use double \\d in Java). The + stands for at least one number at all.
However better look at the detailed description at Regex101.

Java regular expression for number starts with code

I am not a Java developer but I am interfacing with a Java system.
Please help me with a regular expression that would detect all numbers starting with with 25678 or 25677.
For example in rails would be:
^(25677|25678)
Sample input is 256776582036 an 256782405036

^(25678|25677)
or
^2567[78]
if you do ^(25678|25677)[0-9]* it Guarantees that the others are all numbers and not other characters.
Should do the trick for you...Would look for either number and then any number after

In Java the regex would be the same, assuming that the number takes up the entire line. You could further simplify it to
^2567[78]
If you need to match a number anywhere in the string, use \b anchor (double the backslash if you are making a string literal in Java code).
\b2567[78]
how about if there is a possibility of a + at the beginning of a number
Add an optional +, like this [+]? or like this \+? (again, double the backslash for inclusion in a string literal).
Note that it is important to know what Java API is used with the regular expression, because some APIs will require the regex to cover the entire string in order to declare it a match.

Try something like:
String number = ...;
if (number.matches("^2567[78].*$")) {
//yes it starts with your number
}
Regex ^2567[78].*$ Means:
Number starts with 2567 followed by either 7 or 8 and then followed by any character.
If you need just numbers after say 25677, then regex should be ^2567[78]\\d*$ which means followed by 0 or n numbers after your matching string in begining.

The regex syntax of Java is pretty close to that of rails, especially for something this simple. The trick is in using the correct API calls. If you need to do more than one search, it's worthwhile to compile the pattern once and reuse it. Something like this should work (mixed Java and pseudocode):
Pattern p = Pattern.compile("^2567[78]");
for each string s:
if (p.matcher(s).find()) {
// string starts with 25677 or 25678
} else {
// string starts with something else
}
}
If it's a one-shot deal, then you can simplify all this by changing the pattern to cover the entire string:
if (someString.matches("2567[78].*")) {
// string starts with 25677 or 25678
}
The matches() method tests whether the entire string matches the pattern; hence the leading ^ anchor is unnecessary but the trailing .* is needed.
If you need to account for an optional leading + (as you indicated in a comment to another answer), just include +? at the start of the pattern (or after the ^ if that's used).

Why the second argument is not being taken as regex?

I came across an interesting question on java regex
Is there a regular expression way to replace a set of characters with another set (like shell tr command)?
So I tried the following:
String a = "abc";
a = a.replaceAll("[a-z]", "[A-Z]");
Now if I get print a the output is
[A-Z][A-Z][A-Z]
Here I think the compiler is taking the first argument as gegex, but not the second argument.
So is there any problem with this code or something else is the reason???

This is the way replaceAll works.
See API:
public String replaceAll(String regex, String replacement)
Replaces each substring of this string that matches the given regular expression with the given replacement.

The answer to the linked question is a quite clear »No«, so this should come as no surprise.
As you can see from the documentation the second argument is indeed a regular string that is used as replacement:
Parameters:
regex – the regular expression to which this string is to be matched
replacement – the string to be substituted for each match

second argument is simple String that will get substituted according to API

If you want to turn lower case to upper case, there is a toUpperCase function available in String class. For equivalent functionality to tr utility, I think there is no support in Java (up to Java 7).
The replacement string is usually take literally, except for the sequence $n where n denotes the number of the capturing group in the regex. This will use captured string from the match as replacement.

I consider regex as a way to express a condition (i.e does a given string match this expression). With that in mind, what you are asking would mean "please replace what matches in my string with ... another condition" which doesn't make much sens.
Now by trying to understand what you are looking for, it ssems to me that you want to find some automatic mapping between classes of characters (e.g. [a-z] -> [A-Z]). As far as I know this does not exist and you would have to write it yourself (except for the forementionned toUpperCase())

public String replaceAll(String regex, String replacement)
First argument is regular expression if substring matches with that pattern that will be replaced by second argument ,if you want to convert to lowercase to upper case use
toUpperCase()
method

You should look into jtr. Example of usage:
String hello = "abccdefgdhcij";
CharacterReplacer characterReplacer;
try {
characterReplacer = new CharacterReplacer("a-j", "Helo, Wrd!");
hello = characterReplacer.doReplacement(hello);
} catch(CharacterParseException e) {
}
System.out.println(hello);
Output:
Hello, World!

Is this Regex incorrect? No matches found

I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)

Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.

Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time

Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));

It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.

I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.

If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.