I am exploring Regular expressions.
Problem statement : Replace String between # and # with the values provided in replacements map.
import java.util.regex.*;
import java.util.*;
public class RegExTest {
public static void main(String args[]){
HashMap<String,String> replacements = new HashMap<String,String>();
replacements.put("OldString1","NewString1");
replacements.put("OldString2","NewString2");
replacements.put("OldString3","NewString3");
String source = "#OldString1##OldString2#_ABCDEF_#OldString3#";
Pattern pattern = Pattern.compile("\\#(.+?)\\#");
//Pattern pattern = Pattern.compile("\\#\\#");
Matcher matcher = pattern.matcher(source);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(buffer, "");
buffer.append(replacements.get(matcher.group(1)));
}
matcher.appendTail(buffer);
System.out.println("OLD_String:"+source);
System.out.println("NEW_String:"+buffer.toString());
}
}
Output: ( Caters to my requirement but does not know who group(1) command works)
OLD_String:#OldString1##OldString2#_ABCDEF_#OldString3#
NEW_String:NewString1NewString2_ABCDEF_NewString3
If I change the code as below
Pattern pattern = Pattern.compile("\\#(.+?)\\#");
with
Pattern pattern = Pattern.compile("\\#\\#");
I am getting below error:
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
I did not understand difference between
"\\#(.+?)\\#" and `"\\#\\#"`
Can you explain the difference?
The difference is fairly straightforward - \\#(.+?)\\# will match two hashes with one or more chars between them, while \\#\\# will match two hashes next to each other.
A more powerful question, to my mind, is "what is the difference between \\#(.+?)\\# and \\#.+?\\#?"
In this case, what's different is what is (or isn't) getting captured. Brackets in a regex indicate a capture group - basically, some substring you want to output separately from the overall matched string. In this case, you're capturing the text in between the hashes - the first pattern will capture and output it separately, while the second will not. Try it yourself - asking for matcher.group(1) on the first will return that text, while the second will produce an exception, even though they both match the same text.
.+? Tells it to match (one or more of) anything lazily (until it sees a #). So as soon as it parses one instance of something, it stops.
I think the \#\# would match ## so i think the error is because it only matches that one ## and then there's only a group 0, no group 1. But not 100% on that part.
Related
I have written a snippet, but it doesn't work correctly.
I have an input in this format:
Arg2+res=(s11_19,s11_20,s11_21,s11_22),Arg4-res=()
It can contain multiple Args (e.g. Arg1, Arg2, ...).
What I want, is to return +resinstances. For example, in the above example, I need this part:
Arg2+res=(s11_19,s11_20,s11_21,s11_22)
My Regex is like the following:
Pattern p = Pattern.compile("Arg\\d+\\+res=\\(\\S+\\)");
Matcher m = p.matcher(ove_imp_roles);
while (m.find()) {
System.out.println(m.group());
}
The code has two problems:
1) It returns the whole string as a single match. For example, in the above sentence it returns Arg2+res=(s11_19,s11_20,s11_21,s11_22),Arg4-res=() as the matching instance.
Even if both instances include Arg1+res, it returns the whole string as a single match, while I expect it to be returned as two different matches.
2) The code counts instances with -res, too, while I don't need them.
Can anyone help me with this problem?
Update: I checked the code again and updated the above question correspondingly. The problem with -res occurs when it includes empty brackets (for example Arg1-res=().
Thanks in advance,
You're calling m.find() inside while(m.find()), make it like this:
Pattern p = Pattern.compile("Arg\\d+\\+res=\\(\\S+\\)");
Matcher m = p.matcher(ove_imp_roles);
while (m.find()) {
System.out.println(m.group());
}
btw your regex is matching 2nd Arg correctly
Based on the edited question and new input OP can use this regex:
Pattern p = Pattern.compile("Arg\\d+\\+res=\\([^)]+\\)");
[^)]+ will match 1 or more characters that are not ).
The problem is (\\S+\\). If you have the following input:
String s = "Arg2+res=(s1355_19,s1355_20);Arg3-res=(s1355_19,s1355_20)";
Arg\\d+\\+res=\\( matches Arg2+res=( and then S+ will match (because the + is greedy):
s1355_19,s1355_20);Arg3-res=(s1355_19,s1355_20
So you can make it lazy, so that it stops as soon as it finds the first right parenthesis in the input:
Pattern p = Pattern.compile("Arg\\d+\\+res=\\(\\S+?\\)");
Alternatively, you can split the input by ';' and see if each String matches "^Arg\\d+\\+.*$"
Consider an input string like
Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5
and the regular expression
\b(TWO|FOUR)=([^ ]*)\b
Using this regular expression, the following code can extract the 2 specific key-value pairs out of the 5 total ones (i.e., only some predefined key-value pairs should be extracted).
public static void main(String[] args) throws Exception {
String input = "Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5";
String regex = "\\b(TWO|FOUR)=([^ ]*)\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println("\t" + matcher.group(1) + " = " + matcher.group(2));
}
}
More specifically, the main() method above prints
TWO = 2
FOUR = 4
but every time find() is invoked, the whole regular expression is evaluated for the part of the string remaining after the latest match, left to right.
Also, if the keys are not mutually distinct (or, if a regular expression with overlapping matches was used in the place of each key), there will be multiple matches. For instance, if the regex becomes
\b(O.*?|T.*?)=([^ ]*)\b
the above method yields
ONE = 1
TWO = 2
THREE = 3
If the regex was not fully re-evaluated but each alternative part was somehow examined once (or, if an appropriately modified regex was used), the output would have been
ONE = 1
TWO = 2
So, two questions:
Is there a more efficient way of extracting a selected set of unique keys and their values, compared to the original regular expression?
Is there a regular expression that can match every alternative part of the OR (|) sub-expression exactly once and not evaluate it again?
Java Returns a Match Position: You can Use Dynamically-Generated Regex on Remaining Substrings
With the understanding that it can be generalized to a more complex and useful scenario, let's take a variation on your first example: \b(TWO|FOUR|SEVEN)=([^ ]*)\b
You can use it like this:
Pattern regex = Pattern.compile("\\b(TWO|FOUR|SEVEN)=([^ ]*)\\b");
Matcher regexMatcher = regex.matcher(yourString);
if (regexMatcher.find()) {
String theMatch = regexMatcher.group();
String FoundToken = = regexMatcher.group(1);
String EndPosition = regexMatcher.end();
}
You could then:
Test the value contained by FoundToken
Depending on that value, dynamically generate a regex testing for the remaining possible tokens. For instance, if you found FOUR, your new regex would be \\b(TWO|SEVEN)=([^ ]*)\\b
Using EndPosition, apply that regex to the end of the string.
Discussion
This approach would serve your goal of not re-evaluating parts of the OR that have already matched.
It also serves your goal of avoiding duplicates.
Would that be faster? Not in this simple case. But you said you are dealing with a real problem, and it will be a valid approach in some cases.
Can anyone please help me do the following in a java regular expression?
I need to read 3 characters from the 5th position from a given String ignoring whatever is found before and after.
Example : testXXXtest
Expected result : XXX
You don't need regex at all.
Just use substring: yourString.substring(4,7)
Since you do need to use regex, you can do it like this:
Pattern pattern = Pattern.compile(".{4}(.{3}).*");
Matcher matcher = pattern.matcher("testXXXtest");
matcher.matches();
String whatYouNeed = matcher.group(1);
What does it mean, step by step:
.{4} - any four characters
( - start capturing group, i.e. what you need
.{3} - any three characters
) - end capturing group, you got it now
.* followed by 0 or more arbitrary characters.
matcher.group(1) - get the 1st (only) capturing group.
You should be able to use the substring() method to accomplish this:
string example = "testXXXtest";
string result = example.substring(4,7);
This might help: Groups and capturing in java.util.regex.Pattern.
Here is an example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Example {
public static void main(String[] args) {
String text = "This is a testWithSomeDataInBetweentest.";
Pattern p = Pattern.compile("test([A-Za-z0-9]*)test");
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Matched: " + m.group(1));
} else {
System.out.println("No match.");
}
}
}
This prints:
Matched: WithSomeDataInBetween
If you don't want to match the entire pattern rather to the input string (rather than to seek a substring that would match), you can use matches() instead of find(). You can continue searching for more matching substrings with subsequent calls with find().
Also, your question did not specify what are admissible characters and length of the string between two "test" strings. I assumed any length is OK including zero and that we seek a substring composed of small and capital letters as well as digits.
You can use substring for this, you don't need a regex.
yourString.substring(4,7);
I'm sure you could use a regex too, but why if you don't need it. Of course you should protect this code against null and strings that are too short.
Use the String.replaceAll() Class Method
If you don't need to be performance optimized, you can try the String.replaceAll() class method for a cleaner option:
String sDataLine = "testXXXtest";
String sWhatYouNeed = sDataLine.replaceAll( ".{4}(.{3}).*", "$1" );
References
https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html#using-regular-expressions-with-string-methods
String s = "test";
Pattern pattern = Pattern.compile("\\n((\\w+\\s*[^\\n]){0,2})(\\b" + s + "\\b\\s)((\\w+\\s*){0,2})\\n?");
Matcher matcher = pattern.matcher(searchableText);
boolean topicTitleFound = matcher.find();
startIndex = 0;
while (topicTitleFound) {
int i = searchableText.indexOf(matcher.group(0));
if (i > startIndex) {
builder.append(documentText.substring(startIndex, i - 1));
...
This is the text that I tacle:
Some text comes here
topicTitle test :
test1 : testing123
test2 : testing456
test3 : testing789
test4 : testing9097
When I'm testing this regex on http://regexpal.com/ or http://www.regexplanet.com I clearly find the title that is saying: "topicTitle test". But in my java code topicTitleFound returns false.
Please help
It could be that you have carriage-return characters ('\r') before the newline characters ('\n') in your searchableText. This would cause the match to fail at line boundaries.
To make your multi-line pattern more robust, try using the MULTILINE option when compiling the regex. Then use ^ and $ as needed to match line boundaries.
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Update:
After actually testing out your code, I see that the pattern matches whether carriage-returns are present or not. In other words, your code "works" as-is, and topicTitleFound is true when it is first assigned (outside the while loop).
Are you sure that you are getting false for topicTitleFound? Or is the problem in the loop?
By the way, the use of indexOf() is wasteful and awkward, since the matcher already stores the index at which group 0 begins. Use this instead:
int i = matcher.start(0);
Your regex is a bit hard to decrypt - not really obvious what you're trying to do. One thing that springs to mind is that your regex expects the match to start with a newline, and your sample text doesn't.
I'm trying to find all the occurrences of "Arrows" in text, so in
"<----=====><==->>"
the arrows are:
"<----", "=====>", "<==", "->", ">"
This works:
String[] patterns = {"<=*", "<-*", "=*>", "-*>"};
for (String p : patterns) {
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
}
but this doesn't:
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
No idea why. It often reports "<" instead of "<====" or similar.
What is wrong?
Solution
The following program compiles to one possible solution to the question:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class A {
public static void main( String args[] ) {
String p = "<=+|<-+|=+>|-+>|<|>";
Matcher m = Pattern.compile(p).matcher(args[0]);
while (m.find()) {
System.out.println(m.group());
}
}
}
Run #1:
$ java A "<----=====><<---<==->>==>"
<----
=====>
<
<---
<==
->
>
==>
Run #2:
$ java A "<----=====><=><---<==->>==>"
<----
=====>
<=
>
<---
<==
->
>
==>
Explanation
An asterisk will match zero or more of the preceding characters. A plus (+) will match one or more of the preceding characters. Thus <-* matches < whereas <-+ matches <- and any extended version (such as <--------).
When you match "<=*|<-*|=*>|-*>" against the string "<---", it matches the first part of the pattern, "<=*", because * includes zero or more. Java matching is greedy, but it isn't smart enough to know that there is another possible longer match, it just found the first item that matches.
Your first solution will match everything that you are looking for because you send each pattern into matcher one at a time and they are then given the opportunity to work on the target string individually.
Your second attempt will not work in the same manner because you are putting in single pattern with multiple expressions OR'ed together, and there are precedence rules for the OR'd string, where the leftmost token will be attempted first. If there is a match, no matter how minimal, the get() will return that match and continue on from there.
See Thangalin's response for a solution that will make the second work like the first.
for <======= you need <=+ as the regex. <=* will match zero or more ='s which means it will always match the zero case hence <. The same for the other cases you have. You should read up a bit on regexs. This book is FANTASTIC:
Mastering Regular Expressions
Your provided regex pattern String does work for your example: "<----=====><==->>"
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
However it is broken for some other examples pointed out in the answers such as input string "<-" yields "<", yet strangely "<=" yields "<=" as it should.