Java Stringparsing with Regexp - java

I try to parse a String with a Regexp to get parameters out of it.
As an example:
String: "TestStringpart1 with second test part2"
Result should be: String[] {"part1", "part2"}
Regexp: "TestString(.*?) with second test (.*?)"
My Testcode was:
String regexp = "TestString(.*?) with second test (.*?)";
String res = "TestStringpart1 with second test part2";
Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(res);
int i = 0;
while(matcher.find()) {
i++;
System.out.println(matcher.group(i));
}
But it only outputs the "part1"
Could someone give me hint?
Thanks

may be some fix regexp
String regexp = "TestString(.*?) with second test (.*)";
and change println code ..
if (matcher.find())
for (int i = 1; i <= matcher.groupCount(); ++i)
System.out.println(matcher.group(i));

Well, you only ever ask it to... In your original code, the find keeps shifting the matcher from one match of the entire regular expression to the next, while within the while's body you only ever pull out one group. Actually, if there would have been multiple matches of the regexp in your string, you would have found that for the first occurence, you would have got the "part1", for the second occurence you would have got the "part2", and for any other reference you would have got an error.
while(matcher.find()) {
System.out.print("Part 1: ");
System.out.println(matcher.group(1));
System.out.print("Part 2: ");
System.out.println(matcher.group(2));
System.out.print("Entire match: ");
System.out.println(matcher.group(0));
}

Related

Regex to get value between two colon excluding the colons

I have a string like this:
something:POST:/some/path
Now I want to take the POST alone from the string. I did this by using this regex
:([a-zA-Z]+):
But this gives me a value along with colons. ie I get this:
:POST:
but I need this
POST
My code to match the same and replace it is as follows:
String ss = "something:POST:/some/path/";
Pattern pattern = Pattern.compile(":([a-zA-Z]+):");
Matcher matcher = pattern.matcher(ss);
if (matcher.find()) {
System.out.println(matcher.group());
ss = ss.replaceFirst(":([a-zA-Z]+):", "*");
}
System.out.println(ss);
EDIT:
I've decided to use the lookahead/lookbehind regex since I did not want to use replace with colons such as :*:. This is my final solution.
String s = "something:POST:/some/path/";
String regex = "(?<=:)[a-zA-Z]+(?=:)";
Matcher matcher = Pattern.compile(regex).matcher(s);
if (matcher.find()) {
s = s.replaceFirst(matcher.group(), "*");
System.out.println("replaced: " + s);
}
else {
System.out.println("not replaced: " + s);
}
There are two approaches:
Keep your Java code, and use lookahead/lookbehind (?<=:)[a-zA-Z]+(?=:), or
Change your Java code to replace the result with ":*:"
Note: You may want to define a String constant for your regex, since you use it in different calls.
As pointed out, the reqex captured group can be used to replace.
The following code did it:
String ss = "something:POST:/some/path/";
Pattern pattern = Pattern.compile(":([a-zA-Z]+):");
Matcher matcher = pattern.matcher(ss);
if (matcher.find()) {
ss = ss.replaceFirst(matcher.group(1), "*");
}
System.out.println(ss);
UPDATE
Looking at your update, you just need ReplaceFirst only:
String result = s.replaceFirst(":[a-zA-Z]+:", ":*:");
See the Java demo
When you use (?<=:)[a-zA-Z]+(?=:), the regex engine checks each location inside the string for a * before it, and once found, tries to match 1+ ASCII letters and then assert that there is a : after them. With :[A-Za-z]+:, the checking only starts after a regex engine found : character. Then, after matching :POST:, the replacement pattern replaces the whole match. It is totlally OK to hardcode colons in the replacement pattern since they are hardcoded in the regex pattern.
Original answer
You just need to access Group 1:
if (matcher.find()) {
System.out.println(matcher.group(1));
}
See Java demo
Your :([a-zA-Z]+): regex contains a capturing group (see (....) subpattern). These groups are numbered automatically: the first one has an index of 1, the second has the index of 2, etc.
To replace it, use Matcher#appendReplacement():
String s = "something:POST:/some/path/";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile(":([a-zA-Z]+):").matcher(s);
while (m.find()) {
m.appendReplacement(result, ":*:");
}
m.appendTail(result);
System.out.println(result.toString());
See another demo
This is your solution:
regex = (:)([a-zA-Z]+)(:)
And code is:
String ss = "something:POST:/some/path/";
ss = ss.replaceFirst("(:)([a-zA-Z]+)(:)", "$1*$3");
ss now contains:
something:*:/some/path/
Which I believe is what you are looking for...

How to figure out exact reason why Regex is failing in java

I have a Regex Pattern that i am using to match screen.
When i use it to test in Sublime Text, the same is working just fine.
but in Java execution, the code is failing
System.out.println(Pattern.matches("(B+)?|(R+)?", "RRBRR"));//false
System.out.println(Pattern.matches("(B+)?|(R+)?", "RRRRR"));//true
The above code should be coming as true in both cases, whereas in java it is coming as false.
my basic requirement is to identify groups of unique character in sequence...
meaning if String is
RRRRBBBRRBBBRBBBRRR
Then it should identify as
RRRR BBB RR BBB R BBB RRR
Please help...Thanks in advance
Try this:
String value = "RRRRBBBRRBBBRBBBRRR";
Pattern pattern = Pattern.compile("B+|R+");
Matcher matcher = pattern.matcher(value);
while (matcher.find()) {
System.out.println(matcher.group());
}
The fact that the first expression returns false is due to the fact that you have a B in a middle of several R so you don't have an exact match since your regular expression expect only Rs or Bs
matches adds an implicit ^ at the start & $ at the end which means substring matches wont work. find() will look for substring.
Matcher is best suited for this:
public static void main (String[] args) throws java.lang.Exception
{
String regex = "(B+)?|(R+)?";
Pattern pat = Pattern.compile(regex);
Matcher matcher = pat.matcher("RRBRR");
System.out.println(matcher.find());
int count = 0;
while(matcher.find()){
System.out.println(matcher.group());
count++;
}
System.out.println("Count:"+count);
}

Matcher Find nth Match Indexes

I'm trying to get the indexes for each pattern that I find in a document. So far I have:
String temp = "This is a test to see HelloWorld in a test that sees HelloWorld in a test";
Pattern pattern = Pattern.compile("HelloWorld");
Matcher matcher = pattern.matcher(temp);
int current = 0;
int start;
int end;
while (matcher.find()) {
start = matcher.start(current);
end = matcher.end(current);
System.out.println(temp.substring(start, end));
current++;
}
For some reason it keeps finding only the first instance of HelloWorld in temp though which results in an infinite loop. To be honest, I wasn't sure if you could use matcher.start(current) and matcher.end(current) - it was just a wild guess because matcher.group(current) worked before. This time I need the actual indexes though so matcher.group() wouldn't work for me.
Modify the regex to look like this:
while (matcher.find()) {
start = matcher.start();
end = matcher.end();
System.out.println(temp.substring(start, end));
}
Don't pass the index to start(int) and end(int). The API states that the parameter is the group number. In your case, only zero is correct. Use start() and end() instead.
The matcher will move to the next match on each iteration because of your call to find():
This method starts at the beginning of the input sequence or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.
The problem is this line of code.
start = matcher.start(current);
current is 1 after first iteration.
If you just need the start and end offsets of your matched text, you don't need the current group, this will be ok:
String temp = "This is a test to see HelloWorld in a test that sees HelloWorld in a test";
Pattern pattern = Pattern.compile("HelloWorld");
Matcher matcher = pattern.matcher(temp);
int current = 0;
while (matcher.find()) {
System.out.println(temp.substring(matcher.start(), matcher.end()));
}
while (matcher.find()) {
start = matcher.start();
end = matcher.end();
System.out.println(temp.substring(start, end));
}
Will do what you want.
String temp = "This is a test to see HelloWorld in a test that sees HelloWorld in a test";
Pattern pattern = Pattern.compile("HelloWorld");
Matcher m = pattern.matcher(temp);
while (matcher.find()) {
System.out.println(temp.substring(m.start(), m.stop()));
}

Matcher Find Infinite Loop

I'm trying to replace certain words in a long string. What happens is some words stay the same and some change. The words that don't change seem to get the matcher stuck in an infinite loop as it keeps trying to do the same action on words that are meant to stay the same. Below is an example similar to mine - I couldn't put the exact code that I'm using because it's far more detailed and would take up too much space I'm afraid.
public String test() {
String temp = "<p><img src=\"logo.jpg\"/></p>\n<p>CANT TOUCH THIS!</p>";
Pattern pattern = Pattern.compile("(<p(\\s.+)?>(.+)?</p>)");
Matcher matcher = pattern.matcher(temp);
StringBuilder stringBuilder = new StringBuilder(temp);
int start;
int end;
String match;
while (matcher.find()) {
start = matcher.start();
end = matcher.end();
match = temp.substring(start, end);
stringBuilder.replace(start, end, changeWords(match));
temp = stringBuilder.toString();
matcher = pattern.matcher(temp);
System.out.println("This is the word I'm getting stuck on: " + match);
}
return temp;
}
public String changeWords(String words) {
return "<p><img src=\"logo.jpg\"/></p>";
}
Any suggestions as to why this might be happening?
You reinitialize the matcher in the loop.
Remove the matcher = pattern.matcher(temp); instruction in your while loop and you should not be stuck any more.
You are using Matcher wrong. Your while loop reads:
while (matcher.find()) {
start = matcher.start();
end = matcher.end();
match = temp.substring(start, end);
stringBuilder.replace(start, end, changeWords(match));
temp = stringBuilder.toString();
matcher = pattern.matcher(temp);
}
it should just be:
matcher.replaceAll(temp, "new text");
No "while" loop, it is unnecessary. A matcher will not replace text it does not match and it will do the right job with regards to not matching twice at the same place etc -- no need to spoonfeed it.
What is more, your regex can do without the capturing parens. And if you only want to replace "words" (regexes have no notion of words), add word anchors around the text to be matched:
Pattern pattern = Pattern.compile("\\btext\\b");
You are looking to match "text" word and again replacing that word either with "text" (if condition in changeWord()) or "new text" (else in changeWord()). That whay it's causing infinite loop.
Why are you using Matcher at all? You don't need regex to replace words, just use replace():
input.replace("oldtext", "newtext"); // replace all occurrences of old with new
I fixed it simply by adding this line:
if (!match.equals(changeWords(match))) {
matcher = pattern.matcher(temp);
}

Discard the leading and trailing series of a character, but retain the same character otherwise

I have to process a string with the following rules:
It may or may not start with a series of '.
It may or may not end with a series of '.
Whatever is enclosed between the above should be extracted. However, the enclosed string also may or may not contain a series of '.
For example, I can get following strings as input:
''''aa''''
''''aa
aa''''
''''aa''bb''cc''''
For the above examples, I would like to extract the following from them (respectively):
aa
aa
aa
aa''bb''cc
I tried the following code in Java:
Pattern p = Pattern.compile("[^']+(.+'*.+)[^']*");
Matcher m = p.matcher("''''aa''bb''cc''''");
while (m.find()) {
int count = m.groupCount();
System.out.println("count = " + count);
for (int i = 0; i <= count; i++) {
System.out.println("-> " + m.group(i));
}
But I get the following output:
count = 1
-> aa''bb''cc''''
-> ''bb''cc''''
Any pointers?
EDIT: Never mind, I was using a * at the end of my regex, instead of +. Doing this change gives me the desired output. But I would still welcome any improvements for the regex.
This one works for me.
String str = "''''aa''bb''cc''''";
Pattern p = Pattern.compile("^'*(.*?)'*$");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
have a look at the boundary matcher of Java's Pattern class (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html). Especially $ (=end of a line) might be interesting. I also recommend the following eclipse plugin for regex testing: http://sourceforge.net/projects/quickrex/ it gives you the possibilty to exactly see what will be the match and the group of your regex for a given test string.
E.g. try the following pattern: [^']+(.+'*.+)+[^'$]
I'm not that good in Java, so I hope the regex is sufficient. For your examples, it works well.
s/^'*(.+?)'*$/$1/gm

Categories