How to match several capturing groups, but results not as expected

How to match several capturing groups, but results not as expected - java

I'm trying to learn the Java Regular Expression. I want to match several capturing group (i.e. j(a(va))) against another string (i.e. this is java. this is ava, this is va). I was expecting the output to be:
I found the text "java" starting at index 8 and ending at index 12.
I found the text "ava" starting at index 21 and ending at index 24.
I found the text "va" starting at index 34 and ending at index 36.
Number of group: 2
However, the IDE instead only output:
I found the text "java" starting at index 8 and ending at index 12.
Number of group: 2
Why this is the case? Is there something I am missing?
Original code:
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.println("\nEnter your regex:");
Pattern pattern
= Pattern.compile(br.readLine());
System.out.println("\nEnter input string to search:");
Matcher matcher
= pattern.matcher(br.readLine());
boolean found = false;
while (matcher.find()) {
System.out.format("I found the text"
+ " \"%s\" starting at "
+ "index %d and ending at index %d.%n",
matcher.group(),
matcher.start(),
matcher.end());
found = true;
System.out.println("Number of group: " + matcher.groupCount());
}
if (!found) {
System.out.println("No match found.");
}
After running the code above, I have entered the following input:
Enter your regex:
j(a(va))
Enter input string to search:
this is java. this is ava, this is va
And the IDE outputs:
I found the text "java" starting at index 8 and ending at index 12.
Number of group: 2

Your regexp only matches the whole string java, it doesn't match ava or va. When it matches java, it will set capture group 1 to ava and capture group 2 to va, but it doesn't match those strings on their own. The regexp that would produce the result you want is:
j?(a?(va))
The ? makes the preceding item optional, so it will match the later items without these prefixes.
DEMO

You need regex (j?(a?(va)))
Pattern p = Pattern.compile("(j?(a?(va)))");
Matcher m = p.matcher("this is java. this is ava, this is va");
while( m.find() )
{
String group = m.group();
int start = m.start();
int end = m.end();
System.out.format("I found the text"
+ " \"%s\" starting at "
+ "index %d and ending at index %d.%n",
group,
start,
end);
}
You can see demo here

Related

regex to capture the string between a word and first occurrence of a character

Want to capture the string after the last slash and before the first occurrence of backward slash().
sample data:
sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; ,"myreferer":"https://www.example.com/mybook/order/newbooking/itemSummary/amex","Accept":"application/json, application/javascript","sessionId":"ggh76734",
targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW; ,"myreferer":"https://www.example.com/mybook/order/newbooking/itemList/basket","Accept":"application/json, application/javascript","sessionId":"ggh76734", targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; ,"myreferer":"https://www.example.com/mybook/order/newbooking/itemList/","Accept":"application/json, application/javascript","sessionId":"ggh76734",targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;List item
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW; ,"myreferer":"https://www.example.com/mybook/order/newbooking/itemList/basket?id=76734&para=jhjdfhj&type=new&ordertype=kjkf", "Accept":"application/json, application/javascript","sessionId":"ggh76734", targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
Expecting the below output:
amex
basket
''(empty string)
basket
Have build the below regex to capture it but its 100% accurate. It is capturing some additional part.
Regex
\bmyreferer\\\":\\\"\S+\/(.*?)\\\",
Could you please help me to improve the regex to get desired output?

You could use a negated character class with a capture group:
\bmyreferer":"[^"]+/([^/"]*)"
\bmyreferer":" Match literally preceded by a word boundary
[^"]+/ Match 1+ times any char except ", followed by a /
( Capture group 1
[^/"]* Optionally match (to also match an empty string) any char except / and "
)" Close group 1 and match "
regex demo | Java demo
Example code
String regex = "\\bmyreferer\":\"[^\"]+/([^/\"]*)\"";
String string = "sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; ,\"myreferer\":\"https://www.example.com/mybook/order/newbooking/itemSummary/amex\",\"Accept\":\"application/json, application/javascript\",\"sessionId\":\"ggh76734\", targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;\n\n"
+ "sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW; ,\"myreferer\":\"https://www.example.com/mybook/order/newbooking/itemList/basket\",\"Accept\":\"application/json, application/javascript\",\"sessionId\":\"ggh76734\", targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;\n\n"
+ "sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; ,\"myreferer\":\"https://www.example.com/mybook/order/newbooking/itemList/\",\"Accept\":\"application/json, application/javascript\",\"sessionId\":\"ggh76734\",targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;List item";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Group 1 value: " + matcher.group(1));
}
Output
Group 1 value: amex
Group 1 value: basket
Group 1 value:

How to find a number in text at specific location using regex in java

How to create a method which will find a number in String Text. I contain List of Strings which contain text like:
Radius of Circle is 7 cm
Rectangle 8 Height is 10 cm
Rectangle Width is 100 cm, Some text
Now I have to find all the number in these lines which are coming before cm so that I don't mistakenly find any other number.
How can it happen

A matching regular expression would be:
(\d+) cm
In order to obtain the captured number before the cm, you can use the Pattern and Matcher classes:
String line = "Radius of Circle is 7 cm";
Pattern pattern = Pattern.compile("(\\d+) cm");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("Value: " + matcher.group(1));
}
This example only matches the line from example (1), but can be easily repeated for each the lines contained in your list. See Java Regex Capture Groups for more information.

You have to find groups with only digits in the string using the following regex:
(?:\d{1,})
\d{1,} matches a digit (equal to [0-9])
{1,} Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed
(?:) The capturing group
Main##:
import java.util.regex.Pattern;
import java.util.Scanner;
import java.util.regex.Matcher;
public class RegexExample{
public static void main(String[] args){
Scanner sc=new Scanner(System.in);
while (true) {
Pattern pattern = Pattern.compile("(?:\\d{1,})");
System.out.println("Enter text:");
Matcher matcher = pattern.matcher(sc.nextLine());
boolean found = false;
while (matcher.find()) {
System.out.println("I found the text "+matcher.group()+" starting at index "+
matcher.start()+" and ending at index "+matcher.end());
found = true;
}
if(!found){
System.out.println("No match found.");
}
}
}
}
Example:
Enter text:
Radius of Circle is 7 cm
I found the text 7 starting at index 20 and ending at index 21
Enter text:
Rectangle 8 Height is 10 cm
I found the text 8 starting at index 10 and ending at index 11
I found the text 10 starting at index 22 and ending at index 24
Enter text:
Rectangle Width is 100 cm, Some text
I found the text 100 starting at index 19 and ending at index 22
Enter text:
Note: In java code, the character \ it's an escape character. That's why you have to append another \.

The correct pattern to use here is:
(\\d+)\\s+cm\\b
For a one liner, we can try using String#replaceAll:
String input = "Rectangle Width is 100 cm, Some text";
String output = input.replaceAll(".*?(\\d+)\\s+cm\\b.*", "$1");
System.out.println(output);
Or, to find all matches in a given text, we can try using a formal pattern matcher:
String input = "Rectangle Width is 100 cm, Some text";
String pattern = "(\\d+)\\s+cm\\b";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
while (m.find()) {
System.out.println("Found measurement: " + m.group(1));
}

Why does regex="" (an empty pattern) match at every character position?

I have regex="" and a String str="stackoveflow";
I don't understand why it is matching every character in the string. can you explain to me?
public class test {
public static void main(String[] args){
Console console = System.console();
String str="stackoveflow";
Pattern pattern = Pattern.compile("");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
console.format("I found the text" +
" \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(),
matcher.start(),
matcher.end());
}
}
}
Output is:
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
I found the text "" starting at index 3 and ending at index 3.
I found the text "" starting at index 4 and ending at index 4.
I found the text "" starting at index 5 and ending at index 5.
I found the text "" starting at index 6 and ending at index 6.
I found the text "" starting at index 7 and ending at index 7.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
I found the text "" starting at index 10 and ending at index 10.
I found the text "" starting at index 11 and ending at index 11.
I found the text "" starting at index 12 and ending at index 12.

Pattern("") matches a string consisting of zero characters. You can find one of those at every position in the string.
Note: if you changed find to match, you should find that there are no matches. (With match the pattern needs to match the entire input, and the entire input does not match a sequence of zero characters.)
Before you edited the question, your pattern was Pattern("e*"). That means zero or more repetitions of the character 'e'. By the logic above, you can "find" one of those at every character position in the input.

x? quantifer: Why does a non-x give a "zero-length" match?

A quantifier x? means a single or no occurance of x.
I am posting a test harness for matching the regex with strings for convenience.
I am confused about the regex a? when compared to the string ababaaaab.
The output of the program is:
Enter your regex: a?
Enter your input string to seacrh: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
Enter your regex:
I am confused about the b's.
"The regular expression a? is not specifically looking for the letter
"b"; it's merely looking for the presence (or lack thereof) of the
letter "a". If the quantifier allows for a match of "a" zero times,
anything in the input string that's not an "a" will show up as a
zero-length match."
Reference
QUESTION:-
The first line is understandable, and I do understand that presence of b or any non-a is an absence of a, or 0 occurence of a, so should result in a match. But the absence of a (that is the occurance of b) is between the indices 1 and 2. So why is the match of the text "" between the index 1 and 1 (in other words, why are we getting a zero-length match here). From my reasoning, it should be between the indices 1 and 2.
import java.io.InputStreamReader;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/*
* Enter your regex: foo
* Enter input string to search: foo
* I found the text foo starting at index 0 and ending at index 3.
* */
public class RegexTestHarness {
public static void main(String[] args){
/*Console console = System.console();
if (console == null) {
System.err.println("No console.");
System.exit(1);
}*/
while (true) {
/*Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: ", null));*/
System.out.print("\nEnter your regex: ");
Scanner scanner = new Scanner(new InputStreamReader(System.in));
Pattern pattern = Pattern.compile(scanner.next());
System.out.print("\nEnter your input string to seacrh: ");
Matcher matcher =
pattern.matcher(scanner.next());
boolean found = false;
while (matcher.find()) {
/*console.format("I found the text" +
" \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(),
matcher.start(),
matcher.end());*/
System.out.println("I found the text \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end() + ".");
found = true;
}
if(!found){
//console.format("No match found.%n", null);
System.out.println("No match found.");
}
}
}
}

But the absence of a (that is the occurance of b) is between the indices 1 and 2. So why is the match of the text "" between the index 1 and 1 (in other words, why are we getting a zero-length match here)
The length of the match is the length of the input string that matched the pattern.
Since there was no "a", only an empty string was matched.
Again, the pattern does not match "a sequence of non-a characters", it matches a (possibly empty) sequence of "a"s up to a total length of one. In this case, that matched sequence was empty.
But the absence of a (that is the occurance of b)
The absence of a is not the occurance of b. The absence of a takes place before the occurance of b and ends at the occurance of b.

The position reported is not the position of a character
The key thing to understand is that the regex engine is not giving you the position of a character where it found a match.
It is giving you the starting position where it started the match that was successful. That position is not a character. It is the space between characters. For instance,
Position 0 is the very beginning of the string. That is where the \A or ^ assertions match.
Position 1 is the position between the first and the second characters.
Position 9 is the position after the last b at the end of ababaaaab. That is where the \Z or $ assertions match.

a? is greedy. In other words, the regex engine will process as follow:
foreach index
if next char is "a"
return "a"
else if next char is ""
return ""
end if
end foreach
If you apply this algorithm on your input string, you'll have the same output as the one you provided.
You could try its non-greedy (or lazy) equivalent: a??. The regex engine would then process as follow:
foreach index
if next char is ""
return ""
else if next char is "a"
return "a"
end if
end foreach
An empty string would thus be found at each index, and no a would be outputted at all.

Getting the "context" text of a matched group

I'm using the Matcher class of Java to get some strings, now when I get my matches, I also find their begin index and end index. Now what I want to do is get the x preceding and proceeding characters.
So what I did was just call the substring method on the string with {begin index minusx} to {end index plusx}, but it seems to be a little heavy, for every match, I'll have to loop the string for it's context.
I wanted to know whether there's a better way to do that.
Here is what I've done so far:
The part that bothers me is the text.substring, how expensive is it
String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("\\d{2}").matcher(text);
int x = 5;
while (matcher.find()) {
String match = matcher.group();
int start = matcher.start();
int end = matcher.end();
String pretext = text.substring(start - x, start);
String postext = text.substring(end, end + x);
System.out.println(pretext + " - " + match + " - " + postext);
}
Suggested answer of using grouping to solve this:
using the regex (.{5})(\d{2}(.{5}).
First of all, this wouldn't be able to captures ones without at least 5 characters before it. So the solution to that is (.{0,5})(\d{2})(.{0.5}), very nice for that simple regex (\d{2})but for one like "c?at" and the given text "cat" this would match the groups
c
at

String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("(.{5})(\\d{2})(.{5})").matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1) + " - " + matcher.group(2) + " - " + matcher.group(3));
}
output :
Some - 22 - text
with - 44 - char

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to match several capturing groups, but results not as expected - java

Related

regex to capture the string between a word and first occurrence of a character

How to find a number in text at specific location using regex in java

Why does regex="" (an empty pattern) match at every character position?

x? quantifer: Why does a non-x give a "zero-length" match?

Getting the "context" text of a matched group

Categories

Resources