I have a url which has this format:
https://address.com/somestring/somestring-2/c100.200.3.4/somestrigx3/somestring.4
I want to obtain the number from c100.200.3.4 which are delimited by c and / and a dot. So in the end I want to have 100, 200, 3, 4.
I was wondering if there is a way to build a regex pattern for this instead of the classic string search and compute.
It is possible to get with 1 regex, but with a bit of code.
String s = "https://address.com/somestring/somestring-2/c100.200.3.4/somestrigx3/somestring.4";
Pattern pattern = Pattern.compile("(?<=/c)(\\d+)|(?!^)\\G\\.(\\d+)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (matcher.group(1) != null)
System.out.println(matcher.group(1));
if (matcher.group(2) != null)
System.out.println(matcher.group(2));
}
See IDEONE demo
The regex (?<=/c)(\d+)|(?!^)\G\.(\d+) contains two alternatives: (?<=/c)(\d+) matches and captures into Group 1 any sequence of digits after /c, and the (?!^)\G\.(\d+) matches consecutive sequences of a literal . and digits (capturing the latter into Group 2) after the successful previous match (due to (?!^)\G). Since either group can be non-initialized, we have to check it for null.
UPDATE
Since - as it turns out - the number of digit groups is a fix one (4), you can use a simpler regex with capturing groups:
String s = "https://address.com/somestring/somestring-2/c100.200.3.4/somestrigx3/somestring.4";
Pattern pattern = Pattern.compile("(?<=/c)(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
System.out.println(matcher.group(4));
}
See another demo
String splits[] = input_url.replaceAll(".*?/c([0-9.]+)/.*", "$1").split("[.]");
Here, first it is picking the text in between /c(...)/ at group $1 and replacing the whole string with the captured group. After that it is splitting the string with a dot.
Related
Let's say I have a string:
String sentence = "My nieces are Cara:8 Sarah:9 Tara:10";
And I would like to find all their respective names and ages with the following pattern matcher:
String regex = "My\\s+nieces\\s+are((\\s+(\\S+):(\\d+))*)";
Pattern pattern = Pattern.compile;
Matcher matcher = pattern.matcher(sentence);
I understand something like
matcher.find(0); // resets "pointer"
String niece = matcher.group(2);
String nieceName = matcher.group(3);
String nieceAge = matcher.group(4);
would give me my last niece (" Tara:10", "Tara", "10",).
How would I collect all of my nieces instead of only the last, using only one regex/pattern?
I would like to avoid using split string.
Another idea is to use the \G anchor that matches where the previous match ended (or at start).
String regex = "(?:\\G(?!\\A)|My\\s+nieces\\s+are)\\s+(\\S+):(\\d+)";
If My\s+nieces\s+are matches
\G will chain matches from there
(?!\A) neg. lookahead prevents \G from matching at \A start
\s+(\S+):(\d+) using two capturing groups for extraction
See this demo at regex101 or a Java demo at tio.run
Matcher m = Pattern.compile(regex).matcher(sentence);
while (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
You can't iterate over repeating groups, but you can match each group individually, calling find() in a loop to get the details of each one. If they need to be back-to-back, you can iteratively bound your matcher to the last index, like this:
Matcher matcher = Pattern.compile("My\\s+nieces\\s+are").matcher(sentence);
if (matcher.find()) {
int boundary = matcher.end();
matcher = Pattern.compile("^\\s+(\\S+):(\\d+)").matcher(sentence);
while (matcher.region(boundary, sentence.length()).find()) {
System.out.println(matcher.group());
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
boundary = matcher.end();
}
}
I have following input String:
abc.def.ghi.jkl.mno
Number of dot characters may vary in the input. I want to extract the word after the last . (i.e. mno in the above example). I am using the following regex and its working perfectly fine:
String input = "abc.def.ghi.jkl.mno";
Pattern pattern = Pattern.compile("([^.]+$)");
Matcher matcher = pattern.matcher(input);
if(matcher.find()) {
System.out.println(matcher.group(1));
}
However, I am using a third party library which does this matching (Kafka Connect to be precise) and I can just provide the regex pattern to it. The issue is, this library (whose code I can't change) uses matches() instead of find() to do the matching, and when I execute the same code with matches(), it doesn't work e.g.:
String input = "abc.def.ghi.jkl.mno";
Pattern pattern = Pattern.compile("([^.]+$)");
Matcher matcher = pattern.matcher(input);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
The above code doesn't print anything. As per the javadoc, matches() tries to match the whole String. Is there any way I can apply similar logic using matches() to extract mno from my input String?
You may use
".*\\.([^.]*)"
It matches
.*\. - any 0+ chars as many as possible up to the last . char
([^.]*) - Capturing group 1: any 0+ chars other than a dot.
See the regex demo and the Regulex graph:
To extract a word after the last . per your instruction you could do this without Pattern and Matcher as following:
String input = "abc.def.ghi.jkl.mno";
String getMe = input.substring(input.lastIndexOf(".")+1, input.length());
System.out.println(getMe);
This will work. Use .* at the beginning to enable it to match the entire input.
public static void main(String[] argv) {
String input = "abc.def.ghi.jkl.mno";
Pattern pattern = Pattern.compile(".*([^.]{3})$");
Matcher matcher = pattern.matcher(input);
if(matcher.matches()) {
System.out.println(matcher.group(0));
System.out.println(matcher.group(1));
}
}
abc.def.ghi.jkl.mno
mno
This is a better pattern if the dot really is anywhere: ".*\\.([^.]+)$"
I have been lookinig through this : https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
However I still have difficulties to write the right command to get all the expression folllowing this pattern :
<$FB $TWTR are getting plummetted>
(<> just signal the beginning of the sentence-tweet actually as I am parsing twitter). I want to extract FB TWTR.
Any help much appreciated.
Here is a 2-step approach: we extract <...> groups with a regex and then split the chunks into words and see if they start with $.
String s = "<$FB $TWTR are getting plummetted>";
Pattern pattern = Pattern.compile("<([^>]+)>");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
String[] chks = matcher.group(1).split(" ");
for (int i = 0; i<chks.length; i++)
{
if (chks[i].startsWith("$"))
System.out.println(chks[i].substring(1));
}
}
See demo
And here is a 1-regex approach (see demo), use only if you feel confident with regex:
String s = "<$FB $TWTR are getting plummetted>";
Pattern pattern = Pattern.compile("(?:<|(?!^)\\G)[^>]*?\\$([A-Z]+)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
}
The regex used here is (?:<|(?!^)\G)[^>]*?\$([A-Z]+).
It matches:
(?:<|(?!^)\G) - A literal < and then at the end of each successful match
[^>]*? - 0 or more characters other than > (as few as possible)
\$ - literal $
([A-Z]+) - match and capture uppercase letters (replace with what best suits your purpose, perhaps \\w).
I have a pattern defined like this:
private static final Pattern PATTERN = Pattern.compile("[a-zA-Z]{2}");
And in my code I'm doing this:
Matcher matcher = PATTERN.matcher(myString);
and using a while loop to find all matches.
while (matcher.find()){
//do something here
}
If myString is 12345AB3CD45 the matcher is finding those two groups of two letters (AB and CD). The problem is that I have sometimes myString as 12345ABC356 so I would like the matcher to find, first AB and then BC (is only finding `AB).
Am I doing this wrong or the regex is wrong or the matcher doesn't work this way?
You can't match a same position several times with a regex, but you can use a trick.
To do that you need to enclose your pattern in a lookahead and a capture group:
(?=([A-Za-z]{2})), because a lookahead matches no characters and consumes only one position.
The result you are looking for is in the capture group 1.
Fragment of text which was placed in group 0 (entire match) can't be reused in next match to be part of group 0.
12345ABC356
^^ - AB was placed in standard match (group 0)
^^ - B can't be reused here as part of standard match
You can solve this problem with look-around mechanisms like look-ahead, which doesn't consume matched part (they are zero-length), but you can place their content in separate capturing group which you will be able to access.
So your code can look like
private static final Pattern PATTERN = Pattern.compile("[a-zA-Z](?=([a-zA-Z]))");
// ^^^^^^^^ ^^^^^^^^^^
// group 0 group 1
//...
Matcher matcher = PATTERN.matcher(myString);
while (matcher.find()){
String match = matcher.group() + matcher.group(1);
//...
}
I have this regex and my output seems to be matching each single space but the capturing group is only alpha chars. I must be missing something.
String regexstring = new String("1234567 Mike Peloso ");
Pattern pattern = Pattern.compile("[A-Za-z]*");
Matcher matcher = pattern.matcher(regexstring);
while(matcher.find())
{
System.out.println(Integer.toString(matcher.start()));
String someNumberStr = matcher.group();
System.out.println(someNumberStr);
}
There is no capturing group, but you need to use the + quantifier (meaning 1 or more times). The * quantifier matches the preceding element zero or more times and creates a disaster of output...
Pattern pattern = Pattern.compile("[A-Za-z]+");
And then print the match result:
while (matcher.find()) {
System.out.println(matcher.start());
System.out.println(matcher.group());
}
Working Demo