Java regular expression and "greedy" or early search for slash - java

Text:
123/444_ab/alphanum/alphanum/alphanum.sss
256/333_123/alphanum/alphanum.fff
777/999_abcde/alphanum.ggg
I want two groups.
first group matches: 123,256, and 77
second group matches: 444_ab, 333_123, and 999_abcde.
The problems is any regexp I come up with is including extra slashes for the second group. e.g.333_123/alphanum
ex.
(\\d{3})/\\d{3}_.+)/.+[.].+
It should be just give first two groups with a following slash.

As an aside, a requirement like this can also easily be handled by any "split by string" function. Split on '/' to obtain an array of values and go from there ...
I find that this is often much easier to read, and to debug, than "regular-expression chicken scratches," when the data has a format such as what you show here. It will also "obviously" show what should happen when the data contains 5, 4, or 3 groups as you demonstrate in your post, and it will work for any number of groups.

^(.*?)\/(.*?)\/.*
This regular expression should do the trick.

Converting my comment to answer so that solution is easy to find for future visitors.
You may use this regex with MULTILINE mode:
(?m)^(\\d{3})/(\\d{3}_[^/]+)
RegEx Demo
RegEx Details:
(?m): Enable inline MULTILINE mode so that ^ matches start of each line
^: Start of line
(\\d{3}): First capture group to match 3 digits
/: Match a /
(\\d{3}_[^/]+): Second capture group to match 3 digits then _ then 1 or more of any character that is not a /

Use *? for a non-greedy match: ^(.*?)/(.*?)/.*.
.*? will match only as few characters as necessary for the whole expression to match.
import java.util.regex.*;
public class MyClass {
public static void main(String args[]) {
String a = "123/444_ab/alphanum/alphanum/alphanum.sss";
String b = "256/333_123/alphanum/alphanum.fff";
String c = "777/999_abcde/alphanum.ggg";
Pattern p = Pattern.compile("^(.*?)/(.*?)/.*");
Matcher m = p.matcher(a);
if (m.matches()) {
System.out.println("a:");
System.out.println(m.group(1));
System.out.println(m.group(2));
} else {
System.out.println("'a' doesn't match.");
}
m = p.matcher(b);
if (m.matches()) {
System.out.println("b:");
System.out.println(m.group(1));
System.out.println(m.group(2));
} else {
System.out.println("'b' doesn't match.");
}
m = p.matcher(c);
if (m.matches()) {
System.out.println("c:");
System.out.println(m.group(1));
System.out.println(m.group(2));
} else {
System.out.println("'c' doesn't match.");
}
}
}
Output:
a:
123
444_ab
b:
256
333_123
c:
777
999_abcde

Related

Regular Expression in Java. Splitting a string using pattern and matcher

I am trying to get all the matching groups in my string.
My regular expression is "(?<!')/|/(?!')". I am trying to split the string using regular expression pattern and matcher. string needs to be split by using /, but '/'(surrounded by ') this needs to be skipped. for example "One/Two/Three'/'3/Four" needs to be split as ["One", "Two", "Three'/'3", "Four"] but not using .split method.
I am currently the below
// String to be scanned to find the pattern.
String line = "Test1/Test2/Tt";
String pattern = "(?<!')/|/(?!')";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.matches()) {
System.out.println("Found value: " + m.group(0) );
} else {
System.out.println("NO MATCH");
}
But it always saying "NO MATCH". where i am doing wrong? and how to fix that?
Thanks in advance
To get the matches without using split, you might use
[^'/]+(?:'/'[^'/]*)*
Explanation
[^'/]+ Match 1+ times any char except ' or /
(?: Non capture group
'/'[^'/]* Match '/' followed by optionally matching any char except ' or /
)* Close group and optionally repeat it
Regex demo | Java demo
String regex = "[^'/]+(?:'/'[^'/]*)*";
String string = "One/Two/Three'/'3/Four";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Output
One
Two
Three'/'3
Four
Edit
If you do not want to split don't you might also use a pattern to not match / but only when surrounded by single quotes
[^/]+(?:(?<=')/(?=')[^/]*)*
Regex demo
Try this.
String line = "One/Two/Three'/'3/Four";
Pattern pattern = Pattern.compile("('/'|[^/])+");
Matcher m = pattern.matcher(line);
while (m.find())
System.out.println(m.group());
output:
One
Two
Three'/'3
Four
Here is simple pattern matching all desired /, so you can split by them:
(?<=[^'])\/(?=')|(?<=')\/(?=[^'])|(?<=[^'])\/(?=[^'])
The logic is as follows: we have 4 cases:
/ is sorrounded by ', i.e. `'/'
/ is preceeded by ', i.e. '/
/ is followed by ', i.e. /'
/ is sorrounded by characters other than '
You want only exclude 1. case. So we need to write regex for three cases, so I have written three similair regexes and used alternation.
Explanation of the first part (other two are analogical):
(?<=[^']) - positiva lookbehind, assert what preceeds is differnt frim ' (negated character class [^']
\/ - match / literally
(?=') - positiva lookahead, assert what follows is '\
Demo with some more edge cases
Try something like this:
String line = "One/Two/Three'/'3/Four";
String pattern = "([^/]+'/'\d)|[^/]+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
boolean found = false;
while(m.find()) {
System.out.println("Found value: " + m.group() );
found = true;
}
if(!found) {
System.out.println("NO MATCH");
}
Output:
Found value: One
Found value: Two
Found value: Three'/'3
Found value: Four

regex last word in a sentence ending with punctuation (period)

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?
You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")
If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))
You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!
With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet
By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here
I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Java regex - overlapping matches

In the following code:
public static void main(String[] args) {
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("\\d+\\D+\\d+").matcher("2abc3abc4abc5");
while (m.find()) {
allMatches.add(m.group());
}
String[] res = allMatches.toArray(new String[0]);
System.out.println(Arrays.toString(res));
}
The result is:
[2abc3, 4abc5]
I'd like it to be
[2abc3, 3abc4, 4abc5]
How can it be achieved?
Make the matcher attempt to start its next scan from the latter \d+.
Matcher m = Pattern.compile("\\d+\\D+(\\d+)").matcher("2abc3abc4abc5");
if (m.find()) {
do {
allMatches.add(m.group());
} while (m.find(m.start(1)));
}
Not sure if this is possible in Java, but in PCRE you could do the following:
(?=(\d+\D+\d+)).
Explanation
The technique is to use a matching group in a lookahead, and then "eat" one character to move forward.
(?= : start of positive lookahead
( : start matching group 1
\d+ : match a digit one or more times
\D+ : match a non-digit character one or more times
\d+ : match a digit one or more times
) : end of group 1
) : end of lookahead
. : match anything, this is to "move forward".
Online demo
Thanks to Casimir et Hippolyte it really seems to work in Java. You just need to add backslashes and display the first capturing group: (?=(\\d+\\D+\\d+))..
Tested on www.regexplanet.com:
The above solution of HamZa works perfectly in Java. If you want to find a specific pattern in a text all you have to do is:
String regex = "\\d+\\D+\\d+";
String updatedRegex = "(?=(" + regex + ")).";
Where the regex is the pattern you are looking for and to be overlapping you need to surround it with (?=(" at the start and ")). at the end.

Regular expression for a string starting with some string

I have some string, that has this type: (notice)Any_other_string (notes that : () has in this string`.
So, I want to separate this string to 2 part : (notice) and the rest. I do as follow :
private static final Pattern p1 = Pattern.compile("(^\\(notice\\))([a-z_A-Z1-9])+");
String content = "(notice)Stack Over_Flow 123";
Matcher m = p1.matcher(content);
System.out.println("Printing");
if (m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
I hope the result will be (notice) and Stack Over_Flow 123, but instead, the result is : (notice)Stack and (notice)
I cannot explain this result. Which regex is suitable for my purpose?
Issue 1: group(0) will always return the entire match - this is specified in the javadoc - and the actual capturing groups start from index 1. Simply replace it with the following:
System.out.println(m.group(1));
System.out.println(m.group(2));
Issue 2: You do not take spaces and other characters, such as underscores, into account (not even the digit 0). I suggest using the dot, ., for matching unknown characters. Or include \\s (whitespace) and _ into your regex. Either of the following regexes should work:
(^\\(notice\\))(.+)
(^\\(notice\\))([A-Za-z0-9_\\s]+)
Note that you need the + inside the capturing group, or it will only find the last character of the second part.

Regular Expression in Java: How to refer to "matched patterns"?

I was reading the Java Regular Expression tutorial, and it seems only to teach to test whether a pattern matched or not, but does not tell me how to refer to a matched pattern.
For example, I have a string "My name is xxxxx". And I want to print xxxx. How would I do that with Java regular expressions?
Thanks.
What tutorial were you reading ? The sun's one tackles that topic quite thoroughly, but you have to read it correctly :)
Capturing a part of a string is done through the parentheses. If you want to capture a group in a string, you have to put this part of the regular expression in parentheses. The groups are defined in the order the parentheses appear, and the group with index 0 represents the whole string.
For instance, the regexp "Day ([0-9]+) - Note ([0-9]+)" would define 3 groups :
group(0) : The whole string
group(1) : The first group in the regexp, that is to say the day number
group(2) : The second group in the regexp, that is to say the note number
As for the actual code and how to retrieve the groups you've defined in your regexp, have a look at the Java documentation, especially the Matcher class and its group method : http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html
You can test your regexps with that very useful tool : http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
Hope this helped,
Cheers
Note the use of parentheses in the pattern and the group() method on Matcher
import java.util.regex.*;
public class Example {
static public void main(String[] args) {
Pattern regex = Pattern.compile("My name is (.*)");
String s = "My name is Michael";
Matcher matcher = regex.matcher(s);
if (matcher.matches()) {
System.out.println("original string: " + matcher.group(0));
System.out.println("first group: " + matcher.group(1));
}
}
}
Output is:
original string: My name is Michael
first group: Michael
You can use the Matcher group(int) method:
Pattern p = Pattern.compile("My name is (.*)");
Matcher m = p.matcher("My name is akf");
m.find();
String s = m.group(1); //grab the first group*
System.out.println(s);
output:
akf
* look at matching groups
Matcher m = Pattern.compile("name is (.*)").matcher("My name is Ross");
if (m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
The parens form a capturing group. Group 0 is the entire pattern and group 1 is the back reference.
The above program outputs:
name is Ross
Ross

Categories