Regular expression for a string starting with some string - java

I have some string, that has this type: (notice)Any_other_string (notes that : () has in this string`.
So, I want to separate this string to 2 part : (notice) and the rest. I do as follow :
private static final Pattern p1 = Pattern.compile("(^\\(notice\\))([a-z_A-Z1-9])+");
String content = "(notice)Stack Over_Flow 123";
Matcher m = p1.matcher(content);
System.out.println("Printing");
if (m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
I hope the result will be (notice) and Stack Over_Flow 123, but instead, the result is : (notice)Stack and (notice)
I cannot explain this result. Which regex is suitable for my purpose?

Issue 1: group(0) will always return the entire match - this is specified in the javadoc - and the actual capturing groups start from index 1. Simply replace it with the following:
System.out.println(m.group(1));
System.out.println(m.group(2));
Issue 2: You do not take spaces and other characters, such as underscores, into account (not even the digit 0). I suggest using the dot, ., for matching unknown characters. Or include \\s (whitespace) and _ into your regex. Either of the following regexes should work:
(^\\(notice\\))(.+)
(^\\(notice\\))([A-Za-z0-9_\\s]+)
Note that you need the + inside the capturing group, or it will only find the last character of the second part.

Related

How to write a regex capture group which matches a character 3 or 4 times before a delimiter?

I'm trying to write a regex that splits elements out according to a delimiter. The regex also needs to ensure there are ideally 4, but at least 3 colons : in each match.
Here's an example string:
"Checkers, etc:Blue::C, Backgammon, I say:Green::Pepsi:P, Chess, misc:White:Coke:Florida:A, :::U"
From this, there should be 4 matches:
Checkers, etc:Blue::C
Backgammon, I say:Green::Pepsi:P
Chess, misc:White:Coke:Florida:A
:::U
Here's what I've tried so far:
([^:]*:[^:]*){3,4}(?:, )
Regex 101 at: https://regex101.com/r/O8iacP/8
I tried setting up a non-capturing group for ,
Then I tried matching a group of any character that's not a :, a :, and any character that's not a : 3 or 4 times.
The code I'm using to iterate over these groups is:
String line = "Checkers, etc:Blue::C, Backgammon, I say::Pepsi:P, Chess:White:Coke:Florida:A, :::U";
String pattern = "([^:]*:[^:]*){3,4}(?:, )";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher matcher = r.matcher(line);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Any help is appreciated!
Edit
Using #Casimir's regex, it's working. I had to change the above code to use group(0) like this:
String line = "Checkers, etc:Blue::C, Backgammon, I say::Pepsi:P, Chess:White:Coke:Florida:A, :::U";
String pattern = "(?![\\s,])(?:[^:]*:){3}\\S*(?![^,])";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher matcher = r.matcher(line);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Now prints:
Checkers, etc:Blue::C
Backgammon, I say::Pepsi:P
Chess:White:Coke:Florida:A
:::U
Thanks again!
I suggest this pattern:
(?![\\s,])(?:[^:]*:){3}\\S*(?![^,])
Negative lookaheads avoid to match leading or trailing delimiters. The second one in particular forces the match to be followed by the delimiter or the end of the string (not followed by a character that isn't a comma).
demo
Note that the pattern doesn't have capture groups, so the result is the whole match (or group 0).
You might use
(?:[^,:]+, )?[^:,]*(?::+[^:,]+)+
(?:[^,:]+, )? Optionally match 1+ any char except a , or : followed by , and space
[^:,]* Match 0+ any char except : or ,
(?: Non Capturing group
:+[^:,]+ Match 1+ : and 1+ times any char except : and ,
)+ Close group and repeat 1+ times
Regex demo
You seem to be making it harder than it needs to be with the lookahead (which won't be satisfied at end-of-line anyway).
([^:]*:){3}[^:,]*:?[^:,]*
Find the first 3 :'s, then start including , in the negative groupings, with an optional 4th :.

java regex char sequence

I have a string with multiple "message" inside it. "message" starts with certain char sequence. I've tried:
String str = 'ab message1ab message2ab message3'
Pattern pattern = Pattern.compile('(?<record>ab\\p{ASCII}+(?!ab))');
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
handleMessage(matcher.group('record'))
}
but \p{ASCII}+ greedy eat everything.
Symbols a, b can be inside message only their sequence mean start of next message
p{ASCII}+ is the greedy regex for one or more ASCII characters, meaning that it will use the longest possible match. But you can use the reluctant quantifier if you want the shortest possible match: p{ASCII}+?. In that case, you should use a positive lookahead assertion.
The regex could become:
Pattern pattern = Pattern.compile("(?<record>ab\\p{ASCII}+?)(?=(ab)|\\z)");
Please note the (ab)|\z to match the last message...

How to get all integers before hyphen from java String

I want to parse through hyphen, the answer should be 0 0 1 (integer), what could be the best way to parse in java
public static String str ="[0-S1|0-S2|1-S3, 1-S1|0-S2|0-S3, 0-S1|1-S2|0-S3]";
Please help me out.
Use the below regex with Pattern and matcher classes.
Pattern.compile("\\d+(?=-)");
\\d+ - Matches one or more digits. + repeats the previous token \\d (which matches a digit character) one or more times.
(?=-) - Only if it's followed by an hyphen. (?=-) Called positive lookahead assertion which asserts that the match must be followed by an - symbol.
String str ="[0-S1|0-S2|1-S3, 1-S1|0-S2|0-S3, 0-S1|1-S2|0-S3]";
Matcher m = Pattern.compile("\\d+(?=-)").matcher(str);
while(m.find())
{
System.out.println(m.group());
}
one lazy way: if you already know the pattern of the string, use substring and indexof to locate your word.
String str ="[0-S1|0-S2|1-S3, 1-S1|0-S2|0-S3, 0-S1|1-S2|0-S3]";
integer int1 = Integer.parseInt(str.substring(str.indexOf("["),str.indexOf("-S1")));
and so on.

Find pattern in string with regex -> how to improve my solution

i would like to parse a string and get the "stringIAmLookingFor"-part of it, which is surrounded by "\_" at the end and the beginning. I'm using a regex to match that and then remove the "\_" in the found string. This is working, but I'm wondering if there is a more elegant approach to this problem?
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w)*_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
String match = m.group();
match = match.replaceAll("_", "");
System.out.println(match);
}
Solution (partial)
Please also check the next section. Don't just read the solution here.
Just modify your code a bit:
String test = "xyz_stringIAmLookingFor_zxy";
// Make the capturing group capture the text in between (\w*)
// A capturing group is enclosed in (pattern), denoting the part of the
// pattern whose text you want to get separately from the main match.
// Note that there is also non-capturing group (?:pattern), whose text
// you don't need to capture.
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// The text is in the capturing group numbered 1
// The numbering is by counting the number of opening
// parentheses that makes up a capturing group, until
// the group that you are interested in.
String match = m.group(1);
System.out.println(match);
}
Matcher.group(), without any argument will return the text matched by the whole regex pattern. Matcher.group(int group) will return the text matched by capturing group with the specified group number.
If you are using Java 7, you can make use of named capturing group, which makes the code slightly more readable. The string matched by the capturing group can be accessed with Matcher.group(String name).
String test = "xyz_stringIAmLookingFor_zxy";
// (?<name>pattern) is similar to (pattern), just that you attach
// a name to it
// specialText is not a really good name, please use a more meaningful
// name in your actual code
Pattern p = Pattern.compile("_(?<specialText>\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// Access the text captured by the named capturing group
// using Matcher.group(String name)
String match = m.group("specialText");
System.out.println(match);
}
Problem in pattern
Note that \w also matches _. The pattern you have is ambiguous, and I don't know what your expected output is for the cases where there are more than 2 _ in the string. And do you want to allow underscore _ to be part of the output?
You can define the group you actually want, since you're already using parentheses. You just need to tweak your pattern a bit.
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
System.out.println(m.group(1));
}
Use group(1) instead of group() because group() will get you the entire pattern and not the matching group.
Reference : http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#group(int)
"xyz_stringIAmLookingFor_zxy".replaceAll("_(\\w)*_", "$1");
will replace everything by this group in parenthesis
a simpler regex, no group needed:
"(?<=_)[^_]*"
if you want it more strict:
"(?<=_)[^_]+(?=_)"
try
String s = "xyz_stringIAmLookingFor_zxy".replaceAll(".*_(\\w*)_.*", "$1");
System.out.println(s);
output
stringIAmLookingFor

Regexp grouping and replaceAll with .* in Java duplicates the replacement

I got a problem using Rexexp in Java. The example code writes out ABC_012_suffix_suffix, I was expecting it to output ABC_012_suffix
Pattern rexexp = Pattern.compile("(.*)");
Matcher matcher = rexexp.matcher("ABC_012");
String result = matcher.replaceAll("$1_suffix");
System.out.println(result);
I understand that replaceAll replaces all matched groups, the questions is why is this regexp group (.*) matching twice on my string ABC_012 in Java?
Pattern regexp = Pattern.compile(".*");
Matcher matcher = regexp.matcher("ABC_012");
matcher.matches();
System.out.println(matcher.group(0));
System.out.println(matcher.replaceAll("$0_suffix"));
Same happens here, the output is:
ABC_012
ABC_012_suffix_suffix
The reason is hidden in the replaceAll method: it tries to find all subsequences that match the pattern:
while (matcher.find()) {
System.out.printf("Start: %s, End: %s%n", matcher.start(), matcher.end());
}
This will result in:
Start: 0, End: 7
Start: 7, End: 7
So, to our first surprise, the matcher finds two subsequences, "ABC_012" and another "". And it appends "_suffix" to both of them:
"ABC_012" + "_suffix" + "" + "_suffix"
Probably .* gives you "full match" and then reduces match to the "empty match" (but still a match). Try (.+) or (^.*$) instead. Both work as expected.
At regexinfo star is defined as follows:
*(star) - Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.
If you just want to add "_suffix" to your input why don't you just do:
String result = "ABC_012" + "_suffix";
?

Categories