Java regex extract capture group if it exists

Java regex extract capture group if it exists - java

I apparently don't understand Java's regex library or regex either for that matter.
for this string:
String text = "asdf 2013-05-12 asdf";
this regex explodes in my face:
String REGEX_FORMAT_1 = ".+?([0-9]{4}\\s?-\\s?[0-9]{2}\\s?-\\s?[0-9]{2}).+";
Matcher matcher_1 = PATTERN_FORMAT_1.matcher(text);
if(matcher_1.matches()) {
String matchedGroup = matcher_1.group();
...
}
Semantically this makes sense to me but it seems I've totally misunderstood something. The regex works fine in some online regex editors like regex101 but not in others. Could someone please help me understand why I don't get the capture group containing 2013-05-12 ...

group() is equivalent to group(0) and returns the entire matched string. Use group(1) to pull out the first matched group.
String text = "asdf 2013-05-12 asdf";
String regex = ".+?([0-9]{4}\\s?-\\s?[0-9]{2}\\s?-\\s?[0-9]{2}).+";
Matcher matcher = Pattern.compile(regex).matcher(text);
if (matcher.matches()) {
String matchedGroup = matcher.group(1);
System.out.println(matchedGroup);
}
Output:
2013-05-12

Related

how to exclude "<" in regex match

I have a String which looks like "<name><address> and <Phone_1>". I have get to get the result like
1) <name>
2) <address>
3) <Phone_1>
I have tried using regex "<(.*)>" but it returns just one result.

The regex you want is
<([^<>]+?)><([^<>]+?)> and <([^<>]+?)>
Which will then spit out the stuff you want in the 3 capture groups. The full code would then look something like this:
Matcher m = Pattern.compile("<([^<>]+?)><([^<>]+?)> and <([^<>]+?)>").matcher(string);
if (m.find()) {
String name = m.group(1);
String address = m.group(2);
String phone = m.group(3);
}

The pattern .* in a regex is greedy. It will match as many characters as possible between the first < it finds and the last possible > it can find. In the case of your string it finds the first <, then looks for as much text as possible until a >, which it will find at the very end of the string.
You want a non-greedy or "lazy" pattern, which will match as few characters as possible. Simply <(.+?)>. The question mark is the syntax for non-greedy. See also this question.

This will work if you have dynamic number of groups.
Pattern p = Pattern.compile("(<\\w+>)");
Matcher m = p.matcher("<name><address> and <Phone_1>");
while (m.find()) {
System.out.println(m.group());
}

java regular expression word without ending with dot

I need to print the simple bind variable names in the SQL query.
I need to print the words starting with : character But NOT ending with dot . character.
in this sample I need to print pOrg, pBusinessId but NOT the parameter.
The regular expression ="(:)(\\w+)^\\." is not working.
Could you help in correcting the regular expression.
Thanks
Peddi
public void testMethod(){
String regEx="(:)(\\w+)([^\\.])";
String input= "(origin_table like 'I%' or (origin_table like 'S%' and process_status =5))and header_id = NVL( :parameter.number1:NULL, header_id) and (orginization = :pOrg) and (businsess_unit = :pBusinessId";
Pattern pattern;
Matcher matcher;
pattern = Pattern.compile(regEx);
matcher = pattern.matcher(input);
String grp = null;
while(matcher.find()){
grp = matcher.group(2);
System.out.println(grp);
}
}

You can try with something like
String regEx = "(:)(\\w+)\\b(?![.])";
(:)(\\w+)\\b will make sure that you are matching only entire words starting with :
(?![.]) is look behind mechanism which makes sure that after found word there is no .
This regex will also allow :NULL so if there is some reason why it shouldn't be matched share it with us.
Anyway to exclude NULL from results you can use
String regEx = "(:)(\\w+)\\b(?![.])(?<!:NULL)";
To make regex case insensitive so NULL could also match null compile this pattern with Pattern.CASE_INSENSITIVE flag like
Pattern pattern = Pattern.compile(regEx,Pattern.CASE_INSENSITIVE);

Since it looks like you're using camelcase, you can actually simplify things a bit when it comes to excluding :NULL:
:([a-z][\\w]+)\\b(?!\\.)
And $1 will return your variable names.
Alternative that doesn't rely on negative lookahead:
:([a-z][\\w]+)\\b(?:[^\\.]|$)

You can try:
Pattern regex = Pattern.compile("^:.*?[^.]$");
Demo

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

I have a string that begins with one or more occurrences of the sequence "Re:". This "Re:" can be of any combinations, for ex. Re<any number of spaces>:, re:, re<any number of spaces>:, RE:, RE<any number of spaces>:, etc.
Sample sequence of string : Re: Re : Re : re : RE: This is a Re: sample string.
I want to define a java regular expression that will identify and strip off all occurrences of Re:, but only the ones at the beginning of the string and not the ones occurring within the string.
So the output should look like This is a Re: sample string.
Here is what I have tried:
String REGEX = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)";
String INPUT = title;
String REPLACE = "";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
I am using p{Z} to match whitespaces(have found this somewhere in this forum, as Java regex does not identify \s).
The problem I am facing with this code is that the search stops at the first match, and escapes the while loop.

Try something like this replace statement:
yourString = yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
Explanation of the regex:
(?i) make it case insensitive
^ anchor to start of string
( start a group (this is the "re:")
\\s* any amount of optional whitespace
re "re"
\\s* optional whitespace
: ":"
\\s* optional whitespace
) end the group (the "re:" string)
+ one or more times

in your regex:
String regex = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)"
here is what it does:
see it live here
it matches strings like:
\p{Z}Reee\p{Z: or
R\p{Z}}}
which make no sense for what you try to do:
you'd better use a regex like the following:
yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
or to make #Doorknob happy, here's another way to achieve this, using a Matcher:
Pattern p = Pattern.compile("(?i)^(\\s*re\\s*:\\s*)+");
Matcher m = p.matcher(yourString);
if (m.find())
yourString = m.replaceAll("");
(which is as the doc says the exact same thing as yourString.replaceAll())
Look it up here
(I had the same regex as #Doorknob, but thanks to #jlordo for the replaceAll and #Doorknob for thinking about the (?i) case insensitivity part ;-) )

Remove part of String following regex match in Java

I want to remove a part of a string following what matches my regex.
I am trying to make a TV show organization program and I want to cut off anything in the name following the season and episode marker in the form SXXEXX where X is a digit.
I grasped the regex model fairly easily to create "[Ss]\d\d[Ee]\d\d" which should match properly.
I want to use the Matcher method end() to get the last index in the string of the match but it does not seem to be working as I think it should.
Pattern p = Pattern.compile("[Ss]\\d\\d[Ee]\\d\\d");
Matcher m = p.matcher(name);
if(m.matches())
return name.substring(0, m.end());
If someone could tell me why this doesn't work and suggest a proper way to do it, that would be great. Thanks.

matches() tries to match the whole string again the pattern. If you want to find your pattern within a string, use find(), find() will search for the next match in the string.
Your code could be quite the same:
if(m.find())
return name.substring(0, m.end());

matches matches the entire string, try find()
You could capture the name as well:
String name = "a movie S01E02 with some stuff";
Pattern p = Pattern.compile("(.*[Ss]\\d\\d[Ee]\\d\\d)");
Matcher m = p.matcher(name);
if (m.find())
System.out.println(m.group());
else
System.out.println("No match");
Will capture and print:
a movie S01E02

This should work
.*[Ss]\d\d[Ee]\d\d
In java (I'm rusty) this will be
String ResultString = null;
Pattern regex = Pattern.compile(".*[Ss]\\d\\d[Ee]\\d\\d");
Matcher regexMatcher = regex.matcher("Title S11E11Blah");
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
Hope this helps

Getting dialogue snippets from text using regular expressions

I'm trying to extract snippets of dialogue from a book text. For example, if I have the string
"What's the matter with the flag?" inquired Captain MacWhirr. "Seems all right to me."
Then I want to extract "What's the matter with the flag?" and "Seem's all right to me.".
I found a regular expression to use here, which is "[^"\\]*(\\.[^"\\]*)*". This works great in Eclipse when I'm doing a Ctrl+F find regex on my book .txt file, but when I run the following code:
String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\""; Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
if(m.find())
System.out.println(m.group(1));
The only thing that prints is null. So am I not converting the regex into a Java string properly? Do I need to take into account the fact that Java Strings have a \" for the double quotes?

In a natural language text, it's not likely that " is escaped by a preceding slash, so you should be able to use just the pattern "([^"]*)".
As a Java string literal, this is "\"([^\"]*)\"".
Here it is in Java:
String regex = "\"([^\"]*)\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
while (m.find()) {
System.out.println(m.group(1));
}
The above prints:
What's the matter with the flag?
Seems all right to me.
On escape sequences
Given this declaration:
String s = "\"";
System.out.println(s.length()); // prints "1"
The string s only has one character, ". The \ is an escape sequence present at the Java source code level; the string itself has no slash.
See also
JLS 3.10.6 Escape Sequences for Character and String Literals
The problem with the original code
There's actually nothing wrong with the pattern per se, but you're not capturing the right portion. \1 isn't capturing the quoted text. Here's the pattern with the correct capturing group:
String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"";
String bookText = "\"What's the matter?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
while (m.find()) {
System.out.println(m.group(1));
}
For visual comparison, here's the original pattern, as a Java string literal:
String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\""
^^^^^^^^^^^^^^^^^
why capture this part?
And here's the modified pattern:
String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
we want to capture this part!
As mentioned before, though: this complicated pattern isn't necessary for natural language text, which isn't likely to contain escaped quotes.
See also
regular-expressions.info/Grouping and backreferences

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex extract capture group if it exists - java

Related

how to exclude "<" in regex match

java regular expression word without ending with dot

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

Remove part of String following regex match in Java

Getting dialogue snippets from text using regular expressions

Categories

Resources