Regex to extract strings within delimiter - java

I am trying to extract string occurences within delimiters (parentheses in this case) but not the ones which are within quotes (single or double). Here is what I have tried - this regex fetches all occurences within parentheses, also the ones which are within quotes (I don't want the ones within quotes)
public class RegexMain {
static final String PATTERN = "\\(([^)]+)\\)";
static final Pattern CONTENT = Pattern.compile(PATTERN);
/**
* #param args
*/
public static void main(String[] args) {
String testString = "Rhyme (Jack) and (Jill) went up the hill on \"(Peter's)\" request.";
Matcher match = CONTENT.matcher(testString);
while(match.find()) {
System.out.println(match.group()); // prints Jack, Jill and Peter's
}
}
}

You could try
public class RegexMain {
static final String PATTERN = "\\(([^)]+)\\)|\"[^\"]*\"";
static final Pattern CONTENT = Pattern.compile(PATTERN);
/**
* #param args
*/
public static void main(String[] args) {
String testString = "Rhyme (Jack) and (Jill) went up the hill on \"(Peter's)\" request.";
Matcher match = CONTENT.matcher(testString);
while(match.find()) {
if(match.group(1) != null) {
System.out.println(match.group(1)); // prints Jack, Jill
}
}
}
}
This pattern will match quoted strings as well as parenthesized ones but only the parenthesized ones will put something in group(1). Since + and * are greedy in regular expressions it will prefer to match "(Peter's)" over (Peter's).

This is a case where you can make elegant use of look-behind and look-ahead operators to achieve what you want. Here is a solution in Python (I always use it for trying out stuff quickly on the command line), but the regular expression should be the same in Java code.
This regex matches content that is preceded by an opening parenthesis using positive look-behind and succeeded by a closing parenthesis using positive look-ahead. But it avoids these matches when the opening parenthesis is preceded by a single or double quote using negative look-behind and when the closing parenthesis is succeeded by a single or double quote using negative look-ahead.
In [1]: import re
In [2]: s = "Rhyme (Jack) and (Jill) went up the hill on \"(Peter's)\" request."
In [3]: re.findall(r"""
...: (?<= # start of positive look-behind
...: (?<! # start of negative look-behind
...: [\"\'] # avoids matching opening parenthesis preceded by single or double quote
...: ) # end of negative look-behind
...: \( # matches opening parenthesis
...: ) # end of positive look-behind
...: \w+ (?: \'\w* )? # matches whatever your content looks like (configure this yourself)
...: (?= # start of positive look-ahead
...: \) # matches closing parenthesis
...: (?! # start of negative look-ahead
...: [\"\'] # avoids matching closing parenthesis succeeded by single or double quote
...: ) # end of negative look-ahead
...: ) # end of positive look-ahead
...: """,
...: s,
...: flags=re.X)
Out[3]: ['Jack', 'Jill']

Note: This is not the final response because I'm not familiar with JAVA but I believe it can still be converted into the JAVA language.
The easiest approach, as far as I'm concerned, is to replace the quoted parts in the string with an empty string, then look for the matches. Hoping you're somewhat familiar with PHP, here's the idea.
$str = "Rhyme (Jack) and (Jill) went up the hill on \" (Peter's)\" request.";
preg_match_all(
$pat = '~(?<=\().*?(?=\))~',
// anything inside parentheses
preg_replace('~([\'"]).*?\1~','',$str),
// this replaces quoted strings with ''
$matches
// and assigns the result into this variable
);
print_r($matches[0]);
// $matches[0] returns the matches in preg_match_all
// [0] => Jack
// [1] => Jill

Related

Java regex repeating capture groups

Considering the following string: "${test.one}${test.two}" I would like my regex to return two matches, namely "test.one" and "test.two". To do that I have the following snippet:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTester {
private static final Pattern pattern = Pattern.compile("\\$\\{((?:(?:[A-z]+(?:\\.[A-z0-9()\\[\\]\"]+)*)+|(?:\"[\\w/?.&=_\\-]*\")+)+)}+$");
public static void main(String[] args) {
String testString = "${test.one}${test.two}";
Matcher matcher = pattern.matcher(testString);
while (matcher.find()) {
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
}
}
}
I have some other stuff in there as well, because I want this to also be a valid match ${test.one}${"hello"}.
So, basically, I just want it to match on anything inside of ${} as long as it either follows the format: something.somethingelse (alphanumeric only there) or something.somethingElse() or "something inside of quotations" (alphanumeric plus some other characters). I have the main regex working, or so I think, but when I run the code, it finds two groups,
${test.two}
test.two
I want the output to be
test.one
test.two
Basically, your regex main problem is that it matches only at the end of string, and you match many more chars that just letters with [A-z]. Your grouping also seem off.
If you load your regex at regex101, you will see it matches
\$\{
( - start of a capturing group
(?: - start of a non-capturing group
(?:[A-z]+ - start of a non-capturing group, and it matches 1+ chars between A and z (your first mistake)
(?:\.[A-z0-9()\[\]\"]+)* - 0 or more repetitions of a . and then 1+ letters, digits, (, ), [, ], ", \, ^, _, and a backtick
)+ - repeat the non-capturing group 1 or more times
| - or
(?:\"[\w/?.&=_\-]*\")+ - 1 or more occurrences of ", 0 or more word, /, ?, ., &, =, _, - chars and then a "
)+ - repeat the group pattern 1+ times
) - end of non-capturing group
}+ - 1+ } chars
$ - end of string.
To match any occurrence of your pattern inside a string, you need to use
\$\{(\"[^\"]*\"|\w+(?:\(\))?(?:\.\w+(?:\(\))?)*)}
See the regex demo, get Group 1 value after a match is found. Details:
\$\{ - a ${ substring
(\"[^\"]*\"|\w+(?:\(\))?(?:\.\w+(?:\(\))?)*) - Capturing group 1:
\"[^\"]*\" - ", 0+ chars other than " and then a "
| - or
\w+(?:\(\))? - 1+ word chars and an optional () substring
(?:\.\w+(?:\(\))?)* - 0 or more repetitions of . and then 1+ word chars and an optional () substring
} - a } char.
See the Java demo:
String s = "${test.one}${test.two}\n${test.one}${test.two()}\n${test.one}${\"hello\"}";
Pattern pattern = Pattern.compile("\\$\\{(\"[^\"]*\"|\\w+(?:\\(\\))?(?:\\.\\w+(?:\\(\\))?)*)}");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
}
Output:
test.one
test.two
test.one
test.two()
test.one
"hello"
You could use the regular expression
(?<=\$\{")[a-z]+(?="\})|(?<=\$\{)[a-z]+\.[a-z]+(?:\(\))?(?=\})
which has no capture groups. The characters classes [a-z] can be modified as required provided they do not include a double-quote, period or right brace.
Demo
Java's regex engine performs the following operations.
(?<=\$\{") # match '${"' in a positive lookbehind
[a-z]+ # match 1+ lowercase letters
(?="\}) # match '"}' in a positive lookahead
| # or
(?<=\$\{) # match '${' in a positive lookbehind
[a-z]+ # match 1+ lowercase letters
\.[a-z]+ # match '.' followed by 1+ lowercase letters
(?:\(\))? # optionally match `()`
(?=\}) # match '}' in a positive lookahead

How do I write a multi-regex line?

I'm trying to write a line of regex that performs the following:
A string variable that can contain only:
The letters a to z (upper and lowercase) (zero or many times)
The hyphen character (zero or many times)
The single quote character (zero or one time)
The space character (zero or one time)
Tried searching through many regex websites
.matches("([a-zA-Z_0-9']*(\\s)?)(-)?"))
This allows close to what I want, however you cant start typing a-z anymore after you have typed in space character. So it's sequential in a way. I want the validation to allow for any sequence of those factors.
Expected:
Allowed to type a string that has any amount of a-zA-Z, zero to one space, zero to one dash, anywhere throughout the string.
This is a validation for that
"^(?!.*\\s.*\\s)(?!.*'.*')[a-zA-Z'\\s-]*$"
Expanded
^ # Begin
(?! .* \s .* \s ) # Max single whitespace
(?! .* ' .* ' ) # Max single, single quote
[a-zA-Z'\s-]* # Optional a-z, A-Z, ', whitespace or - characters
$ # End
I guess,
^(?!.*([ ']).*\\1)[A-Za-z' -]*$
might work OK.
Here,
(?!.*([ ']).*\\1)
we are trying to say that, if there was horizontal space (\h) or single quote (') twice in the string, exclude those, which we would be then keeping only those with zero or one time of repetition.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "^(?!.*([ ']).*\\1)[A-Za-z' -]*$";
final String string = "abcAbc- ";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output
Full match: abcAbc-
Group 1: null
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Regex that allows only single separators between words

I need to construct a regular expression such that it should not allow / at the start or end, and there should not be more than one / in sequence.
Valid Expression is: AB/CD
Valid Expression :AB
Invalid Expression: //AB//CD//
Invalid Expression: ///////
Invalid Expression: AB////////
The / character is just a separator between two words. Its length should not be more than one between words.
Assuming you only want to allow alphanumerics (including underscore) between slashes, it's pretty trivial:
boolean foundMatch = subject.matches("\\w+(?:/\\w+)*");
Explanation:
\w+ # Match one or more alnum characters
(?: # Start a non-capturing group
/ # Match a single slash
\w+ # Match one or more alnum characters
)* # Match that group any number of times
This regex does it:
^(?!/)(?!.*//).*[^/]$
So in java:
if (str.matches("(?!/)(?!.*//).*[^/]"))
Note that ^ and $ are implied by matches(), because matches must match the whole string to be true.
[a-zA-Z]+(/[a-zA-Z]+)+
It matches
a/b
a/b/c
aa/vv/cc
doesn't matches
a
/a/b
a//b
a/b/
Demo
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Reg {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[a-zA-Z]+(/[a-zA-Z]+)+");
Matcher matcher = pattern.matcher("a/b/c");
System.out.println(matcher.matches());
}
}

Regex for numeric portion of Java string

I'm trying to write a Java method that will take a string as a parameter and return another string if it matches a pattern, and null otherwise. The pattern:
Starts with a number (1+ digits); then followed by
A colon (":"); then followed by
A single whitespace (" "); then followed by
Any Java string of 1+ characters
Hence, some valid string thats match this pattern:
50: hello
1: d
10938484: 394958558
And some strings that do not match this pattern:
korfed49
: e4949
6
6:
6:sdjjd4
The general skeleton of the method is this:
public String extractNumber(String toMatch) {
// If toMatch matches the pattern, extract the first number
// (everything prior to the colon).
// Else, return null.
}
Here's my best attempt so far, but I know I'm wrong:
public String extractNumber(String toMatch) {
// If toMatch matches the pattern, extract the first number
// (everything prior to the colon).
String regex = "???";
if(toMatch.matches(regex))
return toMatch.substring(0, toMatch.indexOf(":"));
// Else, return null.
return null;
}
Thanks in advance.
Your description is spot on, now it just needs to be translated to a regex:
^ # Starts
\d+ # with a number (1+ digits); then followed by
: # A colon (":"); then followed by
# A single whitespace (" "); then followed by
\w+ # Any word character, one one more times
$ # (followed by the end of input)
Giving, in a Java string:
"^\\d+: \\w+$"
You also want to capture the numbers: put parentheses around \d+, use a Matcher, and capture group 1 if there is a match:
private static final Pattern PATTERN = Pattern.compile("^(\\d+): \\w+$");
// ...
public String extractNumber(String toMatch) {
Matcher m = PATTERN.matcher(toMatch);
return m.find() ? m.group(1) : null;
}
Note: in Java, \w only matches ASCII characters and digits (this is not the case for .NET languages for instance) and it will also match an underscore. If you don't want the underscore, you can use (Java specific syntax):
[\w&&[^_]]
instead of \w for the last part of the regex, giving:
"^(\\d+): [\\w&&[^_]]+$"
Try using the following: \d+: \w+

Create a string-capable Guava Splitter

I would like to create a Guava Splitter for Java that can handles Java strings as one block. For instance, I would like the following assertion to be true:
#Test
public void testSplitter() {
String toSplit = "a,b,\"c,d\\\"\",e";
List<String> expected = ImmutableList.of("a", "b", "c,d\"","e");
Splitter splitter = Splitter.onPattern(...);
List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
assertEquals(expected, actual);
}
I can write the regex to find all the elements and don't consider the ',' but I can't find the regex that would act as a separator to be used with a Splitter.
If it's impossible, please just say so, then I'll build the list from the findAll regex.
This seems like something you should use a CSV library such as opencsv for. Separating values and handling cases like quoted blocks are what they're all about.
This is a Guava feature request: http://code.google.com/p/guava-libraries/issues/detail?id=412
I've same problem (except no need to support escaping of quote character). I don't like to include another library for such simple thing. And then i came to idea, that i need a mutable CharMatcher. As with solution of Bart Kiers, it keeps quote character.
public static Splitter quotableComma() {
return on(new CharMatcher() {
private boolean inQuotes = false;
#Override
public boolean matches(char c) {
if ('"' == c) {
inQuotes = !inQuotes;
}
if (inQuotes) {
return false;
}
return (',' == c);
}
});
}
#Test
public void testQuotableComma() throws Exception {
String toSplit = "a,b,\"c,d\",e";
List<String> expected = ImmutableList.of("a", "b", "\"c,d\"", "e");
Splitter splitter = Splitters.quotableComma();
List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
assertEquals(expected, actual);
}
You could split on the following pattern:
\s*,\s*(?=((\\["\\]|[^"\\])*"(\\["\\]|[^"\\])*")*(\\["\\]|[^"\\])*$)
which might look (a bit) friendlier with the (?x) flag:
(?x) # enable comments, ignore space-literals
\s*,\s* # match a comma optionally surrounded by space-chars
(?= # start positive look ahead
( # start group 1
( # start group 2
\\["\\] # match an escaped quote or backslash
| # OR
[^"\\] # match any char other than a quote or backslash
)* # end group 2, and repeat it zero or more times
" # match a quote
( # start group 3
\\["\\] # match an escaped quote or backslash
| # OR
[^"\\] # match any char other than a quote or backslash
)* # end group 3, and repeat it zero or more times
" # match a quote
)* # end group 1, and repeat it zero or more times
( # open group 4
\\["\\] # match an escaped quote or backslash
| # OR
[^"\\] # match any char other than a quote or backslash
)* # end group 4, and repeat it zero or more times
$ # match the end-of-input
) # end positive look ahead
But even in this commented-version, it still is a monster. In plain English, this regex could be explained as follows:
Match a comma that is optionally surrounded by space-chars, only when looking ahead of that comma (all the way to the end of the string!), there are zero or an even number of quotes while ignoring escaped quotes or escaped backslashes.
So, after seeing this, you might agree with ColinD (I do!) that using some sort of a CSV parser is the way to go in this case.
Note that the regex above will leave the qoutes around the tokens, i.e., the string a,b,"c,d\"",e (as a literal: "a,b,\"c,d\\\"\",e") will be split as follows:
a
b
"c,d\""
e
Improving on #Rage-Steel 's answer a bit.
final static CharMatcher notQuoted = new CharMatcher() {
private boolean inQuotes = false;
#Override
public boolean matches(char c) {
if ('"' == c) {
inQuotes = !inQuotes;
}
return !inQuotes;
};
final static Splitter SPLITTER = Splitter.on(notQuoted.and(CharMatcher.anyOf(" ,;|"))).trimResults().omitEmptyStrings();
And then,
public static void main(String[] args) {
final String toSplit = "a=b c=d,kuku=\"e=f|g=h something=other\"";
List<String> sputnik = SPLITTER.splitToList(toSplit);
for (String s : sputnik)
System.out.println(s);
}
Pay attention to thread safety (or, to simplify - there isn't any)

Categories