I'm trying to write a Java regex to match a comma delimited list of interfaces. Something like:
Runnable, Serializable, List, Map
There can be zero or more entries in the list. A trailing comma is invalid. Space is optional. I came up with the following, which gets me to one or more entries, and then check for empty:
String validName = "[a-zA-Z_][a-zA-Z0-9_]*";
String regex = validName + "\\s*(,\\s*" + validName + ")*";
if (s.matches(regex) || s.trim().isEmpty())
...
But is there a way to include the "zero entries" condition into the regex?
To make a pattern optional, use a group with a ? quantifier set to it:
String regex = "(?:" + validName + "\\s*(?:,\\s*" + validName + ")*)?";
// ^^^ ^^
if (s.matches(regex) {
// ....
}
The ? greedy quantifier matches one or zero occurrences of the pattern it is applied to. (greedy means it prefers to get 1 occurrence rather than 0)
The (?: character sequence opens a non-capturing group. I.E. this is how you use "normal parentheses" for logically grouping sections of regular expressions.
You may add \\s* subpatterns at the start/end of the pattern to allow leading/trailing whitespace.
Try String regex = "^$|" + regex ^$ means "nothing between the beginning of the input and the end of the input". ^$| means "either match nothing or match whatever matches the rest of the regex"
Related
Input
example("This is tes't")
example('This is the tes\"t')
Ouput should be
This is tes't
This is the tes"t
Code
String text = "example(\"This is tes't\")";
//String text = "$.i18nMessage('This is the tes\"t\')";
final String quoteRegex = "example.*?(\".*?\")?('.*?')?";
Matcher matcher0 = Pattern.compile(quoteRegex).matcher(text);
while (matcher0.find()) {
System.out.println(matcher0.group(1));
System.out.println(matcher0.group(2));
}
I see output as
null
null
Though when i use regex example.*?(\".*?\") it returns This is tes't and when i use example.*?('.*?') it returns
This is the tes"t but whn i combine both with example.*?(\".*?\")?('.*?')? it returns null . Why ?
The .*?(\".*?\")?('.*?')? subpattern sequence at the end of your regex can match an empty string (all 3 parts are quantified with * / *? that match 0 or more chars). After matcing example, the .*? is skipped at first, and is only expanded once the subsequent subpatterns do not match. However, they both match an empty string before (, thus, you only have example in matcher0.group(0).
Use either an alternation that makes group 1 obligatory (demo):
Pattern.compile("example.*?(\".*?\"|'.*?')"
Or a variant with a tempered greedy token (demo) that allows to get rid of the alternation:
Pattern.compile("example.*?(([\"'])(?:(?!\\2).)*\\2)"
Or, better, support escaped sequences (another demo):
Pattern.compile("example.*?(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')"
In all 3 examples, you only need to access Group 1. If there can only be ( between example and " or ', you should replace .*? with \( since it will make matching safer. Although, it is never too safe to use a regex to match string literals (at least, with one regex).
I found the following question in one Java test suite
Pattern p = Pattern.compile("[wow]*");
Matcher m = p.matcher("wow its cool");
boolean b = false;
while (b = m.find()) {
System.out.print(m.start() + " \"" + m.group() + "\" ");
}
where the output seems to be as follows
0 "wow" 3 "" 4 "" 5 "" 6 "" 7 "" 8 "" 9 "oo" 11 "" 12 ""
Up till the last match it is clear, the pattern [wow]* greedily matches 0 or more 'w' and 'o' characters, while for unmatching characters, including spaces, it results in empty strings. However after matching the last 'l' with 11 "", the following 12 "" seems to be unclear. There is no detailing for this in the test solution, nor I was really able to definitely figure it out from javadoc. My best guess is boundary character, but I would appreciate if someone could provide an explanation
The reason that you see this behavior is that your pattern allows empty matches. In other words, if you pass it an empty string, you would see a single match at position zero:
Pattern p = Pattern.compile("[wow]*"); // One of the two 'w's is redundant, but the engine is OK with it
Matcher m = p.matcher(""); // Passing an empty string results in a valid match that is empty
boolean b = false;
while (b = m.find()) {
System.out.print(m.start() + " \"" + m.group() + "\" ");
}
this would print 0 "" because an empty string is as good a match as any other match for the expression.
Going back to your example, every time the engine discovers a match, including an empty one, it advances past it by a single character. "Advancing by one" means that the engine considers the "tail" of the string at the next position. This includes the time when the regex engine is at position 11, i.e. at the very last character: here, the "tail" consists of an empty string. This is similar to calling "wow its cool".substring(12): you would get an empty string in that case as well.
The engine consider an empty string a valid input, and tries to match it against your expression, as shown in the example above. This produces a match, which your program properly reports.
[wow]* Matches the first wow string. count = 1
Because of the * (zero or more) next to the character class, [wow]* this regex would match an empty string which exists before the character which is not matched by the above pattern. So it matches the boundary or empty space which exists just before to the first space. Count = 2.
its is not matched by the above regex . So it matches the empty string which exists before each character. So count is 2+3=5.
And also the second space is not matched by the above regex. So we get an empty string as match. 5+1=6
c is not matched by the above regex. So it matches the empty space which exists just before to the c 6+1=7
oo is matched by the above regex. [wow]*. So it matches oo and this is considered as 1 match . So we get 7+1=8 as count.
l is not matched. Count = 9
At the last it matches the empty string which exists next to the last character. So now the count is 9+1=10
And finally we all know that the m.start() prints the starting index of the corresponding match.
DEMO
The regex is simply matching the pattern against the input, starting at a given offset.
For the last match, the offset of 12 is at the point after the last character of 'cool' - you might think this is the end of the string and therefore cannot be used for matching purposes - but you'd be wrong. For pattern-matching, this is a perfectly valid starting point.
As you state, your regex expression includes the possibility of zero characters and indeed, this is what happens after the end of the last character, but before the end-of-string marker (usually represented by $ in a regex expression).
To put it another way, without testing past the end of the last character, it would mean no matches would ever occur relating to the end of the string - but there are many regex constructs that match the end of the string (and you've shown one of them here).
I'm trying to write a regular expression that checks ahead to make sure there is either a white space character OR an opening parentheses after the words I'm searching for.
Also, I want it to look back and make sure it is preceded by either a non-Word (\W) or nothing at all (i.e. it is the beginning of the statement).
So far I have,
"(\\W?)(" + words.toString() + ")(\\s | \\()"
However, this also matches the stuff at either ends - I want this pattern to match ONLY the word itself - not the stuff around it.
I'm using Java flavor Regex.
As you tagged your question yourself, you need lookarounds:
String regex = "(?<=\\W|^)(" + Pattern.quote(words.toString()) + ")(?= |[(])"
(?<=X) means "preceded by X"
(?<!=X) means "not preceded by X"
(?=X) means "followed by X"
(?!=X) means "not followed by X"
What about the word itself: will it always start with a word character (i.e., one that matches \w)? If so, you can use a word boundary for the leading condition.
"\\b" + theWord + "(?=[\\s(])"
Otherwise, you can use a negative lookbehind:
"(?<!\\w)" + theWord + "(?=[\\s(])"
I'm assuming the word is either quoted like so:
String theWord = Pattern.quote(words.toString());
...or doesn't need to be.
If you don't want a group to be captured by the matching, you can use the special construct (?:X)
So, in your case:
"(?:\\W?)(" + words.toString() + ")(?:\\s | \\()"
You will only have two groups then, group(0) for the whole string and group(1) for the word you are looking for.
I want to remove all the leading and trailing punctuation in a string. How can I do this?
Basically, I want to preserve punctuation in between words, and I need to remove all leading and trailing punctuation.
., #, _, &, /, - are allowed if surrounded by letters
or digits
\' is allowed if preceded by a letter or digit
I tried
Pattern p = Pattern.compile("(^\\p{Punct})|(\\p{Punct}$)");
Matcher m = p.matcher(term);
boolean a = m.find();
if(a)
term=term.replaceAll("(^\\p{Punct})", "");
but it didn't work!!
Ok. So basically you want to find some pattern in your string and act if the pattern in matched.
Doing this the naiive way would be tedious. The naiive solution could involve something like
while(myString.StartsWith("." || "," || ";" || ...)
myString = myString.Substring(1);
If you wanted to do a bit more complex task, it could be even impossible to do the way i mentioned.
Thats why we use regular expressions. Its a "language" with which you can define a pattern. the computer will be able to say, if a string matches that pattern. To learn about regular expressions, just type it into google. One of the first links: http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial
As for your problem, you could try this:
myString.replaceFirst("^[^a-zA-Z]+", "")
The meaning of the regex:
the first ^ means that in this pattern, what comes next has to be at
the start of the string.
The [] define the chars. In this case, those are things that are NOT
(the second ^) letters (a-zA-Z).
The + sign means that the thing before it can be repeated and still
match the regex.
You can use a similar regex to remove trailing chars.
myString.replaceAll("[^a-zA-Z]+$", "");
the $ means "at the end of the string"
You could use a regular expression:
private static final Pattern PATTERN =
Pattern.compile("^\\p{Punct}*(.*?)\\p{Punct}*$");
public static String trimPunctuation(String s) {
Matcher m = PATTERN.matcher(s);
m.find();
return m.group(1);
}
The boundary matchers ^ and $ ensure the whole input is matched.
A dot . matches any single character.
A star * means "match the preceding thing zero or more times".
The parentheses () define a capturing group whose value is retrieved by calling Matcher.group(1).
The ? in (.*?) means you want the match to be non-greedy, otherwise the trailing punctuation would be included in the group.
Use this tutorial on patterns. You have to create a regex that matches string starting with alphabet or number and ending with alphabet or number and do inputString.matches("regex")
I've got an string parts which match to following pattern.
abcd|(|a|ab|abc)e(fghi|(|f|fg|fgh)jklmn)
But problem I have got is, my whole string is repeated combination of above like patterns. And my whole string must contain more than 14 sets of above pattern.
Can anyone one help me to improve my above RegEx to wanted format.
Thanks
Update
Input examples:
Matched string parts : abcd, abefgjkln, efjkln, ejkln
But whole string is : abcdabefgjklnefjklnejkln (Combination of above 4 parts)
There must be more than 15 parts in whole string. Above one have only 4 parts. So, it's wrong.
This will try to match your "parts" at least 15 times in a string.
boolean foundMatch = false;
try {
foundMatch = subjectString.matches("(?:(?:ab(?:cd|efgjkln))|(?:(?:ef?jkln))){15,}");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
If there are at least 15 repetitions of any of the above parts foundMatch will be true, else it will remain false.
Breakdown :
"(?:" + // Match the regular expression below
"|" + // Match either the regular expression below (attempting the next alternative only if this one fails)
"(?:" + // Match the regular expression below
"ab" + // Match the characters “ab” literally
"(?:" + // Match the regular expression below
// Match either the regular expression below (attempting the next alternative only if this one fails)
"cd" + // Match the characters “cd” literally
"|" + // Or match regular expression number 2 below (the entire group fails if this one fails to match)
"efgjkln" + // Match the characters “efgjkln” literally
")" +
")" +
"|" + // Or match regular expression number 2 below (the entire group fails if this one fails to match)
"(?:" + // Match the regular expression below
"(?:" + // Match the regular expression below
"e" + // Match the character “e” literally
"f" + // Match the character “f” literally
"?" + // Between zero and one times, as many times as possible, giving back as needed (greedy)
"jkln" + // Match the characters “jkln” literally
")" +
")" +
"){15,}" // Between 15 and unlimited times, as many times as possible, giving back as needed (greedy)
What about this:
(?:a(?:b(?:c(?:d)?)?)?ef(?:g(?:h(?:i)?)?)?jklmn){15,}
Explanation: you create a non-capturing group (with (?: ... )), and say that this should be repeated >=15 times, hence the curly braces in the end.
First, it seems that your pattern can be simplified. Really pattern a is a subset of ab that is a subset of abc, so if pattern abc matches it means that a matches too. Think about this and change your pattern appropriately. Right now it probably not what you really want.
Second, to repeat something is puttern use {N}, i.e. abc{5} means "abc repeated five times". You can also use {3,}, {,5}, {3,5} that mean repeat>=3, repeat<=5, 3<=repeat<=5.