How do you match more than one space character in Java regex?
I have a regex I am trying to match. The regex fails when I have two or more space characters.
public static void main(String[] args) {
String pattern = "\\b(fruit)\\s+([^a]+\\w+)\\b"; //Match 'fruit' not followed by a word that begins with 'a'
String str = "fruit apple"; //One space character will not be matched
String str_fail = "fruit apple"; //Two space characters will be matched
System.out.println(preg_match(pattern,str)); //False (Thats what I want)
System.out.println(preg_match(pattern,str_fail)); //True (Regex fail)
}
public static boolean preg_match(String pattern,String subject) {
Pattern regex = Pattern.compile(pattern);
Matcher regexMatcher = regex.matcher(subject);
return regexMatcher.find();
}
The problem is actually because of backtracking. Your regex:
"\\b(fruit)\\s+([^a]+\\w+)\\b"
Says "fruit, followed by one or more spaces, followed by one or more non 'a' characters, followed by one or more 'word' characters". The reason this fails with two spaces is because \s+ matches the first space, but then gives back the second, which then satisfies the [^a]+ (with the second space) and the \s+ portion (with the first).
I think you can fix it by simply using the posessive quantifier instead, which would be \s++. This tells the \s not to give back the second space character. You can find the documentation on Java's quantifiers here.
As an illustration, here are two examples at Rubular:
Using the possessive quantifier on \s (gives expected results, from what you describe)
Your current regex with separate groupings around [^a\]+ and \w+. Notice that the second match group (representing the [^a]+) is capturing a the second space character.
Related
String s = #Section250342,Main,First/HS/12345/Jack/M,2000 10.00,
#Section250322,Main,First/HS/12345/Aaron/N,2000 17.00,
#Section250399,Main,First/HS/12345/Jimmy/N,2000 12.00,
#Section251234,Main,First/HS/12345/Jack/M,2000 11.00
Wherever there is the word /Jack/M in the3 string, I want to pull the section numbers(250342,251234) and the values(10.00,11.00) associated with it using regex each time.
I tried something like this https://regex101.com/r/4te0Lg/1 but it is still messed.
.Section(\d+(?:\.\d+)?).*/Jack/M
If the only parts of each section that change are the section number, the name of the person and the last value (like in your example) then you can make a pattern very easily by using one of the sections where Jack appears and replacing the numbers you want by capturing groups.
Example:
#Section250342,Main,First/HS/12345/Jack/M,2000 10.00
becomes,
#Section(\d+),Main,First/HS/12345/Jack/M,2000 (\d+.\d{2})
If the section substring keeps the format but the other parts of it may change then just replace the rest like this:
#Section(\d+),\w+,(?:\w+/)*Jack/M,\d+ (\d+.\d{2})
I'm assuming that "Main" is a class, "First/HS/..." is a path and that the last value always has 2 and only 2 decimal places.
\d - A digit: [0-9]
\w - A word character: [a-zA-Z_0-9]
+ - one or more times
* - zero or more times
{2} - exactly 2 times
() - a capturing group
(?:) - a non-capturing group
For reference see: https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/util/regex/Pattern.html
Simple Java example on how to get the values from the capturing groups using java.util.regex.Pattern and java.util.regex.Matcher
import java.util.regex.*;
public class GetMatch {
public static void main(String[] args) {
String s = "#Section250342,Main,First/HS/12345/Jack/M,2000 10.00,#Section250322,Main,First/HS/12345/Aaron/N,2000 17.00,#Section250399,Main,First/HS/12345/Jimmy/N,2000 12.00,#Section251234,Main,First/HS/12345/Jack/M,2000 11.00";
Pattern p = Pattern.compile("#Section(\\d+),\\w+,(?:\\w+/)*Jack/M,\\d+ (\\d+.\\d{2})");
Matcher m;
String[] tokens = s.split(",(?=#)"); //split the sections into different strings
for(String t : tokens) //checks every string that we got with the split
{
m = p.matcher(t);
if(m.matches()) //if the string matches the pattern then print the capturing groups
System.out.printf("Section: %s, Value: %s\n", m.group(1), m.group(2));
}
}
}
You could use 2 capture groups, and use a tempered greedy token approach to not cross #Section followed by a digit.
#Section(\d+)(?:(?!#Section\d).)*\bJack/M,\d+\h+(\d+(?:\.\d+)?)\b
Explanation
#Section(\d+) Match #Section and capture 1+ digits in group 1
(?:(?!#Section\d).)* Match any character if not directly followed by #Section and a digit
\bJack/M, Match the word Jack and /M,
\d+\h+ Match 1+ digits and 1+ spaces
(\d+(?:\.\d+)?) Capture group 2, match 1+ digits and an optional decimal part
\b A word boundary
Regex demo
In Java:
String regex = "#Section(\\d+)(?:(?!#Section\\d).)*\\bJack/M,\\d+\\h+(\\d+(?:\\.\\d+)?)\\b";
I'm currently trying to splice a string into a multi-line string.
The regex should select white-spaces which has 13 characters before.
The problem is that the 13 character count does not reset after the previous selected white-space. So, after the first 13 characters, the regex selects every white-space.
I'm using the following regex with a positive look-behind of 13 characters:
(?<=.{13})
(there is a whitespace at the end)
You can test the regex here and the following code:
import java.util.ArrayList;
public class HelloWorld{
public static void main(String []args){
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
for (String string : str.split("(?<=.{13}) ")) {
System.out.println(string);
}
}
}
The output of this code is as follows:
This is a test.
The
app
should
break
this
string
in
substring
on
whitespaces
after
13
characters
But it should be:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters
You may actually use a lazy limiting quantifier to match the lines and then replace with $0\n:
.{13,}?[ ]
See the regex demo
IDEONE demo:
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
System.out.println(str.replaceAll(".{13,}?[ ]", "$0\n"));
Note that the pattern matches:
.{13,}? - any character that is not a newline (if you need to match any character, use DOTALL modifier, though I doubt you need it in the current scenario), 13 times at least, and it can match more characters but up to the first space encountered
[ ] - a literal space (a character class is redundant, but it helps visualize the pattern).
The replacement pattern - "$0\n" - is re-inserting the whole matched value (it is stored in Group 0) and appends a newline after it.
You can just match and capture 13 characters before white spaces rather than splitting.
Java code:
Pattern p = Pattern.compile( "(.{13}) +" );
Matcher m = p.matcher( text );
List<String> matches = new ArrayList<>();
while(m.find()) {
matches.add(m.group(1));
}
It will produce:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters
RegEx Demo
you can do this with the .split and using regular expression. It would be like this
line.split("\\s+");
This will spilt every word with one or more whitespace.
I'm new to regular expressions...
I have a problem about the regular expression that will match a string only contains:
0-9, a-z, A-Z, space, comma, and single quote?
If the string contain any char that doesn't belong the above expression, it is invalid.
Is that something like:
Pattern p = Pattern.compile("\\s[a-zA-Z0-9,']");
Matcher m = p.matcher("to be or not");
boolean b = m.lookingAt();
Thank you!
Fix your expression adding bounds:
Pattern p = Pattern.compile("^\\s[a-zA-Z0-9,']+$");
Now your can say m.find() and be sure that this returns true only if your string contains the enumerated symbols only.
BTW is it mistake that you put \\s in the beginning? This means that the string must start from single white space. If this is not the requirement just remove this.
You need to include the space inside the character class and allow more than one character:
Pattern p = Pattern.compile("[\\sa-zA-Z0-9,']*");
Matcher m = p.matcher("to be or not");
boolean b = m.matches();
Note that \s will match any whitespace character (including newlines, tabs, carriage returns, etc.) and not only the space character.
You probably want something like this:
"^[a-zA-Z0-9,' ]+$"
input1="caused/VBN by/IN thyroid disorder"
Requirement: find word "caused" that is followed by slash followed by any number of capital alphabets -- and not followed by space + "by/IN.
In the example above "caused/VBN" is followed by " by/IN", so 'caused' should not match.
input2="caused/VBN thyroid disorder"
"by/IN" doesn't follow caused, so it should match
regex="caused/[A-Z]+(?![\\s]+by/IN)"
caused/[A-Z]+ -- word 'caused' + / + one or more capital letters
(?![\\s]+by) -- negative lookahead - not matching space and by
Below is a simple method that I used to test
public static void main(String[] args){
String input = "caused/VBN by/IN thyroid disorder";
String regex = "caused/[A-Z]+(?![\\s]+by/IN)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while(matcher.find()){
System.out.println(matcher.group());
}
Output: caused/VB
I don't understand why my negative lookahead regex is not working.
You need to include a word boundary in your regular expression:
String regex = "caused/[A-Z]+\\b(?![\\s]+by/IN)";
Without it you can get a match, but not what you were expecting:
"caused/VBN by/IN thyroid disorder";
^^^^^^^^^
this matches because "N by" doesn't match "[\\s]+by"
The character class []+ match will be adjusted (via backtracking) so that the lookahead will match.
What you have to do is stop the backtracking so that the expression []+ is fully matched.
This can be done a couple of different ways.
A positive lookahead, followed by a consumption
"caused(?=(/[A-Z]+))\\1(?!\\s+by/IN)"
A standalone sub-expression
"caused(?>/[A-Z]+)(?!\\s+by/IN)"
A possesive quantifier
"caused/[A-Z]++(?!\\s+by/IN)"
How do I write a Pattern (Java) to match any sequence of characters except a given list of words?
I need to find if a given code has any text surrounded by tags like besides a given list of words.
For example, I want to check if there are any other words besides "one" and "two" surrounded by the tag .
"This is the first tag <span>one</span> and this is the third <span>three</span>"
The pattern should match the above string because the word "three" is surrounded by the tag and is not part of the list of given words ("one", "two").
Look-ahead can do this:
\b(?!your|given|list|of|exclusions)\w+\b
Matches
a word boundary (start-of-word)
not followed by any of "your", "given", "list", "of", "exclusions"
followed by multiple word characters
followed by a word boundary (end-of-word)
In effect, this matches any word that is not excluded.
This should get you started.
import java.util.regex.*;
// >(?!one<|two<)(\w+)/
//
// Match the character “>” literally «>»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!one|two)»
// Match either the regular expression below (attempting the next alternative only if this one fails) «one»
// Match the characters “one<” literally «one»
// Or match regular expression number 2 below (the entire group fails if this one fails to match) «two»
// Match the characters “two<” literally «two»
// Match the regular expression below and capture its match into backreference number 1 «(\w+)»
// Match a single character that is a “word character” (letters, digits, etc.) «\w+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the characters “/” literally «</»
List<String> matchList = new ArrayList<String>();
try {
Pattern regex = Pattern.compile(">(?!one<|two<)(\\w+)/");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group(1));
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
Use this:
if (!Pattern.matches(".*(word1|word2|word3).*", "word1")) {
System.out.println("We're good.");
};
You're checking that the pattern does not match the string.