If i have this String:
String line = "This, is Stack; Overflow.";
And want to split it into the following array of strings:
String[] array = ...
so the array contains this output:
["This",",","is","Stack",";","Overflow","."]
What regex expression should i put into the split() method ?
Just split your input according to the spaces or the boundaries which exists between a word character and a non-word character, vice-versa.
String s = "This, is Stack; Overflow.";
String parts[] = s.split("\\s|(?<=\\w)(?=\\W)");
System.out.println(Arrays.toString(parts));
\s matches any kind of whitespace character, \w matches a word character and \W matches a non-word character.
\s matches a space character.
(?<=\\w) Positive look-behind which asserts that the match must be preceded by a word character (a-z, A-Z, 0-9, _).
(?=\\W) Positive look-ahead which asserts that the match must be followed by a non-word character(any character other than the word character). So this (?<=\\w)(?=\\W) regex matches only the boundaries not a character.
Thus splitting the input according to the matches spaces and the boundaries will give you the desired output.
DEMO
OR
String s = "This, is Stack; Overflow.";
String parts[] = s.split("\\s|(?<=\\w)(?=\\W)|(?<=[^\\w\\s])(?=\\w)");
System.out.println(Arrays.toString(parts));
Output:
[This, ,, is, Stack, ;, Overflow, .]
You can do that with this pattern:
\\s+|(?<=\\S)(?=[^\\w\\s])|(?<=[^\\w\\s])\\b
it trims whitespaces and deals with consecutive special characters, example:
With ;This, is Stack; ;; Overflow.
you obtain: [";", "This", ",", "is", "Stack", ";", ";", ";", "Overflow", "."]
But obviously, the more efficient way is to not use the split method but the find method with this pattern:
\\w+|[^\\w\\s]
Related
I'm currently trying to splice a string into a multi-line string.
The regex should select white-spaces which has 13 characters before.
The problem is that the 13 character count does not reset after the previous selected white-space. So, after the first 13 characters, the regex selects every white-space.
I'm using the following regex with a positive look-behind of 13 characters:
(?<=.{13})
(there is a whitespace at the end)
You can test the regex here and the following code:
import java.util.ArrayList;
public class HelloWorld{
public static void main(String []args){
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
for (String string : str.split("(?<=.{13}) ")) {
System.out.println(string);
}
}
}
The output of this code is as follows:
This is a test.
The
app
should
break
this
string
in
substring
on
whitespaces
after
13
characters
But it should be:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters
You may actually use a lazy limiting quantifier to match the lines and then replace with $0\n:
.{13,}?[ ]
See the regex demo
IDEONE demo:
String str = "This is a test. The app should break this string in substring on whitespaces after 13 characters";
System.out.println(str.replaceAll(".{13,}?[ ]", "$0\n"));
Note that the pattern matches:
.{13,}? - any character that is not a newline (if you need to match any character, use DOTALL modifier, though I doubt you need it in the current scenario), 13 times at least, and it can match more characters but up to the first space encountered
[ ] - a literal space (a character class is redundant, but it helps visualize the pattern).
The replacement pattern - "$0\n" - is re-inserting the whole matched value (it is stored in Group 0) and appends a newline after it.
You can just match and capture 13 characters before white spaces rather than splitting.
Java code:
Pattern p = Pattern.compile( "(.{13}) +" );
Matcher m = p.matcher( text );
List<String> matches = new ArrayList<>();
while(m.find()) {
matches.add(m.group(1));
}
It will produce:
This is a test.
The app should
break this string
in substring on
whitespaces after
13 characters
RegEx Demo
you can do this with the .split and using regular expression. It would be like this
line.split("\\s+");
This will spilt every word with one or more whitespace.
I need to add spaces between all punctuation in a string.
\\ "Hello: World." -> "Hello : World ."
\\ "It's 9:00?" -> "It ' s 9 : 00 ?"
\\ "1.B,3.D!" -> "1 . B , 3 . D !"
I think a regex is the way to go, matching all non-punctuation [a-ZA-Z\\d]+, adding a space before and/or after, then extracting the remainder matching all punctuation [^a-ZA-Z\\d]+.
But I don't know how to (recursively?) call this regex. Looking at the first example, the regex will only match the "Hello". I was thinking of just building a new string by continuously removing and appending the first instance of the matched regex, while the original string is not empty.
private String addSpacesBeforePunctuation(String s) {
StringBuilder builder = new StringBuilder();
final String nonpunctuation = "[a-zA-Z\\d]+";
final String punctuation = "[^a-zA-Z\\d]+";
String found;
while (!s.isEmpty()) {
// regex stuff goes here
found = ???; // found group from respective regex goes here
builder.append(found);
builder.append(" ");
s = s.replaceFirst(found, "");
}
return builder.toString().trim();
}
However this doesn't feel like the right way to go... I think I'm over complicating things...
You can use lookarounds based regex using punctuation property \p{Punct} in Java:
str = str.replaceAll("(?<=\\S)(?:(?<=\\p{Punct})|(?=\\p{Punct}))(?=\\S)", " ");
(?<=\\S) Asserts if prev char is not a white-space
(?<=\\p{Punct}) asserts a position if previous char is a punctuation char
(?=\\p{Punct}) asserts a position if next char is a punctuation char
(?=\\S) Asserts if next char is not a white-space
IdeOne Demo
When you see a punctuation mark, you have four possibilities:
Punctuation is surrounded by spaces
Punctuation is preceded by a space
Punctuation is followed by a space
Punctuation is neither preceded nor followed by a space.
Here is code that does the replacement properly:
String ss = s
.replaceAll("(?<=\\S)\\p{Punct}", " $0")
.replaceAll("\\p{Punct}(?=\\S)", "$0 ");
It uses two expressions - one matching the number 2, and one matching the number 3. Since the expressions are applied on top of each other, they take care of the number 4 as well. The number 1 requires no change.
Demo.
What could be the regular expression to match below String
String str = "<Element>\r\n <Sub>regular</Sub></Element>";
There is a
carriage return "\r", new line character "\n" and a space after <Element>.
My code is as below
if(str.matches("<Element>([\\s])<Sub>(.*)"))
{
System.out.println("Matches");
}
Use the "dot matches newline" switch:
if (str.matches("(?s)<Element>\\s*<Sub>(.*)"))
With the switch turned on, \s will match newline characters.
I slso fixed your regex, removing two sets of redundant brackets, and adding the crucial * after the whitespace regex.
How do you match more than one space character in Java regex?
I have a regex I am trying to match. The regex fails when I have two or more space characters.
public static void main(String[] args) {
String pattern = "\\b(fruit)\\s+([^a]+\\w+)\\b"; //Match 'fruit' not followed by a word that begins with 'a'
String str = "fruit apple"; //One space character will not be matched
String str_fail = "fruit apple"; //Two space characters will be matched
System.out.println(preg_match(pattern,str)); //False (Thats what I want)
System.out.println(preg_match(pattern,str_fail)); //True (Regex fail)
}
public static boolean preg_match(String pattern,String subject) {
Pattern regex = Pattern.compile(pattern);
Matcher regexMatcher = regex.matcher(subject);
return regexMatcher.find();
}
The problem is actually because of backtracking. Your regex:
"\\b(fruit)\\s+([^a]+\\w+)\\b"
Says "fruit, followed by one or more spaces, followed by one or more non 'a' characters, followed by one or more 'word' characters". The reason this fails with two spaces is because \s+ matches the first space, but then gives back the second, which then satisfies the [^a]+ (with the second space) and the \s+ portion (with the first).
I think you can fix it by simply using the posessive quantifier instead, which would be \s++. This tells the \s not to give back the second space character. You can find the documentation on Java's quantifiers here.
As an illustration, here are two examples at Rubular:
Using the possessive quantifier on \s (gives expected results, from what you describe)
Your current regex with separate groupings around [^a\]+ and \w+. Notice that the second match group (representing the [^a]+) is capturing a the second space character.
How do I write a Pattern (Java) to match any sequence of characters except a given list of words?
I need to find if a given code has any text surrounded by tags like besides a given list of words.
For example, I want to check if there are any other words besides "one" and "two" surrounded by the tag .
"This is the first tag <span>one</span> and this is the third <span>three</span>"
The pattern should match the above string because the word "three" is surrounded by the tag and is not part of the list of given words ("one", "two").
Look-ahead can do this:
\b(?!your|given|list|of|exclusions)\w+\b
Matches
a word boundary (start-of-word)
not followed by any of "your", "given", "list", "of", "exclusions"
followed by multiple word characters
followed by a word boundary (end-of-word)
In effect, this matches any word that is not excluded.
This should get you started.
import java.util.regex.*;
// >(?!one<|two<)(\w+)/
//
// Match the character “>” literally «>»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!one|two)»
// Match either the regular expression below (attempting the next alternative only if this one fails) «one»
// Match the characters “one<” literally «one»
// Or match regular expression number 2 below (the entire group fails if this one fails to match) «two»
// Match the characters “two<” literally «two»
// Match the regular expression below and capture its match into backreference number 1 «(\w+)»
// Match a single character that is a “word character” (letters, digits, etc.) «\w+»
// Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the characters “/” literally «</»
List<String> matchList = new ArrayList<String>();
try {
Pattern regex = Pattern.compile(">(?!one<|two<)(\\w+)/");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group(1));
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
Use this:
if (!Pattern.matches(".*(word1|word2|word3).*", "word1")) {
System.out.println("We're good.");
};
You're checking that the pattern does not match the string.