Having a word + whitespace as delimiter - java

I am trying to split the string "HI. HOW ARE YOU? I AM FINE!" into a string array using split function with the following syntax
String[] i = "HI. HOW ARE YOU? I AM FINE! ".split("[\\. |? |! ]+");
Expected output
HI
HOW ARE YOU
I AM FINE
But in intellij, it's saying "Duplicate character literal" and it's considering space as a separate delimiter.
How do I make sure that it take the full stop plus space, question mark plus space and exclamation plus space without it considering space as a separate delimiter?
Which would be the correct regex for it?
If it can be done without regex, even that is okay.
Thanks.

A better pattern would be
"[?!.]\\s*"
As for the error you are getting, that is because the | operator does not work within a character class. If you want your pattern to work, change the square brackets [...] to parentheses (...).

Related

Java - How to split a string on "space" with a fraction char like "1 ½"?

So I want to be able do split this string by spaces:
"1 ½ cups fat-free half-and-half, divided "
I wrote my code like this:
String trimmed;
String[] words = trimmed.split(" ");
But it doesn't work! The 1 and the ½ end up in the same position of the array.
I also tried How to split a string with any whitespace chars as delimiters but it does not split string either. Looking in text editor there is clearly some sort of "space" but I don't get how to split on it. Is is because of "½"?
You've got a thin space there instead of a "regular" space character.
Regex capturing of this is not trivial, as there are other character classes you need to capture. You would at a minimum want to capture it as an additional grouping...
System.out.println(Arrays.toString(s.split("(\\s|\\u2009)")));
...but you would also need to include all the other non-standard white space characters in this search just to be sure you don't miss any. The above works for your case.
The reason for this is that the space between 1 and ½ is not a regular space (U+0020) but instead a "thin space" (U+2009).
Since String.split(String) accepts a regex pattern, you could for example use the pattern \h instead which represents a "horizontal whitespace character", see Pattern documentation, and matches U+2009.
Or you could use the pattern " |\u2009".

Split regex; keep delimiter

I have a text looks like that:
This is [!img|http://imageURL] text containing [!img|http://imageURL2] some images in it
So now I want to split this string in parts and keep the delimiters.
I already figured out, that this works, to split the string, but it don't keep the delimiters:
\[!img\|.*\]
And in some other posts I see that I need to add ?<= to keep the delimiter.
So I connected both, but I get the error message: Lookbehinds need to be zero-width, thus quantifiers are not allowed
Here's the full regex throwing this error:
(?<=\[!img\|.*\])
I expect as result:
[This is; [!img|http://imageURL]; text containing; [!img|http://imageURL2]; some images in it]
So whats the best way to fix it?
You can use a combination of lookaround assertions:
String[] splitArray = subject.split("(?<=\\])|(?=\\[!img)");
This splits a string if the preceding character is a ] or if the following characters are [!img.

Why the space appears as sub string in this split instruction?

I have string with spaces and some non-informative characters and substrings required to be excluded and just to keep some important sections. I used the split as below:
String myString[]={"01: Hi you look tired today? Can I help you?"};
myString=myString[0].split("[\\s+]");// Split based on any white spaces
for(int ii=0;ii<myString.length;ii++)
System.out.println(myString[ii]);
The result is :
01:
Hi
you
look
tired
today?
Can
I
help
you?
The spaces appeared after the split as sub strings when the regex is “[\s+]” but disappeared when the regex is "\s+". I am confused and not able to find answer in the related stack overflow pages. The link regex-Pattern made me more confused.
Please help, I am new with java.
19/1/2015:Edit
After your valuable advice, I reached to point in my program where a conditional statements is required to be decomposed and processed. The case I have is:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\,]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result is fine till now as:
01:IF
rd.h
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
with
0.4610;
My next step is to add string "with" to the regex and get rid of this word while doing the split.
I tried it this way:
String s1="01:IF rd.h && dq.L && o.LL && v.L THEN la.VHB , av.VHR with 0.4610;";
String [] s2=s1.split(("[\\s\\&\\, with]+"));
for(int ii=0;ii<s2.length;ii++)System.out.println(s2[ii]);
The result not perfect, because I got unwonted extra split at every "h" letter as:
01:IF
rd.
dq.L
o.LL
v.L
THEN
la.VHB
av.VHR
0.4610;
Any advice on how to specify string with mixed white spaces and separation marks?
Many thanks.
inside square brackets, [\s+] will represent the whitespace character class with the plus sign added. it is only one character so a sequence of spaces will split many empty strings as Todd noted, and will also use + as separator.
you should use \s+ (without brackets) as the separator. that means one or more whitespace characters.
myString=myString[0].split("\\s+");
Your biggest problem is not understanding enough about regular expressions to write them properly. One key point you don't comprehend is that [...] is a character class, which is a list of characters any one of which can match. For example:
[abc] matches either a, b or c (it does not match "abc")
[\\s+] matches any whitespace or "+" character
[with] matches a single character that is either w, i, t or h
[.$&^?] matches those literal characters - most characters lose their special regex meaning when in a character class
To split on any number of whitespace, comma and ampersand and consume "with" (if it appears), do this:
String [] s2 = s1.split("[\\s,&]+(with[\\s,&]+)?");
You can try it easily here Online Regex and get useful comments.

Can newlines be replaces with spaces? (lexer)

I'm currently in the progress of developing a parser for a subset of Java, and I was wondering;
Is there any cases, in which newlines are more than token separators?
That is, where they couldn't just be replaced by a space.
Should I ignore newlines, in the same way that I ignore white-space?
That is, just use them to detect token separation.
Yes all newline characters in Java source code can be replaced by a space or be removed. However, do not remove \n (backslash n), because that are the newline characters inside a String literal.
And, yes newlines are for the parser the same as spaces, as long as you are outside String literals. If you are in a String literal, and you would remove a newline, then you would surpress a syntax error. Because it is not allowed in Java to have newline characters in a String literal. So, this is wrong:
String str = "first line
same line";
So, it depends on the fact if you want to detect syntax errors with your parser or not. Do you only parse valid code or not? That is the question you should ask yourself.
The only situation I can think of where it makes a difference is within String-literals.
If there is a linebreak between two "s it would cause a syntax error while a space would not.
you have to notice that it could come in string \n, and of course if you want to make this replace you have to increase the lines number +1 because you will need it in the next phases of your project.

Unescaped "." still matches when used in a negation group

I made, what I believed to be, an error in a regular expression in Java recently but when I test my code I don't get the error I expect.
The expression I created was meant to replace a password in a string that I received from another source. The pattern I used went along the lines of: "password: [^\\s.]*", the idea being that it would match the word "password" the colon, a space, then any characters except for a space or a full-stop (period). I would then replace the instance with "password: XXXXXX" and therefore mask it.
The obvious error should be that I have forgotten to escape the full-stop. In otherwords the proper expression should have been "password: [^\\s\\.]*". Thing is, if I don't escape the full-stop the code still works!
Here's some sample code:
import java.util.regex.*;
public class SimpleRegexTest {
public static void main(String[] args) {
Pattern simplePattern = Pattern.compile("password: [^\\s.]*");
Matcher simpleMatcher = simplePattern.matcher("password: newpass. Enjoy.");
String maskedString = simpleMatcher.replaceAll("password: XXXXXX");
System.out.println(maskedString);
}
}
When I run the above code I get the following output:
password: XXXXXX. Enjoy.
Is this a special case, or have I completely missed something?
(edit: changed to "escape the full-stop")
Michael Borgwardt: I couldn't think of another term to describe what I was doing apart from "negation group", sorry for the ambiguity.
Aviator: In this case, no, a space won't be in the password. I didn't make the rules ;-).
(edit: doubled up the slashes in the non-code text so it displays properly, added the ^ which was in the code, but not the text :-/)
Sundar: Fixed the double slashes, SO seems to have it's own escape characters.
A period ('.' character) does not need to be escaped inside a character class [] in a regular expression.
From the API:
Note that a different set of metacharacters are in effect inside a character class than outside a character class. For instance, the regular expression . loses its special meaning inside a character class, while the expression - becomes a range forming metacharacter.
It looks like you got the negation operator mixed up for regex ranges.
In particular, my understanding is that you used the snippet [\s.]* to mean "any characters except for a space or a full-stop (period)." This would in fact be expressed as [^ .]*, using the caret to negate the characters in the set.
I don't know if this was just a typo in your post or what was actually in your code, but the regex as it stands in your question will match the word "password", a colon, a space, then any sequence of backslash characters, "s" characters or periods.

Categories