Split regex; keep delimiter - java

I have a text looks like that:
This is [!img|http://imageURL] text containing [!img|http://imageURL2] some images in it
So now I want to split this string in parts and keep the delimiters.
I already figured out, that this works, to split the string, but it don't keep the delimiters:
\[!img\|.*\]
And in some other posts I see that I need to add ?<= to keep the delimiter.
So I connected both, but I get the error message: Lookbehinds need to be zero-width, thus quantifiers are not allowed
Here's the full regex throwing this error:
(?<=\[!img\|.*\])
I expect as result:
[This is; [!img|http://imageURL]; text containing; [!img|http://imageURL2]; some images in it]
So whats the best way to fix it?

You can use a combination of lookaround assertions:
String[] splitArray = subject.split("(?<=\\])|(?=\\[!img)");
This splits a string if the preceding character is a ] or if the following characters are [!img.

Related

Java - How to split a string on "space" with a fraction char like "1 ½"?

So I want to be able do split this string by spaces:
"1 ½ cups fat-free half-and-half, divided "
I wrote my code like this:
String trimmed;
String[] words = trimmed.split(" ");
But it doesn't work! The 1 and the ½ end up in the same position of the array.
I also tried How to split a string with any whitespace chars as delimiters but it does not split string either. Looking in text editor there is clearly some sort of "space" but I don't get how to split on it. Is is because of "½"?
You've got a thin space there instead of a "regular" space character.
Regex capturing of this is not trivial, as there are other character classes you need to capture. You would at a minimum want to capture it as an additional grouping...
System.out.println(Arrays.toString(s.split("(\\s|\\u2009)")));
...but you would also need to include all the other non-standard white space characters in this search just to be sure you don't miss any. The above works for your case.
The reason for this is that the space between 1 and ½ is not a regular space (U+0020) but instead a "thin space" (U+2009).
Since String.split(String) accepts a regex pattern, you could for example use the pattern \h instead which represents a "horizontal whitespace character", see Pattern documentation, and matches U+2009.
Or you could use the pattern " |\u2009".

"" as result into java split with Regexp

I have string content and need to split into an array of tokens, but one of the tokens as a result is "", which can result in multiple tokens like "" and I need to avoid them by Regexp.
I try use the Regexp like the print but he do not remove my problem.
Node content example:
Regexp and the result:
You are splitting your string on spaces (among various other characters).
You'd be better off if(node.equals("")){// ignore it or remove it} because whatever you split your string on, you will always have to worry about empty results because your split character could be anywhere in the string. Calling trim on your string before you split it will get rid of all that extra leading and trailing space and, because it's spaces you're splitting on, get rid of those pesky empty values; which from what I can see in your question, is exactly what's going on.

Having a word + whitespace as delimiter

I am trying to split the string "HI. HOW ARE YOU? I AM FINE!" into a string array using split function with the following syntax
String[] i = "HI. HOW ARE YOU? I AM FINE! ".split("[\\. |? |! ]+");
Expected output
HI
HOW ARE YOU
I AM FINE
But in intellij, it's saying "Duplicate character literal" and it's considering space as a separate delimiter.
How do I make sure that it take the full stop plus space, question mark plus space and exclamation plus space without it considering space as a separate delimiter?
Which would be the correct regex for it?
If it can be done without regex, even that is okay.
Thanks.
A better pattern would be
"[?!.]\\s*"
As for the error you are getting, that is because the | operator does not work within a character class. If you want your pattern to work, change the square brackets [...] to parentheses (...).

Using .split() for multiple characters in Java

I'm trying to split an input by ".,:;()[]"'\/!? " chars and add the words to a list. I've tried .split("\\W+?") and .split("\\W"), but both of them are returning empty elements in the list.
Additionally, I've tried .split("\\W+"), which returns only words without any special characters that should go along with them (for instance, if one of the input words is "C#", it writes "C" in the list). Lastly, I've also tried to put all of the special chars above into the .split() method: .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? "), but this isn't splitting the input at all. Could anyone advise please?
split() function accepts a regex.
This is not the regex you're looking for .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? ")
Try creating a character class like [.,:;()\[\]'\\\/!\?\s"] and add + to match one or more occurences.
I also suggest to change the character space with the generic \s who takes all the space variations like \t.
If you're sure about the list of characters you have selected as splitters, this should be your correct split with the correct Java string literal as #Andreas suggested:
.split("[.,:;()\\[\\]'\\\\\\/!\\?\\s\"]+")
BTW: I've found a particularly useful eclipse editor option which escapes the string when you're pasting them into the quotes. Go to Window/Preferences, under Java/Editor/Typing/, check the box next to Escape text when pasting into a string literal

Java Regex Metacharacters

I found this thread and one of users on it posted the following line of code:
String[] digits2 = number.split("(?<=.)");
I have consulted a couple of sources- like 1 and 2-to decipher what this code mean but I can't figure it out. Can anybody explain what the argument in the split() method means?
Edit: To anyone who has the same question as I had, here's another helpful link
This is a positive lookbehind. The overall expression means "after any character, but without capturing anything". Essentially, if the string looks like
ABC
then the matches would occur at |, between the characters.
A|B|C|
.split("") (on an empty string/pattern) will match the empty string at the start of the regex. This is an additional empty string character that is undesirable. (?<=.) is a zero-width assertion (does not consume any characters) that matches the zero-width space followed by any character (followed by because it is a lookbehind). This splits on the empty string between each character, but not the empty space between the first character and the start of the string.

Categories