"" as result into java split with Regexp - java

I have string content and need to split into an array of tokens, but one of the tokens as a result is "", which can result in multiple tokens like "" and I need to avoid them by Regexp.
I try use the Regexp like the print but he do not remove my problem.
Node content example:
Regexp and the result:

You are splitting your string on spaces (among various other characters).
You'd be better off if(node.equals("")){// ignore it or remove it} because whatever you split your string on, you will always have to worry about empty results because your split character could be anywhere in the string. Calling trim on your string before you split it will get rid of all that extra leading and trailing space and, because it's spaces you're splitting on, get rid of those pesky empty values; which from what I can see in your question, is exactly what's going on.

Related

Regex expression that keeps upper/lower case characters AND whitespace?

I need to parse some text, I am doing this using a Regex expression within the replaceAll() method. This is the line where I use it:
String parsedValue = selectedValue.replaceAll("[^A-Za-z]", "");
This is nearly perfect, it removes the numbers from the string, however it also gets rid of the spaces and I need to keep the spaces? How can I modify it to do this?
For example, "Local Police 101" would become "Local Police".
You're so close! You just need to add a space to your list of "not", so you end up with "[^A-Za-z ]";
String parsedValue = selectedValue.replaceAll("[^A-Za-z ]", "");
Notice the space after the lowercase "z" in your regular expression.
Edit:
Looking at your example, you're also wanting to remove the leftover spaces at the beginning and end of the string. To do this, you will also want to trim the result of replaceAll. To do this, simply add .trim() after replaceAll(). You'll end up with something like this:
String parsedValue = selectedValue.replaceAll("[^A-Za-z ]", "").trim();

Using .split() for multiple characters in Java

I'm trying to split an input by ".,:;()[]"'\/!? " chars and add the words to a list. I've tried .split("\\W+?") and .split("\\W"), but both of them are returning empty elements in the list.
Additionally, I've tried .split("\\W+"), which returns only words without any special characters that should go along with them (for instance, if one of the input words is "C#", it writes "C" in the list). Lastly, I've also tried to put all of the special chars above into the .split() method: .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? "), but this isn't splitting the input at all. Could anyone advise please?
split() function accepts a regex.
This is not the regex you're looking for .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? ")
Try creating a character class like [.,:;()\[\]'\\\/!\?\s"] and add + to match one or more occurences.
I also suggest to change the character space with the generic \s who takes all the space variations like \t.
If you're sure about the list of characters you have selected as splitters, this should be your correct split with the correct Java string literal as #Andreas suggested:
.split("[.,:;()\\[\\]'\\\\\\/!\\?\\s\"]+")
BTW: I've found a particularly useful eclipse editor option which escapes the string when you're pasting them into the quotes. Go to Window/Preferences, under Java/Editor/Typing/, check the box next to Escape text when pasting into a string literal

Split regex; keep delimiter

I have a text looks like that:
This is [!img|http://imageURL] text containing [!img|http://imageURL2] some images in it
So now I want to split this string in parts and keep the delimiters.
I already figured out, that this works, to split the string, but it don't keep the delimiters:
\[!img\|.*\]
And in some other posts I see that I need to add ?<= to keep the delimiter.
So I connected both, but I get the error message: Lookbehinds need to be zero-width, thus quantifiers are not allowed
Here's the full regex throwing this error:
(?<=\[!img\|.*\])
I expect as result:
[This is; [!img|http://imageURL]; text containing; [!img|http://imageURL2]; some images in it]
So whats the best way to fix it?
You can use a combination of lookaround assertions:
String[] splitArray = subject.split("(?<=\\])|(?=\\[!img)");
This splits a string if the preceding character is a ] or if the following characters are [!img.

Why does java String.split() leave behind empty strings?

When I use the String.split() method, how come sometimes I get empty strings? For example, if I do:
"(something)".split("\\W+")
Then the first element of the return value will be an empty string. Also, the example from the documentation (as seen here) doesn't make sense either.
Regex Result
: { "boo", "and", "foo" }}
o { "b", "", ":and:f" }}
How come the ":" is used as the delimiter, there are no empty strings, but with "o" there are?
With:
"(something)".split("\\W+")
it's assuming the delimiter comes between fields, so what you end up with is:
"" "something" "" <- fields
( ) <- delimiters
You could fix that by trimming the string first to remove any leading or trailing delimiters, something like:
"(something)".replaceAll("^\\W*","").replaceAll("\\W*$","").split("\\W+")
With something like:
"boo:and:foo".split("o", 0)
you'll get:
"b" "" ":and:f" <- fields
o o <- delimiters
because you have consecutive delimiters (which don't exists when the delimiter is ":") which are deemed therefore to have an empty field between them.
And the reason you don't have trailing blank fields because of foo at the end, has to do with that limit of zero. In that case, trailing (not leading) empty fields are removed.
If you want to also get rid of the empty fields in the middle, you can instead use "o+" as the delimiter since that will greedily absorb consective o characters into a single delimiter. You can also use the replaceAll trick shown above to get rid of leading empty fields.
Actually the reason is not in which delimiter you choose, in the latter case you have two os following one by one. And what is between them? The empty string is.
Maybe it's contrintuitive in the beginning and you might think it would be better to skip empty strings. But there are two very popular formats to store data in text file. Tab separated values and comma separated values.
Let's imagine that you want to store information about people in format name,surname,age. For example Peter,Green,12. But what if you want to store information about the guy whose surname you don't know. It should look like Mike,,13. Then if you split by comma you get 'Mike', '', '13' and you know that the first element is name, the second is empty surname and the third is age. But if you choose to skip empty strings then you'll get 'Mike', '13'. And you cannot understand which field is missing.

Java String split regexp returns empty strings with multiple delimiters

I have a problem that I can't seem to find an answer here for, so I'm asking it.
The thing is that I have a string and I have delimiters. I want to create an array of strings from the things which are between those delimiters (might be words, numbers, etc). However, if I have two delimiters next to one another, the split method will return an empty string for one of the instances.
I tested this against even more delimiters that are in succession. I found out that if I have n delimiters, I will have n-1 empty strings in the result array. In other words, if I have both "," and " " as delimiters, and the sentence "This is a very nice day, isn't it", then the array with results would be like:
{... , "day", "", "isn't" ...}
I want to get those extra empty strings out and I can't figure out how to do that. A sample regex for the delimiters that I have is:
"[\\s,.-\\'\\[\\]\\(\\)]"
Also can you explain why there are extra empty strings in the result array?
P.S. I read some of the similar posts which included information about the second parameter of the regex. I tried both negative, zero, and positive numbers, and I didn't get the result that I'm looking for. (one of the questions had an answer saying that -1 as a parameter might solve the problem, but it didn't.
You can use this regex for splitting:
[\\s,.'\\[\\]()-]+
Keep unescaped hyphen at first or last position in character class otherwise it is treated as range like A-Z or 0-9
You must use quantifier + for matching 1 more delimiters
I think your problem is just the regex itself. You should use a greedy quantifier:
"[\\s,.-\\'\\[\\]\\(\\)]+"
See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#sum
X+ ... X, one or more times
Your regular expression describes just one single character. If you want it to match multiple separators at once, use a quantifier:
String s = "This is a very nice day, isn't it";
String[] tokens = s.split("[\\s,.\\-\\[\\]()']+");
(Note the '+' at the end of the expression)
If you want to get rid of empty strings, you can use the Guava project Splitter class.
on method:
Returns a splitter that uses the given fixed string as a separator.
Example (ignoring empty strings):
System.out.println(
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split("foo,bar,, qux")
);
Output:
[foo, bar, qux]
onPattern method:
Returns a splitter that considers any subsequence matching a given
pattern (regular expression) to be a separator.
Example (ignoring empty strings):
System.out.println(
Splitter
.onPattern("([,.|])")
.trimResults()
.omitEmptyStrings()
.split("foo|bar,, qux.hi")
);
Output:
[foo, bar, qux, hi]
For more details, consult Splitter documentation.

Categories