Why does java String.split() leave behind empty strings? - java

When I use the String.split() method, how come sometimes I get empty strings? For example, if I do:
"(something)".split("\\W+")
Then the first element of the return value will be an empty string. Also, the example from the documentation (as seen here) doesn't make sense either.
Regex Result
: { "boo", "and", "foo" }}
o { "b", "", ":and:f" }}
How come the ":" is used as the delimiter, there are no empty strings, but with "o" there are?

With:
"(something)".split("\\W+")
it's assuming the delimiter comes between fields, so what you end up with is:
"" "something" "" <- fields
( ) <- delimiters
You could fix that by trimming the string first to remove any leading or trailing delimiters, something like:
"(something)".replaceAll("^\\W*","").replaceAll("\\W*$","").split("\\W+")
With something like:
"boo:and:foo".split("o", 0)
you'll get:
"b" "" ":and:f" <- fields
o o <- delimiters
because you have consecutive delimiters (which don't exists when the delimiter is ":") which are deemed therefore to have an empty field between them.
And the reason you don't have trailing blank fields because of foo at the end, has to do with that limit of zero. In that case, trailing (not leading) empty fields are removed.
If you want to also get rid of the empty fields in the middle, you can instead use "o+" as the delimiter since that will greedily absorb consective o characters into a single delimiter. You can also use the replaceAll trick shown above to get rid of leading empty fields.

Actually the reason is not in which delimiter you choose, in the latter case you have two os following one by one. And what is between them? The empty string is.
Maybe it's contrintuitive in the beginning and you might think it would be better to skip empty strings. But there are two very popular formats to store data in text file. Tab separated values and comma separated values.
Let's imagine that you want to store information about people in format name,surname,age. For example Peter,Green,12. But what if you want to store information about the guy whose surname you don't know. It should look like Mike,,13. Then if you split by comma you get 'Mike', '', '13' and you know that the first element is name, the second is empty surname and the third is age. But if you choose to skip empty strings then you'll get 'Mike', '13'. And you cannot understand which field is missing.

Related

"" as result into java split with Regexp

I have string content and need to split into an array of tokens, but one of the tokens as a result is "", which can result in multiple tokens like "" and I need to avoid them by Regexp.
I try use the Regexp like the print but he do not remove my problem.
Node content example:
Regexp and the result:
You are splitting your string on spaces (among various other characters).
You'd be better off if(node.equals("")){// ignore it or remove it} because whatever you split your string on, you will always have to worry about empty results because your split character could be anywhere in the string. Calling trim on your string before you split it will get rid of all that extra leading and trailing space and, because it's spaces you're splitting on, get rid of those pesky empty values; which from what I can see in your question, is exactly what's going on.

Using .split() for multiple characters in Java

I'm trying to split an input by ".,:;()[]"'\/!? " chars and add the words to a list. I've tried .split("\\W+?") and .split("\\W"), but both of them are returning empty elements in the list.
Additionally, I've tried .split("\\W+"), which returns only words without any special characters that should go along with them (for instance, if one of the input words is "C#", it writes "C" in the list). Lastly, I've also tried to put all of the special chars above into the .split() method: .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? "), but this isn't splitting the input at all. Could anyone advise please?
split() function accepts a regex.
This is not the regex you're looking for .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? ")
Try creating a character class like [.,:;()\[\]'\\\/!\?\s"] and add + to match one or more occurences.
I also suggest to change the character space with the generic \s who takes all the space variations like \t.
If you're sure about the list of characters you have selected as splitters, this should be your correct split with the correct Java string literal as #Andreas suggested:
.split("[.,:;()\\[\\]'\\\\\\/!\\?\\s\"]+")
BTW: I've found a particularly useful eclipse editor option which escapes the string when you're pasting them into the quotes. Go to Window/Preferences, under Java/Editor/Typing/, check the box next to Escape text when pasting into a string literal

Replace with empty string replaces newChar around all the characters in original string

I was just working on one of my java code in which I am using Java String.replace method. So while testing the replace method as in one situation I am planning to put junk value of String.replace("","");
so on Testing I came to a condition of replacing blank value with some other value i.e String.replace("","p") which replaced "p" everywhere around all the characters of the original String
Example:
String strSample = "val";
strSample = strSample.replace("","p");
System.out.println(strSample);
Output:
pvpaplp
Can anyone please explain why it works like this?
replace looks for each place that you have a String which starts with the replaced string. e.g. if you replace "a" in "banana" it finds "a" 3 times.
However, for empty string it finds it everywhere including before and after the last letter.
Below is the definition from Java docs for the overloaded replace method of your case.
String java.lang.String.replace(CharSequence target, CharSequence
replacement)
Replaces each substring of this string that matches the literal target
sequence with the specified literal replacement sequence. The
replacement proceeds from the beginning of the string to the end, for
example, replacing "aa" with "b" in the string "aaa" will result in
"ba" rather than "ab".
Parameters:
target The sequence of char values to be replaced
replacement The replacement sequence of char values
Now, since you are defining target value as "" i.e. empty, so it will pick each location in the string and replace it with value defined in replacement.
Good thing to note is the fact that if you will use strSample = strSample.replace(" ","p"); which means one white space character as target value then nothing will be replaced because now in this case replace method will try to search for a white space character.
The native Java java.lang.String implementation (like Ruby and Python) considers empty string "" a valid character sequence while performing string operations. Therefore the "" character sequence is effectively everywhere between two chars including before and after the last character.
It works coherently with all java.lang.String operations. See :
String abc = "abc";
System.out.println(abc.replace("", "a")); // aaabaca instead of "abc"
System.out.println(abc.indexOf("", "a")); // 0 instead of -1
System.out.println(abc.contains("", "a")); // true instead of false
As a side note :
This behavior might be misleading because many other languages / implementations do not behave like this. For instance, SQL (MySQL, MSSQL, Oracle and PostgreSQL) and PHP do not considers "" like a valid character sequence for string replacement. .NET goes further and throws System.ArgumentException: String cannot be of zero length. when calling, for instance, abc.Replace("", "a").
Even the popular Apache Commons Lang Java library works differently :
org.apache.commons.lang3.StringUtils.replace("abc", "", "a")); /* abc */
Take a look at this example:
"" + "abc" + ""
What is result of this code?
Answer: it is still "abc". So as you see we can say that all strings have some empty strings before and after it.
Same rule applies in-between characters like
"a"+""+"b"+""+"c"
will still create "abc"
So empty strings also exists between characters.
In your code
"val".replace("","p")
all these empty strings ware replaced with p which result in pvpaplp.
In case of ""+""+..+""+"" assume that Java is smart enough to see it as one "".

Java String split regexp returns empty strings with multiple delimiters

I have a problem that I can't seem to find an answer here for, so I'm asking it.
The thing is that I have a string and I have delimiters. I want to create an array of strings from the things which are between those delimiters (might be words, numbers, etc). However, if I have two delimiters next to one another, the split method will return an empty string for one of the instances.
I tested this against even more delimiters that are in succession. I found out that if I have n delimiters, I will have n-1 empty strings in the result array. In other words, if I have both "," and " " as delimiters, and the sentence "This is a very nice day, isn't it", then the array with results would be like:
{... , "day", "", "isn't" ...}
I want to get those extra empty strings out and I can't figure out how to do that. A sample regex for the delimiters that I have is:
"[\\s,.-\\'\\[\\]\\(\\)]"
Also can you explain why there are extra empty strings in the result array?
P.S. I read some of the similar posts which included information about the second parameter of the regex. I tried both negative, zero, and positive numbers, and I didn't get the result that I'm looking for. (one of the questions had an answer saying that -1 as a parameter might solve the problem, but it didn't.
You can use this regex for splitting:
[\\s,.'\\[\\]()-]+
Keep unescaped hyphen at first or last position in character class otherwise it is treated as range like A-Z or 0-9
You must use quantifier + for matching 1 more delimiters
I think your problem is just the regex itself. You should use a greedy quantifier:
"[\\s,.-\\'\\[\\]\\(\\)]+"
See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#sum
X+ ... X, one or more times
Your regular expression describes just one single character. If you want it to match multiple separators at once, use a quantifier:
String s = "This is a very nice day, isn't it";
String[] tokens = s.split("[\\s,.\\-\\[\\]()']+");
(Note the '+' at the end of the expression)
If you want to get rid of empty strings, you can use the Guava project Splitter class.
on method:
Returns a splitter that uses the given fixed string as a separator.
Example (ignoring empty strings):
System.out.println(
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split("foo,bar,, qux")
);
Output:
[foo, bar, qux]
onPattern method:
Returns a splitter that considers any subsequence matching a given
pattern (regular expression) to be a separator.
Example (ignoring empty strings):
System.out.println(
Splitter
.onPattern("([,.|])")
.trimResults()
.omitEmptyStrings()
.split("foo|bar,, qux.hi")
);
Output:
[foo, bar, qux, hi]
For more details, consult Splitter documentation.

Java: Do I have an effective regex to eliminate symbols & rename a file?

I have a series of link names from which I'm trying to eliminate special characters. From a brief filewalk, my biggest concerns appear to be brackets, parentheses and colons. After unsuccessfully wrestling with escape characters to SELECT : [ and (, I decided instead to exclude everything I wanted to KEEP in the filename.
Consider:
String foo = inputFilname ; //SAMPLE DATA: [Phone]_Michigan_billing_(automatic).html
String scrubbed foo = foo.replaceAll("[^a-zA-Z-._]","") ;
Expected Result: Phone_Michigan_billing_automatic.html
My escape-character regex was approaching 60 characters when I ditched it. The last version I saved before changing strategies was [:.(\\[)|(\\()|(\\))|(\\])] where I thought I was asking for escape-character-[() and ].
The blanket exclude seems to work just fine. Is the Regex really that simple? Any input on how effective this strategy will be? I feel like I'm missing something and need a couple sets of eyes.
In my opinion, you're using the wrong tool for this job. StringUtils has a method named replaceChars that will replace all occurrences of a char with another one. Here's the documentation:
public static String replaceChars(String str,
String searchChars,
String replaceChars)
Replaces multiple characters in a String in one go. This method can also be used to delete characters.
For example:
replaceChars("hello", "ho", "jy") = jelly.
A null string input returns null. An empty ("") string input returns an empty string. A null or empty set of search characters returns the input string.
The length of the search characters should normally equal the length of the replace characters. If the search characters is longer, then the extra search characters are deleted. If the search characters is shorter, then the extra replace characters are ignored.
StringUtils.replaceChars(null, *, *) = null
StringUtils.replaceChars("", *, *) = ""
StringUtils.replaceChars("abc", null, *) = "abc"
StringUtils.replaceChars("abc", "", *) = "abc"
StringUtils.replaceChars("abc", "b", null) = "ac"
StringUtils.replaceChars("abc", "b", "") = "ac"
StringUtils.replaceChars("abcba", "bc", "yz") = "ayzya"
StringUtils.replaceChars("abcba", "bc", "y") = "ayya"
StringUtils.replaceChars("abcba", "bc", "yzx") = "ayzya"
So in your example:
String translated = StringUtils.replaceChars("[Phone]_Michigan_billing_(automatic).html", "[]():", null);
System.out.println(translated);
Will output:
Phone_Michigan_billing_automatic.html
This will be more straightforward and easier to understand than any regex you could write.
I think your regex is the way to go. In general white listing values instead of black listing them is almost always better.(Only allowing characters you KNOW are good instead of eliminating all characters you think are bad) From a security standpoint this regex should be preferred. You will never end up with a inputFilename which has invalid characters.
suggested regex: [^a-zA-Z-._]
I think your regex can be as simple as \W which will match everything that is not a word character (letters, digits, and underscores). This is the negation of \w
So your code becomes:
foo.replaceAll("\W","");
As pointed out in the comments the above also removes periods this will work to also keep periods:
foo.replaceAll("[^\w.]","");
Details: escape every thing that is not (the ^ inside the character class), a digit, underscore, letter ( the \w) or a period (the \.)
As noted above there may be other chars you want to white list: like -. Just include them in your character class as you go along.
foo.replaceAll("[^\w.\-]","");

Categories