foo.split(',').length != number of ',' found in 'foo'?

foo.split(',').length != number of ',' found in 'foo'? - java

Maybe it's because it's end of day on a Friday, and I have already found a work-around, but this is killing me.
I am using Java but am .NET developer.
I have a string and I need to split it on semicolon comma. Let's say its a row in a CSV file who has 200 210 columns. line.split(',').length will be sometimes, 199, where count of ',' will be 208 OR 209. I find count in 2 different ways even to be sure (using a regex, then manually looping through and checking the character after losing my sanity).
What's the super-obvious-hit-face-on-desk thing I'm missing here? Why isn't foo.split(delim).length == CountOfOccurences(foo,delim) all the time, only sometimes?
thanks much

First, there's an obvious difference of one. If there are 200 columns, all with text, there are 199 commas. Second, Java drops trailing empty strings by default. You can change this by passing a negative number as the second argument.
"foo,,bar,baz,,".split(",")
is:
{foo,,bar,baz}
an array of 4 elements. But
"foo,,bar,baz,,".split(",", -1)
is::
{foo,,bar,baz,,}
with all 6.
Note that only trailing empty strings are dropped by default.
Finally, don't forget that the String is compiled into a regex. This is not be applicable here, since , is not a special character, but you should keep it in mind.

There are a couple things happening. First, if you have three items like a,b,c and split on comma, you'll have three entries, one more than the number of commas.
But what you're dealing with probably comes from consecutive delimiters. : a,,,,b,c,,,,,
The ones at the end get dropped. Check the java documentation for the split function.
http://download.java.net/jdk7/docs/api/java/lang/String.html

As others have pointed out, String.split has some very non-intuitive behaviour.
If you're using Google's Guava open-source Java library, there's a Splitter class which gives a much nicer (in my opinion) API for this, with more flexibility:
String input = "foo, bar,";
Splitter.on(',').split(input);
// returns "foo", " bar", ""
Splitter.on(',').omitEmptyStrings().split(input);
// returns "foo", " bar"
Splitter.on(',').omitEmptyStrings().trimResults().split(input);
// returns "foo", "bar"

Is it omitting blanks?
Do you have something like "a,b,c,,d,e" or trailing delimiters like "a,b,c,,,,"?
Are there extra delimiters in the cell data?

Short example: foo = "1,2" and
foo.split(",").length = 2
count(foo, ",") = 1
Probably you have a mistake in your code. Here is an example in Java code:
String row = "1,2,3,4,,5"; // second example: 1,2,3,5,,
System.out.println(row.split(",").length); // print 6 in both cases
// code to count how many , you have in your row
Pattern patter = Pattern.compile(",");
Matcher m = patter.matcher(row);
int nr = 0;
while(m.find())
{
nr++;
}
System.out.println(nr); // print 5 for the first example and 6 for second

Related

Having problem when comparing two strings

I'm reading a CSV file using Java. Inside the file, each row is in this format:
operation, start, end.
I need to do a different operation for different input. But something weird happened when I'm trying to compare two string.
I used equals to compare two strings. And one of the operation is "add", but the first element I fetched from the document always give me the wrong answer. I know that's an "add" and I printed it out it looks like an "add", but when I'm using operation.equals("add"), it's false. For all rest of Strings it's correct except the first one. Is there anything special about the first row in CSV file?
Here is my code:
while ((line = br.readLine()) != null) {
String[] data = line.split(",");
String operation = data[0];
int start = Integer.parseInt(data[1]);
int end = Integer.parseInt(data[2]);
System.out.println(operation + " " + start + " " + end);
System.out.println(operation.equals("add"));
For example, it printed out
add 1 3
false
add 4 6
true
And I really don't know why. These two add looks exactly the same.
And here is what my csv file look like:
enter image description here

There are (at least) 4 reasons why two string that "look" like they are the same when you display / print them could turn out to be non-equal:
If you compare Strings using == rather than equals(Object), then you will often get the wrong answer. (This is not the problem here ... since you are using the equals method. However, this is a common problem.)
Unexpected leading or trailing whitespace characters on one string. These can be removed using trim().
Other leading, trailing or embedded control characters or Unicode "funky" characters. For example stray Unicode BOM (byte order mark) characters.
Homoglyphs. There are a number of examples where two or more distinct Unicode code points are rendered on the screen using the same or virtually the same glyphs.
Cases 3 and 4 can only be reliably detected by using traceprints or a debugger to examine the lengths and the char values in the two strings.
(Screen shots of the CSV file won't help us to diagnose this! A cut-and-paste of the CSV file might help.)

You should remove the double quotes from the first element and then check with equals method.
Try this:
String operation = operation.substring(1, to.length() - 1);
operation.equals("add")
Hope it works for you.

It looks like your line in image looks fine. I suppose in this case, that you could set wrong document encoding. E.g. when UTF, and you do not put it, then is has special header at the beginning. It could be a reason, why you read first word incorrectly.

HashMap .get() returns null for every value except the last one

I have a file with different names like this:
Thomas Danny
Jack Thomas
Danny Mike
Thomas Kate
Victor James
Edit: I have single spaces between names so it splits correctly, this isn't the problem.
Each pair represents who invited who to the party
My task is to get to the bottom of the cycle via HashMap.
For example, when given "Kate" as an argument, the program needs to print out "Jack", because Thomas invited Kate and Jack invited Thomas.
My code so far:
Map<String, String> whoInvited = new HashMap<>();
String[] pairs = text.split("\n");
for (String pair : pairs){
String[] invite = pair.split(" ");
whoInvited.put(invite[1], invite[0]);
}
System.out.println(whoInvited.size())
//Returns 5
System.out.println(whoInvited.get("Danny"));
//Returns null
System.out.println(whoInvited.get("James"));
//The only one that returns anything besides null(returns "Victor")
String lookingFor = "Kate";
while (whoInvited.containsKey(lookingFor)){
lookingFor = whoInvited.get(lookingFor);
}
System.out.println(lookingFor);
}
I don't understand how does the HashMap get messed up like this. If i use the .get() function inside the for-loop, it gives me the value perfectly, but right after the loop ends, it becomes messed up, only having the last value.
Printing out whoInvited gives me just "=Victor}"
EDIT: FIXED!

It is not putting nulls, you have space-separated content. When you split on a space, you get many empty values because you have multiple consecutive spaces:
"Thomas Danny".split(" ")
==> String[5] { "Thomas", "", "", "", "Danny" }
This explains why invite[1] resolves to a blank string. And because map keys are unique, each blank key overwrites the preceding one, and you're left with just the last one.
You can get around the problem by just splitting on any number of consecutive spaces:
"Thomas Danny".split(" +")
==> String[2] { "Thomas", "Danny" }

The problem you have is caused by the fact that line endings differ on different Operating Systems. Windows uses \r\n while Unix uses \n. When splitting on \n, some weird stuff happens because of the \r. Replacing text.split("\n") with text.split("\r\n") will fix the problem for Windows.

You're splitting the lines on whitespace, but this will cause empty strings to be emitted between whitespaces within the string. For your example this means invite[1] will almost always be "".
This means the key will always be the same value, resulting in only the last value being retained.
To fix this you should split using a regex expression to handle multiple whitespaces like \s+, then the values in the split will be actual strings.

There is more than one space between the names. You shoud better use regex for splitting.
Try pair.split(" +")
Update:
Is your file format unix or windows ???
Try this for splitting new line:
String lines[] = string.split("\r?\n");

Regular Expression to Match Number of Lines and Characters per Line

I'm trying to make sure that a string contains between 0 and 3 lines, and that for a given line that is present that it contains 0 to 100 characters. It would need to be a valid expression for JavaScript and Java. Like many people doing RegEx I'm copying from various spots on the Internet.
Working backwards I think ^.{0,100}$ gets me the "line contains 0 to 100 characters", but trying to group that as (^.{0,100}$){0,3} doesn't work.
The new line character is probably part of my problem, so I ended up with something like .{0,100}(?:\n.{0,100}){0,2} trying to say "a line of 0 to 100 characters optionally followed by 0 to 2 instances of a new line and 0 to 100 more characters", but that also failed.
Up until now I got those expressions from other people. Using an online test tool I finally monkeyed this together: ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ which appears to work.
So, my question is, am I missing any pitfalls in ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ given what I'm after? Furthermore, even if that does work is it the best expression to use?

I think what you have will work fine. You can make the line break part a little more compact if you want, and you don't need ^ and $ if you are using matches():
String regex = ".{0,100}(?:[\r\n]+.{0,100}){0,2}";
EDIT
After some more thoughts I realized the newline suggestion above will match 4 (or more) lines as long as a couple of them are empty. So, we are back to your suggested example. Oh well, at least the start and end characters can be omitted.
String regex = ".{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}";

I'm not very good at regular expressions but would this work?
^.{0,100}\n?(.{0,100}\n)?.{0,100}?$
Again I'm still new to reg exp, so if there is an error(which is likely) please tell me.

Why does "split" on an empty string return a non-empty array?

Split on an empty string returns an array of size 1 :
scala> "".split(',')
res1: Array[String] = Array("")
Consider that this returns empty array:
scala> ",,,,".split(',')
res2: Array[String] = Array()
Please explain :)

If you split an orange zero times, you have exactly one piece - the orange.

The Java and Scala split methods operate in two steps like this:
First, split the string by delimiter. The natural consequence is that if the string does not contain the delimiter, a singleton array containing just the input string is returned,
Second, remove all the rightmost empty strings. This is the reason ",,,".split(",") returns empty array.
According to this, the result of "".split(",") should be an empty array because of the second step, right?
It should. Unfortunately, this is an artificially introduced corner case. And that is bad, but at least it is documented in java.util.regex.Pattern, if you remember to take a look at the documentation:
For n == 0, the result is as for n < 0, except trailing empty strings
will not be returned. (Note that the case where the input is itself an
empty string is special, as described above, and the limit parameter
does not apply there.)
Solution 1: Always pass -1 as the second parameter
So, I advise you to always pass n == -1 as the second parameter (this will skip step two above), unless you specifically know what you want to achieve / you are sure that the empty string is not something that your program would get as an input.
Solution 2: Use Guava Splitter class
If you are already using Guava in your project, you can try the Splitter (documentation) class. It has a very rich API, and makes your code very easy to understand.
Splitter.on(".").split(".a.b.c.") // "", "a", "b", "c", ""
Splitter.on(",").omitEmptyStrings().split("a,,b,,c") // "a", "b", "c"
Splitter.on(CharMatcher.anyOf(",.")).split("a,b.c") // "a", "b", "c"
Splitter.onPattern("=>?").split("a=b=>c") // "a", "b", "c"
Splitter.on(",").limit(2).split("a,b,c") // "a", "b,c"

Splitting an empty string returns the empty string as the first element. If no delimiter is found in the target string, you will get an array of size 1 that is holding the original string, even if it is empty.

For the same reason that
",test" split ','
and
",test," split ','
will return an array of size 2. Everything before the first match is returned as the first element.

"a".split(",") -> "a"
therefore
"".split(",") -> ""

In all programming languages I know a blank string is still a valid String. So doing a split using any delimiter will always return a single element array where that element is the blank String. If it was a null (not blank) String then that would be a different issue.

This split behavior is inherited from Java, for better or worse...
Scala does not override the definition from the String primitive.
Note, that you can use the limit argument to modify the behavior:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
i.e. you can set the limit=-1 to get the behavior of (all?) other languages:
# ",a,,b,,".split(",")
res1: Array[String] = Array("", "a", "", "b")
# ",a,,b,,".split(",", -1) // limit=-1
res2: Array[String] = Array("", "a", "", "b", "", "")
It's seems to be well-known the Java behavior is quite confusing but:
The behavior above can be observed from at least Java 5 to Java 8.
There was an attempt to change the behavior to return an empty array when splitting an empty string in JDK-6559590. However, it was soon reverted in JDK-8028321 when it causes regression in various places. The change never makes it into the initial Java 8 release.
Note: The split method wasn't in Java from the beginning (it's not in 1.0.2) but actually is there from at least 1.4 (e.g. see JSR51 circa 2002). I am still investigating...
What's unclear is why Java chose this in the first place (my suspicion is that it was originally an oversight/bug in an "edge case"), but now irrevocably baked into the language and so it remains.

Empty string have no special status while splitting a string. You may use:
Some(str)
.filter(_ != "")
.map(_.split(","))
.getOrElse(Array())

use this Function,
public static ArrayList<String> split(String body) {
return new ArrayList<>(Arrays.asList(Optional.ofNullable(body).filter(a->!a.isEmpty()).orElse(",").split(",")));
}

How do I split a concatenated string into multiple floating point values?

I'm a begginer in java I have
packet=090209153038020734.0090209153039020734.0
like this I want to split this string and store into an array like two strings:
1) 090209153038020734.0
2) 090209153039020734.0
I have done like this:
String packetArray[] = packets.split(packets,Constants.SF);
Where:
Constants.SF=0x01.
But it won't work.
Please help me.

I'd think twice about using split since those are obviously fixed width fields.
I've seen them before on another question here (several in fact so I'm guessing this may be homework (or a popular data collection device :-)) and it's plain that the protocol is:
STX (0x01).
0x0f.
date (YYMMDD or DDMMYY).
time (HHMMSS).
0x02.
value (XXXXXX.X).
0x03.
0x04.
And, given that they're fixed width, you should probably just use substrings to get the information out.

The JavaDoc of String is helpful here: http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html
You have your String packet;
String.indexOf(String) gives you a position of a special substring. your interested in the "." sign. So you write
int position = packet.indexOf(".")+1
+1 becuase you want the trailing decimal too. It will return something 20-ish and will be the last pos of the first number.
Then we use substring
String first = packet.substring(0,position) will give you everything up to the ".0"
String second = packet.substring(position-1) should give you everything starting after the ".0" and up to the end of the string.
Now if you want them explicitely into an array you can just put them there. The code as a whole - I may have some "off by one" -bugs.
int position = packet.indexOf(".")+1
String first = packet.substring(0,position)
String second = packet.substring(position-1)
String[] packetArray = new String[2];
packetArray[0] = first;
packetArray[1] = second;

String packetArray[] = packets.split("\u0001");
should work. You are using
public String[] split(String regex, int limit)
which is doing something else: It makes sure that split() returns an array with at most limit members (1 in this case, so you get what you ask for).

You need to read the Javadocs for the String.split() methods...you are calling the version of String.split() that takes a regular expression and a limit, but you are passing the string itself as the first parameter, which doesn't really make sense.
As Aaron Digulla mentioned, use the other version.

You don't say how you want to do the split. It could be based on a fixed length (number of characters) or you want one decimal place.
If the former you could do packetArray = new String[]{packet.substring(0, 20), packet.substring(21)};
int dotIndex = packets.indexOf('.');
packetArray = new String[]{packet.substring(0, dotIndex+2), packet.substring(dotIndex+2)};
Your solution confuses the regexp with the string.

split uses regular expressions as documented here. Your code seems to be trying to match the whole string Constants.SF = 0x01 times, which doesn't make much sense. If you know what char the boxes are then you can use something like {[^c]+cc} where c is the character of the box (i guess this is 0x01), to match each "packet".
I think you are trying to use it like the .net String.Split(...) function?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.