Java string replace and the NUL (NULL, ASCII 0) character? - java

Testing out someone elses code, I noticed a few JSP pages printing funky non-ASCII characters. Taking a dip into the source I found this tidbit:
// remove any periods from first name e.g. Mr. John --> Mr John
firstName = firstName.trim().replace('.','\0');
Does replacing a character in a String with a null character even work in Java? I know that '\0' will terminate a C-string. Would this be the culprit to the funky characters?

Does replacing a character in a String with a null character even work in Java? I know that '\0' will terminate a c-string.
That depends on how you define what is working. Does it replace all occurrences of the target character with '\0'? Absolutely!
String s = "food".replace('o', '\0');
System.out.println(s.indexOf('\0')); // "1"
System.out.println(s.indexOf('d')); // "3"
System.out.println(s.length()); // "4"
System.out.println(s.hashCode() == 'f'*31*31*31 + 'd'); // "true"
Everything seems to work fine to me! indexOf can find it, it counts as part of the length, and its value for hash code calculation is 0; everything is as specified by the JLS/API.
It DOESN'T work if you expect replacing a character with the null character would somehow remove that character from the string. Of course it doesn't work like that. A null character is still a character!
String s = Character.toString('\0');
System.out.println(s.length()); // "1"
assert s.charAt(0) == 0;
It also DOESN'T work if you expect the null character to terminate a string. It's evident from the snippets above, but it's also clearly specified in JLS (10.9. An Array of Characters is Not a String):
In the Java programming language, unlike C, an array of char is not a String, and neither a String nor an array of char is terminated by '\u0000' (the NUL character).
Would this be the culprit to the funky characters?
Now we're talking about an entirely different thing, i.e. how the string is rendered on screen. Truth is, even "Hello world!" will look funky if you use dingbats font. A unicode string may look funky in one locale but not the other. Even a properly rendered unicode string containing, say, Chinese characters, may still look funky to someone from, say, Greenland.
That said, the null character probably will look funky regardless; usually it's not a character that you want to display. That said, since null character is not the string terminator, Java is more than capable of handling it one way or another.
Now to address what we assume is the intended effect, i.e. remove all period from a string, the simplest solution is to use the replace(CharSequence, CharSequence) overload.
System.out.println("A.E.I.O.U".replace(".", "")); // AEIOU
The replaceAll solution is mentioned here too, but that works with regular expression, which is why you need to escape the dot meta character, and is likely to be slower.

Should be probably changed to
firstName = firstName.trim().replaceAll("\\.", "");

I think it should be the case. To erase the character, you should use replace(".", "") instead.

Does replacing a character in a String
with a null character even work in
Java?
No.
Would this be the culprit to the funky characters?
Quite likely.

This does cause "funky characters":
System.out.println( "Mr. Foo".trim().replace('.','\0'));
produces:
Mr[] Foo
in my Eclipse console, where the [] is shown as a square box. As others have posted, use String.replace().

Related

Java String created with byte[] bug [duplicate]

I have a string that I am creating, and I need to add multiple "\0" (null) characters to the string. Between each null character, is other text data (Just ASCII alphanumeric characters).
My problem is that in J2SE when you add the first null (\0), java then seems to determine that it's a string terminator, (similar to C++), and ignores all other data being appended. No error is raised, the trailing data is just ignored. I need to force the additional trailing data after a null in the string. I have to do this for a legacy database that I am supporting.
I have tried to encode/decode the string in hoping that something like %00 would fool the interpretation of the string behaviour, but when I re-encode the string, Java sees the null character again, and removes all data after the first null.
Update: Here is the relevant code snippet. Yes, I am trying to use Strings. I intend to try chars, but I still have to save it into the database as a string, so I suspect that I will end up with the same problem.
Some background. I am receiving data via HTTP post that has "\n". I need to remove the newlines and replace them with "\0". The "debug" method is just a simple method that does System.out.println.
String[] arrLines = sValue.split("\n");
for(int k=0;k<arrLines.length;k++) {
if (0<k) {
sNewValue += "\0";
}
sNewValue+= arrLines[k];
debug("New value =" + sNewValue);
}
sNewValue, a String, is committed to the database and needs to be done as a String. What I am observing when i display the current value of sNewValue after each iteration in the console is something like this:
input is value1\nValue2\nValue3
Output in the console is giving me from this code
value1
value1
value1
I am expecting
value1
value1 value2
value1 value2 value3
with non-printable null between value1, value2 and value3 respectively. Note that the value actually getting saved back into the database is also just "value1". So, it's not just a console display problem. The data after \0 is getting ignored.
I strongly suspect this is nothing to do with the text in the string itself - I suspect it's just how it's being displayed. For example, try this:
public class Test {
public static void main(String[] args) {
String first = "first";
String second = "second";
String third = "third";
String text = first + "\0" + second + "\0" + third;
System.out.println(text.length()); // Prints 18
}
}
This prints 18, showing that all the characters are present. However, if you try to display text in a UI label, I wouldn't be surprised to see only first. (The same may be true in fairly weak debuggers.)
Likewise you should be able to use:
char c = text.charAt(7);
And now c should be 'e' which is the second letter of "second".
Basically, I'd expect the core of Java not to care at all about the fact that it contains U+0000. It's just another character as far as Java is concerned. It's only at boundaries with native code (e.g. display) that it's likely to cause a problem.
If this doesn't help, please explain exactly what you've observed - what it is that makes you think the rest of the data isn't being appended.
EDIT: Another diagnostic approach is to print out the Unicode value of each character in the string:
for (int i = 0; i < text.length(); i++) {
System.out.println((int) text.charAt(i));
}
I suggest you use a char[] or List<Char> instead since it sounds like you're not really using a String as such (a real String doesn't normally contain nulls or other unprintable characters).
Same behavior for the StringBuffer class?
Since "\0" makes some trouble, I would recommend to not use it.
I would try to replace some better delimiter with "\0" when actually writing the string to your DB.
This is because \ is an escape character in Java (as in many C-related languages) and you need to escape it using additional \ as follows.
String str="\\0Java language";
System.out.println(str);
and you should be able the display \0Java language on the console.

Escape character '\' doesn't show in System.out.println() but in return value

In Java, when I replace characters in a String with escaped-characters, the characters show up in the return value, although they were not there according to System.out.println.
String[][][] proCategorization(String[] pros, String[][] preferences) {
String str = "wehnquflkwe,wefwefw,wefwefw,wefwef";
String strReplaced = str.replace(",","\",\""); //replace , with ","
System.out.println(strReplaced);
The console output is: wehnquflkwe","wefwefw","wefwefw","wefwef
String[][][] array3d = new String[1][1][1]; // initialize 3d array
array3d[0][0][0] = strReplaced;
System.out.println(array3d[0][0][0]);
return array3d;
}
The console output is:
wehnquflkwe","wefwefw","wefwefw","wefwef
Now the return value is:
[[["wehnquflkwe\",\"wefwefw\",\"wefwefw\",\"wefwef"]]]
I don't understand why the \ show up in the return value but not in the System.out.println.
Characters in memory can be represented in different ways.
Your integrated development environment (IDE) has a debugger that chooses to represent a String[][][] with a single element that contains the characters
wehnquflkwe","wefwefw","wefwefw","wefwef
as a java-quoted string
"wehnquflkwe\",\"wefwefw\",\"wefwefw\",\"wefwef"
this makes a lot of sense, because you can then copy and paste this string into java code without any loss.
On the other hand, your system's console, and the IDE's built-in terminal emulator, will output the characters in their normal representation, that is, without any java string-escape-characters:
wehnquflkwe","wefwefw","wefwefw","wefwef
As an experiment, you may want to check what happens with other "special" characters, such as \t (a tab break) or \b (backspace). This is just the tip of the iceberg - characters in Java generally translate into unicode points, which may or may not be supported by the fonts available in your system or terminal. The IDE's way of representing characters as java-quoted strings allows it to losslessly represent pretty much anything; System.out.println's output is a lot more variable.
System.out.println prints the String exactly as it is stored in memory.
On the other hand, when you stop the application flow using a breakpoint you are able to look up the values.
Most of the IDEs display escape characters with \ to indicate that it's just one String, not String[] in this case, or not to split the String into two lines if it contains \n in the middle.
Just in case, you still have doubts, I suggest printing strReplaced.length(). This should allow you to count characters one by one.
Possible experiments:
String s = "my cute \n two line String";
System.out.println(s + " length is: " + s.length());

Is it possible to add data to a string after adding "\0" (null)?

I have a string that I am creating, and I need to add multiple "\0" (null) characters to the string. Between each null character, is other text data (Just ASCII alphanumeric characters).
My problem is that in J2SE when you add the first null (\0), java then seems to determine that it's a string terminator, (similar to C++), and ignores all other data being appended. No error is raised, the trailing data is just ignored. I need to force the additional trailing data after a null in the string. I have to do this for a legacy database that I am supporting.
I have tried to encode/decode the string in hoping that something like %00 would fool the interpretation of the string behaviour, but when I re-encode the string, Java sees the null character again, and removes all data after the first null.
Update: Here is the relevant code snippet. Yes, I am trying to use Strings. I intend to try chars, but I still have to save it into the database as a string, so I suspect that I will end up with the same problem.
Some background. I am receiving data via HTTP post that has "\n". I need to remove the newlines and replace them with "\0". The "debug" method is just a simple method that does System.out.println.
String[] arrLines = sValue.split("\n");
for(int k=0;k<arrLines.length;k++) {
if (0<k) {
sNewValue += "\0";
}
sNewValue+= arrLines[k];
debug("New value =" + sNewValue);
}
sNewValue, a String, is committed to the database and needs to be done as a String. What I am observing when i display the current value of sNewValue after each iteration in the console is something like this:
input is value1\nValue2\nValue3
Output in the console is giving me from this code
value1
value1
value1
I am expecting
value1
value1 value2
value1 value2 value3
with non-printable null between value1, value2 and value3 respectively. Note that the value actually getting saved back into the database is also just "value1". So, it's not just a console display problem. The data after \0 is getting ignored.
I strongly suspect this is nothing to do with the text in the string itself - I suspect it's just how it's being displayed. For example, try this:
public class Test {
public static void main(String[] args) {
String first = "first";
String second = "second";
String third = "third";
String text = first + "\0" + second + "\0" + third;
System.out.println(text.length()); // Prints 18
}
}
This prints 18, showing that all the characters are present. However, if you try to display text in a UI label, I wouldn't be surprised to see only first. (The same may be true in fairly weak debuggers.)
Likewise you should be able to use:
char c = text.charAt(7);
And now c should be 'e' which is the second letter of "second".
Basically, I'd expect the core of Java not to care at all about the fact that it contains U+0000. It's just another character as far as Java is concerned. It's only at boundaries with native code (e.g. display) that it's likely to cause a problem.
If this doesn't help, please explain exactly what you've observed - what it is that makes you think the rest of the data isn't being appended.
EDIT: Another diagnostic approach is to print out the Unicode value of each character in the string:
for (int i = 0; i < text.length(); i++) {
System.out.println((int) text.charAt(i));
}
I suggest you use a char[] or List<Char> instead since it sounds like you're not really using a String as such (a real String doesn't normally contain nulls or other unprintable characters).
Same behavior for the StringBuffer class?
Since "\0" makes some trouble, I would recommend to not use it.
I would try to replace some better delimiter with "\0" when actually writing the string to your DB.
This is because \ is an escape character in Java (as in many C-related languages) and you need to escape it using additional \ as follows.
String str="\\0Java language";
System.out.println(str);
and you should be able the display \0Java language on the console.

Replacing Unicode character codes with characters in String in Java

I have a Java String like this: "peque\u00f1o". Note that it has an embedded Unicode character: '\u00f1'.
Is there a method in Java that will replace these Unicode character sequences with the actual characters? That is, a method that would return "pequeño" if you gave it "peque\u00f1o" as input?
Note that I have a string that has 12 chars (those that we see, that happen to be in the ASCII range).
Actually the string is "pequeño".
String s = "peque\u00f1o";
System.out.println(s.length());
System.out.println(s);
yields
7
pequeño
i.e. seven chars and the correct representation on System.out.
I remember giving the same response last week, use org.apache.commons.lang.StringEscapeUtils.
If you have the appropriate fonts, a println or setting the string in a JLabel or JTextArea should do the trick. The escaping is only for the compiler.
If you plan to copy-paste the readable strings in source, remember to also choose a suitable file encoding like UTF8.

Given a string in Java, just take the first X letters

Is there something like a C# Substring for Java? I'm creating a mobile application for Blackberry device and due to screen constraints I can only afford to show 13 letters plus three dots for an ellipsis.
Any suggestion on how to accomplish this?
I need bare bones Java and not some fancy trick because I doubt a mobile device has access to a complete framework. At least in my experience working with Java ME a year ago.
You can do exactly what you want with String.substring().
String str = "please truncate me after 13 characters!";
if (str.length() > 16)
str = str.substring(0, 13) + "..."
String foo = someString.substring(0, Math.min(13, someString.length()));
Edit: Just for general reference, as of Guava 16.0 you can do:
String truncated = Ascii.truncate(string, 16, "...");
to truncate at a max length of 16 characters with an ellipsis.
Aside
Note, though, that truncating a string for display by character isn't a good system for anything where i18n might need to be considered. There are (at least) a few different issues with it:
You may want to take word boundaries and/or whitespace into account to avoid truncating at an awkward place.
Splitting surrogate pairs (though this can be avoided just by checking if the character you want to truncate at is the first of a surrogate pair).
Splitting a character and a combining character that follows it (e.g. an e followed by a combining character that puts an accent on that e.)
The appearance of a character may change depending on the character that follows it in certain languages, so just truncating at that character will produce something that doesn't even look like the original.
For these reasons (and others), my understanding is that best practice for truncation for display in a UI is to actually fade out the rendering of the text at the correct point on the screen rather than truncating the underlying string.
Whenever there is some operation that you would think is a very common thing to do, yet the Java API requires you to check bounds, catch exceptions, use Math.min(), etc. (i.e. requires more work than you would expect), check Apache's commons-lang. It's almost always there in a more concise format. In this case, you would use StringUtils#substring which does the error case handling for you. Here's what it's javadoc says:
Gets a substring from the specified String avoiding exceptions.
A negative start position can be used to start n characters from the end of the String.
A null String will return null. An empty ("") String will return "".
StringUtils.substring(null, *) = null
StringUtils.substring("", *) = ""
StringUtils.substring("abc", 0) = "abc"
StringUtils.substring("abc", 2) = "c"
StringUtils.substring("abc", 4) = ""
StringUtils.substring("abc", -2) = "bc"
StringUtils.substring("abc", -4) = "abc"
String str = "This is Mobile application."
System.out.println(str.subSequence(0, 13)+"...");

Categories