Can String contain <0x00> along with assigned values in java - java

If I declare one string, is there any possibility that the string can contain <0x00> along with assigned data ?
For instance :
String s = "Stack";
Can the string result come as :
Stack<0x00><0x00><0x00><0x00><0x00><0x00><0x00><0x00><0x00><0x00><0x00><0x00>

Yes, as:
String s = "Stack\u0000\u000";
This in contrast to C/C++ where strings are terminated by a '\0' char.
If a String must be passed as byte array to native code, there java has a trick available for UTF-8,
a modified UTF-8 that also turns '\u0000' into a multi-byte sequence: DataOutputStream.writeUTF(String)
Note that \u0000 (as some other control chars) is not allowed in XML.
By the way the 0 string terminator is deemed by its inventor as the greatest mistake in C. It also influenced processor instruction sets.

Related

How do I convert a single character code to a `char` given a character set?

I want to convert decimal to ascii and this is the code returns the unexpected results. Here is the code I am using.
public static void main(String[] args) {
char ret= (char)146;
System.out.println(ret);// returns nothing.
I expect to get character single "'" as per http://www.ascii-code.com/
Anyone came across this? Thanks.
So, a couple of things.
First of all the page you linked to says this about the code point range in question:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.
This is incorrect, or at least, to me, misleadingly worded. ISO 8859-1 / Latin-1 does not define code point 146 (and another reference just because). So that's already asking for trouble. You can see this also if you do the conversion through String:
String s = new String(new byte[] {(byte)146}, "iso-8859-1");
System.out.println(s);
Outputs the same "unexpected" result. It appears that what they are actually referring to is the Windows-1252 set (aka "Windows Latin-1", but this name is almost completely obsolete these days), which does define that code point as a right single quote (for other charsets that provide this character at 146 see this list and look for encodings that provide it at 0x92), and we can verify this as such:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
So the first mistake is that page is confusing.
But the big mistake is you can't do what you're trying to do in the way you are doing it. A char in Java is a UTF-16 code point (or half of one, if you're representing the supplementary characters > 0xFFFF, a single char corresponds to a BMP point, a pair of them or an int corresponds to the full range, including the supplementary ones).
Unfortunately, Java doesn't really expose a lot of API for single-character conversions. Even Character doesn't have any readily available ways to convert from the charset of your choice to UTF-16.
So one option is to do it via String as hinted at in the examples above, e.g. express your code points as a raw byte[] array and convert from there:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
char c = s.charAt(0);
System.out.println(c);
You could grab the char again via s.charAt(0). Note that you have to be mindful of your character set when doing this. Here we know that our byte sequence is valid for the specified encoding, and we know that the result is only one char long, so we can do this.
However, you have to watch out for things in the general case. For example, perhaps your byte sequence and character set yield a result that is in the UTF-16 supplementary character range. In that case s.charAt(0) would not be sufficient and s.codePointAt(0) stored in an int would be required instead.
As an alternative, with the same caveats, you could use Charset to decode, although it's just as clunky, e.g.:
Charset cs = Charset.forName("windows-1252");
CharBuffer cb = cs.decode(ByteBuffer.wrap(new byte[] {(byte)146}));
char c = cb.get(0);
System.out.println(c);
Note that I am not entirely sure how Charset#decode handles supplementary characters and can't really test right now (but anybody, feel free to chime in).
As an aside: In your case, 146 (0x92) cast directly to char corresponds to the UTF-16 character "PRIVATE USE TWO" (see also), and all bets are off for what you'll end up displaying there. This character is classified by Unicode as a control character, and seems to fall in the range of characters reserved for ANSI terminal control (although AFAIK isn't actually used, but it's in that range regardless). I wouldn't be surprised if perhaps browsers in some locales rendered it as a right-single-quote for compatibility, but terminals did something weird with it.
Also, fyi, the official UTF-16 code point for right single quote is 0x2019. You could reliably store that in a char by using that value, e.g.:
System.out.println((char)0x2019);
You can also see this for yourself by looking at the value after the conversion from windows-1252:
String s = new String(new byte[] {(byte)146}, "windows-1252");
char c = s.charAt(0);
System.out.printf("0x%x\n", (int)c); // outputs 0x2019
Or, for completeness:
String s = new String(new byte[] {(byte)146}, "windows-1252");
int cp = s.codePointAt(0);
System.out.printf("0x%x\n", cp); // outputs 0x2019
The page you refer mention that values 160 to 255 correspond to the ISO-8859-1 (aka Latin 1) table; as for values in the range 128 to 159, they are from the Windows specific variant of the Latin 1 (ISO-8859-1 leave that range undefined, to be assigned by operating system).
Java characters are based on UTF16, which is itself based on the Unicode table. If you want to specifically refer to the right quote character, it is you can specify it as '\u2019' in Java (see http://www.fileformat.info/info/unicode/char/2019/index.htm).

Declaration of characters and Strings

Declaration of a character:
char ch = '';
When I do this i am getting the error 'empty character literal'.
Declaration of a String:
String str = "";
I see no error in doing that to a String.
The question is, why doesn't a similar error show up for the declaration of a String, or why declaration of empty character generating such error where empty string is getting passed
String is a set of chars and String str=""; contains no chars(read: empty string)
but if you want to have Char variable it must have some value. '' means no value.
String is a class in Java with its own syntax and methods. It accepts strings in double quotes. And a string is actually an Array of characters and is hence acceptable to be posted empty.
Char on the other hand is a data type and cannot be left undetermined. It needs to specified NULL.
I would recommend you to read through the Java tutorial documentation hosted on Oracle's website whenever you are in doubt about anything related to Java.
Basically char is a thing you put in a box, and a string is a box to hold all those things. You can have an empty box but not a non-existant thing.
A string is an array of characters. By passing it nothing, i.e. making it equal to "" you basically make an empty array which is fine. But char is a primitive type hence it cannot be "empty". The closest you can get is setting it equal to '\0' which is the null character.
Here char represents the 16-bit integer value of the character in quotes. Refer this table for the values.
There is no representation number for "empty/no character".
In case of String refer their source code. You can see that empty string is represented internally by 0 size char array. So String internally does not have magical representation of empty/no character. For "" String class does not allocate any space per se

Null termination in strings

Yes, I did check other threads and I have come to a conclusion. I just want you to confirm it so that I don't have any misconceptions.
Java String objects are not null terminated.
C++ std::string objects are also not null terminated
C strings or C-style strings if you will (array of characters), are the only strings that are null-terminated.
Correct or Incorrect?
C-strings are 0-terminated strings. You aren't forced to use them in C though.
Both C++ std::string and Java strings are counted strings, which means they store their length.
But C++ std::strings are also followed by a 0 since C++11, making them 0-terminated if (as often the case) they don't contain any embeddded 0, for better interoperability with 0-terminated-string APIs.
All of those are in themselves correct, but petty pedantery: C-style strings are not unique to C, there are other places where such things occur (most commonly in various forms of Assembler code, and C being a language originally designed to be "slightly above assembler" makes this "no surprise").
And in C++11, std::string is guaranteed to have a NUL terminator after the last actual string character [but it's valid to store NULL characters inside the string if you wish] (at least if you call c_str(), but in the implementations I've looked at, it's stored there on creation/update)
All the statements are not wrong, but need to clarify more of the specifics in each of the mentioned languages.
That is correct c++ std::string and java String both hold private fields indicating the length of the string. A NULL terminator is not needed.
The std::string method c_str returns the string as a NULL terminated char array for use when a NULL terminator is required e.g. c string functions such as strlen.
I don't know about the Java part, but in C++11 std::strings are NUL-terminated (besides storing the chars count), i.e. &s[0] returns the same string as s.c_str() (which is NUL-terminated, as a raw C-style string).
See this answer for more details.
The question you need to be asking is why C-String should be null terminated.
The answer is the string manipulation functions needs to know the exact length of the string. As strings in C are just array of characters there is no information that tells (this is the size of this array) they need something to help determining the size of array which is the null character standing at the end of it.
Where as in Java strings are instances of the String class which has the length field so there is no need for the null termination.
The same thing apply to strings in c++.
Almost correct.
C-string are not just an array of characters. They are a null-terminated array of characters.
So if you have an array of characters, it's not a C-string yet, it's just an ordinary array of characters. It has to have a terminating null character to be a valid C-style string.
Additionally, an std::string must also be null-terminated (since C++11). (But it still has a private variable holding the length of the string.)

4 byte unicode character in Java

I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character.
"\U" - not working (illegal escape character error)
for example: U+1F701 (0xf0 0x9f 0x9c 0x81). How it can be written in a string?
A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).
Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).
You need to do this:
final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.
Given that a char is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "\uD83D\uDF01" -- or directly as the symbol if your computing environment has support for it.
See also the CharsetDecoder and CharsetEncoder classes.
See also String.codePointCount(), and, since Java 8, String.codePoints() (inherited from CharSequence).
String s = "𩸽";
Technically this is one character. But be careful s.length() will returns 2. Also java won't compile String s = '𩸽'. Java don't promise you that String.length() shall returns exact number of characters, it returns just number of java-chars required for store this string.
Real number of characters can be obtained from s.codePointCount(0, s.length()).
jshell> String s = "🏳";
s ==> "🏳️"
jshell> s.codePointCount(0, s.length());
$5 ==> 2

java unicode value of char

When I do Collection.sort(List), it will sort based on String's compareTo() logic,where it compares both the strings char by char.
List<String> file1 = new ArrayList<String>();
file1.add("1,7,zz");
file1.add("11,2,xx");
file1.add("331,5,yy");
Collections.sort(file1);
My understanding is char means it specifies the unicode value, I want to know the unicode values of char like ,(comma) etc. How can I do it? Any url contains the numeric value of these?
My understanding is char means it specifies the unicode value, I want to know the unicode values of char like ,(comma) etc
Well there's an implicit conversion from char to int, which you can easily print out:
int value = ',';
System.out.println(value); // Prints 44
This is the UTF-16 code unit for the char. (As fge notes, a char in Java is a UTF-16 code unit, not a Unicode character. There are Unicode code points greater than 65535, which are represented as two UTF-16 code units.)
Any url contains the numeric value of these?
Yes - for more information about Unicode, go to the Unicode web site.
Uhm no, char is not a "unicode value" (and the word to use is Unicode code point).
A char is a code unit in the UTF-16 encoding. And it so happens that in Unicode's Basic Multilingual Plane (ie, Unicode code points ranging from U+0000 to U+FFFF, for code points defined in this range), yes, there is a 1-to-1 mapping between char and Unicode.
In order to know the numeric value of a code point you can just do:
System.out.println((int) myString.charAt(0));
But this IS NOT THE CASE for code points outside the BMP. For these, one code point translates to two chars. See Character.toChars(). And more generally, all static methods in Character relating to code points. There are quite a few!
This also means that String's .length() is actually misleading, since it returns the number of chars, not the number of graphemes.
Demonstration with one Unicode emoticon (the first in that page):
System.out.println(new String(Character.toChars(0x1f600)).length())
prints 2. Whereas:
final String s = new String(Character.toChars(0x1f600));
System.out.println(s.codePointCount(0, s.length());
prints 1.

Categories