This question already has answers here:
How many characters can a Java String have?
(7 answers)
Size of Initialisation string in java
(3 answers)
Closed 9 years ago.
I am trying to solve a CodeChef problem in Java and found out that I could not create a String with length > one million chars (with my compiler at least). I pasted the first one million decimal digits of Pi in a string (e.g. String PI = "3.1415926535...151") in the Java file and it fails to compile. When I take out the Pi and replace it with a shorter string like "dog", the code compiles. Can anyone confirm if this is indeed a limitation of Java?
Thanks.
Can anyone confirm if this is indeed a limitation of Java?
Yes. There is an implementation limit of 65535 on the length of a string literal1. It is not stated in the JLS, but it is implied by the structure of the class file; see JVM Spec 4.4.7 and note that the string length field is 'u2' ... which means a 16 bit unsigned integer.
Note that a String object can have up to 2^31 - 1 characters. The 2^16 -1 limit is actually for string-valued constant expressions; e.g. string literals or concatenations of literals that are embedded in the source code of a Java program.
If you want to a String that represents the first million digits of Pi, then it would be better to read the characters from a file in the filesystem, or a resource on the classpath.
1 - This limit is actually on the number of bytes in the (modified) UTF-8 representation of the String. If the string consists of characters in the range 0x01 to 0x7f, then each byte represents a single character. Otherwise, a character can require up to 6 bytes.
I think this problem is not related to string literals but to method size: http://chrononsystems.com/blog/method-size-limit-in-java. According to that, the size of the method can not exceed 64k.
Related
This question already has answers here:
Creating Unicode character from its number
(13 answers)
Closed 13 days ago.
I have to read a string from a file and display the corresponding unicode representation in a Text field on my application.
For example I read the string "e13a" from the file and i'd like to display the corresponding "\ue13a" character in the Text field.
Is there a way to obtain the desired behaviour?
I already tried escaping the string directly in the file but I always obtain the raw string instead of the unicode representation
tl;dr
Character.toString( Integer.parseInt( "e13a" , 16 ) )
See this code run at Ideone.com.
Code point
Parse your input string as a hexadecimal number, base 16. Convert to a decimal number, base 10.
That number represents a code point, the number permanently assigned to each of the over 144,000 characters defined in Unicode. Code points range from zero to just over one million, with most of that range unassigned.
String input = "e13a" ;
int codePoint = Integer.parseInt( input , 16 ) ;
Instantiate a String object whose content is the character identified by that code point.
String output = Character.toString( codePoint ) ;
Avoid char
The char type has been essentially broken since Java 2, and legacy since Java 5. As a 16-bit value, char is physically incapable of representing most characters.
To work with individual characters, use code point integers as seen above.
I posted the question after a lot of trying and searching.
Shortly after posting I found a more trivial solution than I expected:
The converted string is:
String converted = String.valueOf((char) Integer.parseInt(unicodeString,16));
where "unicodeString" is the string I read from the file.
This question already has answers here:
Does java define string as null terminated?
(3 answers)
Closed 3 years ago.
I have searched but I found mixed answers. One site says string is terminated with a special character ‘\0’.
What is the purpose of the \0 special character, and is the source correct?
Source 1
Source 2
The \0 special character is part of a set of special characters used for octal values (link). If you try foo\377bar it gives you fooÿbar. The \0 in this case converts to 0o0 in octal, when this is converted to ASCII , reads as no string at all. If you have foo\377bar, the \377 is converted from Octal to ASCII and is read as fooÿbar. Read here.
This suggests in Java \0 is not used to null terminate a string, and further research shows Java doesn't use anything to null terminate strings, as the length of the string is enough to tell it that it has reached the end of the string.
This can be proven by running the following:
String test = "foo\0bar";
char[] list = test.toCharArray();
for (char ch : list) System.out.println(ch);
System.out.println(test);
EDIT: Java actually uses Unicode. This does not change the effect described above however, and \0nn can even be used to declare variable names, because of regular expressions in the lexical analysis stage of the compiler. This source goes through special characters, and explains how they can be used in Java. \0 is the same as \0n where n is also 0, and is used as a null character. Kind of equivalent to Epsilon in RE, which is used to represent a string that is null.
I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character.
"\U" - not working (illegal escape character error)
for example: U+1F701 (0xf0 0x9f 0x9c 0x81). How it can be written in a string?
A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).
Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).
You need to do this:
final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.
Given that a char is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "\uD83D\uDF01" -- or directly as the symbol if your computing environment has support for it.
See also the CharsetDecoder and CharsetEncoder classes.
See also String.codePointCount(), and, since Java 8, String.codePoints() (inherited from CharSequence).
String s = "𩸽";
Technically this is one character. But be careful s.length() will returns 2. Also java won't compile String s = '𩸽'. Java don't promise you that String.length() shall returns exact number of characters, it returns just number of java-chars required for store this string.
Real number of characters can be obtained from s.codePointCount(0, s.length()).
jshell> String s = "🏳";
s ==> "🏳️"
jshell> s.codePointCount(0, s.length());
$5 ==> 2
This question already has answers here:
How does Java 16 bit chars support Unicode?
(3 answers)
Closed 8 years ago.
Someone asked a similar question. But I didnt really get the answer.
when I say
char myChar = 'k' in java its going to reserve 16 bits for it (according to java docs below?
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
Now lets say I have a unicode character '電' and assume that its code point is something like U+FFFF1. This code point could not be stored in 2 bytes and so would java allocate extra bytes (UTF16 based string) for it?
In short when I have something like this -
char myChar = '電'
Assuming that its code point representation is long and will require more than 2 bytes.
How many bits will myChar have - 16 or 32
Thanks
Jave uses UTF-16, and yes every Java char is 16-bits. From the Java Tutorial - Primitive Data Types,
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
Further, the Character Javadoc says (in part),
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
So, supplementary characters (like your second example) aren't represented as a single 16-bit character.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Max name length of variable or method in Java
I was reading the java docs and it says “A variable's name can be any legal identifier — an unlimited-length sequence of Unicode letters and digits … “ in c++ the variable name length is around 255 characters depending on the compiler, so how is this handled in java does the compiler truncate the variable name after x number of characters, and if this is true what would be x ?
According to the class file format spec (under section 4.11):
The length of field and method names, field and method descriptors, and other constant string values is limited to 65535 characters by the 16-bit unsigned length item of the CONSTANT_Utf8_info structure (§4.4.7). Note that the limit is on the number of bytes in the encoding and not on the number of encoded characters. UTF-8 encodes some characters using two or three bytes. Thus, strings incorporating multibyte characters are further constrained.
This applies to local variables as well because of the LocalVariableTable pointing to CONSTANT_Utf8_info values for the variable names.
No one in their right mind should ever come within miles of the limit. You reach a point where it defeats the purpose. You want to choose names that clarify your intent, but that doesn't mean a variable name should rival "Ulysses" in length. The limit has more to do with good taste and readability.
Given that a java.lang.String has a field
private final int count;
to specify the number of characters in it, the maximum identifier length must be no more than
Integer.MAX_VALUE