Length limit for a variable name in Java [duplicate] - java

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Max name length of variable or method in Java
I was reading the java docs and it says “A variable's name can be any legal identifier — an unlimited-length sequence of Unicode letters and digits … “ in c++ the variable name length is around 255 characters depending on the compiler, so how is this handled in java does the compiler truncate the variable name after x number of characters, and if this is true what would be x ?

According to the class file format spec (under section 4.11):
The length of field and method names, field and method descriptors, and other constant string values is limited to 65535 characters by the 16-bit unsigned length item of the CONSTANT_Utf8_info structure (§4.4.7). Note that the limit is on the number of bytes in the encoding and not on the number of encoded characters. UTF-8 encodes some characters using two or three bytes. Thus, strings incorporating multibyte characters are further constrained.
This applies to local variables as well because of the LocalVariableTable pointing to CONSTANT_Utf8_info values for the variable names.

No one in their right mind should ever come within miles of the limit. You reach a point where it defeats the purpose. You want to choose names that clarify your intent, but that doesn't mean a variable name should rival "Ulysses" in length. The limit has more to do with good taste and readability.

Given that a java.lang.String has a field
private final int count;
to specify the number of characters in it, the maximum identifier length must be no more than
Integer.MAX_VALUE

Related

I can't see default value for instance "char" in java [duplicate]

This question already has answers here:
What's the default value of char?
(14 answers)
Closed 4 years ago.
I have found that instance variable "char" default value is "u0000" (unicode of null). But when I tried with the piece of code in below, I could only see an empty print line. Please give me clarification.
public class Basics {
char c;
int x;
public static void main(String[] args) {
Basics s = new Basics();
System.out.println(s.c);
System.out.println(s.x);
}
}
Console output as follow:
(empty line)
0
'\u0000' (char c = 0;) is a Unicode control character. You are not supposed to see it.
System.out.println(Character.isISOControl(s.c) ? "<control>" : s.c);
Try
System.out.println((int) s.c);
if you want to see the numeric value of the default char (which is 0).
Otherwise, it just prints a blank (not an empty line).
You can see that it's not an empty line if you add visible characters before an after s.c:
System.out.print ("--->");
System.out.print (s.c);
System.out.println ("<---");
will print:
---> <---
Could you please provide me more information about why does unicode is selected as default value for char data type? is there any specific reason behind this?
It was recognized that language that was to become Java needed to support multilingual character sets by default. At that time Unicode was the new standard way of doing it1. When Java first adopted Unicode, Unicode used 16 bit codes exclusively. That caused the Java designers to specify char as an unsigned 16 bit integral type. Unfortunately, Unicode rapidly expanded beyond a 16 bits, and Java had to adapt ... by switching to UTF-16 as Java's native in-memory text encoding scheme.
For more background:
Why Java char uses UTF-16?
Why does Java use UTF-16 for the internal text representation
But note that:
In the latest version of Java, you have the option enabling a more compact representation for text data.
The width of char is so hard-wired that it would be impossible to change. In fact, if you want to represent a Unicode code point, you should use an int rather than a char.
1 - It is still the standard way. AFAIK there are no credible alternatives to Unicode at this time.
The specific reason that \u0000 was chosen as the default initial value for char, is because it is zero. Objects are default initialized by writing all zero bytes to all fields irrespective of their types. This maps to zero for integral types and floating point types, false for boolean, and null for reference types.
It so happens that the \u0000 character maps to the ASCII NUL control character which is a non-printing character.

Why Char is actually a NumericType in Java, but not a SymbolicType or String?

Regarding Java syntax, there is a NumericType which consists of IntegralType and FloatingPointType. IntegralTypes are byte, short, int, long and char.
At the same time, I can assign a single character to char variable.
char c1 = 10;
char c2 = 'c';
So here is my question. Why char in numeric type and how JVM convert 'c' to a number?
Why char in numeric type...
Using numbers to represent characters as indexes into a table is the standard way the text is handled in computers. It's called character encoding and has a long history, going back at least to telegraphs. For a long time personal computers used ASCII (a 7-bit encoding = 127 characters plus nul) and then "extended ASCII" (an 8-bit encoding of various forms where the "upper" 128 characters had a variety of interpretations), but these are now obsolete and suitable only for niche purposes thanks to their limited character set. Before personal computers, popular ones were EBCDIC and its precursor BCD. Modern systems use Unicode (usually by storing one or more of its transformations such as UTF-8 or UTF-16) or various standardized "code pages" such as Windows-1252 or ISO-8859-1.
...and how JVM convert 'c' to a number?
Java's numeric char values map to and from characters via Unicode (which is how the JVM knows that 'c' is the value 0x0063, or that 'é' is 0x00E9). Specifically, a char value maps to a Unicode code point and strings are sequences of code points.
There's quite a lot about the char data type, including why the value is 16 bits wide, in the JavaDoc of the Character class:
Unicode Character Representations
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.
Because underneath Java represents chars as Unicode. There is some convenience to this, for example you can run a loop from 'A' to 'Z' and do something. It's important to realize, however, that in Java Strings aren't strictly arrays of characters like they are in some other languages. More info here
Internally char is stored as ASCII (or UNICODE) code which is integer. The difference is in how it is processed after it is read from memory.
In C/C++ char and int are very close and is type casted implicitly. Similar behavior in Java shows the relation between C/C++ and Java as JVM is written in C/C++.
Besides being able to do arithmetic operations on chars which sometimes comes handy (like c >= 'a' && c <= 'z') I would say it is a design decision driven by the similar approach taken in other languages when Java was invented (primarily C and C++).
The fact that Character does not extend Number (as other numeric primitive wrappers do) somehow indicates that Java designers tried to find some kind of a compromise between numeric and non-numeric nature of characters.
DISCLAIMER I was not able to find any official docs about this.

Unicode code points and java char [duplicate]

This question already has answers here:
How does Java 16 bit chars support Unicode?
(3 answers)
Closed 8 years ago.
Someone asked a similar question. But I didnt really get the answer.
when I say
char myChar = 'k' in java its going to reserve 16 bits for it (according to java docs below?
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
Now lets say I have a unicode character '電' and assume that its code point is something like U+FFFF1. This code point could not be stored in 2 bytes and so would java allocate extra bytes (UTF16 based string) for it?
In short when I have something like this -
char myChar = '電'
Assuming that its code point representation is long and will require more than 2 bytes.
How many bits will myChar have - 16 or 32
Thanks
Jave uses UTF-16, and yes every Java char is 16-bits. From the Java Tutorial - Primitive Data Types,
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
Further, the Character Javadoc says (in part),
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
So, supplementary characters (like your second example) aren't represented as a single 16-bit character.

Java with String length limit [duplicate]

This question already has answers here:
How many characters can a Java String have?
(7 answers)
Size of Initialisation string in java
(3 answers)
Closed 9 years ago.
I am trying to solve a CodeChef problem in Java and found out that I could not create a String with length > one million chars (with my compiler at least). I pasted the first one million decimal digits of Pi in a string (e.g. String PI = "3.1415926535...151") in the Java file and it fails to compile. When I take out the Pi and replace it with a shorter string like "dog", the code compiles. Can anyone confirm if this is indeed a limitation of Java?
Thanks.
Can anyone confirm if this is indeed a limitation of Java?
Yes. There is an implementation limit of 65535 on the length of a string literal1. It is not stated in the JLS, but it is implied by the structure of the class file; see JVM Spec 4.4.7 and note that the string length field is 'u2' ... which means a 16 bit unsigned integer.
Note that a String object can have up to 2^31 - 1 characters. The 2^16 -1 limit is actually for string-valued constant expressions; e.g. string literals or concatenations of literals that are embedded in the source code of a Java program.
If you want to a String that represents the first million digits of Pi, then it would be better to read the characters from a file in the filesystem, or a resource on the classpath.
1 - This limit is actually on the number of bytes in the (modified) UTF-8 representation of the String. If the string consists of characters in the range 0x01 to 0x7f, then each byte represents a single character. Otherwise, a character can require up to 6 bytes.
I think this problem is not related to string literals but to method size: http://chrononsystems.com/blog/method-size-limit-in-java. According to that, the size of the method can not exceed 64k.

In what encoding is a Java char stored in?

Is the Java char type guaranteed to be stored in any particular encoding?
Edit: I phrased this question incorrectly. What I meant to ask is are char literals guaranteed to use any particular encoding?
"Stored" where? All Strings in Java are represented in UTF-16. When written to a file, sent across a network, or whatever else, it's sent using whatever character encoding you specify.
Edit: Specifically for the char type, see the Character docs. Specifically: "The char data type ... are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities." Therefore, casting char to int will always give you a UTF-16 value if the char actually contains a character from that charset. If you just poked some random value into the char, it obviously won't necessarily be a valid UTF-16 character, and likewise if you read the character in using a bad encoding. The docs go on to discuss how the supplementary UTF-16 characters can only be represented by an int, since char doesn't have enough space to hold them, and if you're operating at this level, it might be important to get familiar with those semantics.
A Java char is conventionally used to hold a Unicode code unit; i.e. a 16 bit unit that is part of a valid UTF-16 sequence. However, there is nothing to prevent an application from putting any 16 bit unsigned value into a char, irrespective of what it actually means.
So you could say that a Unicode code unit can be represented by a char and a char can represent a Unicode code unit ... but neither of these is necessarily true, in the general case.
Your question about how a Java char is stored cannot be answered. Simply said, it depends on what you mean by "stored":
If you mean "represented in an executing program", then the answer is JVM implementation specific. (The char data type is typically represented as a 16 bit machine integer, though it may or may not be machine word aligned, depending on the specific context.)
If you mean "stored in a file" or something like that, then the answer is entirely dependent on how the application chooses to store it.
Is the Java char type guaranteed to be stored in any particular encoding?
In the light of what I said above the answer is "No". In an executing application, it is up to the application to decide what a char means / contains. When a char is stored to a file, the application decides how it wants to store it and what on-disk representation it will use.
FOLLOWUP
What about char literals? For example, 'c' must have some value that is defined by the language.
Java source code is required (by the language spec) to be Unicode text, represented in some character encoding that the tool chain understands; see the javac -encoding option. In theory, a character encoding could map the c in 'c' in your source code to something unexpected.
In practice though, the c will map to the Unicode lower-case C code-point (U+0063) and will be represented as the 16-bit unsigned value 0x0063.
To the extent that char literals have a meaning ascribed by the Java language, they represent (and are represented as) UTF-16 code units. Note that they may or may not be assigned Unicode code points ("characters"). Some Unicode code points in the range U+0000 to U+FFFF are unassigned.
Originally, Java used UCS-2 internally; now it uses UTF-16. The two are virtually identical, except for D800 - DFFF, which are used in UTF-16 as part of the extended representation for larger characters.

Categories