Why is the 'char' primitive in java necessary?

Why is the 'char' primitive in java necessary? - java

It has occurred to me that the char type in java can be entirely replaced with integer types (and leaving the character literals for programmers' convenience). This would allow for flexibility of storage size, as ASCII only takes one byte and Unicode beyond the Basic Multilingual Plane requires more than two bytes. If a character is just a two-byte number like the short type, why is there a separate type for it?

Nothing is 100% necessary in a programming language; we could all use BCPL if we really wanted to. (BTW, I learned that well enough to write fizzbuzz a few years ago and recommend doing that. It's a language with an interesting viewpoint and/or historical perspective.)
The question is, does char simplify or improve programming? Put in your terms, is it worth one byte per character to save the if/then complexity inherent in using byte for some characters and short for others? I think the answer to that is cheap: Even for a novel-length string that's only half a megabyte, or about one cent's worth of RAM. Or compared to using short: does having a separate unsigned 16-byte type improve or simplify anything over using a signed 16-byte type for holding unicode code points, so that the character "ꙁ" would be a negative number in java and positive in reality? A matter of judgment.

Related

Why are char[] the only arrays not supported by Arrays.stream()?

While going through the ways of converting primitive arrays to Streams, I found that char[] are not supported while other primitive array types are supported. Any particular reason to leave them out in the stream?

Of course, the answer is "because that's what the designers decided". There is no technical reason why CharStream could not exist.
If you want justification, you usually need to turn the the OpenJDK mailing list*. The JDK's documentation is not in the habit of justifying why anything is why it is.
Someone asked
Using IntStream to represent char/byte stream is a little
inconvenient. Should we add CharStream and ByteStream as well?
The reply from Brian Goetz (Java Language Architect) says
Short answer: no.
It is not worth another 100K+ of JDK footprint each for these forms
which are used almost never. And if we added those, someone would
demand short, float, or boolean.
Put another way, if people insisted we had all the primitive
specializations, we would have no primitive specializations. Which
would be worse than the status quo.
Source
He also says the same elsewhere
If you want to deal with them as chars, you can downcast them to
chars easily enough. Doesn't seem like an important enough use case
to have a whole 'nother set of streams. (Same with Short, Byte,
Float).
Source
TL;DR: Not worth the maintenance cost.
*In case you're curious, the google query I used was
site:http://mail.openjdk.java.net/ charstream

As Eran said, it's not the only one missing.
A BooleanStream would be useless, a ByteStream (if it existed) can be handled as an InputStream or converted to IntStream (as can short), and float can be handled as a DoubleStream.
As char is not able to represent all characters anyway (see linked), it would be a bit of a legacy stream. Although most people don't have to deal with codepoints anyway, so it can seem strange. I mean you use String.charAt() without thinking "this doesn't actually work in all cases".
So some things were left out because they weren't deemed that important. As said by JB Nizet in the linked question:
The designers explicitly chose to avoid the explosion of classes and
methods by limiting the primitive streams to 3 types, since the other
types (char, short, float) can be represented by their larger
equivalent (int, double) without any significant performance penalty.
The reason BooleanStream would be useless, is because you only have 2 values and that limits the operations a lot. There's no mathematical operations to do, and how often are you working with lots of boolean values anyway?
As can be seen from the comments, a BooleanStream is not needed. If it were, there would be a lot of actual use cases instead of theoretical situations, a use case going back to Java 1.4, and a fallacious comparison to while loop.

It's not only char arrays that are not supported.
There are only 3 types of primitive streams - IntStream, LongStream and DoubleStream.
As a result, Arrays has methods that convert int[], long[] and double[] to the corresponding primitive streams.
There are no corresponding methods for boolean[], byte[], short[], char[] and float[], since these primitive types have no corresponding primitive streams.

char is a dependent part of String - storing UTF-16 values. A Unicode symbol, a code point, is sometimes a surrogate pair of chars. So any simple solution with chars only covers part of the Unicode domain.
There was a time that char had its own right to be a public type. But nowadays it is better to use code points, an IntStream. A stream of char could not straightforwardly handle surrogate pairs.
The other more prosaic reason is that the JVM "processor" model uses an int as smallest "register", keeping booleans, bytes, shorts and also chars in such an int sized storage location. To not necessarily bloat java classes, one refrained from all possible copy variants.
In the far future one might expect primitive types allowed to function as generic type parameters, providing a List<int>. Then we might see a Stream<char>.
For the moment better avoid char, and maybe use java.text.Normalizer for a unique canonical form of code points / Unicode strings.

Java: can java.lang.Character be used for characters outside Basic Multilingual Plane?

As I understand java keeps string in uft16 which for every code points uses either 16 (for BMP) or 32 bits. But I am not sure if class Character can be used for keeping code point which need 32 bits. Reading http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html didn't help. So can it?

No, char and Character can't represent a code point outside the BMP. There's no specific type for this, but all the Java APIs just use int to refer to code points specifically as opposed to UTF-16 code units.
If you look at all the codePoint* methods in java.lang.Character, such as codePointAt(char[], int, int) you'll see they use int.
In my experience, very little code (including my own) correctly takes account of this, instead assuming that it's reasonable to talk about the length of a string as being the number of UTF-16 code units in it. Having said that, "length" is a pretty hard-to-pin-down concept for strings, in that it doesn't mean the number of displayed glyphs, and different normalization forms of logically-equivalent text can consist of different numbers of code points...

Why would you need unsigned types in Java?

I have often heard complaints against Java for not having unsigned data types. See for example this comment. I would like to know how is this a problem? I have been programming in Java for 10 years more or less and never had issues with it. Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Since unsigned and signed numbers are represented with the same bit values, the only places I can think of where signedness matters are:
When converting the numbers to other bit representation. Between 8, 16 and 32 bit integer types you can use bitmasks if needed.
When converting numbers to decimal format, usually to Strings.
Interoperating with non-Java systems through API's or protocols. Again the data is just bits, so I don't see the problem here.
Using the numbers as memory or other offsets. With 32 bit ints this might be problem for very huge offsets.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those. What am I missing? What are the actual benefits of having unsigned types in a programming language and how would having those make Java better?

Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Why not? Is "applying a bitwise AND with 0xFF" actually part of what your code is trying to represent? If not, why should it have to be part of have you write it? I actually find that almost anything I want to do with bytes beyond just copying them from one place to another ends up requiring a mask. I want my code to be cruft-free; the lack of unsigned bytes hampers this :(
Additionally, consider an API which will always return a non-negative value, or only accepts non-negative values. Using an unsigned type allows you to express that clearly, without any need for validation. Personally I think it's a shame that unsigned types aren't used more in .NET, e.g. for things like String.Length, ICollection.Count etc. It's very common for a value to naturally only be non-negative.
Is the lack of unsigned types in Java a fatal flaw? Clearly not. Is it an annoyance? Absolutely.
The comment that you quote hits the nail on the head:
Java's lack of unsigned data types also stands against it. Yes, you can work around it, but it's not ideal and you'll be using code that doesn't really reflect the underlying data correctly.
Suppose you are interoperating with another system, which wants an unsigned 16 bit integer, and you want to represent the number 65535. You claim "the data is just bits, so I don't see the problem here" - but having to pass -1 to mean 65535 is a problem. Any impedance mismatch between the representation of your data and its underlying meaning introduces an extra speedbump when writing, reading and testing the code.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those.
The only times you would need to consider those operations is when you were naturally working with values of two different types - one signed and one unsigned. At that point, you absolutely want to have that difference pointed out. With signed types being used to represent naturally unsigned values, you should still be considering the differences, but the fact that you should is hidden from you. Consider:
// This should be considered unsigned - so a value of -1 is "really" 65535
short length = /* some value */;
// This is really signed
short foo = /* some value */;
boolean result = foo < length;
Suppose foo is 100 and length is -1. What's the logical result? The value of length represents 65535, so logically foo is smaller than it. But you'd probably go along with the code above and get the wrong result.
Of course they don't even need to represent different types here. They could both be naturally unsigned values, represented as signed values with negative numbers being logically greater than positive ones. The same error applies, and wouldn't be a problem if you had unsigned types in the language.
You might also want to read this interview with Joshua Bloch (Google cache, as I believe it's gone from java.sun.com now), including:
Ooh, good question... I'm going to say that the strangest thing about the Java platform is that the byte type is signed. I've never heard an explanation for this. It's quite counterintuitive and causes all sorts of errors.

If you like, yes, everything is ones and zeroes. However, your hardware arithmetic and logic unit doesn't work that way. If you want to store your bits in a signed integer value but perform operations that are not natural to signed integers, you will usually waste both storage space and processing time.
An unsigned integer type stores twice as many non-negative values in the same space as the corresponding signed integer type. So if you want to take into Java any data commonly used in a language with unsigned values, such as a POSIX date value (unsigned number of seconds) that is normally used with C, then in general you will need to use a wider integer type than C would use. If you are processing many such values, again you will waste both storage space and fetch-execute time.

The times I have used unsigned data types have been when I read in large blocks of data that correspond to images, or worked with openGL. I personally prefer unsigned if I know something will never be negative, as a "safety feature" of sorts.
Unsigned types are useful for bit-by-bit comparisons, and I'm pretty sure they are used extensively in graphics.

What is the purpose of long, double, byte, char in Java?

So I'm learning java, and I have a question. It seems that the types int, boolean and string will be good for just about everything I'll ever need in terms of variables, except perhaps float could be used when decimal numbers are needed in a number.
My question is, are the other types such as long, double, byte, char etc ever used in normal, everyday programming? What are some practical things these could be used for? What do they exist for?

With the possible exception of "short", which arguably is a bit of a waste of space-- sometimes literally, they're all horses for courses:
Use an int when you don't need fractional numbers and you've no reason to use anything else; on most processors/OS configurations, this is the size of number that the machine can deal with most efficiently;
Use a double when you need fractional numbers and you've no reason to use anything else;
Use a char when you want to represent a character (or possibly rare cases where you need two-byte unsigned arithmetic);
Use a byte if either you specifically need to manipulate a signed byte (rare!), or when you need to move around a block of bytes;
Use a boolean when you need a simple "yes/no" flag;
Use a long for those occasions where you need a whole number, but where the magnitude could exceed 2 billion (file sizes, time measurements in milliseconds/nanoseconds, in advanced uses for compacting several pieces of data into a single number);
Use a float for those rare cases where you either (a) are storing a huge number of them and the memory saving is worthwhile, or (b) are performing a massive number of calculations, and can afford the loss in accuracy. For most applications, "float" offers very poor precision, but operations can be twice as fast -- it's worth testing this on your processor, though, to find that it's actually the case! [*]
Use a short if you really need 2-byte signed arithmetic. There aren't so many cases...
[*] For example, in Hotspot on Pentium architectures, float and double operations generally take exactly the same time, except for division.
Don't get too bogged down in the memory usage of these types unless you really understand it. For example:
every object size is rounded to 16 bytes in Hotspot, so an object with a single byte field will take up precisely the same space as a single object with a long or double field;
when passing parameters to a method, every type takes up 4 or 8 bytes on the stack: you won't save anything by changing a method parameter from, say, an int to a short! (I've seen people do this...)
Obviously, there are certain API calls (e.g. various calls for non-CPU intensive tasks that for some reason take floats) where you just have to pass it the type that it asks for...!
Note that String isn't a primitive type, so it doesn't really belong in this list.

A java int is 32 bits, while a long is 64 bits, so when you need to represent integers larger than 2^31, long is your friend. For a typical example of the use of long, see System.currentTimeMillis()
A byte is 8 bits, and the smallest addressable entity on most modern hardware, so it is needed when reading binary data from a file.
A double has twice the size of a float, so you would usually use a double rather than a float, unless you have some restrictions on size or speed and a float has sufficient capacity.
A short is two bytes, 16 bits. In my opinion, this is the least necessary datatype, and I haven't really seen that in actual code, but again, it might be useful for reading binary file formats or doing low level network protocols. For example ip port numbers are 16 bit.
Char represents a single character, which is 16 bits. This is the same size as a short, but a short is signed (-32768 to 32767) while a char is unsigned (0 to 65535). (This means that an ip port number probably is more correctly represented as a char than a short, but this seems to be outside the intended scope for chars...)
For the really authorative source on these details, se the java language specification.

You can have a look here about the primitive types in Java.
The main interest between these types are the memory usage. For example, int uses 32bits while byte only uses 8bits.
Imagine that you work on large structure (arrays, matrices...), then you will better take care of the type you are using in order to reduce the memory usage.

I guess there are several purposes to types of that kind:
1) They enforce restrictions on the size (and sign) of variables that can be stored in them.
2) They can add a bit of clarity to code (e.g. if you use a char, then anyone reading the code knows what you plan to store in it).
3) They can save memory. if you have a large array of numbers, all of which will be unsigned and below 256, you can declare it as an array of bytes, saving some memory compared with if you declared an array of ints.
4) You need long if the numbers you need to store are larger than 2^32 and a double for very large floating point numbers.

The primitive data types are required because they are the basis of every complex collection.
long, double, byte etc. are used if you need only a small integer (or whatever), that does not waste your heap space.
I know, there's enough of RAM in our times, but you should not waste it.
I need the "small ones" for database and stream operations.

Integers should be used for numbers in general.
Doubles are the basic data type used to represent decimals.
Strings can hold essentially any data type, but it is easier to use ints and is confusing to use string except for text.
Chars are used when you only wish to hold one letter, although they are essentially only for clarity.
Shorts, longs, and floats may not be necessary, but if you are, for instance, creating an array of size 1,00000 which only needed to hold numbers less than 1,000, then you would want to use shorts, simply to save space.

It's relative to the data you're dealing with. There's no point using a data type which reserves a large portion of memory when you're only dealing with a small amount of data. For example, a lot of data types reserve memory before they've even been used. Take arrays for example, they'll reserve a default amount (say, 256 bytes <-- an example!) even if you're only using 4 bytes of that.
See this link for your answer

Why are there no byte or short literals in Java?

I can create a literal long by appending an L to the value; why can't I create a literal short or byte in some similar way? Why do I need to use an int literal with a cast?
And if the answer is "Because there was no short literal in C", then why are there no short literals in C?
This doesn't actually affect my life in any meaningful way; it's easy enough to write (short) 0 instead of 0S or something. But the inconsistency makes me curious; it's one of those things that bother you when you're up late at night. Someone at some point made a design decision to make it possible to enter literals for some of the primitive types, but not for all of them. Why?

In C, int at least was meant to have the "natural" word size of the CPU and long was probably meant to be the "larger natural" word size (not sure in that last part, but it would also explain why int and long have the same size on x86).
Now, my guess is: for int and long, there's a natural representation that fits exactly into the machine's registers. On most CPUs however, the smaller types byte and short would have to be padded to an int anyway before being used. If that's the case, you can as well have a cast.

I suspect it's a case of "don't add anything to the language unless it really adds value" - and it was seen as adding sufficiently little value to not be worth it. As you've said, it's easy to get round, and frankly it's rarely necessary anyway (only for disambiguation).
The same is true in C#, and I've never particularly missed it in either language. What I do miss in Java is an unsigned byte type :)

Another reason might be that the JVM doesn't know about short and byte. All calculations and storing is done with ints, longs, floats and doubles inside the JVM.

There are several things to consider.
1) As discussed above the JVM has no notion of byte or short types. Generally these types are not used in computation at the JVM level; so one can think there would be less use of these literals.
2) For initialization of byte and short variables, if the int expression is constant and in the allowed range of the type it is implicitly cast to the target type.
3) One can always cast the literal, ex (short)10

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.