Why are char[] the only arrays not supported by Arrays.stream()? - java

While going through the ways of converting primitive arrays to Streams, I found that char[] are not supported while other primitive array types are supported. Any particular reason to leave them out in the stream?

Of course, the answer is "because that's what the designers decided". There is no technical reason why CharStream could not exist.
If you want justification, you usually need to turn the the OpenJDK mailing list*. The JDK's documentation is not in the habit of justifying why anything is why it is.
Someone asked
Using IntStream to represent char/byte stream is a little
inconvenient. Should we add CharStream and ByteStream as well?
The reply from Brian Goetz (Java Language Architect) says
Short answer: no.
It is not worth another 100K+ of JDK footprint each for these forms
which are used almost never. And if we added those, someone would
demand short, float, or boolean.
Put another way, if people insisted we had all the primitive
specializations, we would have no primitive specializations. Which
would be worse than the status quo.
Source
He also says the same elsewhere
If you want to deal with them as chars, you can downcast them to
chars easily enough. Doesn't seem like an important enough use case
to have a whole 'nother set of streams. (Same with Short, Byte,
Float).
Source
TL;DR: Not worth the maintenance cost.
*In case you're curious, the google query I used was
site:http://mail.openjdk.java.net/ charstream

As Eran said, it's not the only one missing.
A BooleanStream would be useless, a ByteStream (if it existed) can be handled as an InputStream or converted to IntStream (as can short), and float can be handled as a DoubleStream.
As char is not able to represent all characters anyway (see linked), it would be a bit of a legacy stream. Although most people don't have to deal with codepoints anyway, so it can seem strange. I mean you use String.charAt() without thinking "this doesn't actually work in all cases".
So some things were left out because they weren't deemed that important. As said by JB Nizet in the linked question:
The designers explicitly chose to avoid the explosion of classes and
methods by limiting the primitive streams to 3 types, since the other
types (char, short, float) can be represented by their larger
equivalent (int, double) without any significant performance penalty.
The reason BooleanStream would be useless, is because you only have 2 values and that limits the operations a lot. There's no mathematical operations to do, and how often are you working with lots of boolean values anyway?
As can be seen from the comments, a BooleanStream is not needed. If it were, there would be a lot of actual use cases instead of theoretical situations, a use case going back to Java 1.4, and a fallacious comparison to while loop.

It's not only char arrays that are not supported.
There are only 3 types of primitive streams - IntStream, LongStream and DoubleStream.
As a result, Arrays has methods that convert int[], long[] and double[] to the corresponding primitive streams.
There are no corresponding methods for boolean[], byte[], short[], char[] and float[], since these primitive types have no corresponding primitive streams.

char is a dependent part of String - storing UTF-16 values. A Unicode symbol, a code point, is sometimes a surrogate pair of chars. So any simple solution with chars only covers part of the Unicode domain.
There was a time that char had its own right to be a public type. But nowadays it is better to use code points, an IntStream. A stream of char could not straightforwardly handle surrogate pairs.
The other more prosaic reason is that the JVM "processor" model uses an int as smallest "register", keeping booleans, bytes, shorts and also chars in such an int sized storage location. To not necessarily bloat java classes, one refrained from all possible copy variants.
In the far future one might expect primitive types allowed to function as generic type parameters, providing a List<int>. Then we might see a Stream<char>.
For the moment better avoid char, and maybe use java.text.Normalizer for a unique canonical form of code points / Unicode strings.

Related

Why isn't BigInteger a primitive

If you use BigInteger (or BigDecimal) and want to perform arithmetic on them, you have to use the methods add or subtract, for example. This may sound fine until you realize that this
i += d + p + y;
would be written like this for a BigInteger:
i = i.add(d.add(p.add(y)));
As you can see it is a little easier to read the first line. This could be solved if Java allowed operator overloading but it doesn't, so this begs the question:
Why isn't BigInteger a primitive type so it can take advantage of the same operators as other primitive types?
That's because BigInteger is not, in fact, anything that is close to being a primitive. It is implemented using an array and some additional fields, and the various operations include complex operations. For example, here is the implementation of add:
public BigInteger add(BigInteger val) {
if (val.signum == 0)
return this;
if (signum == 0)
return val;
if (val.signum == signum)
return new BigInteger(add(mag, val.mag), signum);
int cmp = compareMagnitude(val);
if (cmp == 0)
return ZERO;
int[] resultMag = (cmp > 0 ? subtract(mag, val.mag)
: subtract(val.mag, mag));
resultMag = trustedStripLeadingZeroInts(resultMag);
return new BigInteger(resultMag, cmp == signum ? 1 : -1);
}
Primitives in Java are types that are usually implemented directly by the CPU of the host machine. For example, every modern computer has a machine-language instruction for integer addition. Therefore it can also have very simple byte code in the JVM.
A complex type like BigInteger cannot usually be handled that way, and it cannot be translated into simple byte code. It cannot be a primitive.
So your question might be "Why no operator overloading in Java". Well, that's part of the language philosophy.
And why not make an exception, like for String? Because it's not just one operator that is the exception. You need to make an exception for the operators *, /, +,-, <<, ^ and so on. And you'll still have some operations in the object itself (like pow which is not represented by an operator in Java), which for primitives are handled by speciality classes (like Math).
Fundamentally, because the informal meaning of "primitive" is that it's data that can be handled directly with a single CPU instruction. In other words, they are primitives because they fit in a 32 or 64 bits word, which is the data architecture that your CPU works with, so they can explicitely be stored in the registers.
And thus your CPU can make the following operation:
ADD REGISTER_3 REGISTER_2 REGISTER_1 ;;; REGISTER_3 = REGISTER_1 + REGISTER_2
A BigInteger which can occupy an arbitrarily large amount of memory can't be stored in a single REGISTER and will need to perform multiple instructions to make a simple sum.
This is why they couldn't possibly be a primitive type, and now they actually are objects with methods and fields, a much more complex structure than simple primitive types.
Note: The reason why I called this informal is because ultimately the Java designers could define a "Java primitive type" as anything they wanted, they own the word, however this is vaguely the agreed use of the word.
int and boolean and char aren't primitives so that you can take advantage of operators like + and /. They are primitives for historical reasons, the biggest of which is performance.
In Java, primitives are defined as just those things that are not full-fledged Objects. Why create these unusual structures (and then re-implement them as proper objects, like Integer, later on)? Primarily for performance: operations on Objects were (and are) slower than operations on primitive types. (As other answers mention, hardware support made these operations faster, but I'd disagree that hardware support is an "essential property" of primitives.)
So some types received "special treatment" (and were implemented as primitives), and others didn't. Think of it this way: if even the wildly-popular String is not a primitive type, why would BigInteger be?
It's because primitive types have a size limit. For instance int is 32 bits and long is 64 bits. So if you create a variable of type int the JVM allocates 32 bits of memory on the stack for it. But as for BigInteger, it "theoretically" has no size limit. Meaning it can grow arbitrarily in size. Because of this, there is no way to know its size and allocate a fixed block of memory on the stack for it. Therefore it is allocated on the heap where the JVM can always increase the size if needed.
Primitive types are normally historic types defined by processor architecture. Which is why byte is 8-bit, short is 16-bit, int is 32-bit and long is 64-bit. Maybe when there's more 128-bit architectures, an extra primitive will be created...but I can't see there being enough drive for this...

Why would you need unsigned types in Java?

I have often heard complaints against Java for not having unsigned data types. See for example this comment. I would like to know how is this a problem? I have been programming in Java for 10 years more or less and never had issues with it. Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Since unsigned and signed numbers are represented with the same bit values, the only places I can think of where signedness matters are:
When converting the numbers to other bit representation. Between 8, 16 and 32 bit integer types you can use bitmasks if needed.
When converting numbers to decimal format, usually to Strings.
Interoperating with non-Java systems through API's or protocols. Again the data is just bits, so I don't see the problem here.
Using the numbers as memory or other offsets. With 32 bit ints this might be problem for very huge offsets.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those. What am I missing? What are the actual benefits of having unsigned types in a programming language and how would having those make Java better?
Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Why not? Is "applying a bitwise AND with 0xFF" actually part of what your code is trying to represent? If not, why should it have to be part of have you write it? I actually find that almost anything I want to do with bytes beyond just copying them from one place to another ends up requiring a mask. I want my code to be cruft-free; the lack of unsigned bytes hampers this :(
Additionally, consider an API which will always return a non-negative value, or only accepts non-negative values. Using an unsigned type allows you to express that clearly, without any need for validation. Personally I think it's a shame that unsigned types aren't used more in .NET, e.g. for things like String.Length, ICollection.Count etc. It's very common for a value to naturally only be non-negative.
Is the lack of unsigned types in Java a fatal flaw? Clearly not. Is it an annoyance? Absolutely.
The comment that you quote hits the nail on the head:
Java's lack of unsigned data types also stands against it. Yes, you can work around it, but it's not ideal and you'll be using code that doesn't really reflect the underlying data correctly.
Suppose you are interoperating with another system, which wants an unsigned 16 bit integer, and you want to represent the number 65535. You claim "the data is just bits, so I don't see the problem here" - but having to pass -1 to mean 65535 is a problem. Any impedance mismatch between the representation of your data and its underlying meaning introduces an extra speedbump when writing, reading and testing the code.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those.
The only times you would need to consider those operations is when you were naturally working with values of two different types - one signed and one unsigned. At that point, you absolutely want to have that difference pointed out. With signed types being used to represent naturally unsigned values, you should still be considering the differences, but the fact that you should is hidden from you. Consider:
// This should be considered unsigned - so a value of -1 is "really" 65535
short length = /* some value */;
// This is really signed
short foo = /* some value */;
boolean result = foo < length;
Suppose foo is 100 and length is -1. What's the logical result? The value of length represents 65535, so logically foo is smaller than it. But you'd probably go along with the code above and get the wrong result.
Of course they don't even need to represent different types here. They could both be naturally unsigned values, represented as signed values with negative numbers being logically greater than positive ones. The same error applies, and wouldn't be a problem if you had unsigned types in the language.
You might also want to read this interview with Joshua Bloch (Google cache, as I believe it's gone from java.sun.com now), including:
Ooh, good question... I'm going to say that the strangest thing about the Java platform is that the byte type is signed. I've never heard an explanation for this. It's quite counterintuitive and causes all sorts of errors.
If you like, yes, everything is ones and zeroes. However, your hardware arithmetic and logic unit doesn't work that way. If you want to store your bits in a signed integer value but perform operations that are not natural to signed integers, you will usually waste both storage space and processing time.
An unsigned integer type stores twice as many non-negative values in the same space as the corresponding signed integer type. So if you want to take into Java any data commonly used in a language with unsigned values, such as a POSIX date value (unsigned number of seconds) that is normally used with C, then in general you will need to use a wider integer type than C would use. If you are processing many such values, again you will waste both storage space and fetch-execute time.
The times I have used unsigned data types have been when I read in large blocks of data that correspond to images, or worked with openGL. I personally prefer unsigned if I know something will never be negative, as a "safety feature" of sorts.
Unsigned types are useful for bit-by-bit comparisons, and I'm pretty sure they are used extensively in graphics.

Objective-C and Java Primitive Data Types

I need to convert a piece of code from Objective-C to Java, but I have a problem understanding the Primitive Types in Objective-C. So I had their data types in my Objective-C code :
UInt64, Uint32, UInt8 ,
which are unsigned integers (as I understand from internet). So my question is, can I use Java primitive types like byte (8bit) - instead of UInt8, int (32bit) - instead of UInt32, and long (64bit) - instead of UInt64.
Unfortunately, it isn't a straight translation and without knowing more about your program, its hard to suggest what the "right" approach is.
If your UInt8 values really range from 0-255, you may have to use Java signed int to be able to hold the entire range.
If you are dealing with byte streams or memory layouts and really need to use just a single byte of memory, than you could try byte, but you may have to test and handle cases to handle when the high-bit is set (value > 127). Ditto with the other unsigned types.
Ideally, if your code just kind of "defaulted" to the unsigned types, but really the signed versions would have worked fine too (i.e. the ranges of your values never equal or exceed 2^7, 2^15, or 2^31 respectively), then you may be fine with the "straight" translation to byte, int, and long.
Yes, those are the correctly sized data types to use in Java. Make sure you take into account that Java does not have unsigned types and the trick is to use the next largest size. 64 bit unsigned arithmetic requires special consideration.

What is the purpose of long, double, byte, char in Java?

So I'm learning java, and I have a question. It seems that the types int, boolean and string will be good for just about everything I'll ever need in terms of variables, except perhaps float could be used when decimal numbers are needed in a number.
My question is, are the other types such as long, double, byte, char etc ever used in normal, everyday programming? What are some practical things these could be used for? What do they exist for?
With the possible exception of "short", which arguably is a bit of a waste of space-- sometimes literally, they're all horses for courses:
Use an int when you don't need fractional numbers and you've no reason to use anything else; on most processors/OS configurations, this is the size of number that the machine can deal with most efficiently;
Use a double when you need fractional numbers and you've no reason to use anything else;
Use a char when you want to represent a character (or possibly rare cases where you need two-byte unsigned arithmetic);
Use a byte if either you specifically need to manipulate a signed byte (rare!), or when you need to move around a block of bytes;
Use a boolean when you need a simple "yes/no" flag;
Use a long for those occasions where you need a whole number, but where the magnitude could exceed 2 billion (file sizes, time measurements in milliseconds/nanoseconds, in advanced uses for compacting several pieces of data into a single number);
Use a float for those rare cases where you either (a) are storing a huge number of them and the memory saving is worthwhile, or (b) are performing a massive number of calculations, and can afford the loss in accuracy. For most applications, "float" offers very poor precision, but operations can be twice as fast -- it's worth testing this on your processor, though, to find that it's actually the case! [*]
Use a short if you really need 2-byte signed arithmetic. There aren't so many cases...
[*] For example, in Hotspot on Pentium architectures, float and double operations generally take exactly the same time, except for division.
Don't get too bogged down in the memory usage of these types unless you really understand it. For example:
every object size is rounded to 16 bytes in Hotspot, so an object with a single byte field will take up precisely the same space as a single object with a long or double field;
when passing parameters to a method, every type takes up 4 or 8 bytes on the stack: you won't save anything by changing a method parameter from, say, an int to a short! (I've seen people do this...)
Obviously, there are certain API calls (e.g. various calls for non-CPU intensive tasks that for some reason take floats) where you just have to pass it the type that it asks for...!
Note that String isn't a primitive type, so it doesn't really belong in this list.
A java int is 32 bits, while a long is 64 bits, so when you need to represent integers larger than 2^31, long is your friend. For a typical example of the use of long, see System.currentTimeMillis()
A byte is 8 bits, and the smallest addressable entity on most modern hardware, so it is needed when reading binary data from a file.
A double has twice the size of a float, so you would usually use a double rather than a float, unless you have some restrictions on size or speed and a float has sufficient capacity.
A short is two bytes, 16 bits. In my opinion, this is the least necessary datatype, and I haven't really seen that in actual code, but again, it might be useful for reading binary file formats or doing low level network protocols. For example ip port numbers are 16 bit.
Char represents a single character, which is 16 bits. This is the same size as a short, but a short is signed (-32768 to 32767) while a char is unsigned (0 to 65535). (This means that an ip port number probably is more correctly represented as a char than a short, but this seems to be outside the intended scope for chars...)
For the really authorative source on these details, se the java language specification.
You can have a look here about the primitive types in Java.
The main interest between these types are the memory usage. For example, int uses 32bits while byte only uses 8bits.
Imagine that you work on large structure (arrays, matrices...), then you will better take care of the type you are using in order to reduce the memory usage.
I guess there are several purposes to types of that kind:
1) They enforce restrictions on the size (and sign) of variables that can be stored in them.
2) They can add a bit of clarity to code (e.g. if you use a char, then anyone reading the code knows what you plan to store in it).
3) They can save memory. if you have a large array of numbers, all of which will be unsigned and below 256, you can declare it as an array of bytes, saving some memory compared with if you declared an array of ints.
4) You need long if the numbers you need to store are larger than 2^32 and a double for very large floating point numbers.
The primitive data types are required because they are the basis of every complex collection.
long, double, byte etc. are used if you need only a small integer (or whatever), that does not waste your heap space.
I know, there's enough of RAM in our times, but you should not waste it.
I need the "small ones" for database and stream operations.
Integers should be used for numbers in general.
Doubles are the basic data type used to represent decimals.
Strings can hold essentially any data type, but it is easier to use ints and is confusing to use string except for text.
Chars are used when you only wish to hold one letter, although they are essentially only for clarity.
Shorts, longs, and floats may not be necessary, but if you are, for instance, creating an array of size 1,00000 which only needed to hold numbers less than 1,000, then you would want to use shorts, simply to save space.
It's relative to the data you're dealing with. There's no point using a data type which reserves a large portion of memory when you're only dealing with a small amount of data. For example, a lot of data types reserve memory before they've even been used. Take arrays for example, they'll reserve a default amount (say, 256 bytes <-- an example!) even if you're only using 4 bytes of that.
See this link for your answer

Why are there no byte or short literals in Java?

I can create a literal long by appending an L to the value; why can't I create a literal short or byte in some similar way? Why do I need to use an int literal with a cast?
And if the answer is "Because there was no short literal in C", then why are there no short literals in C?
This doesn't actually affect my life in any meaningful way; it's easy enough to write (short) 0 instead of 0S or something. But the inconsistency makes me curious; it's one of those things that bother you when you're up late at night. Someone at some point made a design decision to make it possible to enter literals for some of the primitive types, but not for all of them. Why?
In C, int at least was meant to have the "natural" word size of the CPU and long was probably meant to be the "larger natural" word size (not sure in that last part, but it would also explain why int and long have the same size on x86).
Now, my guess is: for int and long, there's a natural representation that fits exactly into the machine's registers. On most CPUs however, the smaller types byte and short would have to be padded to an int anyway before being used. If that's the case, you can as well have a cast.
I suspect it's a case of "don't add anything to the language unless it really adds value" - and it was seen as adding sufficiently little value to not be worth it. As you've said, it's easy to get round, and frankly it's rarely necessary anyway (only for disambiguation).
The same is true in C#, and I've never particularly missed it in either language. What I do miss in Java is an unsigned byte type :)
Another reason might be that the JVM doesn't know about short and byte. All calculations and storing is done with ints, longs, floats and doubles inside the JVM.
There are several things to consider.
1) As discussed above the JVM has no notion of byte or short types. Generally these types are not used in computation at the JVM level; so one can think there would be less use of these literals.
2) For initialization of byte and short variables, if the int expression is constant and in the allowed range of the type it is implicitly cast to the target type.
3) One can always cast the literal, ex (short)10

Categories