Why does StructuredArray need to be non-constructible? - java

This talk at 34:00 describes the design of StructuredArrays for Java. Everything's rather clear, except for on thing:
It shouldn't be constructible, i.e., the instance may be only obtainable by some static factory method like newInstance. At the same time, they should be subclassible, which means that there must be a public constructor and the non-constructibility will be assured at runtime. This sounds very hacky, so I wonder why?
I'm aware about the advantages of factories in general and static factory methods in particular. But what do we get here, so that it makes the hack acceptable?

The point of the StructuredArray class is that someday it can be replaced with an intrinsic implementation that allocates the whole array, including the component objects, as one long block of memory. When this happens, the size of the object will depend on the number of elements and the element class.
If StructuredArray had a public constructor, then you could write x = new StructuredArray<>(StructuredArray.class, MyElement.class, length). This doesn't seem to present any problem, except that in bytecode, this turns into a new instruction that allocates the object, and then a separate invokespecial instruction to call the object's constructor.
You see the problem -- the new instruction has to allocate the object, but it cannot, because the size of the object depends on constructor parameters (the element class and length) that it doesn't have! Those aren't passed until the constructor call that follows sometime later.
There are ways to around problems like this, but they're all kinda gross. It makes a lot more sense to encapsulate construction in a static factory method, because then you just can't write new StructuredArray..., and the JVM doesn't have to use any "magic" to figure out how much memory to allocate in the new instruction for StructuredArray, because there just can't be any such instructions*.
If some later JVM wants to provide an intrinsic implementation of the static factory that allocates a contiguous array, then it's no problem -- it gets all the information it needs in the factory method invocation.
NB* - yes, OK, technically you can write new StructuredArray..., but it doesn't make a useful object for you.

Semantics Going through the API documentation my understanding is that it is a question mostly of Semantics. And providing a Fluent API. Also if you go to the conclusion slide of the presentation you should notice that the Semantics bullet comes first (if we don't count the source code url).
If we pick the normal Arrays. They present a clear semantics of:
Type of the array
length of the array
type of the elements
As a result
We have a unified model of working with arrays. And the API is crystal clear. There are no 10 different ways of working with arrays. I believe that for the Java language developers, this cleanness of the api is of extreme importance. Forcing the non-contructability they are implicitly forcing us to use the API the way they want us to use it.
Construction
Since the StructuredArray essentially is array as well. Presenting a constructor will immediately force us to use the Concrete implementation of the StructuredArray which automatically will create problems introducing this unified model of "What exactly is an "Array?".
This is why going through the Javadoc we can see the way the StructuredArray is actually contructed:
static <S extends StructuredArray<T>,T> S newInstance(java.lang.invoke.MethodHandles.Lookup lookup,
java.lang.Class<S> arrayClass,
java.lang.Class<T> elementClass,
java.util.Collection<T> sourceCollection)
What is visible here is that the StructuredArray is forcing several things:
It is forcing all client classes to work with "StructuredArray" and not with the concrete implementation.
StructuredArray is essentially immutable.
The immutability means that there is a strict notation of Length.
Structured Array has a source of elements. Which once consumed may be disposed.
And similarly to the regular Array, the Structured array has a concept of TYPE of elements.
I believe that there is a very strong notation of semantics and also the authors are giving us an excellent hint in how exactly the coding is supposed to happen.
Another interesting feature of the structured arrays is the ability to pass a constructor. Again we are talking about a strong decoupling of the interface and the API from the actual implementation.
Array Model
My words are further confirmed by examining the StructuredArrayModel
http://objectlayout.github.io/ObjectLayout/JavaDoc/index.html?org/ObjectLayout/StructuredArray.html
StructuredArrayModel(java.lang.Class<S> arrayClass, java.lang.Class<T> elementClass, long length)
Three things are visible from the constructor:
- Array class
- Type of the elements
- length
Observing further the constructs that the Structured Array supports:
An array of structs:
struct foo[];
A struct with a struct inside:
struct foo { int a; bar b; int c; };
A struct with an array at the end:
struct foo { int len; char[] payload; };
It is fully supported by the StructuredArrayModel
In contrast to the StructuredArray we have the ability to instantiate easily concrete implementations of the model.
StructuredArray presents us the ability to pass pseudo constructors http://objectlayout.github.io/ObjectLayout/JavaDoc/org/ObjectLayout/CtorAndArgs.html
newInstance(CtorAndArgs<S> arrayCtorAndArgs, java.lang.Class<T> elementClass, long length)

Related

In Java, are all members of a class stored in contiguous memory?

While searching whether this was already answered, I found Are class members guaranteed to be contiguous in memory?, but that deals with C++, not Java.
To provide context, I have a background in Go and I'm learning Java. I know that with Go, I can write a struct using pointers like this:
type myStruct struct {
firstMember *string
secondMember *int
}
But when studying Go in detail, I often read about this being a bad idea unless you really need them to be pointers, because it means the values for each member can be spread anywhere across dynamic memory, hurting performance because it's less able to take advantage of spatial locality in the CPU.
Instead, it's often recommended to write the struct this way, without using pointers:
type myStruct struct {
firstMember string
secondMember int
}
As I learn how to effectively write Java, I'm curious if I have this same tool in my toolset when working with Java. Since I don't have the ability to use pointers (because every variable whose type is a class is a reference to that class, effectively a pointer), I can only write the class using String and int:
class MyClass {
String firstMember;
int secondMember;
}
Realizing that this was the only way to write a class for my data structure led me to the question posed.
But when studying Go in detail, I often read about this being a bad
idea unless you really need them to be pointers, because it means the
values for each member can be spread anywhere across dynamic memory,
hurting performance because it's less able to take advantage of
spatial locality in the CPU.
You have no choice in Java.
class MyClass {
String firstMember;
int secondMember;
}
The String-valued member is, and can only be, a reference (i.e., effectively a pointer). The int-valued member is a primitive value (i.e., not a pointer).
The Java world is divided into primitive values and objects (of some class, or arrays, and so on). Variables for the former types are not references, variables for the latter types are references.
The Java Language Specification does not talk about object layout at all; that's not a concept that appears in the language.
The JVM Specification specifically says
The Java Virtual Machine does not mandate any particular internal
structure for objects.
Pragmatically, you might guess that the body of a class instance is a single piece of memory, but that still leaves open the questions of alignment, padding, and ordering or members (no reason to keep source-code order that I can see, and some reasons to reorder).

Why are the 'Arrays' class' methods all static in Java?

I was going through Java documentation, and I learned that methods in the Arrays class in Java are all static. I don't really understand the reason behind why they made it static.
For example, the following code violates the OO approach, because if I have a type, 'X', then all the methods which acts on it should be inside it:
int[] a = {34, 23, 12};
Arrays.sort(a);
It would be better if they have implemented the following way:
int[] a = {34, 23, 12};
a.sort();
Can anyone explain me a bit on this?
In Java there is no way to extend the functionally of an array. Arrays all inherit from Object but this gives very little. IMHO This is a deficiency of Java.
Instead, to add functionality for arrays, static utility methods are added to classes like Array and Arrays. These methods are static as they are not instance methods.
Good observation. Observe also that not every array can be sorted. Only arrays of primitives and Objects which implement the Comparable interface can be sorted. So a general sort() method that applies to all arrays is not possible. And so we have several overloaded static methods for each of the supported types that are actually sortable.
Update:
#Holger correctly points out in the comments below that one of the overloaded static methods is indeed Arrays.sort(Object[]) but the docs explicitly state:
All elements in the array must implement the Comparable interface.
So it doesn't work for Objects that don't implement Comparable or one of its subinterfaces.
First of all, Arrays is an utility class, which does exactly that: exposes static methods. It is separate from any arr[] instances and has no OO relation to it. There are several classes like that, like Collections or various StringUtils.
Arrays are collections, they are used to store data. Arrays.sort() is an algorithm which sorts the collection. There may be many other algorithms which sort data in different way, all of them would be used in the same way: MyAlgorithm.doSthWithArray(array). Even if there was a sort() method on an array (it would then have to be a SortableArray, because not all Objects can be sorted automatically), all other algorithms would have to be called the old way anyway. Unless there was a visitor pattern introduced... But that makes things too complicated, hence, there is no point.
For a java Collection there's Collections.sort(), even in C++ there is std::sort which works similarly, as does qsort in C . I don't see a problem here, I see consistency.
Static Methods are sometimes used for utility purpose.
So Arrays is utility class for general purpose array operations.
Similarly, Collections is also Util class where utility methods are given.
Arrays are kind of like second-class generics. When you make an array it makes a custom class for the array type, but it's not full featured because they decided how arrays would work before they really fleshed out the language.
That, combined with maintaining backwards compatibility, means that Arrays are stuck with an archaic interface.
It's just an old part of the API.
An array is not an object which stores state, beyond the actual values of int the array. In other words, it's just a "dumb container". It doesn't "know" any behaviour.
A utility class is a class which has just public static methods which are stateless functions. Sorting is stateless because there's nothing remembered between calls to that method. It runs "standalone", applying its formula to whatever object is passed in, as long as that object is "sortable". A second instance of an Arrays class would have behaviour no different, so just have the one static instance.
As Dariusz pointed out, there are different ways of sorting. So you could have MyArrays.betterSort(array) as well as Arrays.sort(array).
If you wanted to have the array "know" how best to sort its own members, you'd have to have your own array class which extends an array.
But what if you had a situation where you wanted different sorting on different times on the the same array? A contrived example, maybe, but there are plenty of similar real-world examples.
And now you're getting complicated. Maybe an array of type T sort differently than type S ....
It's made simple with a static utility and the Comparator<T> interface.
For me this is the perfect solution. I have an array, and I have a class, Arrays, which operates over the data in the array. For example, you may want to hold some random numbers and you will never want to sort or any other utility method you will receive behavior which you don't want. That's why in code design it is good to separate data from the behavior.
You can read about the single responsibility principle.
The Arrays class contains methods that are independent of state, so therefore they should be static. It's essentially a utility class.
While OOP principles don't apply, the current way is clearer, concise, and more readable since you don't have to worry about polymorphism and inheritance. This all reduces scope, which ultimately reduces the chances that you screw something up.
Now, you may ask yourself "Why can't I extend the functionality of an array in Java?". A nice answer is that this introduces potential security holes, which could break system code.

java's typing system: prefer interface types to class types as method parameters/return values

I just making an effort to understand the power of the interfaces and how to use them to the best advantage.
So far, I understood that interfaces:
enable us to have another layer of abstraction, separate the what (defined by the interface) and the how (any valid implementation).
Given just one single implementation I would just build a house (in one particular way) and say here, its done instead of coming round with a building plan (the interface) and ask you, other developers to build it as i expect.
So far, so good.
What still puzzles me is why to favor interface types over class types when it comes to method parameters and return values. Why is that so? What are the benefits (drawbacks of the class approach)?
What interests me the most is how this actually translates into code.
Say we have a sort of pseudo mathInterface
public interface pseudoMathInterface {
double getValue();
double getSquareRoot();
List<Double> getFirstHundredPrimes();
}
//...
public class mathImp implements pseudoMathInterface { }
//.. actual implementation
So in the case of getPrimes() method I would bound it to List, meaning any concrete implementation of the List interface rather than a concerete implementation such as ArrayList!?
And in terms of the method parameter would I once again broaden my opportunities whilst ensuring that i can do with the type whatever i would like to do given it is part of the interface's contract which the type finally implements.!?
Say you are the creator of a Maven dependency, a JAR with a well-known, well-specified API.
If your method requests an ArrayList<Thing>, treating it is a collection of Things, but all I have got is a HashSet<Thing>, your method will twist my arm into copying everything into an ArrayList for no benefit;
if your method declares to return an ArrayList<Thing>, which (semantically) contains just a collection of Things and the index of an element within it carries no meaning, then you are forever binding yourself to returning an actual ArrayList, even though e.g. the future course of the project makes it obvious that a custom collection implementation, specifically tailored to the optimization of the typical use case of this method, is desperately needed to improve a key performance bottleneck.
You are forced to make an API breaking change, again for no benefit to your client, but just to fix an internal issue. In the meantime you've got people writing code which assumes an ArrayList, such as iterating through it by index (there is an extremely slight performance gain to do so, but there are early optimizers out there to whom that's plenty).
I propose you judiciously generalize from the above two statements into general principles which capture the "why" of your question.
An important reason to prefer interfaces for formal argument types is that it does not bind you to a particular class hierarchy. Java supports only single inheritance of implementation (class inheritance), but it supports unlimited inheritance of interface (implements).
Return types are a different question. A good rule of thumb is to prefer the most general possible argument types, and the most specific possible return types. The "most general possible" is pretty easy, and it clearly lines up with preferring interface types for formal arguments. The "most specific possible" return types is trickier, however, because it depends on just what you mean by "possible".
One reason for using interface types as your methods' declared return types is to allow you to return instances of non-public classes. Another is to preserve the flexibility to change what specific type you return without breaking dependent code. Yet another is to allow different implementations to return different types. That's just off the top of my head.
So in the case of getPrimes() method I would bound it to List, meaning any concrete implementation of the List interface rather than a concerete implementation such as ArrayList!?
Yes, this allows the method to later then change what List type it returns without breaking client code that uses the method.
Besides having the ability to change what object is really passed to/returned from a method without breaking code, sometimes it may be better to use an interface type as a parameter/return type to lower the visibility of fields and methods available. This would reduce overall complexity of the code that then uses that interface type object.

Compiling Java Generics with Wildcards to C++ Templates

I am trying to build a Java to C++ trans-compiler (i.e. Java code goes in, semantically "equivalent" (more or less) C++ code comes out).
Not considering garbage collection, the languages are quite familiar, so the overall process works quite well already. One issue, however, are generics which do not exist in C++. Of course, the easiest way would be to perform erasure as done by the java compiler. However, the resulting C++ code should be nice to handle, so it would be good if I would not lose generic type information, i.e., it would be good, if the C++ code would still work with List<X> instead of List. Otherwise, the C++ code would need explicit casting everywhere where such generics are used. This is bug-prone and inconvenient.
So, I am trying to find a way to somehow get a better representation for generics. Of course, templates seem to be a good candidate. Although they are something completely different (metaprogramming vs. compile-time only type enhancement), they could still be useful. As long as no wildcards are used, just compiling a generic class to a template works reasonably well. However, as soon as wildcards come into play, things get really messy.
For example, consider the following java constructor of a list:
class List<T>{
List(Collection<? extends T> c){
this.addAll(c);
}
}
//Usage
Collection<String> c = ...;
List<Object> l = new List<Object>(c);
how to compile this? I had the idea of using chainsaw reinterpret cast between templates. Then, the upper example could be compiled like that:
template<class T>
class List{
List(Collection<T*> c){
this.addAll(c);
}
}
//Usage
Collection<String*> c = ...;
List<Object*> l = new List<Object*>(reinterpret_cast<Collection<Object*>>(c));
however, the question is whether this reinterpret cast produces the expected behaviour. Of course, it is dirty. But will it work? Usually, List<Object*> and List<String*> should have the same memory layout, as their template parameter is only a pointer. But is this guaranteed?
Another solution I thought of would be replacing methods using wildcards by template methods which instanciate each wildcard parameter, i.e., compile the constructor to
template<class T>
class List{
template<class S>
List(Collection<S*> c){
this.addAll(c);
}
}
of course, all other methods involving wildcards, like addAll would then also need template parameters. Another problem with this approach would be handling wildcards in class fields for example. I cannot use templates here.
A third approach would be a hybrid one: A generic class is compiled to a template class (call it T<X>) and an erased class (call it E). The template class T<X> inherits from the erased class E so it is always possible to drop genericity by upcasting to E. Then, all methods containing wildcards would be compiled using the erased type while others could retain the full template type.
What do you think about these methods? Where do you see the dis-/advantages of them?
Do you have any other thoughts of how wildcards could be implemented as clean as possible while keeping as much generic information in the code as possible?
Not considering garbage collection, the languages are quite familiar, so the overall process works quite well already.
No. While the two languages actually look rather similar, they are significantly different as to "how things are done". Such 1:1 trans-compilations as you are attempting will result in terrible, underperforming, and most likely faulty C++ code, especially if you are looking not at a stand-alone application, but at something that might interface with "normal", manually-written C++.
C++ requires a completely different programming style from Java. This begins with not having all types derive from Object, touches on avoiding new unless absolutely necessary (and then restricting it to constructors as much as possible, with the corresponding delete in the destructor - or better yet, follow Potatoswatter's advice below), and doesn't end at "patterns" like making your containers STL-compliant and passing begin- and end-iterators to another container's constructor instead of the whole container. I also didn't see const-correctness or pass-by-reference semantics in your code.
Note how many of the early Java "benchmarks" claimed that Java was faster than C++, because Java evangelists took Java code and translated it to C++ 1:1, just like you are planning to do. There is nothing to be won by such transcompilation.
An approach you haven't discussed is to handle generic wildcards with a wrapper class template. So, when you see Collection<? extends T>, you replace it with an instantiation of your template that exposes a read-only[*] interface like Collection<T> but wraps an instance of Collection<?>. Then you do your type erasure in this wrapper (and others like it), which means the resulting C++ is reasonably nice to handle.
Your chainsaw reinterpret_cast is not guaranteed to work. For instance if there's multiple inheritance in String, then it's not even possible in general to type-pun a String* as an Object*, because the conversion from String* to Object* might involve applying an offset to the address (more than that, with virtual base classes)[**]. I expect you'll use multiple inheritance in your C++-from-Java code, for interfaces. OK, so they'll have no data members, but they will have virtual functions, and C++ makes no special allowance for what you want. I think with standard-layout classes you could probably reinterpret the pointers themselves, but (a) that's too strong a condition for you, and (b) it still doesn't mean you can reinterpret the collection.
[*] Or whatever. I forget the details of how the wildcards work in Java, but whatever's supposed to happen when you try to add a T to a List<? extends T>, and the T turns out not to be an instance of ?, do that :-) The tricky part is auto-generating the wrapper for any given generic class or interface.
[**] And because strict aliasing forbids it.
If the goal is to represent Java semantics in C++, then do so in the most direct way. Do not use reinterpret_cast as its purpose is to defeat the native semantics of C++. (And doing so between high-level types almost always results in a program that is allowed to crash.)
You should be using reference counting, or a similar mechanism such as a custom garbage collector (although that sounds unlikely under the circumstances). So these objects will all go to the heap anyway.
Put the generic List object on the heap, and use a separate class to access that as a List<String> or whatever. This way, the persistent object has the generic type that can handle any ill-formed means of accessing it that Java can express. The accessor class contains just a pointer, which you already have for reference counting (i.e. it subclasses the "native" reference, not an Object for the heap), and exposes the appropriately downcasted interface. You might even be able to generate the template for the accessor using the generics source code. If you really want to try.

Why is String.length() a method?

If a String object is immutable (and thus obviously cannot change its length), why is length() a method, as opposed to simply being public final int length such as there is in an array?
Is it simply a getter method, or does it make some sort of calculation?
Just trying to see the logic behind this.
Java is a standard, not just an implementation. Different vendors can license and implement Java differently, as long as they adhere to the standard. By making the standard call for a field, that limits the implementation quite severely, for no good reason.
Also a method is much more flexible in terms of the future of a class. It is almost never done, except in some very early Java classes, to expose a final constant as a field that can have a different value with each instance of the class, rather than as a method.
The length() method well predates the CharSequence interface, probably from its first version. Look how well that worked out. Years later, without any loss of backwards compatibility, the CharSequence interface was introduced and fit in nicely. This would not have been possible with a field.
So let's really inverse the question (which is what you should do when you design a class intended to remain unchanged for decades): What does a field gain here, why not simply make it a method?
This is a fundamental tenet of encapsulation.
Part of encapsulation is that the class should hide its implementation from its interface (in the "design by contract" sense of an interface, not in the Java keyword sense).
What you want is the String's length -- you shouldn't care if this is cached, calculated, delegates to some other field, etc. If the JDK people want to change the implementation down the road, they should be able to do so without you having to recompile.
Perhaps a .length() method was considered more consistent with the corresponding method for a StringBuffer, which would obviously need more than a final member variable.
The String class was probably one of the very first classes defined for Java, ever. It's possible (and this is just speculation) that the implementation used a .length() method before final member variables even existed. It wouldn't take very long before the use of the method was well-embedded into the body of Java code existing at the time.
Perhaps because length() comes from the CharSequence interface. A method is a more sensible abstraction than a variable if its going to have multiple implementations.
You should always use accessor methods in public classes rather than public fields, regardless of whether they are final or not (see Item 14 in Effective Java).
When you allow a field to be accessed directly (i.e. is public) you lose the benefit of encapsulation, which means you can't change the representation without changing the API (you break peoples code if you do) and you can't perform any action when the field is accessed.
Effective Java provides a really good rule of thumb:
If a class is accessible outside its package, provide accessor methods, to preserve the flexibility to change the class's internal representation. If a public class exposes its data fields, all hope of changing its representation is lost, as client code can be distributed far and wide.
Basically, it is done this way because it is good design practice to do so. It leaves room to change the implementation of String at a later stage without breaking code for everyone.
String is using encapsulation to hide its internal details from you. An immutable object is still free to have mutable internal values as long as its externally visible state doesn't change. Length could be lazily computed. I encourage you to take a look as String's source code.
Checking the source code of String in Open JDK it's only a getter.
But as #SteveKuo points out this could differ dependent on the implementation.
In most current jvm implementations a Substring references the char array of the original String for content and it needs start and length fields to define their own content, so the length() method is used as a getter. However this is not the only possible way to implement String.
In a different possible implementation each String could have its own char array and since char arrays already have a length field with the correct length it would be redundant to have one for the String object, since String.length() is a method we don't have to do that and can just reference the internal array.length .
These are two possible implementations of String, both with their own good and bad parts and they can replace each other because the length() method hides where the length is stored (internal array or in own field).

Categories