Is there a use case for the size() method on the java.util.BitSet class?
I mean - the JavaDoc clearly says it's implementation dependant, it returns the size of the internal long[] storage in bits. From what it says, one could conclude that you won't be able to set a bit with a higher index than size(), but that's not true, the BitSet can grow automatically:
BitSet myBitSet = new BitSet();
System.out.println(myBitSet.size()); // prints "64"
myBitSet.set(768);
System.out.println(myBitSet.size()); // prints "832"
In every single encounter with BitSet I have had in my life, I always wanted to use length() since that one returns the logical size of the BitSet:
BitSet myBitSet = new BitSet();
System.out.println(myBitSet.length()); // prints "0"
myBitSet.set(768);
System.out.println(myBitSet.length()); // prints "769"
Even though I have been programming Java for the last 6 years, the two methods are always highly confusing for me. I often mix them up and use the wrong one incidentally, because in my head, I think of BitSet as a clever Set<boolean> where I'd use size().
It's like if ArrayList had length() returning the number of elements and size() returning the size of the underlying array.
Now, is there any use case for the size() method I am missing? Is it useful in any way? Has anyone ever used it for anything? Might it be important for some manual bit twiddling or something similar?
EDIT (after some more research)
I realized BitSet was introduced in Java 1.0 while the Collections framework with most of the classes we use was introduced in Java 1.2. So basically it seems to me that size() is kept because of legacy reasons and there's no real use for it. The new Collection classes don't have such methods, while some of the old ones (Vector, for example) do.
I realized BitSet was introduced in Java 1.0 while the Collections framework with most of the classes we use was introduced in Java 1.2.
Correct.
So basically it seems to me that size() is kept because of legacy reasons and there's no real use for it.
Yes, pretty much.
The other "size" method is length() which gives you the largest index at which a bit is set. From a logical perspective, length() is more useful than size() ... but length() was only introduced in Java 1.2.
The only (hypothetical) use-case I can think of where size() might be better than length() is when:
you are trying to establish a "fence post" for an iteration of the bits in the set, and
it is highly likely that you will stop iterating well before the end, and
it doesn't matter is you go a little bit beyond the last bit that is set.
In that case, size() is arguably better than length() because it is a cheaper call. (Look at the source code ...) But that's pretty marginal.
(I guess, another use-case along similar lines is when you are creating a new BitSet and preallocating it based on the size() of an existing BitSet. Again, the difference is marginal.)
But you are right about compatibility. It is clear that they could not either get rid of size() or change its semantics without creating compatibility problems. So they presumably decided to leave it alone. (Indeed, they didn't even see the need to deprecate it. The "harm" in having a not-particularly-useful method in the API is minimal.)
If the size method wasn't designed by Java creators as public, it would still undoubtedly exist as a private method/field. So we are discussing its accessibility and maybe naming.
Java 1.0 took a lot of inspiration, not just the procedural syntax, from C/C++. In the C++ standard library, the counterparts to BitSet's length and size also exist. They are called there size and capacity, respectively. There is rarely any hard reason to use capacity in C++, and even less so in a garbage collected language such as Java, but having the method accessible is still arguably useful. I will explain in Java terms.
Tell me, what is the maximum number of machine instructions ever needed for executing a BitSet operation such as set? One would like to answer "just a handful", but this is only true if that particular operation does not result in reallocation of the whole underlying array. Theoretically, the reallocations turn a constant time algorithm into a linear time one.
Does this theoretical difference have much practical impact? Rarely. The array usually doesn't grow too often. However, whenever you have an algorithm operating over a gradually growing BitSet with an approximately known final size, you will save on reallocations if you pass the final size already to the BitSet's constructor. In some very special circumstances this may even have a noticeable effect, in most circumstances it does not hurt.
set then has constant time complexity - calling it cannot ever block the application for too long.
if just one extremely large BitSet instance is using up all your available memory (by design), swapping may start noticeably later dependending on how your JVM implements the growth operation (with or without an extra copy).
Now imagine that you operate on many BitSets, all of which have been allocated with a target size. You are constructing one BitSet instance from another and you want the new one share the old one's target size as you know you will be using them side by side. Having the size method public makes this easier to implement cleanly.
It is the number of 0 and 1s which has to be a multiple of 64. You could use the cardinality() for the number of 1s.
One of the main reason i think it may be useful is when we need to extend the BitSet class and override the length method. In that case, the size is useful. below is how length returns value with dependancy on size method.
protected Set bitset;
public int length() {
int returnValue = 0;
// Make sure set not empty
// Get maximum value +1
if (bitset.size() > 0) {
Integer max = (Integer)Collections.max(bitset);
returnValue = max.intValue()+1;
}
return returnValue;
}
Related
int temp = name.get(0);
name.set(0, name.get(1));
name.set(1, temp)
Collections.swap(name, 0, 1)
I want to swap two elements and don't know which is more efficient. It seems like the runtime of both swaps are the same but I'm not too sure. Thanks!
Collections.swap is:
public static void swap(List<?> list, int i, int j) {
// instead of using a raw type here, it's possible to capture
// the wildcard but it will require a call to a supplementary
// private method
final List l = list;
l.set(i, l.set(j, l.get(i)));
}
So 2 gets and 2 sets vs. 1 get and 2 sets. Also Collections.swap nicely uses the return values from set to bypass the use of the temp variable.
I want to swap two elements and don't know which is more efficient. It seems like the runtime of both swaps are the same but I'm not too sure.
The only way to be sure is to write a proper micro-benchmark, run it (on a number of hardware platforms / Java versions) and interpret the results.
We can look at the source code, and make some informed guesses, but we cannot deduce micro-level efficiency from first principles1.
My advice:
Write the code in the way that you think is most readable, and let the compilers do the optimization. They can typically do a better job than most programmers.
If performance of your application is a concern, then write an application benchmark and use a profiler to find out where the real performance hotspots are.
Use the hotspot information to decide where it is worthwhile expending effort in hand-tuning the application ... not your intuition / guesswork.
1 - ... unless there is someone here with an unhealthily detailed amount of knowledge in their heads about how real world Java JIT compilers actually work across multiple platforms. And if there is someone here like that, we should probably just let them rest quietly, rather than bugging them with questions like this :-)
While I was thinking over the memory usage of various types, I started to become a bit confused of how Java utilizes memory for integers when passed to a method.
Say, I had the following code:
public static void main (String[] args){
int i = 4;
addUp(i);
}
public static int addUp(int i){
if(i == 0) return 0;
else return addUp(i - 1);
}
In this following example, I am wondering if my following logic was correct:
I have made a memory initially for integer i = 4. Then I pass it to a method. However, since primitives are not pointed in Java, in the addUp(i == 4), I create another integer i = 4. Then afterwards, there is another addUp(i == 3), addUp(i == 2), addUp(i == 1), addUp(i == 0) in which each time, since the value is not pointed, a new i value is allocated in the memory.
Then for a single "int i" value, I have used 6 integer value memories.
However, if I were to always pass it through an array:
public static void main (String[] args){
int[] i = {4};
// int tempI = i[0];
addUp(i);
}
public static int addUp(int[] i){
if(i[0] == 0) return 0;
else return addUp(i[0] = i[0] - 1);
}
- Since I create an integer array of size 1 and then pass that to addUp which will again be passed for addUp(i[0] == 3), addUp(i[0] == 2), addUp(i[0] == 1), addUp(i[0] == 0), I have only had to use 1 integer array memory space and hence far more cost efficient. In addition, if I were to make a int value beforehand to store the initial value of i[0], I still have my "original" value.
Then this leads me to the question, why do people pass primitives like int in Java methods? Isn't it far more memory efficient to just pass the array values of those primitives? Or is the first example somehow still just O(1) memory?
And on top of this question, I just wonder the memory differences of using int[] and int especially for a size of 1. Thank you in advance. I was simply wondering being more memory efficient with Java and this came to my head.
Thanks for all the answers! I'm just now quickly wondering if I were to "analyze" big-oh memory of each code, would they both be considered O(1) or would that be wrong to assume?
What you are missing here: the int values in your example go on the stack, not on the heap.
And it is much less overhead to deal with fixed size primitive values existing on the stack - compared to objects on the heap!
In other words: using a "pointer" means that you have to create a new object on the heap. All objects live on the heap; there is no stack for arrays! And objects becomes subject to garbage collection immediately after you stopped using them. Stacks on the other hand come and go as you invoke methods!
Beyond that: keep in mind that the abstractions that programming languages provide to us are created to help us writing code that is easy to read, understand and maintain. Your approach is basically to do some sort of fine tuning that leads to more complicated code. And that is not how Java solves such problems.
Meaning: with Java, the real "performance magic" happens at runtime, when the just-in-time compiler kicks in! You see, the JIT can inline calls to small methods when the method is invoked "often enough". And then it becomes even more important to keep data "close" together. As in: when data lives on the heap, you might have to access memory to get a value. Whereas items living on the stack - might still be "close" (as in: in the processor cache). So your little idea to optimize memory usage could actually slow down program execution by orders of magnitude. Because even today, there are orders of magnitude between accessing the processor cache and reading main memory.
Long story short: avoid getting into such "micro-tuning" for either performance or memory usage: the JVM is optimized for the "normal, typical" use cases. Your attempts to introduce clever work-arounds can therefore easily result in "less good" results.
So - when you worry about performance: do what everybody else is doing. And if you one really care - then learn how the JVM works. As it turns out that even my knowledge is slightly outdated - as the comments imply that a JIT can inline objects on the stack. In that sense: focus on writing clean, elegant code that solves the problem in straight forward way!
Finally: this is subject to change at some point. There are ideas to introduce true value value objects to java. Which basically live on the stack, not the heap. But don't expect that to happen before Java10. Or 11. Or ... (I think this would be relevant here).
Several things:
First thing will be splitting hairs, but when you pass an int in java you are allocating 4 bytes onto the stack, and when you pass an array (because it is a reference) you are actually allocating 8 bytes (assuming an x64 architecture) onto the stack, plus the additional 4 bytes that store the int into the heap.
More importantly, the data that lives in the array is allocated into the heap, whereas the reference to the array itself is allocated onto the stack, when passing an integer there is no heap allocation required the primitive is only allocated into the stack. Over time reducing the heap allocations will mean that the garbage collector will have fewer things to clean up. Whereas the cleanup of stack-frames is trivial and doesn't require additional processing.
However, this is all moot (imho) because in practice when you have complicated collections of variables and objects you are likely going to end up grouping them together into a class. In general, you should be writing to promote readability and maintainability rather than trying to squeeze every last drop of performance out of the JVM. The JVM is pretty quick as it is, and there is always Moore's Law as a backstop.
It would be difficult to analyze the the Big-O for each because in order to get a true picture you would have to factor in the behavior of the garbage collector and that behavior is highly dependent on both the JVM itself and any runtime (JIT) optimizations that the JVM has made to your code.
Please remember Donald Knuth's wise words that "premature optimization is the root of all evil"
Write code that avoids micro-tuning, code that promotes readability and maintainability will fare better over the long run.
If your assumption is that arguments passed to functions necessarily consume memory (which is false by the way), then in your second example that passes an array, a copy of the reference to the array is made. That reference may actually be larger than an int, it's unlikely to be smaller.
Whether these methods take O(1) or O(N) depends on the compiler. (Here N is the value of i or i[0], depending.) If the compiler uses tail-recursion optimization then the stack space for the parameters, local variables, and return address can be reused and the implementation will then be O(1) for space. Absent tail-recursion optimization the space complexity is the same as the time complexity, O(N).
Basically tail-recursion optimization amounts (in this case) to the compiler rewriting your code as
public static int addUp(int i){
while(i != 0) i = i-1 ;
return 0;
}
or
public static int addUp(int[] i){
while(i[0] != 0) i[0] = i[0] - 1 ;
return 0 ;
}
A good optimizer might further optimize away the loops.
As far as I know, no Java compilers implement tail-recursion optimization at present, but there is no technical reason that it can't be done in many cases.
Actually, when you pass an array as a parameter to a method - a reference to this array is passed under the hood. The array itself is stored on the heap. And the reference can be 4 or 8 bytes in size (depending on CPU architecture, JVM implementation, etc.; even more, JLS doesn't say anything about how big a reference is in memory).
On the other hand, primitive int value always consumes only 4 bytes and resides on the stack.
When you pass an array, the content of the array may be modified by the method that receives the array. When you pass int primitives, those primitives may not be modified by the method that receives them. That's why sometimes you may use primitives and sometimes arrays.
Also in general, in Java programming you tend to favor readability and let this kind of memory optimizations be done by the JIT compiler.
The int array reference actually takes up more space in the stack frames than an int primitive (8 bytes vs 4). You're actually using more space.
But I think the primary reason people prefer the first way is because it's clearer and more legible.
People actually do do things a lot closer to the second when more ints are involved.
I have a bottleneck method which attempts to add points (as x-y pairs) to a HashSet. The common case is that the set already contains the point in which case nothing happens. Should I use a separate point for adding from the one I use for checking if the set already contains it? It seems this would allow the JVM to allocate the checking-point on stack. Thus in the common case, this will require no heap allocation.
Ex. I'm considering changing
HashSet<Point> set;
public void addPoint(int x, int y) {
if(set.add(new Point(x,y))) {
//Do some stuff
}
}
to
HashSet<Point> set;
public void addPoint(int x, int y){
if(!set.contains(new Point(x,y))) {
set.add(new Point(x,y));
//Do some stuff
}
}
Is there a profiler which will tell me whether objects are allocated on heap or stack?
EDIT: To clarify why I think the second might be faster, in the first case the object may or may not be added to the collection, so it's not non-escaping and cannot be optimized. In the second case, the first object allocated is clearly non-escaping so it can be optimized by the JVM and put on stack. The second allocation only occurs in the rare case where it's not already contained.
Marko Topolnik properly answered your question; the space allocated for the first new Point may or may not be immediately freed and it is probably foolish to bank on it happening. But I want to expand on why you're currently in a deep state of sin:
You're trying to optimise this the wrong way.
You've identified object creation to be the bottleneck here. I'm going to assume that you're right about this. You're hoping that, if you create fewer objects, the code will run faster. That might be true, but it will never run very fast as you've designed it.
Every object in Java has a pretty fat header (16 bytes; an 8-byte "mark word" full of bit fields and an 8-byte pointer to the class type) and, depending on what's happened in your program thus far, possibly another pretty fat trailer. Your HashSet isn't storing just the contents of your objects; it's storing pointers to those fat-headers-followed-by-contents. (Actually, it's storing pointers to Entry classes that themselves store pointers to Points. Two levels of indirection there.)
A HashSet lookup, then, figures out which bucket it needs to look at and then chases one pointer per thing in the bucket to do the comparison. (As one great big chain in series.) There probably aren't very many of these objects, but they almost certainly aren't stored close together, making your cache angry. Note that object allocation in Java is extremely cheap---you just increment a pointer---and that this is quite probably a bigger source of slowness.
Java doesn't provide any abstraction like C++'s templates, so the only real way to make this fast and still provide the Set abstraction is to copy HashSet's code, change all of the data structures to represent your objects inline, modify the methods to work with the new data structures, and, if you're still worried, make copies of the relevant methods that take a list of parameters corresponding to object contents (i.e. contains(int, int)) that do the right thing without constructing a new object.
This approach is error-prone and time-consuming, but it's necessary unfortunately often when working on Java projects where performance matters. Take a look at the Trove library Marko mentioned and see if you can use it instead; Trove did exactly this for the primitive types.
With that out of the way, a monomorphic call site is one where only one method is called. Hotspot aggressively inlines calls from monomorphic call sites. You'll notice that HashSet.contains punts to HashMap.containsKey. You'd better pray for HashMap.containsKey to be inlined since you need the hashCode call and equals calls inside to be monomorphic. You can verify that your code is being compiled nicely by using the -XX:+PrintAssembly option and poring over the output, but it's probably not---and even if it is, it's probably still slow because of what a HashSet is.
As soon as you have written new Point(x,y), you are creating a new object. It may happen not to be placed on the heap, but that's just a bet you can lose. For example, the contains call should be inlined for the escape analysis to work, or at least it should be a monomorphic call site. All this means that you are optimizing against a quite erratic performance model.
If you want to avoid allocation the solid way, you can use Trove library's TLongHashSet and have your (int,int) pairs encoded as single long values.
There are cases when one needs a memory efficient to store lots of objects. To do that in Java you are forced to use several primitive arrays (see below why) or a big byte array which produces a bit CPU overhead for converting.
Example: you have a class Point { float x; float y;}. Now you want to store N points in an array which would take at least N * 8 bytes for the floats and N * 4 bytes for the reference on a 32bit JVM. So at least 1/3 is garbage (not counting in the normal object overhead here). But if you would store this in two float arrays all would be fine.
My question: Why does Java not optimize the memory usage for arrays of references? I mean why not directly embed the object in the array like it is done in C++?
E.g. marking the class Point final should be sufficient for the JVM to see the maximum length of the data for the Point class. Or where would this be against the specification? Also this would save a lot of memory when handling large n-dimensional matrices etc
Update:
I would like to know wether the JVM could theoretically optimize it (e.g. behind the scene) and under which conditions - not wether I can force the JVM somehow. I think the second point of the conclusion is the reason it cannot be done easily if at all.
Conclusions what the JVM would need to know:
The class needs to be final to let the JVM guess the length of one array entry
The array needs to be read only. Of course you can change the values like Point p = arr[i]; p.setX(i) but you cannot write to the array via inlineArr[i] = new Point(). Or the JVM would have to introduce copy semantics which would be against the "Java way". See aroth's answer
How to initialize the array (calling default constructor or leaving the members intialized to their default values)
Java doesn't provide a way to do this because it's not a language-level choice to make. C, C++, and the like expose ways to do this because they are system-level programming languages where you are expected to know system-level features and make decisions based on the specific architecture that you are using.
In Java, you are targeting the JVM. The JVM doesn't specify whether or not this is permissible (I'm making an assumption that this is true; I haven't combed the JLS thoroughly to prove that I'm right here). The idea is that when you write Java code, you trust the JIT to make intelligent decisions. That is where the reference types could be folded into an array or the like. So the "Java way" here would be that you cannot specify if it happens or not, but if the JIT can make that optimization and improve performance it could and should.
I am not sure whether this optimization in particular is implemented, but I do know that similar ones are: for example, objects allocated with new are conceptually on the "heap", but if the JVM notices (through a technique called escape analysis) that the object is method-local it can allocate the fields of the object on the stack or even directly in CPU registers, removing the "heap allocation" overhead entirely with no language change.
Update for updated question
If the question is "can this be done at all", I think the answer is yes. There are a few corner cases (such as null pointers) but you should be able to work around them. For null references, the JVM could convince itself that there will never be null elements, or keep a bit vector as mentioned previously. Both of these techniques would likely be predicated on escape analysis showing that the array reference never leaves the method, as I can see the bookkeeping becoming tricky if you try to e.g. store it in an object field.
The scenario you describe might save on memory (though in practice I'm not sure it would even do that), but it probably would add a fair bit of computational overhead when actually placing an object into an array. Consider that when you do new Point() the object you create is dynamically allocated on the heap. So if you allocate 100 Point instances by calling new Point() there is no guarantee that their locations will be contiguous in memory (and in fact they will most likely not be allocated to a contiguous block of memory).
So how would a Point instance actually make it into the "compressed" array? It seems to me that Java would have to explicitly copy every field in Point into the contiguous block of memory that was allocated for the array. That could become costly for object types that have many fields. Not only that, but the original Point instance is still taking up space on the heap, as well as inside of the array. So unless it gets immediately garbage-collected (I suppose any references could be rewritten to point at the copy that was placed in the array, thereby theoretically allowing immediate garbage-collection of the original instance) you're actually using more storage than you would be if you had just stored the reference in the array.
Moreover, what if you have multiple "compressed" arrays and a mutable object type? Inserting an object into an array necessarily copies that object's fields into the array. So if you do something like:
Point p = new Point(0, 0);
Point[] compressedA = {p}; //assuming 'p' is "optimally" stored as {0,0}
Point[] compressedB = {p}; //assuming 'p' is "optimally" stored as {0,0}
compressedA[0].setX(5)
compressedB[0].setX(1)
System.out.println(p.x);
System.out.println(compressedA[0].x);
System.out.println(compressedB[0].x);
...you would get:
0
5
1
...even though logically there should only be a single instance of Point. Storing references avoids this kind of problem, and also means that in any case where a nontrivial object is being shared between multiple arrays your total storage usage is probably lower than it would be if each array stored a copy of all of that object's fields.
Isn't this tantamount to providing trivial classes such as the following?
class Fixed {
float hiddenArr[];
Point pointArray(int position) {
return new Point(hiddenArr[position*2], hiddenArr[position*2+1]);
}
}
Also, it's possible to implement this without making the programmer explicitly state that they'd like it; the JVM is already aware of "value types" (POD types in C++); ones with only other plain-old-data types inside them. I believe HotSpot uses this information during stack elision, no reason it couldn't do it for arrays too?
Is there any justifiable reason to in Java something like
Long l = new Long(SOME_CONSTANT)
This creates an extra object and is tagged by FindBugs, and is obviously a bad practice. My question is whether there is ever a good reason to do so?
I previously asked this about String constructors and got a good answer, but that answer doesn't seem to apply to numbers.
Only if you want to make sure you get a unique instance, so practically never.
Some numbers can be cached when autoboxed (although Longs aren't guaranteed to be), which might cause problems. But any code that would break because of caching probably has deeper issues. Right now, I can't think of a single valid case for it.
My question is whether there is ever a good reason to do so?
You might still use it if you want to write code compatible with older JREs. valueOf(long) was only introduced in Java 1.5, so in Java 1.4 and before the constructor was the only way to go directly from a long to a Long. I expect it isn't deprecated because the constructor is still used internally.
The only thing I can think of is to make the boxing explicit, although the equivalent autoboxed code is actually compiled into Long.valueOf(SOME_CONSTANT) which can cache small values : (from jvm src)
public static Long valueOf(long l) {
final int offset = 128;
if (l >= -128 && l <= 127) { // will cache
return LongCache.cache[(int)l + offset];
}
return new Long(l);
}
. Not a big deal, but I dislike seeing code that continually boxes and unboxes without regard for type, which can get sloppy.
Functionally, though, I can't see a difference one way or the other. The new long will still compute as equals and hashcode equals to the autoboxed one, so I can't see how you could even make a functional distinction if you wanted to.