How should I choose where to store an object in C++?

How should I choose where to store an object in C++? - java

Possible duplicate
Proper stack and heap usage in C++?
I'm beginning to learn C++ from a Java background, and one big difference is the fact that I'm no longer forced to:
dynamically allocate memory for objects
always use pointers to handle objects
as is the case in Java. But I'm confused as to when I should be doing what - can you advise?
Currently I'm tempted to start out doing everything Java-style like
Thing *thing = new Thing();
thing->whatever();
// etc etc

Don't use pointers unless you know why you need them. If you only need an object for a while, allocate it on stack:
Object object;
object.Method();
If you need to pass an object to a function use references:
int doStuff( Object& object )
{
object.Method();
return 0;
}
only use pointers if you need
graph-like complex data structures or
arrays of different object types or
returning a newly created object from a function or
in situations when you sometimes need to specify that "there's no object" - then you use a null pointer.
If you use pointers you need to deallocate objects when those objects are no longer needed and before the last pointer to the object becomes unreacheable since C++ has no built-in garbage collection. To simplify this use smart pointers line std::auto_ptr or boost::shared_ptr.

That's bad. You're bound to forget to free it and if you're determined not to you'd have to handle exceptions because it won't get freed on stack unwinding automatically. Use shared_ptr at the very least.
shared_ptr<Thing> thing( new Thing() );
thing->whatever();
But it actually depends on the object size and the scope. If you're going to use it in one function and the object is not oversized, I'd suggest allocating it in stack frame.
Thing thing;
thing.whatever();
But the good thing is that you can decide whenever you want to allocate a new object ;-)

Do not use the new operator if you can otherwise avoid it, that way lies memory leaks and headaches remembering your object lifetimes.
The C++ way is to use stack-based objects, that cleanup after themselves when they leave scope, unless you copy them. This technique (called RAII) is a very powerful one where each object looks after itself, somewhat like how the GC looks after your memory for you in Java, but with the huge advantage of cleaning up as it goes along in a deterministic way (ie you know exactly when it will get cleaned).
However, if you prefer your way of doing objects, use a share_ptr which can give you the same semantics. Typically you'd use a shared_ptr only for very expensive objects or ones that are copies a lot.

One situation where you might need to allocate an instance on the heap is when it is only known at run-time which instance will be created in the first place (common with OOP):
Animal* animal = 0;
if (rand() % 2 == 0)
animal = new Dog("Lassie");
else
animal = new Monkey("Cheetah");
Another situation where you might need that is when you have a non-copyable class whose instances you have to store in a standard container (which requires that its contents be copyable). A variation of that is where you might want to store pointers to objects that are expensive to copy (this decision shouldn't be done off-hand, though).
In all cases, using smart pointers like shared_ptr and unique_ptr (which are being added to the standard library) are preferable, as they manage the objects lifetime for you.

Related

Java software design - Looping, object creation VS modifying variables. Memory, performance & reliability comparison

Let's say we are trying to build a document scanner class in java that takes 1 input argument, the log path(eg. C:\document\text1.txt). Which of the following implementations would you prefer based on performance/memory/modularity?
ArrayList<String> fileListArray = new ArrayList<String>();
fileListArray.add("C:\\document\\text1.txt");
fileListArray.add("C:\\document\\text2.txt");
.
.
.
//Implementation A
for(int i =0, j = fileListArray.size(); i < j; i++){
MyDocumentScanner ds = new MyDocumentScanner(fileListArray.get(i));
ds.scanDocument();
ds.resultOutput();
}
//Implementation B
MyDocumentScanner ds = new MyDocumentScanner();
for(int i=0, j=fileListArray.size(); i < j; i++){
ds.setDocPath(fileListArray.get(i));
ds.scanDocument();
ds.resultOutput();
}
Personally I would prefer A due to its encapsulation, but it seems like more memory usage due to creation of multiple instances. I'm curious if there is an answer to this, or it is another "that depends on the situation/circumstances" dilemma?

Although this is obviously opinion-based, I will try an answer to tell my opinion.
You approach A is far better. Your document scanner obviously handles a file. That should be set at construction time and be saved in an instance field. So every method can refer to this field. Moreover, the constructor can do some checks on the file reference (null check, existence, ...).
Your approach B has two very serious disadvantages:
After constructing a document scanner, clients could easily call all of the methods. If no file was set before, you must handle that "illegal state" with maybe an IllegalStateException. Thus, this approach increases code and complexity of that class.
There seems to be a series of method calls that a client should or can perform. It's easy to call the file setting method again in the middle of such a series with a completely other file, breaking the whole scan facility. To avoid this, your setter (for the file) should remember whether a file was already set. And that nearly automatically leads to approach A.
Regarding the creation of objects: Modern JVMs are really very fast at creating objects. Usually, there is no measurable performance overhead for that. The processing time (here: the scan) usually is much higher.

If you don't need multiple instances of DocumentScanner to co-exist, I see no point in creating a new instance in each iteration of the loop. It just creates work to the garbage collector, which has to free each of those instances.
If the length of the array is small, it doesn't make much difference which implementation you choose, but for large arrays, implementation B is more efficient, both in terms of memory (less instances created that the GC hasn't freed yet) and CPU (less work for the GC).

Are you implementing DocumentScanner or using an existing class?
If the latter, and it was designed for being able to parse multiple documents in a row, you can just reuse the object as in variant B.
However, if you are designing DocumentScanner, I would recommend to design it such that it handles a single document and does not even have a setDocPath method. This leads to less mutable state in that class and thus makes its design much easier. Also using an instance of the class becomes less error-prone.
As for performance, there won't be a measurable difference unless instantiating a DocumentScanner is doing a lot of work (like instantiating many other objects, too). Instantiating and freeing objects in Java is pretty cheap if they are used only for a short time due to the generational garbage collector.

Helping the JVM with stack allocation by using separate objects

I have a bottleneck method which attempts to add points (as x-y pairs) to a HashSet. The common case is that the set already contains the point in which case nothing happens. Should I use a separate point for adding from the one I use for checking if the set already contains it? It seems this would allow the JVM to allocate the checking-point on stack. Thus in the common case, this will require no heap allocation.
Ex. I'm considering changing
HashSet<Point> set;
public void addPoint(int x, int y) {
if(set.add(new Point(x,y))) {
//Do some stuff
}
}
to
HashSet<Point> set;
public void addPoint(int x, int y){
if(!set.contains(new Point(x,y))) {
set.add(new Point(x,y));
//Do some stuff
}
}
Is there a profiler which will tell me whether objects are allocated on heap or stack?
EDIT: To clarify why I think the second might be faster, in the first case the object may or may not be added to the collection, so it's not non-escaping and cannot be optimized. In the second case, the first object allocated is clearly non-escaping so it can be optimized by the JVM and put on stack. The second allocation only occurs in the rare case where it's not already contained.

Marko Topolnik properly answered your question; the space allocated for the first new Point may or may not be immediately freed and it is probably foolish to bank on it happening. But I want to expand on why you're currently in a deep state of sin:
You're trying to optimise this the wrong way.
You've identified object creation to be the bottleneck here. I'm going to assume that you're right about this. You're hoping that, if you create fewer objects, the code will run faster. That might be true, but it will never run very fast as you've designed it.
Every object in Java has a pretty fat header (16 bytes; an 8-byte "mark word" full of bit fields and an 8-byte pointer to the class type) and, depending on what's happened in your program thus far, possibly another pretty fat trailer. Your HashSet isn't storing just the contents of your objects; it's storing pointers to those fat-headers-followed-by-contents. (Actually, it's storing pointers to Entry classes that themselves store pointers to Points. Two levels of indirection there.)
A HashSet lookup, then, figures out which bucket it needs to look at and then chases one pointer per thing in the bucket to do the comparison. (As one great big chain in series.) There probably aren't very many of these objects, but they almost certainly aren't stored close together, making your cache angry. Note that object allocation in Java is extremely cheap---you just increment a pointer---and that this is quite probably a bigger source of slowness.
Java doesn't provide any abstraction like C++'s templates, so the only real way to make this fast and still provide the Set abstraction is to copy HashSet's code, change all of the data structures to represent your objects inline, modify the methods to work with the new data structures, and, if you're still worried, make copies of the relevant methods that take a list of parameters corresponding to object contents (i.e. contains(int, int)) that do the right thing without constructing a new object.
This approach is error-prone and time-consuming, but it's necessary unfortunately often when working on Java projects where performance matters. Take a look at the Trove library Marko mentioned and see if you can use it instead; Trove did exactly this for the primitive types.
With that out of the way, a monomorphic call site is one where only one method is called. Hotspot aggressively inlines calls from monomorphic call sites. You'll notice that HashSet.contains punts to HashMap.containsKey. You'd better pray for HashMap.containsKey to be inlined since you need the hashCode call and equals calls inside to be monomorphic. You can verify that your code is being compiled nicely by using the -XX:+PrintAssembly option and poring over the output, but it's probably not---and even if it is, it's probably still slow because of what a HashSet is.

As soon as you have written new Point(x,y), you are creating a new object. It may happen not to be placed on the heap, but that's just a bet you can lose. For example, the contains call should be inlined for the escape analysis to work, or at least it should be a monomorphic call site. All this means that you are optimizing against a quite erratic performance model.
If you want to avoid allocation the solid way, you can use Trove library's TLongHashSet and have your (int,int) pairs encoded as single long values.

Difference between new operator in C++ and new operator in java

As far as I know, the new operator does the following things: (please correct me if I am wrong.)
Allocates memory, and then returns the reference of the first block of the
allocated memory. (The memory is allocated from heap, obviously.)
Initialize the object (calling constructor.)
Also the operator new[] works in similar fashion except it does this for each and every element in the array.
Can anybody tell me how both of these operators and different in C++ and Java:
In terms of their life cycle.
What if they fail to allocate memory.

In C++, T * p = new T;...
allocates enough memory for an object of type T,
constructs an object of type T in that memory, possibly initializing it, and
returns a pointer to the object. (The pointer has the same value as the address of the allocated memory for the standard new, but this needn't be the case for the array form new[].)
In case the memory allocation fails, an exception of type std::bad_alloc is thrown, no object is constructed and no memory is allocated.
In case the object constructor throws an exception, no object is (obviously) constructed, the memory is automatically released immediately, and the exception is propagated.
Otherwise a dynamically allocated object has been constructed, and the user must manually destroy the object and release the memory, typically by saying delete p;.
The actual allocation and deallocation function can be controlled in C++. If there is nothing else, a global, predefined function ::operator new() is used, but this may be replaced by the user; and if there exists a static member function T::operator new, that one will be used instead.
In Java it's fairly similar, only that the return value of new is something that can bind to a Java variable of type T (or a base thereof, such as Object), and you must always have an initializer (so you'd say T x = new T();). The object's lifetime is indeterminate, but guaranteed to be at least as long as any variables still refer to the object, and there is no way to (nor any need to) destroy the object manually. Java has no explicit notion of memory, and you cannot control the interna of the allocation.
Furthermore, C++ allows lots of different forms of new expressions (so-called placement forms). They all create dynamic-storage objects which must be destroyed manually, but they can be fairly arbitrary. To my knowledge Java has no such facilities.
The biggest difference is probably in use: In Java, you use new all the time for everything, and you have to, since it's the one and only way to create (class-type) objects. By contrast, in C++ you should almost never have naked news in user code. C++ has unconstrained variables, and so variables themselves can be objects, and that is how objects are usually used in C++.

In your "statement", I don't think "returns a reference to the first block of allocated memory is quite right. new returns a pointer (to the type of the object allocated). This is subtly different from a reference, although conceptually similar.
Answers to your questions:
In C++ an object stays around in memory (see note) until it is explicitly deleted with delete or delete [] (and you must use the one matching what you allocated with, so a new int[1];, although it is the same amount of memory as new int; can not be deleted with delete (and vice versa, delete [] can't be used for a new int). In Java, the memory gets freed by the garbage collector at some point in the future once there is "no reference to the memory".
Both throw an exception (C++ throws std::bad_alloc, Java something like OutOfMemoryError), but in C++ you can use new(std::nothrow) ..., in which case new returns NULL if there isn't enough memory available to satisfy the call.
Note: It is, as per comment, technically possible to "destroy" the object without freeing it's memory. This is a rather unusual case, and not something you should do unless you are REALLY experienced with C++ and you have a VERY good reason to do so. The typical use-case for this is inside the delete operator corresponding to a placement new (where new is called with an already existing memory address to just perform the construction of the object(s)). Again, placement new is pretty much special use of new, and not something you can expect to see much of in normal C++ code.

I don't know about details in Java, but here is what new and new[] do in C++:
Allocate memory
When you have an expression new T or new T(args), the compiler determines which function to call for getting memory
If the type T has an appropriate member operator new that one is called
Otherwise, if the user provided an appropriate global operator new that one is called.
If operator new cannot allocate the requested memory, then it calls a new handler function, which you can set with set_new_handler. That function may free some space so the allocation can succeed, it may terminate the program, or it may throw an exception of type std::bad_alloc or derived from that. The default new handler just throws std::bad_alloc.
The same happens for new T[n] except that operator new[] is called for memory allocation.
Construct the object resp. objects in the newly allocated memory.
For new T(args) the corresponding constructor of the object is called. If the constructor throws an exception, the memory is deallocated by calling the corresponding operator delete (which can be found in the same places as operator new)
For new T it depends if T is POD (i.e. a built-in type or basically a C struct/union) or not. If T is POD, nothing happens, otherwise it is treated like new T().
For new T[n] it also depends on whether T is POD. Again, PODs are not initialized. For non-PODs the default constructor is in turn called for each of the objects in order. If one object's default constructor throws, no further constructors are called, but the already constructed objects (which doesn't include the one whose constructor just threw) are destructed (i.e. have the destructor called) in reverse order. Then the memory is deallocated with the appropriate operator delete[].
Returns a pointer to the newly created object(s). Note that for new[] the pointer will likely not point to the beginning of the allocated memory because there will likely be some information about the number of allocated objects preceding the constructed objects, which is used by delete[] to figure out how many objects to destruct.
In all cases, the objects live until they are destroyed with delete ptr (for objects allocated with normal new) or delete[] ptr (for objects created with array new T[n]). Unless added with a third-party library, there's no garbage collection in C++.
Note that you also can call operator new and operator delete directly to allocate raw memory. The same is true for operator new[] and operator delete[]. However note that even for those low-level functions you may not mix the calls, e.g. by deallocating memory with operator delete that you allocated with operator new[].
You can also copnstruct an object in allocated memory (no matter how you got that) with the so-called placement new. This is done by giving the pointer to the raw memory as argument to new, like this: new(pMem) T(args). To destruct such an explicitly constructed object, you can call the object's destructor directly, p->~T().
Placement new works by calling an operator new which takes the pointer as additional argument and just returns it. This same mechanism can also be used to provide other information to operator new overloads which take corresponding additional arguments. However while you can define corresponding operator delete, those are only used for cleaning up when an object throws an exception during construction. There's no "placement delete" syntax.
One other use of the placement new syntax which is already provided by C++ is nothrow new. That one takes an additional parameter std::nothrow and differs from normal new only in that it returns a null pointer if allocation fails.
Also note that new is not the only memory management mechanism in C++. On the one hand, there are the C functions malloc and free. While usually operator new and operator new[] just call malloc, this is not guaranteed. Therefore you may not mix those forms (e.g. by calling free on a pointer pointing to memory allocated with operator new). On the other hand, STL containers handle their allocations through allocators, which are objects which manage the allocation/deallocation of objects as well as construction/destruction of objects in containers.
And finally, there are those objects whose lifetime is controlled directly by the language, namely those of static and automatic lifetime. Automatic lifetime objects are allocated by simply defining a variable of the type at local scope. They are automatically created when execution passes that line, and automatically destroyed when execution leaves the scope (including it the scope is left through an exception). Static lifetime objects are define at global/namespace scope or at local scope using the keyword static. They are created at program startup (global/namespace scope) or when their definition line is forst executed (local scope), and they live until the end of the program, when they are automatically destroyed in reverse order of construction.
Generally, automatic or static variables are to be preferred to dynamic allocation (i,e, everything you allocate with new or allocators), because there the compiler cares for proper destruction, unlike dynamic allocation where you have to do that on your own. If you have dynamically allocated objects, it's desirable to have their lifetime managed by automatic/static objects (containers, smart pointers) for the same reason.

You seem to have the operation of new correct in that it allocates and initializes memory.
Once the new completes successfully, you, the programmer, are responsible for deleteing that memory. The best way to make sure that this happens is to never use new directly yourself, instead preferring standard containers and algorithms, and stack-based objects. But if you do need to allocate memory, the C++ idiom is to use a smart pointer like unique_ptr from C++11 or shared_ptr from boost or C++11. That makes sure that the memory is reclaimed properly.
If an allocation fails, the new call will throw an exception after cleaning up any portion of the object that has been constructed prior to the failure. You can use the (nothrow) version of new to return a null pointer instead of throwing an exception, but that places even more burden of cleanup onto the client code.

The new keyword
The new operator is somewhat similar in the two languages. The main difference is that every object and array must be allocated via new in Java. (And indeed arrays are actually objects in Java.) So whilst the following is legal in C/C++ and would allocate the array from the stack...
// C/C++ : allocate array from the stack
void myFunction() {
int x[2];
x[0] = 1;
...
}
...in Java, we would have to write the following:
// Java : have to use 'new'; JVM allocates
// memory where it chooses.
void myFunction() {
int[] x = new int[2];
...
}
ref:https://www.javamex.com/java_equivalents/new.shtml

Passing big objects references instead of small objects to methods have any differences in processing or memory consumption?

I have a coding dilemma, and I don't know if there's a pattern or practice that deals with it. Whenever I have to pass some values to a method, most times I try to pass only the needed objects, instead of passing the objects which are being composed by them.
I was discussing with a friend about how Java manages heap and memory stuff and we didn't get anywhere.
Let me give two examples:
//Example 1:
private void method doSomething(String s, Car car, boolean isReal){...}
...
String s = myBigObject.getLabels.getMainName();
Car car = myBigObject.getCar();
boolean isReal = myBigObject.isRealCar();
doSomething(s, car, isReal);
//Example 2 - having in mind that BigObject is a really big object and I'll only use those 3 atributes:
private void method doSomething(BigObject bigObject){...}
...
doSomething(myBigObject);
In the 2nd example, it seems to me memory will be kind of wasted, passing a big object without really needing it.

Since Java passes only references to objects (and copies them, making it technically pass-by-value), there is no memory overhead for passing "big objects". Your Example 1 actually uses a little more memory.
However, there may still be good reason to do it that way: it removes a dependency and allows you to call doSomething on data that is not part of a BigObject. This may or may not be an advantage. If it gets called a lot with BigObject parameters, you'd have a lot of duplicate code extracting those values, which would not be good.
Note also that you don't have to assign return values to a local variable to pass them. You can also do it like this:
doSomething(myBigObject.getLabels().getMainName(),
myBigObject.getCar(),
myBigObject.isRealCar());

You're already only passing a reference to BigObject, not a full copy of BigObject. Java passes references by value.
Arguably, you're spending more memory the first way, not less, since you're now passing two references and a boolean instead of a single reference.

Java uses pass by value, when ever we pass an object to a method keep in mind that we are not going to pass all the values store in side the object we just pass the bits( some thing like this ab06789c) which is the value of the address on which the object is stored in memory(Heap Memory). So you are wasting more memory in first case rather than the 2nd one. Refer to JAVA pass-by-reference or pass-by-memory

All references are the same size, so how could it use more memory? It doesn't.

Are arrays of 'structs' theoretically possible in Java?

There are cases when one needs a memory efficient to store lots of objects. To do that in Java you are forced to use several primitive arrays (see below why) or a big byte array which produces a bit CPU overhead for converting.
Example: you have a class Point { float x; float y;}. Now you want to store N points in an array which would take at least N * 8 bytes for the floats and N * 4 bytes for the reference on a 32bit JVM. So at least 1/3 is garbage (not counting in the normal object overhead here). But if you would store this in two float arrays all would be fine.
My question: Why does Java not optimize the memory usage for arrays of references? I mean why not directly embed the object in the array like it is done in C++?
E.g. marking the class Point final should be sufficient for the JVM to see the maximum length of the data for the Point class. Or where would this be against the specification? Also this would save a lot of memory when handling large n-dimensional matrices etc
Update:
I would like to know wether the JVM could theoretically optimize it (e.g. behind the scene) and under which conditions - not wether I can force the JVM somehow. I think the second point of the conclusion is the reason it cannot be done easily if at all.
Conclusions what the JVM would need to know:
The class needs to be final to let the JVM guess the length of one array entry
The array needs to be read only. Of course you can change the values like Point p = arr[i]; p.setX(i) but you cannot write to the array via inlineArr[i] = new Point(). Or the JVM would have to introduce copy semantics which would be against the "Java way". See aroth's answer
How to initialize the array (calling default constructor or leaving the members intialized to their default values)

Java doesn't provide a way to do this because it's not a language-level choice to make. C, C++, and the like expose ways to do this because they are system-level programming languages where you are expected to know system-level features and make decisions based on the specific architecture that you are using.
In Java, you are targeting the JVM. The JVM doesn't specify whether or not this is permissible (I'm making an assumption that this is true; I haven't combed the JLS thoroughly to prove that I'm right here). The idea is that when you write Java code, you trust the JIT to make intelligent decisions. That is where the reference types could be folded into an array or the like. So the "Java way" here would be that you cannot specify if it happens or not, but if the JIT can make that optimization and improve performance it could and should.
I am not sure whether this optimization in particular is implemented, but I do know that similar ones are: for example, objects allocated with new are conceptually on the "heap", but if the JVM notices (through a technique called escape analysis) that the object is method-local it can allocate the fields of the object on the stack or even directly in CPU registers, removing the "heap allocation" overhead entirely with no language change.
Update for updated question
If the question is "can this be done at all", I think the answer is yes. There are a few corner cases (such as null pointers) but you should be able to work around them. For null references, the JVM could convince itself that there will never be null elements, or keep a bit vector as mentioned previously. Both of these techniques would likely be predicated on escape analysis showing that the array reference never leaves the method, as I can see the bookkeeping becoming tricky if you try to e.g. store it in an object field.

The scenario you describe might save on memory (though in practice I'm not sure it would even do that), but it probably would add a fair bit of computational overhead when actually placing an object into an array. Consider that when you do new Point() the object you create is dynamically allocated on the heap. So if you allocate 100 Point instances by calling new Point() there is no guarantee that their locations will be contiguous in memory (and in fact they will most likely not be allocated to a contiguous block of memory).
So how would a Point instance actually make it into the "compressed" array? It seems to me that Java would have to explicitly copy every field in Point into the contiguous block of memory that was allocated for the array. That could become costly for object types that have many fields. Not only that, but the original Point instance is still taking up space on the heap, as well as inside of the array. So unless it gets immediately garbage-collected (I suppose any references could be rewritten to point at the copy that was placed in the array, thereby theoretically allowing immediate garbage-collection of the original instance) you're actually using more storage than you would be if you had just stored the reference in the array.
Moreover, what if you have multiple "compressed" arrays and a mutable object type? Inserting an object into an array necessarily copies that object's fields into the array. So if you do something like:
Point p = new Point(0, 0);
Point[] compressedA = {p}; //assuming 'p' is "optimally" stored as {0,0}
Point[] compressedB = {p}; //assuming 'p' is "optimally" stored as {0,0}
compressedA[0].setX(5)
compressedB[0].setX(1)
System.out.println(p.x);
System.out.println(compressedA[0].x);
System.out.println(compressedB[0].x);
...you would get:
0
5
1
...even though logically there should only be a single instance of Point. Storing references avoids this kind of problem, and also means that in any case where a nontrivial object is being shared between multiple arrays your total storage usage is probably lower than it would be if each array stored a copy of all of that object's fields.

Isn't this tantamount to providing trivial classes such as the following?
class Fixed {
float hiddenArr[];
Point pointArray(int position) {
return new Point(hiddenArr[position*2], hiddenArr[position*2+1]);
}
}
Also, it's possible to implement this without making the programmer explicitly state that they'd like it; the JVM is already aware of "value types" (POD types in C++); ones with only other plain-old-data types inside them. I believe HotSpot uses this information during stack elision, no reason it couldn't do it for arrays too?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.