I am reading some materials about garbage collection in Java in order to get to know more deeply what really happens in GC process.
I came across on the mechanism called "card table". I've Googled for it and haven't found comprehensive information. Most of explanations are rather shallow and describes it like some magic.
My question is: How card table and write barrier works? What is marked in card tables? How then garbage collector knows that particular object is referenced by another object persisted in older generation.
I would like to have some practical imagination about that mechanism, like I was supposed to prepare some simulation.
I don't know whether you found some exceptionally bad description or whether you expect too many details, I've been quite satisfied with the explanations I've seen. If the descriptions are brief and sound simplistic, that's because it really is a rather simple mechanism.
As you apparently already know, a generational garbage collector needs to be able to enumerate old objects that refer to young objects. It would be correct to scan all old objects, but that destroys the advantages of the generational approach, so you have to narrow it down. Regardless of how you do that, you need a write barrier - a piece of code executed whenever a member variable (of a reference type) is assigned/written to. If the new reference points to a young object and it's stored in an old object, the write barrier records that fact for the garbage collect. The difference lies in how it's recorded. There are exact schemes using so-called remembered sets, a collection of every old object that has (had at some point) a reference to a young object. As you can imagine, this takes quite a bit of space.
The card table is a trade-off: Instead of telling you which objects exactly contains young pointers (or at least did at some point), it groups objects into fixed-sized buckets and tracks which buckets contain objects with young pointers. This, of course, reduces space usage. For correctness, it doesn't really matter how you bucket the objects, as long as you're consistent about it. For efficiency, you just group them by their memory address (because you have that available for free), divided by some larger power of two (to make the division a cheap bitwise operation).
Also, instead of maintaining an explicit list of buckets, you reserve some space for each possible bucket up-front. Specifically, there is an array of N bits or bytes, where N is the number of buckets, so that the ith value is 0 if the ith bucket contains no young pointers, or 1 if it does contain young pointers. This is the card table proper. Typically this space is allocated and freed along with a large block of memory used as (part of) the heap. It may even be embedded in the start of the memory block, if it doesn't need to grow. Unless the entire address space is used as heap (which is very rare), the above formula gives numbers starting from start_of_memory_region >> K instead of 0, so to get an index into the card table you have to subtract the start of the start address of the heap.
In summary, when the write barrier finds that the statement some_obj.field = other_obj; stores a young pointer in an old object, it does this:
card_table[(&old_obj - start_of_heap) >> K] = 1;
Where &old_obj is the address of the object that now has a young pointer (which is already in a register because it was just determined to refer to an old object).
During minor GC, the garbage collector looks at the card table to determine which heap regions to scan for young pointers:
for i from 0 to (heap_size >> K):
if card_table[i]:
scan heap[i << K .. (i + 1) << K] for young pointers
Sometime ago I have written an article explaining mechanics of young collection in HotSpot JVM.
Understanding GC pauses in JVM, HotSpot's minor GC
Principle of dirty card write-barrier is very simple. Each time when program modifies reference in memory, it should mark modified memory page as dirty. There is a special card table in JVM and each 512 byte page of memory has associated one byte entry in card table.
Normally collection of all references from old space to young would require scanning through all objects in old space. That is why we need write-barrier. All objects in young space have been created (or relocated) since last reset of write-barrier, so non-dirty pages cannot have references into young space. This means we can scan only object in dirty pages.
For anyone who is looking for a simple answer:
In JVM, the memory space of objects is broken down into two spaces:
Young generation (space): All new allocations (objects) are created inside this space.
Old generation (space): This is where long lived objects exist (and probably die)
The idea is that, once an object survives a few garbage collection, it is more likely to survive for a long time. So, objects that survive garbage collection for more than a threshold, will be promoted to old generation. The garbage collector runs more frequently in the young generation and less frequently in the old generation. This is because most objects live for a very short time.
We use generational garbage collection to avoid scanning of the whole memory space (like Mark and Sweep approach). In JVM, we have a minor garbage collection which is when GC runs inside the young generation and a major garbage collection (or full GC) which encompasses garbage collection of both young and old generations.
When doing minor garbage collection, JVM follows every reference from the live roots to the objects in the young generation, and marks those objects as live, which excludes them from the garbage collection process. The problem is that there may be some references from the objects in the old generation to the objects in young generation, which should be considered by GC, meaning those objects in young generation that are referenced by objects in old generation should also be marked as live and excluded from the garbage collection process.
One approach to solve this problem is to scan all of the objects in the old generation and find their references to young objects. But this approach is in contradiction with the idea of generational garbage collectors. (Why we broke down our memory space into multiple generations in the first place?)
Another approach is using write barriers and card table. When an object in old generation writes/updates a reference to an object in the young generation, this action goes through something called write barrier. When JVM sees these write barriers, it updates the corresponding entry in the card table. Card table is a table, which each one of its entries correspond to 512 bytes of the memory. You can think of it as an array containing 0 and 1 items. A 1 entry means there is an object in the corresponding area of the memory which contains references to objects in young generation.
Now, when minor garbage collection is happening, first every reference from the live roots to young objects are followed and the referenced objects in young generation will be marked as live. Then, instead of scanning all of the old object to find references to the young objects, the card table is scanned. If GC finds any marked area in the card table, it loads the corresponding object and follows its references to young objects and marks them as live either.
Related
While reading a course on the garbage collector I understood that it removes unreferenced objects from memory, I tried to answer some questions in a quiz related to the subject and I found this question:
The role of the garbage collector is to ensure that there is enough memory to run a Java program?
Is it true or false? I know it manages memory, but does it guarantee that there is enough memory to run a program?
It does, what it sounds exactly:
"Garbage collection"
Garbage collector has not a responsibility of memory allocation or ensuring there is enough memory to run program.
So take a look at wikipedia about garbage collector : https://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
Also according to oracle :
What is Automatic Garbage Collection? Automatic garbage collection is
the process of looking at heap memory, identifying which objects are
in use and which are not, and deleting the unused objects. An in use
object, or a referenced object, means that some part of your program
still maintains a pointer to that object. An unused object, or
unreferenced object, is no longer referenced by any part of your
program. So the memory used by an unreferenced object can be
reclaimed.
https://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html
And check that stacoverflow questiont also : What's the difference between memory allocation and garbage collection, please?
Currently I am learning about GC in Java, but I need some clarification. Let's say we have situation like on this picture:
According to this website first runs DefNew and after that Tenured GC. In that case:
In DefNew Object A has reference from Old generation, this won't be collected.
In Tenured (if I get it right), Object B won't be deleted because has a reference from Young Generation (Object A).
How does it works after all? I was thinking about dirty cards, but only Object C would be marked, because it was changed (deleted reference to Object B).
From an article about "nepotism" problem
This is 'pathological' because ANY promoted node will result in the promotion of ALL following nodes until a GC resolves the issue.
So I guess it will be promoted to Old gen first, then collected.
I know the CMS garbage collector uses the mark sweep algorithm,I am curious about how it marks the object.
CMS initial mark: why does it mark the reachable objects rather than mark unreachable objects?
The garbage collector's task is to find objects that your program can no longer reach by following whatever chain of accessor steps, and to reclaim the memory occupied by them.
The Mark-Sweep GC does that the other way round: it first finds all objects that can still be reached, and then reclaim the memory of all the other objects.
Simplified Mark-Sweep algorithm (of course the real one is far more complex):
Start with all directly-reachable references, e.g. local variables on the stack (arguments and locals from all not-yet-finished method calls), static fields etc.
Mark the objects they point to.
Recursively inspect the newly-marked objects. Mark the objects referenced by their fields.
Repeat until you get no more new marks.
Loop over your memory object-by-object and reclaim the memory of every object without a mark.
Finally remove all marks.
Could anyone please tell me what will be with objects that refer to each other? How does java's GC resolve that issue? Thanks in advance!
If you have object A and B, and if the following conditions hold:
A references to B
B references to A
No other objects reference to any one of them
They are not root objects (e.g. objects in the constants pool etc)
then, these two objects will be garbage collected. This is called "circular reference".
This is because the mark-and-sweep GC will scan and find out all the objects that are reachable from the root objects. If A and B reference each other without any external reference, the mark-and-sweep GC won't be able to mark them as reachable, hence will be selected as candidates for GC.
There are a number of different mark-and-sweep implementations (naive mark-and-sweep, tri-colour etc). But the fundamental idea is the same. If an object cannot be reached from the root by direct/indirect references, it will be garbage collected.
There is a number of GCs. In the Young Generation, there is a copy collector.
What this does is find all the objects which are referenced from "root" objects such a thread stacks. e.g. the eden space is copied to a survivor space, and the survivor spaces are copied to each other. Anything left uncopied is cleared away.
This means if you have a bundle of objects which refer to each other and there is no strong reference to any of them, they will be discarded on the next collection. (The exception being soft references where the GC can choose to keep them or not)
On the slides I am revising from it says the following:
Live objects can be identified either by maintaining a count of the number of references to each object, or by tracing chains of references from the roots.
Reference counting is expensive – it needs action every time a reference changes and it doesn’t spot cyclical structures, but it can reclaim space incrementally.
Tracing involves identifying live objects only when you need to reclaim space – moving the cost from general access to the time at which the GC runs, typically only when you are out of memory.
I understand the principles of why reference counting is expensive but do not understand what
"doesn’t spot cyclical structures, but it can reclaim space incrementally." means. Could anyone help me out a little bit please?
Thanks
Reference counting doesn’t spot cyclical structures...
Let's say you have two objects O1 and O2. They reference each other: O1 -> O2 and O2 -> O1, and no other objects references them. They will both have reference count 1 (one referrer).
If neither O1 or O2 is reachable from a GC root, they can be safely garbage collected. This is not detected by counting references though, since they both have reference count > 0.
0 references is a sufficient but not necessary requirement for an object to be eligible for garbage collection.
...but it can reclaim space incrementally.
The incremental part refers to the fact that you can garbage collect some of the 0-referenced objects quickly, get interrupted and continue at another time without problems.
If a tracing-algorithm gets interrupted it will need to start over from scratch the next time it's scheduled. (The tree of references may have changed since it started!)
Simple reference counting cannot resolve cases, when A refers to B and B refers to A. In this case, both A and B will have reference count 1 and will not be collected even if there are no other references.
Reference counting can reclaim space immediately when some object's reference counter becomes 0. There is no need to wait for GC cycle, scan other objects, etc. So, in a sense, it works incrementally as it reclaims space from objects one by one.
"doesn't spot cyclical structures"
Let's say I've got an object A. A needs another object called B to do its work. But B needs another object called C to do its work. But C needs a pointer to A for some reason or other. So the dependency graph looks like:
A -> B -> C -> A
The reference count for an object should be the number of arrows pointing at it. In this example, every object has a reference count of one.
Let's say our main program created a structure like this during its execution, and the main program had a pointer to A, making A's count equal to two. What happens when this structure falls out of scope? A's reference count is decremented to one.
But notice! Now A, B, and C all have reference counts of one even though they're not reachable from the main program. So this is a memory leak. Wikipedia has details on how to solve this problem.
"it can reclaim space incrementally"
Most garbage collectors have a collection period during which they pause execution and free up objects no longer in use. In a mark-and-sweep system, this is the sweep step. The downside is that during the periods between sweeps, memory keeps growing and growing. An object may stop being used almost immediately after its creation, but it will never be reclaimed until the next sweep.
In a reference-counting system, objects are freed as soon as their reference count hits zero. There is no big pause or any big sweep step, and objects no longer in use do not just sit around waiting for collection, they are freed immediately. Hence the collection is incremental in that it incrementally collects any object no longer in use rather than bulk collecting all the objects that have fallen out of use since the last collection.
Of course, this incrementalism may come with its own pitfalls, namely that it might be less expensive to do a bulk GC rather than a lot of little ones, but that is determined by the exact implementation.
Imagine you have three objects (A, B, C): A has a reference on B, B has a reference on C and C has a reference on A. But no other object has any reference on one of these. It's an independent cyclical structure. Using traditional reference counting would prevent the garbage collector from remove the cycle because every object is still referenced. But as long as no one has a reference on one of the three they could/should be removed. I guess reclaiming space incrementally means the way reference counting works when finding unreferenced instances, no cycles etc.
An object can be released for garbage collecting if the reference count reaches 0.
For circular reference this will never happen as each object in the circle keeps a reference to another so they are all at least 1.
For that some graph theory is needed to detect references which are no longer attached to anything, like little islands in the Heap sea. In order to keep these in memory they must have some 'attachment' to a static variable somewhere.
That is what tracing does. It determines which parts of the heap are islands and can be freed and which are still attached to the mainland i.e. static variables smewhere.