Java Heaps and GC on Multi-Processor machines? - java

How does Java deal with GC and Heap Allocation on multi-processor machines?
In the reading I've done, there doesn't seem to be any difference in the algorithms used between single and multi-processor systems. The art & science of GC tuning is Java seems fairly mature, yet I can't find anything related to this in any of the common JVM implementations.
As a data point, in .Net, the algorithm changes significantly: There's a heap affinitized to each processor, and each processor is responsible for that heap. This is documented in a number of places such as MSDN:
Scalable Collections On a multiprocessor system running the server
version of the execution engine (MSCorSvr.dll), the managed heap is
split into several sections, one per CPU. When a collection is
initiated, the collector has one thread per CPU; all threads collect
their own sections simultaneously. The workstation version of the
execution engine (MSCorWks.dll) doesn't support this feature.
Any insight that can be offered into Java GC tuning specifically for multi processor systems is also of interest to me.

Indeed, in Hotspot JVM, the way that the heap is used does not depend on the heap size or number of cores.
Each thread (not processor) has a Thread Local Allocation Buffer (TLAB) so that object allocation is cheap and thread-safe. This memory space is kind of identical to that heap-processor-affinity you are mentionning.
You can also activate Non-Uniform Memory Access (NUMA). The idea behind NUMA is to prefer the RAM that is close to a CPU chip to store objects instead of considering the entire heap as a uniform space.
Finally, the GC are multi-threaded and scale on your number of cores, so they take advantage of your hardware.

Garbage collection is an implementation specific concept. Different JVMs (IBM, Oracle, OpenJDK, etc.) have different implementations, and different mechanisms are available in different versions too. Often you can select which mechanism you want to use when you start your Java program.
Similar questions here....
These details are often given in the documentation for the commandline options for your JRE:
IBM JDK Here
Oracle JRE Options here

Related

In Java 1.8, how do I choose a garbage collector?

Suppose my environment is Java 1.8, my application is a batch application and there is no requirement for latency, I don't know whether I should choose Parallel GC or G1 GC?
I understand that Parallel GC is optimized for throughput and is more suitable for batch applications like mine, but I find that all Java applications around me are using G1 garbage collector, so I am not sure if I don't need Parallel GC if I have G1, or if I am looking for throughput, Parallel GC is the best choice. better choice?
I went through the first chapter of the Java® Performance Companion book and there is a passage in it that describes this:
As of this writing, G1 primarily targets the use case of large Java heaps with reasonably low pauses, and also those applications that are using CMS GC. There are plans to use G1 to also target the throughput use case, but for applications looking for high throughput that can tolerate longer GC pauses, Parallel GC is currently the better choice.
This exactly answers my question, if 1 project is purely for completing timed tasks, or consuming MQ, it usually has no requirement for pause time, which would be more appropriate with Parallel GC.

VisualVM : Comparing performance of two version of same application

I have refactored and tuned my java application.
I now want to compare the performance of newer and older version of the application , in terms of their individual CPU and heap memory usage.
I am using VisualVM and JDK 1.7. I run them individually and monitor them using VisualVM. At the end all i have is two sets of graphs. This makes deciding which one is better difficult.
Is there a metric that VisualVm provides , which can make deciding which version performs better easier. ?.
Something like Average CPU used or Average Heap Used (if that is an accurate way of measuring performance) Thanks !!
The Platform MBeans provide access to the CPU, memory and Garbage Collection data, so its possible to collect data from there and run statistical analysis against it:
CPU and Memory: http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html
Garbage Collection:
http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/GarbageCollectorMXBean.html

Is garbage collection detrimental to the performance of this type of program

I'm building a program that will live on an AWS EC2 instance (probably) be invoked periodically via a cron job. The program will 'crawl'/'poll' specific websites that we've partnered with and index/aggregate their content and update our database. I'm thinking java is a perfect fit for a language to program this application in. Some members of our engineering team are concerned about the performance detriment of java's garbage collection feature, and are suggesting using C++.
Are these valid concerns? This is an application that will be invoked possible once every 30 minutes via cron job, and as long as it finishes its task within that time frame the performance is acceptable I would assume. I'm not sure if garbage collection would be a performance issue, since I would assume the server will have plenty of memory and the actual act of tracking how many objects point to an area of memory and then declaring that memory free when it reaches 0 doesn't seem too detrimental to me.
No, your concerns are most likely unfounded.
GC can be a concern, when dealing with large heaps & fractured memory (requires a stop the world collection) or medium lived objects that are promoted to old generation but then quickly de-referenced (requires excessive GC, but can be fixed by resizing ratio of new:old space).
A web crawler is very unlikely to fit either of the above two profiles - you probably don't need a massive old generation and should have relatively short lived objects (page representation in memory while you parse out data) and this will be efficiently dealt with in the young generation collector.
We have an in-house crawler (Java) that can happily handle 2 million pages per day, including some additional post-processing per page, on commodity hardware (2G RAM), the main constraint is bandwidth. GC is a non-issue.
As others have mentioned, GC is rarely an issue for throughput sensitive applications (such as a crawler) but it can (if one is not careful) be an issue for latency sensitive apps (such as a trading platform).
The typical concern C++ programmers have for GC is one of latency. That is, as you run a program, periodic GCs interrupt the mutator and cause spikes in latency. Back when I used to run Java web applications for a living, I had a couple customers who would see latency spikes in the logs and complain about it — and my job was to tune the GC to minimize the impact of those spikes. There are some relatively complicated advances in GC over the years to make monstrous Java applications run with consistently low latency, and I'm impressed with the work of the engineers at Sun (now Oracle) who made that possible.
However, GC has always been very good at handling tasks with high throughput, where latency is not a concern. This includes cron jobs. Your engineers have unfounded concerns.
Note: A simple experimental GC reduced the cost of memory allocation / freeing to less than two instructions on average, which improved throughput, but this design is fairly esoteric and requires a lot of memory, which you don't have on EC2.
The simplest GCs around offer a tradeoff between large heap (high latency, high throughput) and small heap (lower latency, lower throughput). It takes some profiling to get it right for a particular application and workload, but these simple GCs are very forgiving in a large heap / high throughput / high latency configuration.
Fetching and parsing websites will take way more time than the garbage collector, its impact will be probably neliglible. Moreover, the automatic memory management is often more efficient when dealing with a lot of small objects (such as strings) than a manual memory management via new/delete. Not talking about the fact that the garbage collected memory is easier to use.
I don't have any hard numbers to back this up, but code that does a lot of small string manipulations (lots of small allocations and deallocations in a short period of time) should be much faster in a garbage-collected environment.
The reason is that modern GC's "re-pack" the heap on a regular basis, by moving objects from an "eden" space to survivor spaces and then to a tenured object heap, and modern GC's are heavily optimized for the case where many small objects are allocated and then deallocated quickly.
For example, constructing a new string in Java (on any modern JVM) is as fast as a stack allocation in C++. By contrast, unless you're doing fancy string-pooling stuff in C++, you'll be really taxing your allocator with lots of small and quick allocations.
Plus, there are several other good reasons to consider Java for this sort of app: it has better out-of-the-box support for network protocols, which you'll need for fetching website data, and it is much more robust against the possibility of buffer overflows in the face of malicious content.
Garbage collection (GC) is fundamentally a space-time tradeoff. The more memory you have, the less time your program will need to spend performing garbage collection. As long as you have a lot of memory available relative to the maximum live size (total memory in use), the main performance hit of GC -- whole-heap collections -- should be a rare event. Java's other advantages (notably robustness, security, portability, and an excellent networking library) make this a no-brainer.
For some hard data to share with your colleagues showing that GC performs as well as malloc/free with plenty of available RAM, see:
"Quantifying the Performance of Garbage Collection vs. Explicit Memory Management", Matthew Hertz and Emery D. Berger, OOPSLA 2005.
This paper provides empirical answers to an age-old question: is
garbage collection faster/slower/the same speed as malloc/free? We
introduce oracular memory management, an approach that lets us measure
unaltered Java programs as if they used malloc and free. The result: a
good GC can match the performance of a good allocator, but it takes 5X
more space. If physical memory is tight, however, conventional garbage
collectors suffer an order-of-magnitude performance penalty.
Paper: PDF
Presentation slides: PPT, PDF

Experience of moving to 64 bit JVM

Our company is planning to move to 64 bit JVM in order to get away from 2 GB maximum heap size limit. Google gave me very mixed results about 64 bit JVM performance.
Has anyone tried moving to 64 bit java and share your experience
In a nutshell: 64-bit JVMs will consume more memory for object references and a few other types (generally not significant), consume more memory per thread (often significant on high-volume sites) and enable you to have larger heaps (generally only important if you have many long-lived objects)
Longer Answers/Comments:
The comment that Java is 32-bit by
design is misleading. Java
memory-addressing is either 32, or
64-bit, but the VM spec ensures that
most fields (e.g. int, long, double,
etc.) are the same regardless.
Also - the GC tuning comments while
pertinent for number of objects, may
not be relevant, GC can be quick on
JVMs with large heaps (I've worked
with heaps up to 15GB, with very
quick GC) - it depends more on how
you play with the generational
collector schemes, and what your
object usage pattern is. While in the
past people have spent lots of energy
tuning parameters, it's very workload
dependent, and modern (Java 5+) JVMs
are very good at self-tuning - unless
you have lots of data you're more
likely to harm yourself than help
with aggresive JVM tuning.
As mentioned on x86 architectures,
the 64-bit EMT64 or x64 processors
also include new instructions for
doing things like atomic writes, or
other options which may also impact
high-performance applications.
If you need the larger heap, then questions of performance are rather moot, aren't they? Or do you have a plan for horizontal scaling?
The main problem that I've heard with 64-bit apps is that a full garbage collection can take a very long time (because it's based on number of live objects). So you want to carefully tune the GC parameters to avoid full collections (I've heard one anecdote about a company that had 64 Gb of heap, and tuned their GC so that they'd never to a full GC; they'd simply shut down once a week).
Other than that, recognize that Java is 32-bit by design, so you're not likely to see any huge performance increase from moving data 64 bits at a time. And you're still limited to 32-bit array indices.
Works fine for us. Why don't you simply try setting it up and run your load test suite under a profiler like jvisualvm?
We've written directly to 64bit and I can see no adverse behavior...
Naively taking 32 bit JVM workloads and putting them on 64 bit produces a performance and space hit in my experience.
However, Most of the major JVM vendors have now implemented a good technology that essentially compresses some of the heap - it's called compressed references or compressed oops for 64 bit JVMs that aren't "big" (ie: in the 4-30gb range).
This makes a big difference and should make a 32->64 transition much lower impact.
Reference for the IBM JVM: link text

Why is memory management so visible in Java VM?

I'm playing around with writing some simple Spring-based web apps and deploying them to Tomcat. Almost immediately, I run into the need to customize the Tomcat's JVM settings with -XX:MaxPermSize (and -Xmx and -Xms); without this, the server easily runs out of PermGen space.
Why is this such an issue for Java VMs compared to other garbage collected languages? Comparing counts of "tune X memory usage" for X in Java, Ruby, Perl and Python, shows that Java has easily an order of magnitude more hits in Google than the other languages combined.
I'd also be interested in references to technical papers/blog-posts/etc explaining design choices behind JVM GC implementations, across different JVMs or compared to other interpreted language VMs (e.g. comparing Sun or IBM JVM to Parrot). Are there technical reasons why JVM users still have to deal with non-auto-tuning heap/permgen sizes?
The title of your question is misleading (not on purpose, I know): PermSize issues (and there are a lot of them, I was one of the first one to diagnose a Tomcat/Sun PermGen issue years ago, when there wasn't any knowledge on the issue yet) are not a Java specifity but a Sun VM specifity.
If you use a VM that doesn't use permanent generation (like, say, an IBM VM if I'm not mistaken) you cannot have permgen issues.
So it's is not a "Java" problem, but a Sun VM implementation problem.
Java gives you a bit more control about memory -- strike one for people wanting to apply that control there, vs Ruby, Perl, and Python, which give you less control on that. Java's typical implementation is also very memory hungry (because it has a more advanced garbage collection approach) wrt the typical implementations of the dynamic languages... but if you look at JRuby or Jython you'll find it's not a language issue (when these different languages use the same underlying VM, memory issues are pretty much equalized). I don't know of a widespread "Perl on JVM" implementation, but if there's one I'm willing to bet it wouldn't be measurably different in terms of footprint from JRuby or Jython!
Python/Perl/Ruby allocate their memory with malloc() or an optimization thereof. The limit to the heap space is determined by the operating system rather than the VM, so there's no need for options like -Xmxn. Also, the garbage collection is simpler, based mostly on reference counting. So there's a lot less to fine-tune.
Furthermore, dynamic languages tend to be implemented with bytecode interpreters rather than JIT compilers, so they aren't used for performance-critical code anyway.
The essence of #WizardOfOdds and #Alex-Martelli's answers appear to be correct: Java has an advanced set of GC options, and sometimes you need to tune them. However, I'm still not entirely clear on why you might design a JVM with or without a permanent generation. I have found a bunch of useful links about garbage collection in Java, though not necessarily in comparison to other languages with GC. Briefly:
The Sun GC evolves very slowly due to the fact that it is deployed everywhere and people may rely on quirks in its implementation.
Sun has detailed white papers on GC design and options, such as Tuning Garbage Collection with the 5.0 Java[tm] Virtual Machine.
There is a new GC in the wings, called the G1 GC. Alex Miller has a good summary of relevant blog posts and a link to the technical paper. But it still has a permanent generation (and doesn't necessarily do a great job with it).
Jon Masamitsu has (had?) an interesting blog at Sun various details of garbage collection.
Happy to update this answer with more details if anyone has them.
This is because Tomcat is running in the Java Virtual Machine, while other languages are either compiled or interpreted and run against your actual machine. When you set -Xmx and -Xms you are saying that you want to JVM to run like a computer with am amount of ram somewhere in the set range.
I think the reason so many people run in to this is that the default values are relatively low and people end up hitting the default ceiling pretty quickly (instead of waiting until you run out of actual ram as you would with other languages).

Categories