Use PermGen space or roll-my-own intern method? - java

I am writing a Codec to process messages sent over TCP using a bespoke wire protocol. During the decode process I create a number of Strings, BigDecimals and dates. The client-server access patterns mean that it is common for the client to issue a request and then decode thousands of response messages, which results in a large number of duplicate Strings, BigDecimals, etc.
Therefore I have created an InternPool<T> class allowing me to intern each class of object. Internally, the pool uses a WeakHashMap<T, WeakReference<T>>. For example:
InternPool<BigDecimal> pool = new InternPool<BigDecimal>();
...
// Read BigDecimal from in buffer and then intern.
BigDecimal quantity = pool.intern(readBigDecimal(in));
My question: I am using InternPool for BigDecimal but should I consider also using it for String instead of String's intern() method, which I believe uses PermGen space? What is the advantage of using PermGen space?

If you already have such a InternPool class, it think it is better to use that than to choose a different interning method for Strings. Especially since String.intern() seems to give a much stronger guarantee than you actually need. Your goal is to reduce memory usage, so perfect interning for the lifetime of the JVM is not actually necessary.
Also, I'd use the Google Collections MapMaker to create a InternPool to avoid re-creating the wheel:
Map<BigDecimal,BigDecimal> bigDecimalPool = new MapMaker()
.weakKeys()
.weakValues()
.expiration(1, TimeUnits.MINUTES)
.makeComputingMap(
new Function<BigDecimal, BigDecimal>() {
public BigDecimal apply(BigDecimal value) {
return value;
}
});
This would give you (correctly implemented) weak keys and values, thread safety, automatic purging of old entries and a very simple interface (a simple, well-known Map). To be sure you could also wrap it using Collections.immutableMap() to avoid bad code messing with it.

It is likely that the JVM's String.intern() pool will be faster. AFAIK, it is implemented in native code, so it should be faster and use less space than a pool implemented using WeakHashMap and WeakReference. You would need to do some careful benchmarking to confirm this.
However, unless you have huge numbers of long-lived duplicate objects, I doubt that interning (either in permGen or with your own pools) will make much difference. And if the ratio of unique to duplicate objects is too low, then interning will just increase the number of live objects (making the GC take longer) and reduce performance due the overheads of interning, and so on. So I would also advocate benchmarking the "intern" versus "no intern" approaches.

Related

How to implement our own string constant pool through a program in java?

How is the string constant pool implemented in java?
How can we make the same local. ?
Here's very simple implementation of object pool:
public class ObjectPool<T> {
private ConcurrentMap<T, T> map = new ConcurrentHashMap<>();
public T get(T object) {
T old = map.putIfAbsent( object, object );
return old == null ? object : old;
}
}
Now to create a pool of strings use
final ObjectPool<String> stringPool = new ObjectPool<>();
You can use it to deduplicate the strings in your program:
String deduplicatedStr = stringPool.get(str);
The String constant pool is a well defined term in Java and is implemented by the JVM. You can't replace it by something you create in your Java program, you'd have to write your own JVM.
If you mean you want some sort of String pool inside your application for storing Strings that your application uses over and over again (say a centralized place for texts to display on a user interface) a ResourceBundle is a good way to go, which is essentially a wrapper around a Map.
One can also call String.intern(), which is essentially a string pool that is already implemented for us.
However be warned that object pools in Java are often the wrong thing to do, whether they are String.intern or ConcurrentHashMap etc. Double check your use case by measuring the impact. Examples of when to use resource pools is when the objects being pooled are both expensive to create and limited in number; for example network and database connections.
The hidden cost that most people forget is the cost to GC. GC cost is related to how many live objects one has on the heap, and the JVM is not very good with objects that live for a fair while and then die. It is much better with objects that die young, or never die.

Java software design - Looping, object creation VS modifying variables. Memory, performance & reliability comparison

Let's say we are trying to build a document scanner class in java that takes 1 input argument, the log path(eg. C:\document\text1.txt). Which of the following implementations would you prefer based on performance/memory/modularity?
ArrayList<String> fileListArray = new ArrayList<String>();
fileListArray.add("C:\\document\\text1.txt");
fileListArray.add("C:\\document\\text2.txt");
.
.
.
//Implementation A
for(int i =0, j = fileListArray.size(); i < j; i++){
MyDocumentScanner ds = new MyDocumentScanner(fileListArray.get(i));
ds.scanDocument();
ds.resultOutput();
}
//Implementation B
MyDocumentScanner ds = new MyDocumentScanner();
for(int i=0, j=fileListArray.size(); i < j; i++){
ds.setDocPath(fileListArray.get(i));
ds.scanDocument();
ds.resultOutput();
}
Personally I would prefer A due to its encapsulation, but it seems like more memory usage due to creation of multiple instances. I'm curious if there is an answer to this, or it is another "that depends on the situation/circumstances" dilemma?
Although this is obviously opinion-based, I will try an answer to tell my opinion.
You approach A is far better. Your document scanner obviously handles a file. That should be set at construction time and be saved in an instance field. So every method can refer to this field. Moreover, the constructor can do some checks on the file reference (null check, existence, ...).
Your approach B has two very serious disadvantages:
After constructing a document scanner, clients could easily call all of the methods. If no file was set before, you must handle that "illegal state" with maybe an IllegalStateException. Thus, this approach increases code and complexity of that class.
There seems to be a series of method calls that a client should or can perform. It's easy to call the file setting method again in the middle of such a series with a completely other file, breaking the whole scan facility. To avoid this, your setter (for the file) should remember whether a file was already set. And that nearly automatically leads to approach A.
Regarding the creation of objects: Modern JVMs are really very fast at creating objects. Usually, there is no measurable performance overhead for that. The processing time (here: the scan) usually is much higher.
If you don't need multiple instances of DocumentScanner to co-exist, I see no point in creating a new instance in each iteration of the loop. It just creates work to the garbage collector, which has to free each of those instances.
If the length of the array is small, it doesn't make much difference which implementation you choose, but for large arrays, implementation B is more efficient, both in terms of memory (less instances created that the GC hasn't freed yet) and CPU (less work for the GC).
Are you implementing DocumentScanner or using an existing class?
If the latter, and it was designed for being able to parse multiple documents in a row, you can just reuse the object as in variant B.
However, if you are designing DocumentScanner, I would recommend to design it such that it handles a single document and does not even have a setDocPath method. This leads to less mutable state in that class and thus makes its design much easier. Also using an instance of the class becomes less error-prone.
As for performance, there won't be a measurable difference unless instantiating a DocumentScanner is doing a lot of work (like instantiating many other objects, too). Instantiating and freeing objects in Java is pretty cheap if they are used only for a short time due to the generational garbage collector.

Helping the JVM with stack allocation by using separate objects

I have a bottleneck method which attempts to add points (as x-y pairs) to a HashSet. The common case is that the set already contains the point in which case nothing happens. Should I use a separate point for adding from the one I use for checking if the set already contains it? It seems this would allow the JVM to allocate the checking-point on stack. Thus in the common case, this will require no heap allocation.
Ex. I'm considering changing
HashSet<Point> set;
public void addPoint(int x, int y) {
if(set.add(new Point(x,y))) {
//Do some stuff
}
}
to
HashSet<Point> set;
public void addPoint(int x, int y){
if(!set.contains(new Point(x,y))) {
set.add(new Point(x,y));
//Do some stuff
}
}
Is there a profiler which will tell me whether objects are allocated on heap or stack?
EDIT: To clarify why I think the second might be faster, in the first case the object may or may not be added to the collection, so it's not non-escaping and cannot be optimized. In the second case, the first object allocated is clearly non-escaping so it can be optimized by the JVM and put on stack. The second allocation only occurs in the rare case where it's not already contained.
Marko Topolnik properly answered your question; the space allocated for the first new Point may or may not be immediately freed and it is probably foolish to bank on it happening. But I want to expand on why you're currently in a deep state of sin:
You're trying to optimise this the wrong way.
You've identified object creation to be the bottleneck here. I'm going to assume that you're right about this. You're hoping that, if you create fewer objects, the code will run faster. That might be true, but it will never run very fast as you've designed it.
Every object in Java has a pretty fat header (16 bytes; an 8-byte "mark word" full of bit fields and an 8-byte pointer to the class type) and, depending on what's happened in your program thus far, possibly another pretty fat trailer. Your HashSet isn't storing just the contents of your objects; it's storing pointers to those fat-headers-followed-by-contents. (Actually, it's storing pointers to Entry classes that themselves store pointers to Points. Two levels of indirection there.)
A HashSet lookup, then, figures out which bucket it needs to look at and then chases one pointer per thing in the bucket to do the comparison. (As one great big chain in series.) There probably aren't very many of these objects, but they almost certainly aren't stored close together, making your cache angry. Note that object allocation in Java is extremely cheap---you just increment a pointer---and that this is quite probably a bigger source of slowness.
Java doesn't provide any abstraction like C++'s templates, so the only real way to make this fast and still provide the Set abstraction is to copy HashSet's code, change all of the data structures to represent your objects inline, modify the methods to work with the new data structures, and, if you're still worried, make copies of the relevant methods that take a list of parameters corresponding to object contents (i.e. contains(int, int)) that do the right thing without constructing a new object.
This approach is error-prone and time-consuming, but it's necessary unfortunately often when working on Java projects where performance matters. Take a look at the Trove library Marko mentioned and see if you can use it instead; Trove did exactly this for the primitive types.
With that out of the way, a monomorphic call site is one where only one method is called. Hotspot aggressively inlines calls from monomorphic call sites. You'll notice that HashSet.contains punts to HashMap.containsKey. You'd better pray for HashMap.containsKey to be inlined since you need the hashCode call and equals calls inside to be monomorphic. You can verify that your code is being compiled nicely by using the -XX:+PrintAssembly option and poring over the output, but it's probably not---and even if it is, it's probably still slow because of what a HashSet is.
As soon as you have written new Point(x,y), you are creating a new object. It may happen not to be placed on the heap, but that's just a bet you can lose. For example, the contains call should be inlined for the escape analysis to work, or at least it should be a monomorphic call site. All this means that you are optimizing against a quite erratic performance model.
If you want to avoid allocation the solid way, you can use Trove library's TLongHashSet and have your (int,int) pairs encoded as single long values.

Can String assignment cause memory leak?

In my J2ME code, I have a loop which look like this,
Enumeration jsonEnumerator = someJSONObject.keys();
while(jsonEnumerator.hasMoreElements()){
String key = (String) jsonEnumerator.nextElement();
String value = someJSONObject.getJSONObject(key);
someOtherJson.put(value,key);
}
Considering that String assignments in the above Code
String key = (String) jsonEnumerator.nextElement();
Is that the right approach to use a pool of Strings instead of instantiating new Objects or what are the other approaches to assign the strings which will avoid memory leaks?
The String assignments won't cause a memory leak.
Whether there the strings leak elsewhere in that code depends on a couple of things that can't be discerned from this code:
How the JSON implementation is creating the key and value strings. (If it is using String.substring() on a much larger String, you may leak storage via a shared string backing array.)
Whether the someOtherJson is being leaked.
The normal approach (in Java SE) is to not worry about it ... until you've got evidence from memory profiling that there is a leak. In Java ME implementations, memory is typically more constrained, and GC implementations can be relatively slow. So it can be necessary to reduce the number and size of objects (including strings). But that's not a memory leak issue ... and I'd still advise profiling first instead of leaping into a memory efficiency campaign that could be a waste of effort.
Is that the right approach to use a pool of Strings instead of instantiating new Objects or what are the other approaches to assign the strings which will avoid memory leaks?
As I said there is no leak in the above code.
String pools don't eliminate leaks, and they don't necessarily reduce the rate of garbage object creation. They can reduce the number of live String objects at any given time, but this comes at a cost.
If you want to try this approach, it is simplest to use String.intern() to manage your String pool. But it won't necessarily help. And can actually make things worse. (If there isn't enough potential for sharing, the space overheads of the interned string pool can exceed the saving. In addition, the interned string pool creates more work for the GC - more tracing, and more effectively weak references to deal with.)
No, String assignment, by itself does not create anything. The only thing resembling a "leak" in Java is when you put a whole bunch of references into some array or other structure and then forget about it -- leave the structure "live" (accessible) but don't use it.
If you're talking about interning strings, then it doesn't happen here. It only happens automatically for constant strings which are found in your source code.
Any other strings will be garbage collected, just like any other object.
My suggestion is:
Enumeration jsonEnumerator = someJSONObject.keys();
while(jsonEnumerator.hasMoreElements()) {
String key = (String) jsonEnumerator.nextElement();
someOtherJson.put(someJSONObject.getJSONObject(key), key);
}
String instantiation will cause to memory leak in J2ME because J2ME uses a poor Garbage Collection method to reduce the resource usage.
When you are trying to develop a J2ME application, be careful about memory and CPU usages.

Using SoftReference for static data to prevent memory shortage in Java

I have a class with a static member like this:
class C
{
static Map m=new HashMap();
{
... initialize the map with some values ...
}
}
AFAIK, this would consume memory practically to the end of the program. I was wondering, if I could solve it with soft references, like this:
class C
{
static volatile SoftReference<Map> m=null;
static Map getM() {
Map ret;
if(m == null || (ret = m.get()) == null) {
ret=new HashMap();
... initialize the map ...
m=new SoftReference(ret);
}
return ret;
}
}
The question is
is this approach (and the implementation) right?
if it is, does it pay off in real situations?
First, the code above is not threadsafe.
Second, while it works in theory, I doubt there is a realistic scenario where it pays off. Think about it: In order for this to be useful, the map's contents would have to be:
Big enough so that their memory usage is relevant
Able to be recreated on the fly without unacceptable delays
Used only at times when other parts of the program require less memory - otherwise the maximum memory required would be the same, only the average would be less, and you probably wouldn't even see this outside the JVM since it give back heap memory to the OS very reluctantly.
Here, 1. and 2. are sort of contradictory - large objects also take longer to create.
This is okay if your access to getM is single threaded and it only acts as a cache.
A better alternative is to have a fixed size cache as this provides a consistent benefit.
getM() should be synchronized, to avoid m being initialized at the same time by different threads.
How big is this map going to be ? Is it worth the effort to handle it this way ? Have you measured the memory consumption of this (for what it's worth, I believe the above is generally ok, but my first question with optimisations is "what does it really save me").
You're returning the reference to the map, so you need to ensure that your clients don't hold onto this reference (and prevent garbage collection). Perhaps your class can hold the reference, and provide a getKey() method to access the content of the map on behalf of clients ? That way you'll maintain control of the reference to the map in one place.
I would synchronise the above, in case the map gets garbage collected and two threads hit getMap() at the same time. Otherwise you're going to create two maps simultaneously!
Maybe you are looking for WeakHashMap? Then entries in the map can be garbage collected separately.
Though in my experience it didn't help much, so I instead built an LRU cache using LinkedHashMap. The advantage is that I can control the size so that it isn't too big and still useful.
I was wondering, if I could solve it with soft references
What is it that you are trying to solve? Are you running into memory problems, or are you prematurely optimizing?
In any case,
The implementation should be altered a bit if you were to use it. As has been noted, it isnt thread-safe. Multiple threads could access the method at the same time, allowing multiple copies of your collection to be created. If these collections were then strongly referenced for the remainder of your program you would end up with more memory consumption, not less
A reason to use SoftReferences is to avoid running out of memory, as there is no contract other than that they will be cleared before the VM throws an OutOfMemoryError. Therefore there is no guaranteed benefit of this approach, other than not creating the cache until it is first used.
The first thing I notice about the code is that it mixes generic with raw types. That is just going to lead to a mess. javac in JDK7 has -Xlint:rawtypes to quickly spot that kind of mistake before trouble starts.
The code is not thread-safe but uses statics so is published across all threads. You probably don' want it to be synchronized because the cause problems if contended on multithreaded machines.
A problem with use a SoftReference for the entire cache is that you will cause spikes when the reference is cleared. In some circumstances it might work out better to have ThreadLocal<SoftReference<Map<K,V>>> which would spread the spikes and help-thread safety at the expense of not sharing between threads.
However, creating a smarter cache is more difficult. Often you end up with values referencing keys. There are ways around this bit it is a mess. I don't think ephemerons (essentially a pair of linked References) are going to make JDK7. You might find the Google Collections worth looking at (although I haven't).
java.util.LinkedHashMap gives an easy way to limit the number of cached entries, but is not much use if you can't be sure how big the entries are, and can cause problems if it stops collection of large object systems such as ClassLoaders. Some people have said you shouldn't leave cache eviction up to the whims of the garbage collector, but then some people say you shouldn't use GC.

Categories