Capacity adapting Java collections

Capacity adapting Java collections - java

Are there any Java libraries for maps and sets that alter their representation strategy based upon the capacity? I have an application where we have many many maps and sets, but most of the time they are small, usually 6 elements or less.
As such we've been able to extract some good memory improvements by writing some specialized maps and sets that just use arrays for small sizes and then default to standard Java Sets and Maps for larger capacities.
However, rolling our own specialized versions of set and maps seems kind of silly if there is already something off the shelf. I've looked at guava and the Apache collections and they do not seem to offer anything like this. Trove sounds like it is more memory efficient than the JDK's collections in general, but it isn't clear if it will attempt to minimize memory usage like this.

You may want to look at Clojure's persistent data structures. Although the "persistent" part may be overkill for you, it does exactly what you are looking for and is still really fast. There is a PersistentArrayMap that is promoted to a PersistentHashMap once the collection exceeds 16 entires.

I'm not aware of any such library.
The problem is that the representations that use the least amount of memory tend to:
be incompatible with the Java Collections APIs which makes integration hard, and
break down the abstraction boundaries; e.g. by adding link fields to element types.
These make it difficult to create a general purpose library along these lines. Then we add the problem that a representation that adapts to minimize heap space usage as the collection grows and shrinks will inevitably create a lot more garbage ... and that will have CPU performance implications.
Your approach is kind of interesting, though it doesn't give you anywhere like minimal memory usage. I assume that your classes are effectively wrappers for the standard implementation classes when the collections get big. If it works for you, I suggest that you stick with it.

Related

Direct operations on off-heap arrays

Recently I've been looking for a way to store large chunk of data in memory for scientific computing. I've looked at scala-offheap and LArray. One thing I noticed is that if I have an existing function operating on a native Java array, I cannot apply it directly on an off-heap array; both libraries require a copy from off-heap array to a normal one.
I don't know if this is a real limitation of the memory model, or simply a limitation imposed by the library APIs. Is it possible to get a Java array "view" of an off-heap array?

jillegal claims to be able to do that, but it's basically one big hack because it violates assumptions of the garbage collectors and it is relying on particular collectors not going up in fire when they encounter those violations. It's probably not a good idea for production use.
If you only need to access primitive types then bytebuffers currently are the abstraction that provides the same APIs for on-heap and off-heap access but you have to extract fields one by one.

Lightweight collection w/ Set contract

I'm developing in an application requiring lots of objects in memory. One of the largest structures is of the type
Map<String,Set<OwnObject>> (with Set as HashSet)
with OwnObject being a heavyweight object representing records in a database. The application works, but has a rather large memory footprint. Reading this Java Specialists newsletter from 2001, I've analyzed the memory usage of my large structure above. The HashSet uses a HashMap in the back, which in turn is quite a heavyweight object, and I guess this is where most of my additional memory goes.
Trying to optimize the memory usage of the structure, I tried around with multiple versions:
Map<String,List<OwnObject>> (with List as ArrayList)
Map<String,OwnObject[]>
Both work, and both are much more lean than the version using the Set<>. However, I'd like to keep the Set contract in place (uniqueness of entries).
One way would be to implement the logic myself. I could extend ArrayList and ensure the contract in add().
Are there frameworks implementing lightweight collections that honor the Set contract? Or do I miss something from the Java collections that I could use without ensuring uniqueness by myself?

The solution I implemented is the following:
Map<String,OwnObject[]>
Adding and removing to the array was done using Arrays.binarySearch() and 2 slice System.arraysCopy()s, by which sorting and uniqueness happen on the side.

A better performing way retrieving select attributes Collections of Large Objects in Java

Is there a method where I can iterate a Collection and only retrieve just a subset of attributes without loading/unloading the each of the full object to cache? 'Cos it seems like a waste to load/unload the WHOLE (possibly big) object when I need only some attribute(s), especially if the objects are big. It might cause unnecessary cache conflicts when loading such unnecessary data, right?
When I meant to 'load to cache' I mean to 'process' that object via the processor. So there would be objects of ex: 10 attributes. In the iterating loop I only use 1 of those. In such a scenario, I think its a waste to load all the other 9 attributes to the processor from the memory. Isn't there a solution to only extract the attributes without loading the full object?
Also, does something like Google's Guava solve the problem internally?
THANK YOU!

It's not usually the first place to look, but it's not certainly impossible that you're running into cache sharing problems. If you're really convinced (from realistic profiling or analysis of hardware counters) that this is a bottleneck worth addressing, you might consider altering your data structures to use parallel arrays of primitives (akin to column-based database storage in some DB architectures). e.g. one 'column' as a float[], another as a short[], a third as a String[], all indexed by the same identifier. This structure allows you to 'query' individual columns without loading into cache any columns that aren't currently needed.
I have some low-level algorithmic code that would really benefit from C's struct. I ran some microbenchmarks on various alternatives and found that parallel arrays was the most effective option for my algorithms (that may or may not apply to your own).
Note that a parallel-array structure will be considerably more complex to maintain and mutate than using Objects in java.util collections. So I'll reiterate - I'd only take this approach after you've convinced yourself that the benefit will be worth the pain.

There is no way in Java to manage loading to processor caches, and there is no way to change how the JVM works with objects, so the answer is no.
Java is not a low-level language and hides such details from the programmer.
The JVM will decide how much of the object it loads. It might load the whole object as some kind of read-ahead optimization, or load only the fields you actually access, or analyze the code during JIT compilation and do a combination of both.
Also, how large do you worry your objects are? I have rarely seen classes with more than a few fields, so I would not consider that big.

How i can know how much memory my cached objects are using?

We are trying to cache the results of database selects (in hash map), so we wouldn’t have to execute them multiple times. and whenever we are changing data base, so for getting the changes in app we have added refresh list functionality.
Now we have a large no of list to fetch, so it taking too much time to load pick list from the data base.
So I have some question regarding this issue:
How I can find out how much memory the list is using? (I have used the method where we are using garbage collector for collecting the memory and taking the difference but there are many list and so it is taking too much time)
How I can optimize the refresh list?
Thanks for the help.

how i can find how much memory the list is using
JProfiler
VisualVM
how i can optimize the refresh list.
Make sure you're using the correct collection type for your data.
Have a look here.
Also have a look at the Guava collections.
One last thing, ignis is very right by advising you not to use System.gc() this might be the very reason you're having performance problems. This is why.

First, while not wanting to generalize when it comes to performance problems, the issue you're seeing are unlikely to be purely down to memory use, though if the lists are large this could come into play when they're refreshed and a large number of objects become eligible for collection.
To solve issues relating to garbage collection there's a few rules of thumb, but it always comes down to breaking out a profiler an tuning the garbage collector - there's more on that here.
But before that any loading of a database is going to involve iteration over a result set, so the biggest optimization you can make will be to reduce the size of the result sets. There's a couple of ways to do that:
if you using a map, try to use keys that don't require loading and do the load when you get a miss.
once loaded, only refresh the rows that have changed since you last loaded the data, though this obivously doesn't solve the start-up problem.
Now all that said, I would not recommend you write your own caching code in the first place. The reasons I say this are:
all modern RDBMS cache, so providing your queries are performant getting the actual result set should not be a bottleneck.
Hibernate provides not only ORM but a robust and well understood caching solution.
if you really need to cache massive datasets, use Coherence or similar - the cache can be started in a seperate JVM and your application doesn't need to take the load hit.

You have two problems here: discovering how much memory is in use, and managing a cache. I'm not sure that the two are really closely related, although they may be.
Discovering how much memory an object uses isn't extremely difficult: one excellent article you can use for reference is "Sizeof for Java" from JavaWorld. It escapes the whole garbage collection fiasco, which has a ton of holes in it (it's slow, it doesn't count the object but the heap - meaning that other objects factor into your results that you may not want, etc.)
Managing the time to initialize the cache is another problem. I work for a company that has a data grid as a product, and thus I'm biased; be aware.
One option is not using a cache at all, but using a data grid. I work for GigaSpaces Technologies, and I feel ours is the best; we can load data from a database on startup, and hold your data there as a distributed, transactional data store in memory (so your greatest cost is network access.) We have a community edition as well as full-featured platforms, depending on your need and budget. (The community edition is free.) We support various protocols, including JDBC, JPA, JMS, Memcached, a Map API (similar to JCache), and a native API.
Other similar options include Coherence, which is itself a data grid, and Terracotta DSO, which can distribute an object graph on a JVM heap.
You can also look at the cache projects themselves: Two include Ehcache and OSCache. (Again: bias. I was one of the people who started OpenSymphony, so I've a soft spot for OSCache.) In your case, what would happen is not a preload of cache - note that I don't know your application, so I'm guessing and might be wrong - but a cache on demand. When you acquire data, you'd check the cache for data first and fetch from the DB only if the data is not in cache, and load the cache on read.
Of course, you can also look at memcached, although I obviously prefer my employer's offering here.

Be aware that invoking
System.gc()
or
Runtime.getRuntime().gc()
is a bad idea unless you really need to do that. You should leave the VM the task of deciding when to free objects, unless after profiling you found that it's the only way to make the application go faster on your client's VM.

I tend to use YourKit for this sort of thing. It costs money but IMO is worth every penny (no connection other than as a customer).

When to use List<Long> instead of long[]?

There's something I really don't understand: a lot (see my comment) of people complain that Java isn't really OO because you still have access to primitive and primitive arrays. Some people go as far as saying that these should go away...
However I don't get it... Could you do efficiently things like signal processing (say write an FFT, for starters), writing efficient encryption algorithms, writing fast image manipulation libraries, etc. in Java if you hadn't access to, say, int[] and long[]?
Should I start writing my Java software by using List<Long> instead of long[]?
If the answer is "simply use higher-level libraries doing what you need" (for example, say, signal processing), then how are these libraries supposed to be written?

I personally use List most of the times, because it gives you a lot of convenience. You can also have concurrent collections, but not concurrent raw arrays.
Almost the only situation I use raw arrays is when I'm reading a large chunk of binary data, like image processing. I'm concerned instantiating e.g.Byte objects 100M times, though I have to confess I never tried working with that huge Byte list. I noticed when you have something like a 100KB file, List<Byte> works ok.
Most of the image processing examples etc. use array as well, so in this field it's actually more convenient to use raw arrays.
So in conclusion, my practical answer to this is
Use wrappers unless you are
Working with a very large array
like length > 10M (I'm too lazy to
write a benchmark!),
Working in a field
where many examples or people prefer
raw arrays (e.g. network
programming, image processing),
You found out there is a significant
performance gain by changing to raw arrays, by doing
experiments.
If for whatever
reason it's easier to work with raw
arrays on that problem for you.

In high performance computing, arrays of objects (as well as primitives) are essential as they map more robustly onto the underlying CPU architecture and behave more predictably for things such as cache access and garbage collection. With such techniques, Java is being used very successfully in areas where the received wisdom is that the language is not suitable.
However, if your goal is solely to write code that is highly maintainable and provably self consistent, then the higher level constructs are the obvious way to go. In your direct comparison, the List object hides the issue of memory allocation, growing your list and so on, as well as providing (in various implementations) additional facilities such as particular access patterns like stacks or queues. The use of generics also allows you to carry out refactoring with a far greater level of confidence and the full support of your IDE and toolchain.
An enlightened developer should make the appropriate choice for the use case they are approaching. Suggesting that a language is not "OO enough" because it allows such choices would lead me to suspect that the person either doesn't trust that other developers are as smart as they are or has not had a particularly wide experience of different application domains.

It's a judgment call, really. Lists tend to play better with generic libraries and have stuff like add, contains, etc, while arrays generally are faster and have built-in language support and can be used as varargs. Select whatever you find serves your purpose better.

Okay.
You need to know the size of an array at the time that it is created, but you cannot change its size after it has been created. But, a list can grow dynamically after it has been created, and it has the .Add() function to do that.
Have you gone through this link ?
A nice comparison of Arrays vs List.
Array or List in Java. Which is faster ?
List v/s Array

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.