Lightweight collection w/ Set contract

Lightweight collection w/ Set contract - java

I'm developing in an application requiring lots of objects in memory. One of the largest structures is of the type
Map<String,Set<OwnObject>> (with Set as HashSet)
with OwnObject being a heavyweight object representing records in a database. The application works, but has a rather large memory footprint. Reading this Java Specialists newsletter from 2001, I've analyzed the memory usage of my large structure above. The HashSet uses a HashMap in the back, which in turn is quite a heavyweight object, and I guess this is where most of my additional memory goes.
Trying to optimize the memory usage of the structure, I tried around with multiple versions:
Map<String,List<OwnObject>> (with List as ArrayList)
Map<String,OwnObject[]>
Both work, and both are much more lean than the version using the Set<>. However, I'd like to keep the Set contract in place (uniqueness of entries).
One way would be to implement the logic myself. I could extend ArrayList and ensure the contract in add().
Are there frameworks implementing lightweight collections that honor the Set contract? Or do I miss something from the Java collections that I could use without ensuring uniqueness by myself?

The solution I implemented is the following:
Map<String,OwnObject[]>
Adding and removing to the array was done using Arrays.binarySearch() and 2 slice System.arraysCopy()s, by which sorting and uniqueness happen on the side.

Related

A better performing way retrieving select attributes Collections of Large Objects in Java

Is there a method where I can iterate a Collection and only retrieve just a subset of attributes without loading/unloading the each of the full object to cache? 'Cos it seems like a waste to load/unload the WHOLE (possibly big) object when I need only some attribute(s), especially if the objects are big. It might cause unnecessary cache conflicts when loading such unnecessary data, right?
When I meant to 'load to cache' I mean to 'process' that object via the processor. So there would be objects of ex: 10 attributes. In the iterating loop I only use 1 of those. In such a scenario, I think its a waste to load all the other 9 attributes to the processor from the memory. Isn't there a solution to only extract the attributes without loading the full object?
Also, does something like Google's Guava solve the problem internally?
THANK YOU!

It's not usually the first place to look, but it's not certainly impossible that you're running into cache sharing problems. If you're really convinced (from realistic profiling or analysis of hardware counters) that this is a bottleneck worth addressing, you might consider altering your data structures to use parallel arrays of primitives (akin to column-based database storage in some DB architectures). e.g. one 'column' as a float[], another as a short[], a third as a String[], all indexed by the same identifier. This structure allows you to 'query' individual columns without loading into cache any columns that aren't currently needed.
I have some low-level algorithmic code that would really benefit from C's struct. I ran some microbenchmarks on various alternatives and found that parallel arrays was the most effective option for my algorithms (that may or may not apply to your own).
Note that a parallel-array structure will be considerably more complex to maintain and mutate than using Objects in java.util collections. So I'll reiterate - I'd only take this approach after you've convinced yourself that the benefit will be worth the pain.

There is no way in Java to manage loading to processor caches, and there is no way to change how the JVM works with objects, so the answer is no.
Java is not a low-level language and hides such details from the programmer.
The JVM will decide how much of the object it loads. It might load the whole object as some kind of read-ahead optimization, or load only the fields you actually access, or analyze the code during JIT compilation and do a combination of both.
Also, how large do you worry your objects are? I have rarely seen classes with more than a few fields, so I would not consider that big.

Capacity adapting Java collections

Are there any Java libraries for maps and sets that alter their representation strategy based upon the capacity? I have an application where we have many many maps and sets, but most of the time they are small, usually 6 elements or less.
As such we've been able to extract some good memory improvements by writing some specialized maps and sets that just use arrays for small sizes and then default to standard Java Sets and Maps for larger capacities.
However, rolling our own specialized versions of set and maps seems kind of silly if there is already something off the shelf. I've looked at guava and the Apache collections and they do not seem to offer anything like this. Trove sounds like it is more memory efficient than the JDK's collections in general, but it isn't clear if it will attempt to minimize memory usage like this.

You may want to look at Clojure's persistent data structures. Although the "persistent" part may be overkill for you, it does exactly what you are looking for and is still really fast. There is a PersistentArrayMap that is promoted to a PersistentHashMap once the collection exceeds 16 entires.

I'm not aware of any such library.
The problem is that the representations that use the least amount of memory tend to:
be incompatible with the Java Collections APIs which makes integration hard, and
break down the abstraction boundaries; e.g. by adding link fields to element types.
These make it difficult to create a general purpose library along these lines. Then we add the problem that a representation that adapts to minimize heap space usage as the collection grows and shrinks will inevitably create a lot more garbage ... and that will have CPU performance implications.
Your approach is kind of interesting, though it doesn't give you anywhere like minimal memory usage. I assume that your classes are effectively wrappers for the standard implementation classes when the collections get big. If it works for you, I suggest that you stick with it.

When to use List<Long> instead of long[]?

There's something I really don't understand: a lot (see my comment) of people complain that Java isn't really OO because you still have access to primitive and primitive arrays. Some people go as far as saying that these should go away...
However I don't get it... Could you do efficiently things like signal processing (say write an FFT, for starters), writing efficient encryption algorithms, writing fast image manipulation libraries, etc. in Java if you hadn't access to, say, int[] and long[]?
Should I start writing my Java software by using List<Long> instead of long[]?
If the answer is "simply use higher-level libraries doing what you need" (for example, say, signal processing), then how are these libraries supposed to be written?

I personally use List most of the times, because it gives you a lot of convenience. You can also have concurrent collections, but not concurrent raw arrays.
Almost the only situation I use raw arrays is when I'm reading a large chunk of binary data, like image processing. I'm concerned instantiating e.g.Byte objects 100M times, though I have to confess I never tried working with that huge Byte list. I noticed when you have something like a 100KB file, List<Byte> works ok.
Most of the image processing examples etc. use array as well, so in this field it's actually more convenient to use raw arrays.
So in conclusion, my practical answer to this is
Use wrappers unless you are
Working with a very large array
like length > 10M (I'm too lazy to
write a benchmark!),
Working in a field
where many examples or people prefer
raw arrays (e.g. network
programming, image processing),
You found out there is a significant
performance gain by changing to raw arrays, by doing
experiments.
If for whatever
reason it's easier to work with raw
arrays on that problem for you.

In high performance computing, arrays of objects (as well as primitives) are essential as they map more robustly onto the underlying CPU architecture and behave more predictably for things such as cache access and garbage collection. With such techniques, Java is being used very successfully in areas where the received wisdom is that the language is not suitable.
However, if your goal is solely to write code that is highly maintainable and provably self consistent, then the higher level constructs are the obvious way to go. In your direct comparison, the List object hides the issue of memory allocation, growing your list and so on, as well as providing (in various implementations) additional facilities such as particular access patterns like stacks or queues. The use of generics also allows you to carry out refactoring with a far greater level of confidence and the full support of your IDE and toolchain.
An enlightened developer should make the appropriate choice for the use case they are approaching. Suggesting that a language is not "OO enough" because it allows such choices would lead me to suspect that the person either doesn't trust that other developers are as smart as they are or has not had a particularly wide experience of different application domains.

It's a judgment call, really. Lists tend to play better with generic libraries and have stuff like add, contains, etc, while arrays generally are faster and have built-in language support and can be used as varargs. Select whatever you find serves your purpose better.

Okay.
You need to know the size of an array at the time that it is created, but you cannot change its size after it has been created. But, a list can grow dynamically after it has been created, and it has the .Add() function to do that.
Have you gone through this link ?
A nice comparison of Arrays vs List.
Array or List in Java. Which is faster ?
List v/s Array

Collection as a metaphor for real world containers

I find modeling physical containers using collections very intuitive. I override/delegate add methods with added capacity constraints based on physical attributes such as volume of added elements, sort based on physical attributes, locate elements by using maps of position to element and so on.
However, when I read the documentation of collection classes, I get the impression that it's not the intended use, that it's just a mathematical construct and a bounded queue is just meant to be constrained by the number of elements and so forth.
Indeed I think that I unless I'm able to model this collection coherently, I should perhaps not expose this class as a collection but only delegate to it internally. Opinions?

Many structures in software development do not have a physical counterpart. In fact, some structures and algorithms are quite abstract, and do not model objects directly in the physical world. So just because an object does not serve as a suitable model for physical objects in the real world does not necessarily mean it cannot be used effectively to solve problems within a computer program.

Indeed I think that I unless I'm able to model this collection coherently, I should perhaps not expose this class as a collection but only delegate to it internally. Opinions?
Firstly, you don't want to get too hung up with the modeling side of software engineering. UML style models (usually) serve primarily as a way of organizing and expressing the developer's high level ideas about how an application should be implemented. There is no need to have a strict one-to-one relationship between the classes in the model and the implementation classes in the application code.
Second, you don't want to get too hung up about modeling "real world" (i.e. physical) objects and their behavior. Most of the "objects" that are used in a typical applications have no real connection with the real world. For example, a "folder" or "directory" is really little more than an analogy of the physical objects with the same names. There's typically no need for the computer concept to be constrained by the physical behavior of the real world objects.
Finally, there are a number of software engineering reasons why it is a bad idea to have your Java domain classes extend the standard collection types. For example:
The collections have a generic behavior that it is typically not appropriate to expose in a domain object. For instance, you typically don't want components of a domain object to be added and removed willy-nilly.
By extending a collection type, you are implicitly giving permission for some part of your application to treat domain objects as just lists or sets or whatever.
By extending collection classes, you would be hard-wiring implementation details into your domain APIs. For example, you would need to decide between extending ArrayList or LinkedList, and changing your mind would result (at least) in a binary API incompatibility ... and possibly worse.

Not entirely sure that I've understood you correctly. I gather that you want to know if you should expose the collection (subclassing) or wrap it (have a private field).
As Robert says, it really depends on the case. It's pretty much your choice. Nonetheless I'd say that in many cases the better choice is to not expose the collection because the constraints define the object you are modelling and are not fully congruent with the underlying collection. In short: users of your object shouldn't need to know that they are dealing with a collection unless it is really a collection with some speciality e.g. has all properties of a collection but allows only a certain number of objects.

When is a ConcurrentSkipListSet useful?

I just saw this data-structure on the Java 6 API and I'm curious about when it would be an useful resource. I'm studying for the scjp exam and I don't see it covered on Kathy Sierra's book, even though I've seen mock exam questions that mention it.

ConcurrentSkipListSet and ConcurrentSkipListMap are useful when you need a sorted container that will be accessed by multiple threads. These are essentially the equivalents of TreeMap and TreeSet for concurrent code.
The implementation for JDK 6 is based on High Performance Dynamic Lock-Free Hash Tables and List-Based Sets by Maged Michael at IBM, which shows that you can implement a lot of operations on skip lists atomically using compare and swap (CAS) operations. These are lock-free, so you don't have to worry about the overhead of synchronized (for most operations) when you use these classes.
There's currently no Red-Black tree based concurrent Map/Set implementation in Java. I looked through the literature a bit and found a couple papers that showed concurrent RB trees outperforming skip lists, but a lot of these tests were done with transactional memory, which isn't supported in hardware on any major architectures at the moment.
I'm assuming the JDK guys went with a skip list here because the implementation was well-known and because making it lock-free was simple and portable (using CAS). If anyone cares to clarify, please do. I'm curious.

skip lists are sorted lists, and efficient to modify with log(n) performance. in that regard it's like TreeSet. however there is no ConcurrentTreeSet. what I heard is that skip list is very easy to implement, that's probably why.
Anyway, when you need a concurrent, sorted and efficient set, you can use ConcurrentSkipListSet

These are useful when you need a set that can safely be accessed by multiple threads simultaneously. It also provides decent performance by being weakly consistent -- inserts can be made safely while you're iterating through the Set, but there's no guarantee that your Iterator will see that insert.

ConcurrentSkipListMap was a fantastic find when I needed to implement a replication layer for a home-grown cache. The Map aspects implemented the cache, and the underlying List aspects let me keep track of the order in which objects appeared in the cache. The "skip" aspect of that list made it efficient to remove an object from one spot in the list and bump it to the end when it was replaced in the cache.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.