Adaptive Maps in Scala (or Java) Preserving Insertion Order

Adaptive Maps in Scala (or Java) Preserving Insertion Order - java

I would like to find and reuse (if possible) a map implementation which has the following attributes:
While the number of entries is small, say < 32, underlying storage should be done in an array like this [key0, val0, key1, val1, ...] This storage scheme avoids many small Entry objects and provides for extremely fast look ups (even tho they are sequential scans!) on modern CPU's due to the CPU's cache not being invalidated and lack of pointer indirection into heap.
The map should maintain insertion order for key/value pairs regardless of the number of entries similar to LinkedHashMap
We are working on an in-memory representations of huge (millions of nodes/edges) graphs in Scala and having such a Map would allow us to store Node/Edge attributes as well as Edges per node in a much more efficient way for 99%+ of Nodes and Edges which have few attributes or neighbors while preserving chronological insertion order for both attributes and edges.
If anyone knows of a Scala or Java map with such characteristics I would be much obliged.
Thanx

While I'm not aware of any implementations that exactly fit your requirements, you may be interested in peeking at Flat3Map (source) in the Jakarta Commons library.
Unfortunately, the Jakarta libraries are rather outdated (e.g., no support for generics in the latest stable release, although it is promising to see that this is changing in trunk) and I usually prefer the Google Collections, but it might be worth your time to see how Apache implemented things.
Flat3Map doesn't preserve the order of the keys, unfortunately, but I do have a suggestion in regard to your original post. Instead of storing the keys and values in a single array like [key0, val0, key1, val1, ...], I recommend using parallel arrays; that is, one array with [key0, key1, ...] and another with [val0, val1, ...]. Normally I am not a proponent of parallel arrays, but at least this way you can have one array of type K, your key type, and another of type V, your value type. At the Java level, this has its own set of warts as you cannot use the syntax K[] keys = new K[32]; instead you'll need to use a bit of typecasting.

Have you measured with profiler if LinkedHashMap is too slow for you? Maybe you don't need that new map - premature optimization is the root of all evil..
Anyway for processing millions or more pieces of data in a second, even best-optimized map can be too slow, because every method call decreases performance as well in that cases. Then all you can do is to rewrite your algorithms from Java collections to arrays (i.e. int -> object maps).

Under java you can maintain a 2d array(spreadsheet). I wrote a program which basically defines a 2 d array with 3 coloumns of data, and 3 coloumns for looking up the data. the three coloumns are testID, SubtestID and Mode. This allows me to basically look up a value by testid and mode or any combination, or I can also reference by static placement. The table is loaded into memory at startup and referenced by the program. It is infinately expandable and new values can be added as needed.
If you are interested, I can post a code source example tonight.
Another idea may be to maintain a database in your program. Databases are designed to organize large amounts of data.

Related

how to optimize memory usage for a big memory?

I have a big HashMap in java storing mappings from String to Integer. It has 400K records. It runs ok, but i am wondering if there is a better to optimize in terms of memory usage. After the Map is initialized, it will only be searched, no other update operations.
I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that. Please help or share your ideas on this.
Thanks.

I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that.
If the strings keys are actually the string representations of integers, then it could make sense to convert them to Integer or Long objects ... using Integer.valueOf(String). You will save some memory, since the primitive wrapper classes use less memory than the corresponding String objects. The space saving is likely to be significant (maybe ~16 bytes versus ~40 bytes per key ... depending on your platform.)
The flip-side of this is that you would need to convert candidate keys from String to the real key type before doing a hashmap lookup. That conversion takes a bit of time, and typically generates a bit of garbage.
But if the String keys do not represent integers, then this is simply not going to work. (Or at least ... I don't know what "conversion" you are referring to ...)
Note also that the key type has to be Integer / Long rather than int / long. Generic type parameters must be reference types.
There may be 3rd-party collection implementations that would help as well ... depending on precisely your data structure works; e.g. Trove, Guava, Fastutil. Try combining then with the String -> Integer preconversion ...
On the suggestion of using a database. If
you don't need the query / update / transactional capabilities of a database, AND
you can afford the memory to hold the data in memory, AND
you can afford the startup cost of loading the data into memory,
then using a database is just a big, unnecessary performance hit on each lookup.

You might want to tune initialCapacity and loadFactor also improving hashCode() to avoid collision if you want read at higher rate, if you have too many write you might want to benchmark hashCode(),
Even if this is too big for your app you might want to consider it moving out side of jvm to some cache (redis) or may be database if you can afford the little read/write delay

Writing the data to a database is ultimately the best solution if the data gets too big, but 400k is still doable in memory.
However, Java's built-in HashMap implementation uses separate chaining and every key-value pair has a separate class. I've gotten great (30%) speed improvements and awesome (50%) memory improvements by building a quadratic probing implementation of Map.
I suggest you search around on the internet. There are plenty of good implementations around!

You could use Guava's ImmutableMap -- it's optimized for write-once data, and takes ~15% less memory than HashMap.

High performing set like data structure for array of ints for java

I'm looking for a high performing data structure that behaves like a set and where the elements will always be an array of ints. The data structure only needs to fulfill this interface:
trait SetX {
def size:Int
def add(element:Array[Int])
def toArray:Array[Array[Int]]
}
The set should not contain duplicates and this could be achieved using Arrays.equals(int[] a, int[] a2) - i.e. the values of the arrays can't be the same.
Before creating it I have a rough idea of how many elements there will be but need resizing behaviour in case there are more than initially thought. The elements will always be the same length and I know what that is at the time of creation.
Of course I could use a Java HashSet (wrapping the arrays of course) but this is being used in a tight loop and it is too slow. I've looked at Trove and that works nicely (by using arrays but providing a TObjectHashingStrategy) but I was hoping that since my requirements are so specific there might be a quicker/more efficient way to do this.
Has anyone ever come across this or have an idea how I could accomplish this?
The trait above is Scala but I'm very happy with Java libs or code.
I should really say what I am doing. I am basically generating a large number of int arrays in a tight loop and at the end of it I just want to see the unique ones. I never have to remove elements from the set or anything else. Just add lots of int arrays to the set and at the end get out the unique ones.

Look at prefix trees. You can follow tree structure immediately during array generation. At the end of generation you will have an answer, if the generated array already is present in the set. Prefix tree would consume much less memory than an ordinary hash set.
If you are generating arrays and have a not very small chance of their equivalence, I suspect you are only taking numbers from a very limited range. It would simplify prefix tree implementation, too.
I'm sure that proper implementation would be faster than using any set implementation to keep solid arrays.
Downside of this solution is that you need to implement data structure yourself, because it would be integrated with the logic of code deeply.

If you want high performance then write your own:
Call it ArraySetInt.
Sets are usually either implemented as trees or hashtable.
If you want an array based set, this would slow down adding, maybe deleting, but will speed up iterating, low memory usage. etc.
First look how ArrayList is implemented.
remove the object and replace it with primitive int.
Then rename the add() to put() and change it to a type of sort by insertion. Use System.arraycopy() to insert. use Arrays.binsearch() to find the insert position and whether element already exist in one step.

With out knowing how much data or if you are doing more reads than write:
You should probably try (ie benchmark) the naive case of an array of arrays or array of special wrapped array (ie composite object with cached hashcode of array and the array). Generally on small data sets not much beats looping through an array (e.g. HashMap for an Enum can actually be slower than looping through).
If you have really large amount of data and your willing to make some compromises you might consider a bloom filter but it sounded like you don't have much data.

I'd go for some classic solution wrapping the array by a class providing faster equals and hashCode. The hashCode can be simply cached and equals can make use of it for quickly saying no in case of differing arrays.
I'd avoid Arrays.hashCode as it uses a stupid multiplier (31), which might lead to unneeded collisions. For a really fast equals you could make use of cryptography and say that two arrays are equals if and only if their SHA-1 are equal (you'd be the first to find a collision :D).
The ArrayWrapper rather simple and should be faster than using TObjectHashingStrategy as it never has to look at the data itself (fewer cache misses) and it has the fastest and best possible hashCode and equals.
You could also look for some CompactHashSet implementation as it can be faster due to better memory locality.

Map versus List

Input : Let's say I have an object as Person. It has 2 properties namely
ssnNo - Social Security Number
name.
In one hand I have a List of Person objects (with unique ssnNo) and in the other hand I have a Map containing Person's ssnNo as the key and Person's name as the value.
Output : I need Person names using its ssnNo.
Questions :
Which approach to follow out of the 2 I have mentioned above i.e. using list or map? (I think the obvious answer would be the map).
If it is the map, is it always recommended to use map whether the data-set is large or small? I mean are there any performance issues that come with the map.

Map is the way to go. Maps perform very well, and their advantages over lists for lookups get bigger the bigger your data set gets.
Of course, there are some important performance considerations:
Make sure you have a good hashcode (and corresponding equals) implementation, so that you data will be evenly spread across the buckets of the Map.
Make sure you pre-size your Map when you allocate it (if at all possible). The map will automatically resize, but the resize operation essentially requires re-inserting each prior element into the new, bigger Map.

You're right, you should use a map in this case. There are no performance issues using map compared to lists, the performance is significantly better than that of a list when data is large. Map uses key's hashcodes to retrieve entries, in similar way as arrays use indexes to retrieve values, which gives good performance

This looks like a situation appropriate for a Map<Long, Person> that maps a social security number to the relevant Person. You might want to consider removing the ssnNo field from Person so as to avoid any redundancies (since you would be storing those values as keys in your map).
In general, Maps and Lists are very different structures, each suited for different circumstances. You would use the former whenever you want to maintain a set of key-value pairs that allows you to easily and quickly (i.e. in constant time) look up values based on the keys (this is what you want to do). You would use the latter when you simply want to store an ordered, linear collection of elements.

I think it makes sense to have a Person object, but it also makes sense to use a Map over a List, since the look up time will be faster. I would probably use a Map with SSNs as keys and Person objects as values:
Map<SSN,Person> ssnToPersonMap;

It's all pointers. It actually makes no sense to have a Map<ssn,PersonName> instead of a Map<ssn,Person>. The latter is the best choice most of the time.

Using map especially one that implement using a hash table will be faster than the list since this will allow you to get the name in constant time O(1). However using the list you need to do a linear search or may be a binary search which is slower.

Lightweight Map implementation Java (little memory overhead)

I am currently writing some code in java meant to be a little framework for a project which revolves around a database with some billions of entries. I want to keep it high-level and the data retriueved from the database shoud be easily usable for statistic inference. I am resolved to use the Map interface in this project.
a core concept is mapping the attributes ("columns in the database") to values ("cells") when handling single datasets (with which I mean a columns in the database) for readable code: I use enum objects (named "Attribute") for the attribute types, which means mapping <Attribute, String>, because the data elements are all String (also not very large, maximum 40 characters or so).
There are 15 columns, so there are 15 enums, and the maps will have only so much entries, or less.
So it appears, I will be having a very large number of Map objects floating around, at times, but with comparatively little payload (15-). My goal is to not make the memory explode due to the implementation memory overhead, compared to the actual payload. (Stretch goal: do the same with cpu usage ;] )
I was not really familiar with all the different implementations of Java Collections to date, and when the problem dawned at me today, I looked into my to-date all-time favorite 'HashMap', and was not happy with how much memory overhead there was declared. I am sure, that additonal to the standard implementations, there are a number of implementations not shipped with Java. Googling my case brought not up much of a result, So I am asking you:
Do you know a good implementation of Map for my use case (low entry count, low value size, enumerable keys, ...)
I hope I made my use case clear, and am anxious for your input =)
Thanks a lot!
Stretch answer goal, absolutely optional and only if you got the time and knowledge:
What other implementations of collections are suitable for:
handling attribute (the String things) vectors, and matrices for inference data (counts/probabilities) (Matrices: here I am really clueless for now, Did really no serious math work with java to date)
math libraries for statistical inference, see above

Use EnumMap, this is the best map implementation if you have enums as key, for both performance and memory usage.
The trick is that this map implementation is the only one that that does not store the keys, it only needs a single array with the values (similar to an ArrayList of the values). There is only a little bit of overhead if there are keys that are not mapped to a value, but in most cases this won't be a problem because enums usually do not have too many instances.
Compared to HashMap, you additionally get a predictable iteration order for free.

Since you start off saying you want to store lots of data, eventually, you'll also want to access/modify that data. There are many high performance libraries out there.
Look at
Trove4j : https://bitbucket.org/robeden/trove/
HPPC: http://labs.carrotsearch.com/hppc.html
FastUtil: http://fastutil.di.unimi.it/
When you find a bottleneck, you can switch to using a lower level API (more efficient)
You'll many more choices if look a bit more: What is the most efficient Java Collections library?
EDIT: if your strings are not unique, you could save significant amounts of memory using String.intern() : Is it good practice to use java.lang.String.intern()?

You can squeeze out a bit of memory with a simple map implementation that uses two array lists (keys and values). For larger maps, that is going to mean insertion and look up speeds become much slower because you have to scan the entire list. However, for small maps it is actually faster this way since you don't have to calculate any hashcodes and only have to look at a small number of entries.
If you need an implementation, take a look at my SimpleMap in my jsonj project: https://github.com/jillesvangurp/jsonj/blob/master/src/main/java/com/github/jsonj/SimpleMap.java

Which implementation of Map<K,V> should I use if my map needs to be small more than fast?

I habitually use HashMap in my programs, since I know it is usually the most efficient (if properly used) and can cope with large maps easily. I know about EnumMap which is very useful for enumeration keys, but often I am generating a small map which will never get very big, is likely to be discarded pretty soon, and has no concurrency issues.
Is HashMap<K,V> too complicated for these small, local and temporary uses? Is there another, simple, implementation which I can use in these cases?
I think I'm looking for a Map implementation which is analogous to ArrayList for List. Does it exist?
Added later after responses:
Here is a scenario where a slow but very simple implementation might be better -- when I have many, many of these Maps. Suppose, for example, I have a million or so of these tiny little maps, each with a handful (often less than three) of entries. I have a low reference rate -- perhaps I don't actually reference them before they are discarded most of the time. Is it still the case that HashMap is the best choice for them?
Resource utilisation is more than just speed -- I would like something that doesn't fragment the heap a lot and make GCs take a long time, for example.
It may be that HashMap is the right answer, but this is not a case of premature optimisation (or at least it may not be).
Added much later after some thought:
I decided to hand-code my own SmallMap. It is easy to make one with AbstractMap. I have also added a couple of constructors so that a SmallMap can be constructed from an existing Map.
Along the way I had to decide how to represent Entrys and to implement SmallSet for the entrySet method.
I learned a lot by coding (and unit-testing this) and want to share this, in case anyone else wants one. It is on github here.

There is no standard small implementation of Map in Java. HashMap is one of the best and most flexible Map implementations around, and is hard to beat. However, in the very small requirement area -- where heap usage and speed of construction is paramount -- it is possible to do better.
I have implemented SmallCollections on GitHub to demonstrate how this might be done. I would love some comments on whether I have succeeded. It is by no means certain that I have.
Although the answers offered here were sometimes helpful, they tended, in general, to misunderstand the point. In any case, answering my own question was, in the end, much more useful to me than being given one.
The question here has served its purpose, and that is why I have 'answered it myself'.

I think this is premature optimization. Are you having memory problems? Performance problems from creating too many maps? If not I think HashMap is fine.
Besides, looking at the API, I'm not seeing anything simpler than a HashMap.
If you are having issue, you could roll your own Map implementation, that has very simple internals. But I doubt you would do better than default Map implementations, plus you have the overhead of making sure your new class works. In this case there might be a problem with your design.

A HashMap is possibly the most light weight and simple collection.
Sometimes the more efficient solution is to use a POJO. e.g. if your keys are field names and/or your values are primitives.

HashMap is a good choice because it offers average case O(1) puts and gets. It does not guarantee ordering though like SortedMap implementations (i.e. TreeMap O(log n) puts and gets) but if you have no requirement for ordering then HashMap is better.

IdentityHashMap
If you are truly concerned about performance, consider IdentityHashMap.
This class tracks keys by their reference in memory rather than by the content of the key. This avoids having to call the hashCode method on your key, which may or may not be lengthy if not cached internally within its class.
This class provides constant-time performance for the basic operations (get and put).
Read the Javadoc carefully when choosing this implementation of Map.
Map.of
Java 9 and JEP 269: Convenience Factory Methods for Collections brought us the convenience of literals syntax for concisely defining a map with Map.of (and list with List.of).
If your situation can use an unmodifiable map, use this simple syntax.
Note that the Map.of methods return a Map interface object. The underlying concrete Map class is unknown and may vary depending on your particular code and the version of Java. The Java implementation is free to optimize in its choice of a concrete class. For example, if your Map.of uses an enum for keys, the Map.of command might choose an EnumMap, or it might choose some yet-to-be-devised class optimized for very small maps.
Map< DayOfWeek , Employee > dailyWorker =
Map.of(
DayOfWeek.MONDAY , alice ,
DayOfWeek.TUESDAY , bob ,
DayOfWeek.WEDNESDAY , bob ,
DayOfWeek.THURSDAY , alice ,
DayOfWeek.FRIDAY , carol ,
DayOfWeek.SATURDAY , carol ,
DayOfWeek.SUNDAY , carol
)
;
Premature Optimization
As others noted, you are likely falling into the trap of premature optimization. Your app's performance is not likely to be impacted significantly by your choice of small map implementation.
Other considerations
There are other considerations you should likely be focusing on, such as concurrency, tolerance for NULLs, and iteration-order.
Here is a graphic chart I made listing those aspects of the ten implementations of Map bundled with Java 11.

I agree with #hvgotcodes that it is premature optimization but it is still good to know all tools in the toolbox.
If you do a lot of iterations over what is in a map, a LinkedHashMap is usually quite a lot faster than a HashMap, if you have a lot of threads working with the Map at the same time, a ConcurrentHashMap is often a better choice. I wouldn't worry about any Map implementation being inefficient for small sets of data. It is typically the other way around, an incorrectly constructed map easily gets inefficient with large amounts of data if you have bad hash values or if something causes it to have too few buckets for its load.
Then of course there are cases when a HashMap makes no sense at all, like if you have three values which you will always index with the keys 0, 1 and 2 but I assume you understand that :-)

HashMap uses more or less memory (when created) depending on how you initialize it: more buckets mean more memory usage, but faster access for large amounts of items; if you need only a small number of items you can initialize it with a small value, which will produce less buckets that will still be fast (since they will each receive a few items). There is no waste of memory if you set it correctly (the tradeoff is basically memory usage vs speed).
As for heap fragmentation and GC cycle wasting and whatnot, there is not much that a Map implementation can do about them; it all falls back to how you set it. Understand that this is not about Java's implementation, but the fact that generic (as in, for example, cannot assume anything about key values like EnumMap does) hashtables (not HashTables) are the best possible implementations of a map structure.

Android has an ArrayMap with the intent of minimizing memory. In addition to being in the core, it's in the v4 support library, which, theoretically, should be able to compile for the Oracle or OpenJDK JREs as well. Here is a link to the source of ArrayMap in a fork of the v4 support library on github.

There is an alternative called AirConcurrentMap that is more memory efficient above 1K Entries than any other Map I have found, and is faster than ConcurrentSkipListMap for key-based operations and faster than any Map for iterations, and has an internal thread pool for parallel scans. It is an ordered i.e. NavigableMap and a ConcurrentMap. It is free for non-commercial no-source use, and commercially licensed with or without source. See boilerbay.com for graphs. Full disclosure: I am the author.
AirConcurrentMap conforms to the standards so it is plug-compatible everywhere, even for a regular Map.
Iterators are already very fast especially over 1K Entries. The higher-speed scans use a 'visitor' model with a single visit(k, v) callback that reaches the speed of Java 8 parallel streams. The AirConcurrentMap parallel scan exceeds Java 8 parallel streams by about 4x. The threaded visitor adds split() and merge() methods to the single-thread visitor that remind one of map/reduce:
static class ThreadedSummingVisitor<K> extends ThreadedMapVisitor<K, Long> {
private long sum;
// This is idiomatic
long getSum(VisitableMap<K, Long> map) {
sum = 0;
map.getVisitable().visit(this);
return sum;
}
#Override
public void visit(Object k, Long v) {
sum += ((Long)v).longValue();
}
#Override
public ThreadedMapVisitor<K, Long> split() {
return new ThreadedSummingVisitor<K>();
}
#Override
public void merge(ThreadedMapVisitor<K, Long> visitor) {
sum += ((ThreadedSummingVisitor<K>)visitor).sum;
}
}
...
// The threaded summer can be re-used in one line now.
long sum = new ThreadedSummingVisitor().getSum((VisitableMap)map);

I also was interested and just for an experiment I created a map which stores keys and values just in fields and allows up to 5 entries. It consumes 4 less memory and works 16 times faster than HashMap
https://github.com/stokito/jsmallmap

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.