Persisting large array for faster access

Persisting large array for faster access - java

I have a 5 dimensional array, where all indicies range from 2-14. It contains all the possible permutations of a 5 number sequence.
This array holds 525720 permutations, which takes quite a while to compute. (5-7 seconds on my Macbook pro). It should be used as a lookup table, to access a value in constant time, or more specific, the value of a certain poker-hand:
array[2][3][4][5][7] // 1
array[5][5][5][5][14] // 2000
Is there a faster way to create this array? I was thinking about persisting the array in some way, and then load it each time my program starts - but is there any efficient ways to do this?
I'm not very familiar with persistence. I dont really know if it's worth it for me, to load it from disk, instead of create it each time. I know about Hibernate, but this seems like a bit of a overkill, just to persist a single array?

Write it out via MappedByteBuffer. Create a big enough file, map it, get an asIntBuffer(), put in your numbers.
Then you can map it later and access it via IntBuffer.get(obvious-math-on-indices).
This is much faster the serialization.

Not a direct answer to your original question, but...
If you are trying to do fast poker-hand evaluations, you want to make sure you read through The Great Poker Hand Evaluator Roundup.
Particularly: Cactus Kev's Poker Hand Evaluator.
I was involved in long-running discussion about running the fastest possible 5- and 7-hand poker evaluations where most of this stuff comes from. Frankly, I don't see how these evaluations are going to any faster until you can hold all C(52,5) or 2,598,960 hand values in a look-up table.

I would start by collapsing your dimensions for indexing:
assuming you have a set of indexes (from your first example, allowed values are 2 to 14):
i1 = 2
i2 = 3
i3 = 5
i4 = 6
i5 = 7
and created your array with
short array[] = new short[13 * 13 * 13 * 13 * 13];
...
then accessing each element becomes
array[(i1 - 2) * 13 * 13 * 13 * 13 + (i2 - 2) * 13 * 13 * 13 + (i3 - 2)
* 13 * 13 + (i4 - 2) * 13 + (i5 - 2)]
This array will take much less memory since you don't need to create an additional layer of objects along each dimension, and you can easily store the entire contents in a file and load it in one list.
It will also be faster to traverse this array because you will be doing 1/5 the array lookups.
Also the tightening up of the number of elements in each dimension will save significant memory.
To keep your code clean this array should be hidden inside an object with a get and set method which takes the five indexes.

What you probably want to do, if the computation of the array is too expensive, is serialize it. That basically places a binary copy of the data onto a storage medium (e.g. your hard disk) that you can very quickly load.
Serialization is pretty straightforward. Here's a tutorial that specifically addresses serializing arrays.
Since these values will presumably only change if your algorithm for evaluating a poker hand changes, it should be fine to just ship the serialized file. The file size should be reasonable if the data you are storing in each array element is not too large (if it's a 16-bit integer for example, the file will be around 1MB in size).

I'm not convinced that your number-of-poker-hand permutations is correct but, in any case...
You can make your array initialization approximately 120-times faster by storing every permutation of a given poker hand at once. That works because the "value" of a poker hand is not affected by the order of the cards.
First calculate the value for the a hand. Say you have five cards (c1, c2, c3, c4, c5):
handValue = EvaluateHand(c1, c2, c3, c4, c5);
// Store the pre-calculated hand value in a table for faster lookup
hand[c1][c2][c3][c4][c5] = handValue;
Then assign the handValue to all permutations of that hand (i.e. the order of the cards doesn't change the handValue).
hand[c1][c2][c3][c5][c4] = handValue;
hand[c1][c2][c4][c3][c5] = handValue;
hand[c1][c2][c4][c5][c3] = handValue;
hand[c1][c2][c5][c3][c4] = handValue;
hand[c1][c2][c5][c4][c3] = handValue;
:
etc.
:
hand[c5][c4][c3][c2][c1] = handValue;

A few things:
If this is for poker hands, you can't just store 2-14. You also need to store the suit as well. This really means you need to store 0-51. Otherwise you have no way of knowing if array[2][3][4][5][6] is a straight or a straight flush.
If you don't actually need to store the suits for your application, and you really want to do it in an array, use indexes of 0-12, not 2-14. This would allow you to use a 13×13×13×13×13 (371,293 member) array, instead of a 15×15×15×15×15 (759,375 member) array. Whenever you access the array, you'd just need to subtract 2 from each index. (I'm not sure where you got your 525,720 count...)

First of all, thanks for your enthusiasm!
So the straight forward approach seems to just serialize it. I think I'll try this at first, to test the performance, and see if its sufficient. (Which I guess it is).
About the MappedByteBuffer... Is it correctly understood, that this makes it possible to load a fraction of the serialized array? So I load the values I need at run-time, instead of loading the whole array at startup?
#Jennie
The suits are stored in a different array. I'm not sure this is the best way to go, since there's lots of stuff to consider about this particular problem. A flush is basically a high-card hand, with a different value... So There's no real reason for me to store the same permutations (high cards) twice, but this is the way to do it for now. I think the way to go is a hash-function, so I can convert high-card values to flush-values easily, but I have not given this many thoughts.
About the indicies, you're of course right. This is just for now. It's easier for me to test the value for "2 3 4 5 6", by just putting in the card-values for now... Later, I'll cut of the array!

Related

A good data structure for storing and searching integers?

Edit: Typos fixed and ambiguity tried to fix.
I have a list of five digit integers in a text file. The expected amount can only be as large as what a 5-digit integer can store. Regardless of how many there are, the FIRST line in this file tells me how many integers are present, so resizing will never be necessary. Example:
3
11111
22222
33333
There are 4 lines. The first says there are three 5-digit integers in the file. The next three lines hold these integers.
I want to read this file and store the integers (not the first line). I then want to be able to search this data structure A LOT, nothing else. All I want to do, is read the data, put it in the structure, and then be able to determine if there is a specific integer in there. Deletions will never occur. The only things done on this structure will be insertions and searching.
What would you suggest as an appropriate data structure? My initial thought was a binary tree of sorts; however, upon thinking, a HashTable may be the best implementation. Thoughts and help please?

It seems like the requirements you have are
store a bunch of integers,
where insertions are fast,
where lookups are fast, and
where absolutely nothing else matters.
If you are dealing with a "sufficiently small" range of integers - say, integers up to around 16,000,000 or so, you could just use a bitvector for this. You'd store one bit per number, all initially zero, and then set the bits to active whenever a number is entered. This has extremely fast lookups and extremely fast setting, but is very memory-intensive and infeasible if the integers can be totally arbitrary. This would probably be modeled with by BitSet.
If you are dealing with arbitrary integers, a hash table is probably the best option here. With a good hash function you'll get a great distribution across the table slots and very, very fast lookups. You'd want a HashSet for this.
If you absolutely must guarantee worst-case performance at all costs and you're dealing with arbitrary integers, use a balanced BST. The indirection costs in BSTs make them a bit slower than other data structures, but balanced BSTs can guarantee worst-case efficiency that hash tables can't. This would be represented by TreeSet.

Given that
All numbers are <= 99,999
You only want to check for existence of a number
You can simply use some form of bitmap.
e.g. create a byte[12500] (it is 100,000 bits which means 100,000 booleans to store existence of 0-99,999 )
"Inserting" a number N means turning the N-th bit on. Searching a number N means checking if N-th bit is on.
Pseduo code of the insertion logic is:
bitmap[number / 8] |= (1>> (number %8) );
searching looks like:
bitmap[number/8] & (1 >> (number %8) );
If you understand the rationale, then a even better news for you: In Java we already have BitSet which is doing what I was describing above.
So code looks like this:
BitSet bitset = new BitSet(12500);
// inserting number
bitset.set(number);
// search if number exists
bitset.get(number); // true if exists

If the number of times each number occurs don't matter (as you said, only inserts and see if the number exists), then you'll only have a maximum of 100,000. Just create an array of booleans:
boolean numbers = new boolean[100000];
This should take only 100 kilobytes of memory.
Then instead of add a number, like 11111, 22222, 33333 do:
numbers[11111]=true;
numbers[22222]=true;
numbers[33333]=true;
To see if a number exists, just do:
int whichNumber = 11111;
numberExists = numbers[whichNumber];
There you are. Easy to read, easier to mantain.

A Set is the go-to data structure to "find", and here's a tiny amount of code you need to make it happen:
Scanner scanner = new Scanner(new FileInputStream("myfile.txt"));
Set<Integer> numbers = Stream.generate(scanner::nextInt)
.limit(scanner.nextInt())
.collect(Collectors.toSet());

Growable multidimensional data structure supporting range queries

Let me put the question first: considering the situation and requirements I'll describe further down, what data structures would make sense/help achieving the non-functional requirements?
I tried to look up several structures but wasn't very successful so far, which might be due to me missing some terminology.
Since we'll implement that in Java any answers should take that into account (e.g. no pointer-magic, assume 8-byte references etc.).
The situation
We have somewhat large set of values that are mapped via a 4-dimensional key (let's call those dimensions A, B, C and D). Each dimension can have a different size, so we'll assume the following:
A: 100
B: 5
C: 10000
D: 2
This means a completely filled structure would contain 10 million elements. Not considering their size the space needed to hold the references alone would be like 80 megabytes, so that would be considered a lower bound for memory consumption.
We further can assume that the structure won't be completely filled but quite densely.
The requirements
Since we build and query that structure quite often we have the following requirements:
constructing the structure should be fast
queries on single elements and ranges (e.g. [A1-A5, B3, any C, D0]) should be efficient
fast deletion of elements isn't required (won't happen too often)
the memory footprint should be low
What we already considered
kd-trees
Building such a tree takes some time since it can get quite deep and we'd either have to accept slower queries or take rebalancing measures. Additonally the memory footprint is quite high since we need to hold the complete key in each node (there might be ways to reduce that though).
Nested maps/map tree
Using nested maps we could store only the key for each dimension as well as a reference to the next dimension map or the values - effectively building a tree out of those maps. To support range queries we'd keep sorted sets of the possible keys and access those while traversing the tree.
Construction and queries were way faster than with kd-trees but the memory footprint was much higher (as expected).
A single large map
An alternative would be to keep the sets for individual available keys and use a single large map instead.
Construction and queries were fast as well but memory consumption was even higher due to each map node being larger (they need to hold all dimensions of a key now).
What we're thinking of at the moment
Building insertion-order index-maps for the dimension keys, i.e. we map each incoming key to a new integer index as it comes in. Thus we can make sure that those indices grow one step a time without any gaps (not considering deletions).
With those indices we'd then access a tree of n-dimensional arrays (flattened to a 1-d array of course) - aka n-ary tree. That tree would grow on demand, i.e. if we need a new array then instead of creating a larger one and copying all the data we'd just create the new block. Any needed non-leaf nodes would be created on demand, replacing the root if needed.
Let me illustrate that with an example of 2 dimensions A and B. We'll allocate 2 elements for each dimension resulting in a 2x2 matrix (array of length 4).
Adding the first element A1/B1 we'd get something like this:
[A1/B1,null,null,null]
Now we add element A2/B2:
[A1/B1,null,A2/B2,null]
Now we add element A3/B3. Since we can't map the new element to the existing array we'll create a new one as well as a common root:
[x,null,x,null]
/ \
[A1/B1,null,A2/B2,null] [A3/B3,null,null,null]
Memory consumption for densely filled matrices should be rather low depending on the size of each array (having 4 dimensions and 4 values per dimension in an array we'd have arrays of length 256 and thus get a maximum tree depth of 2-4 in most cases).
Does this make sense?

If the structure will be "quite densely" filled, then I think it makes sense to assume that it will be full. That simplifies things quite a bit. And it's not like you're going to save a lot (or anything) using a sparse matrix representation of a densely filled matrix.
I'd try the simplest possible structure first. It might not be the most memory efficient, but it should be reasonable and quite easy to work with.
First, a simple array of 10,000,000 references. That is (and please pardon the C#, as I'm not really a Java programmer):
MyStructure[] theArray = new MyStructure[](10000000);
As you say, that's going to consume 80 megabytes.
Next is four different dictionaries (maps, I think, in Java), one for each key type:
Dictionary<KeyAType, int> ADict;
Dictionary<KeyBType, int> BDict;
Dictionary<KeyCType, int> CDict;
Dictionary<KeyDType, int> DDict;
When you add an element at {A,B,C,D}, you look up the respective keys in the dictionary to get their indexes (or add a new index if that key doesn't exist), and do the math to compute an index into the array. The math is, I think:
DIndex + 2*(CIndex + 10000*(BIndex + 5*AIndex));
In .NET, dictionary overhead is something like 24 bytes per key. But you only have 11,007 total keys, so the dictionaries are going to consume something like 250 kilobytes.
This should be very quick to query directly, and range queries should be as fast as a single lookup and then some array manipulation.
One thing I'm not clear on is if you want a key, to resolve to the same index with every build. That is, if "foo" maps to index 1 in one build, will it always map to index 1?
If so, you probably should statically construct the dictionaries. I guess it depends on if your range queries always expect things in the same key order.
Anyway, this is a very simple and very effective data structure. If you can afford 81 megabytes as the maximum size of the structure (minus the actual data), it seems like a good place to start. You could probably have it working in a couple of hours.
At best it's all you'll have to do. And if you end up having to replace it, at least you have a working implementation that you can use to verify the correctness of whatever new structure you come up with.

There are other multidimensional trees that are usually better than kd-trees:quadtrees, R*Trees (like R-Tree, but much faster for updates) or PH-Tree.
The PH-Tree is like a quadtree, but much more space efficient, scales better with dimensions and depth is limited by maximum bitwidth of values, i.e. maximum '10000' requires 14 bit, so the depth will not be more than 14.
Java implementations of all trees can be found on my repo, either here (quadtree may be a bit buggy) or here.
EDIT
The following optimization can probably be ignored. Of course the described query will result in a full scan, but that may not be as bad as it sounds, because it will on average anyway return 33%-50% of the whole tree.
Possible optimisation (not tested, but might work for the PH-Tree):
One problem with range queries is the different selectivity of your dimensions, which may result in something to a full scan of the tree. For example when querying for [0..100][0..5][0..10000][1..1], i.e. constraining only the last dimension (with least selectivity).
To avoid this, especially for the PH-Tree, I would try to multiply your values by a fixed constant. For example multiply A by 100, B by 2000, C by 1 and D by 5000. This allows all values to range from 0 to 10000, which may improve query performance when constraining only dimensions with low selectivity (the 2nd or 4th).

Getting the 5 lowest values with their index from a 2D Array

Any ideas how to get the 5 minimum numbers from a 2D Array. I would like to know their index as well. I'm using Processing but I'm interested to find the correct way to do that.
For example: I have a 4x4 array with the following values:
3-72-64-4
12-45-9-7
86-34-81-55
31-19-18-21
I want to get the five lowest number in my Array which are 3,4,7,9,12. The problem is that I want to know their original index as well.
Example:
Array[0,0] = 3
Array[0,3] = 4
Array[1,3] = 7
Array[1,2] = 9
Is there any formula or good programming way to do that?

There is actually a very good practice that is suited for your case. It's called the 'merge sort algorithm'. It will sort your values and then you just need to output the first 5 values. Here's a link specifically for java. Have fun coding and testing it! I did :D

Well obviously you can just cycle through it and brute force with 2 for loops. Getting the original index makes it harder, as then you cant use sorts, which are faster. If it is sorted or if there is some kind of pattern, you can use a search (binary search) but from what you've given, as it looks as if the data is random, you can't really do much.
If you don't care about indexes, you can try sorts, such as merge sort mentioned by ERed or other types of sorts (I prefer quickSort). Basically you treat the 2D array as a 1D array and assume each subsequent level is just a continuation of the previous level (basically its all just one giant row broken into pieces).

Faster methods for set intersection

I am facing a problem where for a number of words, I make a call to a HashMultimap (Guava) to retrieve a set of integers. The resulting sets have, say, 10, 200 and 600 items respectively. I need to compute the intersection of these three (or four, or five...) sets, and I need to repeat this whole process many times (I have many sets of words). However, what I am experiencing is that on average these set intersections take so long to compute (from 0 to 300 ms) that my program takes a very long time to complete if I look at hundreds of thousands of sets of words.
Is there any substantially quicker method to achieve this, especially given I'm dealing with (sortable) integers?
Thanks a lot!

If you are able to represent your sets as arrays of bits (bitmaps), you can intersect them with AND operations. You could even implement this to run in parallel.
As an example (using jlordo's question): if set1 is {1,2,4} and set2 is {1,2,5}
Then your first set would be represented as: 00010110 (bits set for 1, 2, and 4).
Your second set would be represented as: 00100110 (bits set for 1, 2, and 5).
If you AND them together, you get: 00000110 (bits set for 1 and 2)
Of course, if you had a larger range of integers, then you will need more bytes. The beauty of bitmap indexes is that they take just one bit per possible element, thus occupying a relatively small space.
In Java, for example, you could use the BitSet data structure (not sure if it can do operations in parallel, though).

One problem with a bitmap based solution is that even if the sets themselves are very small, but contain very large numbers (or even unbounded) checking bitmaps would be very wasteful.
A different approach would be, for example, sorting the two sets, merging them and checking for duplicates. This can be done in O(nlogn) time complexity and extra O(n) space complexity, given set sizes are O(n).
You should choose the solution that matches your problem description (input range, expected set sizes, etc.).

The post http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset describes an implementation of an ordered primitive long set with set operations (union, minus and intersection). To my experience it's quite efficient for dense or sparse value populations.

Appropriate level representation / data structure for a 2D platform game?

I'm about to program a copy of Mario in Java. I'm thinking on 2 representations/data structures for the levels but I'm not sure which one should I choose:
A 2D integer array.
A quadtree to divide the level in pieces.
What are its advantages and disadvantages?

Definitely a 2d array of some type. Integers would be a good idea, however characters would be an even better idea.
Consider making a text file which is basically a "map". It could be 10 rows by 10 columns of text. A very simple map, in this case. Maybe you could use an "a" to signify grass, and a "b" to signify brick. In this way, you could even visually understand your map before you put it into action. But you could basically program your application to depend on this text file for the map.
For instance, consider this map:
bbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
So, that would look like a long brick road above 3 long rows of grass. Imagine making this scheme much more complicated, using all kinds of characters a-z and A-Z, even punctuation like $ or % or &.
Then, you could create maps on the fly just by altering the text file. If you use integers, you are limited to only 10 characters (or 10 map objects, per se). For instance, what if you have more 10 objects, then how could you make a text file where the digits are next to each other. You won't know how to seperate the digits.
The downside of a quadtree is that it's overly complicated and won't allow for quick access to specific elements.

Really, it's up to you. One way I can think of doing it is to have an 2d array of map tiles, either clear or an object. Tiles would have a certain size. Of course, then you have the problem of if somethings half a tile. Another option is an array of objects with the object data and location inside. This is less efficient, unless you sort the data.
If you think your maps are going to be very heavy on objects, I'd suggest a 2d tile map. Otherwise, I'd go with an array of objects or parallel arrays (your choice) specifying location and type of an item.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.