I'm about to program a copy of Mario in Java. I'm thinking on 2 representations/data structures for the levels but I'm not sure which one should I choose:
A 2D integer array.
A quadtree to divide the level in pieces.
What are its advantages and disadvantages?
Definitely a 2d array of some type. Integers would be a good idea, however characters would be an even better idea.
Consider making a text file which is basically a "map". It could be 10 rows by 10 columns of text. A very simple map, in this case. Maybe you could use an "a" to signify grass, and a "b" to signify brick. In this way, you could even visually understand your map before you put it into action. But you could basically program your application to depend on this text file for the map.
For instance, consider this map:
bbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
So, that would look like a long brick road above 3 long rows of grass. Imagine making this scheme much more complicated, using all kinds of characters a-z and A-Z, even punctuation like $ or % or &.
Then, you could create maps on the fly just by altering the text file. If you use integers, you are limited to only 10 characters (or 10 map objects, per se). For instance, what if you have more 10 objects, then how could you make a text file where the digits are next to each other. You won't know how to seperate the digits.
The downside of a quadtree is that it's overly complicated and won't allow for quick access to specific elements.
Really, it's up to you. One way I can think of doing it is to have an 2d array of map tiles, either clear or an object. Tiles would have a certain size. Of course, then you have the problem of if somethings half a tile. Another option is an array of objects with the object data and location inside. This is less efficient, unless you sort the data.
If you think your maps are going to be very heavy on objects, I'd suggest a 2d tile map. Otherwise, I'd go with an array of objects or parallel arrays (your choice) specifying location and type of an item.
Related
Let me put the question first: considering the situation and requirements I'll describe further down, what data structures would make sense/help achieving the non-functional requirements?
I tried to look up several structures but wasn't very successful so far, which might be due to me missing some terminology.
Since we'll implement that in Java any answers should take that into account (e.g. no pointer-magic, assume 8-byte references etc.).
The situation
We have somewhat large set of values that are mapped via a 4-dimensional key (let's call those dimensions A, B, C and D). Each dimension can have a different size, so we'll assume the following:
A: 100
B: 5
C: 10000
D: 2
This means a completely filled structure would contain 10 million elements. Not considering their size the space needed to hold the references alone would be like 80 megabytes, so that would be considered a lower bound for memory consumption.
We further can assume that the structure won't be completely filled but quite densely.
The requirements
Since we build and query that structure quite often we have the following requirements:
constructing the structure should be fast
queries on single elements and ranges (e.g. [A1-A5, B3, any C, D0]) should be efficient
fast deletion of elements isn't required (won't happen too often)
the memory footprint should be low
What we already considered
kd-trees
Building such a tree takes some time since it can get quite deep and we'd either have to accept slower queries or take rebalancing measures. Additonally the memory footprint is quite high since we need to hold the complete key in each node (there might be ways to reduce that though).
Nested maps/map tree
Using nested maps we could store only the key for each dimension as well as a reference to the next dimension map or the values - effectively building a tree out of those maps. To support range queries we'd keep sorted sets of the possible keys and access those while traversing the tree.
Construction and queries were way faster than with kd-trees but the memory footprint was much higher (as expected).
A single large map
An alternative would be to keep the sets for individual available keys and use a single large map instead.
Construction and queries were fast as well but memory consumption was even higher due to each map node being larger (they need to hold all dimensions of a key now).
What we're thinking of at the moment
Building insertion-order index-maps for the dimension keys, i.e. we map each incoming key to a new integer index as it comes in. Thus we can make sure that those indices grow one step a time without any gaps (not considering deletions).
With those indices we'd then access a tree of n-dimensional arrays (flattened to a 1-d array of course) - aka n-ary tree. That tree would grow on demand, i.e. if we need a new array then instead of creating a larger one and copying all the data we'd just create the new block. Any needed non-leaf nodes would be created on demand, replacing the root if needed.
Let me illustrate that with an example of 2 dimensions A and B. We'll allocate 2 elements for each dimension resulting in a 2x2 matrix (array of length 4).
Adding the first element A1/B1 we'd get something like this:
[A1/B1,null,null,null]
Now we add element A2/B2:
[A1/B1,null,A2/B2,null]
Now we add element A3/B3. Since we can't map the new element to the existing array we'll create a new one as well as a common root:
[x,null,x,null]
/ \
[A1/B1,null,A2/B2,null] [A3/B3,null,null,null]
Memory consumption for densely filled matrices should be rather low depending on the size of each array (having 4 dimensions and 4 values per dimension in an array we'd have arrays of length 256 and thus get a maximum tree depth of 2-4 in most cases).
Does this make sense?
If the structure will be "quite densely" filled, then I think it makes sense to assume that it will be full. That simplifies things quite a bit. And it's not like you're going to save a lot (or anything) using a sparse matrix representation of a densely filled matrix.
I'd try the simplest possible structure first. It might not be the most memory efficient, but it should be reasonable and quite easy to work with.
First, a simple array of 10,000,000 references. That is (and please pardon the C#, as I'm not really a Java programmer):
MyStructure[] theArray = new MyStructure[](10000000);
As you say, that's going to consume 80 megabytes.
Next is four different dictionaries (maps, I think, in Java), one for each key type:
Dictionary<KeyAType, int> ADict;
Dictionary<KeyBType, int> BDict;
Dictionary<KeyCType, int> CDict;
Dictionary<KeyDType, int> DDict;
When you add an element at {A,B,C,D}, you look up the respective keys in the dictionary to get their indexes (or add a new index if that key doesn't exist), and do the math to compute an index into the array. The math is, I think:
DIndex + 2*(CIndex + 10000*(BIndex + 5*AIndex));
In .NET, dictionary overhead is something like 24 bytes per key. But you only have 11,007 total keys, so the dictionaries are going to consume something like 250 kilobytes.
This should be very quick to query directly, and range queries should be as fast as a single lookup and then some array manipulation.
One thing I'm not clear on is if you want a key, to resolve to the same index with every build. That is, if "foo" maps to index 1 in one build, will it always map to index 1?
If so, you probably should statically construct the dictionaries. I guess it depends on if your range queries always expect things in the same key order.
Anyway, this is a very simple and very effective data structure. If you can afford 81 megabytes as the maximum size of the structure (minus the actual data), it seems like a good place to start. You could probably have it working in a couple of hours.
At best it's all you'll have to do. And if you end up having to replace it, at least you have a working implementation that you can use to verify the correctness of whatever new structure you come up with.
There are other multidimensional trees that are usually better than kd-trees:quadtrees, R*Trees (like R-Tree, but much faster for updates) or PH-Tree.
The PH-Tree is like a quadtree, but much more space efficient, scales better with dimensions and depth is limited by maximum bitwidth of values, i.e. maximum '10000' requires 14 bit, so the depth will not be more than 14.
Java implementations of all trees can be found on my repo, either here (quadtree may be a bit buggy) or here.
EDIT
The following optimization can probably be ignored. Of course the described query will result in a full scan, but that may not be as bad as it sounds, because it will on average anyway return 33%-50% of the whole tree.
Possible optimisation (not tested, but might work for the PH-Tree):
One problem with range queries is the different selectivity of your dimensions, which may result in something to a full scan of the tree. For example when querying for [0..100][0..5][0..10000][1..1], i.e. constraining only the last dimension (with least selectivity).
To avoid this, especially for the PH-Tree, I would try to multiply your values by a fixed constant. For example multiply A by 100, B by 2000, C by 1 and D by 5000. This allows all values to range from 0 to 10000, which may improve query performance when constraining only dimensions with low selectivity (the 2nd or 4th).
I am building a distributional model (count based) from text. Basically for each ngram (a sequence of words), I have to store a count. I need reasonably quick access to the count. For n=5, technically all possible 5-grams are (10^4)^5 even if I assume a conservative estimate of 10k words, which is too high. But many combinations of these n-grams wouldn't exist in text, so a 5d array kind of structure is out of consideration.
I built a trie, where each word is a node. So this trie would be really wide, with max depth 5. That gave me considerable saving of memory. But I still run out of memory (64GB) after I train on enough files. To be fair, I am not using any super efficient Java practices here. Each node has a count, index of word as int. I then have a HashMap to store children. I initially started with a list. Tried to sort it each time I added a child, but I was losing lot of time there, so moved to HashMap. Even with a list, I will run out of memory after reading some more files.
So I guess I need to divide my task into parts, store each part to disk. But ultimately, when accessing I would need to merge these data structures. So I think the way forward is a disk based solution, where I know which file to access for ngrams which start with something (some sort of ordering). As I see it, the problem with trie is it's not very efficient when I go around to merging it. I would need to load two parts into memory to merge. That wouldn't really work.
What approach would you recommend? I looked into a HashMap encoding based structure for language models (like the one berkeleylm uses). But in their use case, they don't need to reconstruct the ngram, so they just hash it and store the hashvalue as context. I need to be able to access the context later.
Any suggestions? Is there any value in using a database? Can they do it without being in-memory?
I wouldn't use HashMap, it's quite memory intensive, a simple sorted array should be better, you can then use binary search on it.
Maybe you could also try a binary prefix-trie. First you create a single char-string, for example by interleave the letters of the words into a single string (I suppose you could also concatenate them, separated by a blank). This long String could then be stored in a binary trie. See CritBit1D for an example.
You could also use a multi-dimensional tree. Many trees are limited to 64bit numbers, but you cold turn the eight leading ASCII characters of every word into a 64-bit integer number and then store that as a 5D key. That should be much more efficient than a 5D array. Multi-dim indexes are: kd-trees, R-trees or quadtrees. The 5-gram-count and the full 5-gram (including remaining characters) can be stored separately in the VALUE that can be associated with each 5D-KEY.
If you are using Java you could try my very own tree. It's a prefix-sharing bitwise quadtree. It is very memory efficient, very well suited to larger datasets (1M entries upwards) and works natively with 'integer' rather than 'float'. It also has very good nearest neighbour search.
I am a software intern designing a program which parses data files outputted by an industrial simulator in order to do calculations on them.
The basic structure of the files is like this:
Property1
Timestep 1
0.000 3.141 5.131 etc...
Timestep 2
3.323 0.000 etc...
etc...
The data needs to be collected in some sort of data structure in order to allow for efficient calculations. There can be several million data points, though many are the same value.
My solution (nested HashMaps):
The main object, DataContainer has a HashMap which contains property names as keys. These keys are associated with their own HashMaps that contain timestep numbers as keys. These keys are associated with their own HashMaps that contain data values as keys that are paired with the number of times that value occurs within the timestep.
Quick Illustration:
DataContainer
properties:
property 1 :
time 1 - 0.000, 4 | 3.313, 10 etc...
time 2
Looking forward to people's input.
If you are interested in efficiency, you would be better off creating custom classes with attributes / getters / setters for the properties.
HashMaps containing HashMaps etc:
take more space,
are slower,
are more tricky to use ... especially if you want to iterate the elements in a predictable order, and
negate the benefit of Java's intrinsic compile time type safety.
My idea:
class DataContainer{
TreeMap timestamp<String, SortedList<Integer>>;
}
I'd go for two arrays of the same length like double[] value; int[] count;. This surely takes much less space than a Map.Entry filled with boxed values. I'd make a simple class around them, and put it into your Map.
I have a 5 dimensional array, where all indicies range from 2-14. It contains all the possible permutations of a 5 number sequence.
This array holds 525720 permutations, which takes quite a while to compute. (5-7 seconds on my Macbook pro). It should be used as a lookup table, to access a value in constant time, or more specific, the value of a certain poker-hand:
array[2][3][4][5][7] // 1
array[5][5][5][5][14] // 2000
Is there a faster way to create this array? I was thinking about persisting the array in some way, and then load it each time my program starts - but is there any efficient ways to do this?
I'm not very familiar with persistence. I dont really know if it's worth it for me, to load it from disk, instead of create it each time. I know about Hibernate, but this seems like a bit of a overkill, just to persist a single array?
Write it out via MappedByteBuffer. Create a big enough file, map it, get an asIntBuffer(), put in your numbers.
Then you can map it later and access it via IntBuffer.get(obvious-math-on-indices).
This is much faster the serialization.
Not a direct answer to your original question, but...
If you are trying to do fast poker-hand evaluations, you want to make sure you read through The Great Poker Hand Evaluator Roundup.
Particularly: Cactus Kev's Poker Hand Evaluator.
I was involved in long-running discussion about running the fastest possible 5- and 7-hand poker evaluations where most of this stuff comes from. Frankly, I don't see how these evaluations are going to any faster until you can hold all C(52,5) or 2,598,960 hand values in a look-up table.
I would start by collapsing your dimensions for indexing:
assuming you have a set of indexes (from your first example, allowed values are 2 to 14):
i1 = 2
i2 = 3
i3 = 5
i4 = 6
i5 = 7
and created your array with
short array[] = new short[13 * 13 * 13 * 13 * 13];
...
then accessing each element becomes
array[(i1 - 2) * 13 * 13 * 13 * 13 + (i2 - 2) * 13 * 13 * 13 + (i3 - 2)
* 13 * 13 + (i4 - 2) * 13 + (i5 - 2)]
This array will take much less memory since you don't need to create an additional layer of objects along each dimension, and you can easily store the entire contents in a file and load it in one list.
It will also be faster to traverse this array because you will be doing 1/5 the array lookups.
Also the tightening up of the number of elements in each dimension will save significant memory.
To keep your code clean this array should be hidden inside an object with a get and set method which takes the five indexes.
What you probably want to do, if the computation of the array is too expensive, is serialize it. That basically places a binary copy of the data onto a storage medium (e.g. your hard disk) that you can very quickly load.
Serialization is pretty straightforward. Here's a tutorial that specifically addresses serializing arrays.
Since these values will presumably only change if your algorithm for evaluating a poker hand changes, it should be fine to just ship the serialized file. The file size should be reasonable if the data you are storing in each array element is not too large (if it's a 16-bit integer for example, the file will be around 1MB in size).
I'm not convinced that your number-of-poker-hand permutations is correct but, in any case...
You can make your array initialization approximately 120-times faster by storing every permutation of a given poker hand at once. That works because the "value" of a poker hand is not affected by the order of the cards.
First calculate the value for the a hand. Say you have five cards (c1, c2, c3, c4, c5):
handValue = EvaluateHand(c1, c2, c3, c4, c5);
// Store the pre-calculated hand value in a table for faster lookup
hand[c1][c2][c3][c4][c5] = handValue;
Then assign the handValue to all permutations of that hand (i.e. the order of the cards doesn't change the handValue).
hand[c1][c2][c3][c5][c4] = handValue;
hand[c1][c2][c4][c3][c5] = handValue;
hand[c1][c2][c4][c5][c3] = handValue;
hand[c1][c2][c5][c3][c4] = handValue;
hand[c1][c2][c5][c4][c3] = handValue;
:
etc.
:
hand[c5][c4][c3][c2][c1] = handValue;
A few things:
If this is for poker hands, you can't just store 2-14. You also need to store the suit as well. This really means you need to store 0-51. Otherwise you have no way of knowing if array[2][3][4][5][6] is a straight or a straight flush.
If you don't actually need to store the suits for your application, and you really want to do it in an array, use indexes of 0-12, not 2-14. This would allow you to use a 13×13×13×13×13 (371,293 member) array, instead of a 15×15×15×15×15 (759,375 member) array. Whenever you access the array, you'd just need to subtract 2 from each index. (I'm not sure where you got your 525,720 count...)
First of all, thanks for your enthusiasm!
So the straight forward approach seems to just serialize it. I think I'll try this at first, to test the performance, and see if its sufficient. (Which I guess it is).
About the MappedByteBuffer... Is it correctly understood, that this makes it possible to load a fraction of the serialized array? So I load the values I need at run-time, instead of loading the whole array at startup?
#Jennie
The suits are stored in a different array. I'm not sure this is the best way to go, since there's lots of stuff to consider about this particular problem. A flush is basically a high-card hand, with a different value... So There's no real reason for me to store the same permutations (high cards) twice, but this is the way to do it for now. I think the way to go is a hash-function, so I can convert high-card values to flush-values easily, but I have not given this many thoughts.
About the indicies, you're of course right. This is just for now. It's easier for me to test the value for "2 3 4 5 6", by just putting in the card-values for now... Later, I'll cut of the array!
We were just assigned a new project in my data structures class -- Generating text with markov chains.
Overview
Given an input text file, we create an initial seed of length n characters. We add that to our output string and choose our next character based on frequency analysis..
This is the cat and there are two dogs.
Initial seed: "Th"
Possible next letters -- i, e, e
Therefore, probability of choosing i is 1/3, e is 2/3.
Now, say we choose i. We add "i" to the output string. Then our seed becomes
hi and the process continues.
My solution
I have 3 classes, Node, ConcreteTrie, and Driver
Of course, the ConcreteTrie class isn't a Trie of the traditional sense. Here is how it works:
Given the sentence with k=2:
This is the cat and there are two dogs.
I generate Nodes Th, hi, is, ... + ... , gs, s.
Each of these nodes have children that are the letter that follows them. For example, Node Th would have children i and e. I maintain counts in each of those nodes so that I can later generate the probabilities for choosing the next letter.
My question:
First of all, what is the most efficient way to complete this project? My solution seems to be very fast, but I really want to knock my professor's socks off. (On my last project A variation of the Edit distance problem, I did an A*, a genetic algorithm, a BFS, and Simulated Annealing -- and I know that the problem is NP-Hard)
Second, what's the point of this assignment? It doesn't really seem to relate to much of what we've covered in class. What are we supposed to learn?
On the relevance of this assignment with what you covered in class (Your second question). The idea of a 'data structures' class is to expose students to the very many structures frequently encountered in CS: lists, stacks, queues, hashes, trees of various types, graphs at large, matrices of various creed and greed, etc. and to provide some insight into their common implementations, their strengths and weaknesses and generally their various fields of application.
Since most any game / puzzle / problem can be mapped to some set of these structures, there is no lack of subjects upon which to base lectures and assignments. Your class seems interesting because while keeping some focus on these structures, you are also given a chance to discover real applications.
For example in a thinly disguised fashion the "cat and two dogs" thing is an introduction to statistical models applied to linguistics. Your curiosity and motivation prompted you to make the relation with markov models and it's a good thing, because chances are you'll meet "Markov" a few more times before graduation ;-) and certainly in a professional life in CS or related domain. So, yes! it may seem that you're butterflying around many applications etc. but so long as you get a feel for what structures and algorithms to select in particular situations, you're not wasting your time!
Now, a few hints on possible approaches to the assignment
The trie seems like a natural support for this type of problem. Maybe you can ask yourself however how this approach would scale, if you had to index say a whole book rather than this short sentence. It seems mostly linearly, although this depends on how each choice on the three hops in the trie (for this 2nd order Markov chain) : as the number of choices increase, picking a path may become less efficient.
A possible alternative storage for the building of the index is a stochatisc matrix (actually a 'plain' if only sparse matrix, during the statistics gathering process, turned stochastic at the end when you normalize each row -or column- depending on you set it up) to sum-up to one (100%). Such a matrix would be roughly 729 x 28, and would allow the indexing, in one single operation, of a two-letter tuple and its associated following letter. (I got 28 for including the "start" and "stop" signals, details...)
The cost of this more efficient indexing is the use of extra space. Space-wise the trie is very efficient, only storing the combinations of letter triplets effectively in existence, the matrix however wastes some space (you bet in the end it will be very sparsely populated, even after indexing much more text that the "dog/cat" sentence.)
This size vs. CPU compromise is very common, although some algorithms/structures are somtimes better than others on both counts... Furthermore the matrix approach wouldn't scale nicely, size-wize, if the problem was changed to base the choice of letters from the preceding say, three characters.
None the less, maybe look into the matrix as an alternate implementation. It is very much in spirit of this class to try various structures and see why/where they are better than others (in the context of a specific task).
A small side trip you can take is to create a tag cloud based on the probabilities of the letters pairs (or triplets): both the trie and the matrix contain all the data necessary for that; the matrix with all its interesting properties, may be more suited for this.
Have fun!
You using bigram approach with characters, but usually it applied to words, because the output will be more meaningful if we use just simple generator as in your case).
1) From my point of view you doing all right. But may be you should try slightly randomize selection of the next node? E.g. select random node from 5 highest. I mean if you always select node with highest probability your output string will be too uniform.
2) I've done exactly the same homework at my university. I think the point is to show to the students that Markov chains are powerful but without extensive study of application domain output of generator will be ridiculous