I am considering using a HashMap as the backing structure for a QuadTree. I believe I can use Morton sequencing to uniquely identify each square of my area of interest. I know that my QuadTree will have a height of at most 16. From my calculations, that would be lead to a matrix of 65,536 x 65,536 which should give me at most 4,294,967,296 cells. Does anyone know if that is too many elements for a HashMap? I could always write up a QuadTree using a Tree but I thought that I could get better performance with a HashMap.
Morton sequence of height 1 == (2x2) == 4
Morton sequence of height 2 == (4x4) == 16
Morton sequence of height 3 == (8x8) == 64
Morton Sequencing example for a tree of max height 3.
Here is what I know:
I will get data in lat/lon over a know rectangular area.
The data will not completely cover the whole area and will likely be
consolidated into chunks somewhere in that area. (worse case is data in all 4,294,967,296 cells)
The resolution of the data ends up breaking down the area into 65k by 65k rectangle.
I also know that I will likely get 10 to 1 queries to insert/update of
the data.
Hashmap is not a good idea.
There is a better solution, used in navigation systems:
Assign each Quadtree cell a letter: A (Left,upper), B(right, upper) , C and D.
Now you can adress each quad cell via a String:
ABACE: this identifies the cell in level 5. (A->B->A->C->E)
Search internet for details on that specific Quadtree coding.
Dont forgett: You decide the sub division rule (when to subdivide a cell into smaller ones), and that decides how many cells you get. The number you give is far to high.
It is only an theroetical calculation which reminds me 1:1 on Google Maps Quad tree.
Further it is import to know which type of Quadtree you need for your Application:
Point Quadtree, Region Quadtree (bounbding box), Line Quadtree.
If you know any existing Quadtree implementation in java. please post a comment, or edit this answer.
Further you cannot implement a one for all solution.
You have to know aproxmetly how many elements you will suport.
The theroretical maximum , which is not equal to the expected maximum, is not a good approach.
You have to know that because you must decide whether to store that in main memory, or on disk, this also influences the structure of the quadtree. The "ABCD" solution is suitable
for dynamic loading from disk.
The google approach stores images in the quadtree, this is different from points you want to store, so i doubt that your calculation is realistic.
If you want to store all streets of all countries in the world, you can estimate that
number because the number of points are known (Either OpenStreetMap, TomTom (Teelatlas), or (Nokia Maps) Navteq.
If you realized that you have to store the quadtree on disk, then proably the size is open, and limited by only the disk space.
I think that implementing a Quad Tree as a Tree will give you better results. Actually implementing such a big database in a HashMap is a bad idea anyways. Because if you have a lot of collisions, the performance of a HashMap decreases badly.
And apparently you know exactly how much data you have. In that case, a HashMap is totally redundant. A HashMap is meant for when you do not know how much data there is. But in this case, you know that every node of the tree has four elements. So why even bother using a HashMap.?
Also, your table is apparently at least 4GB large. On most systems, that just barely fits in your memory. And since there is also Java VM overhead, why do you store this in memory? It would be better to find a datastructure that works well on disks. One such datastructure for spatial data (which I assume you are having, since you are using a quad tree), is an R-Tree.
Whoa, we're getting a number of concepts here all at once. First of all, what are you trying to reach? Store a quad tree? A matrix of cells? Hash lookups?
If you want a quad tree, why use a hash map? You know there could be at most 4 child nodes to each node. A hash map is useful for an arbitrary number of key-value mappings where quick lookup is necessary. If you're only going to have 4, a hash might not even be important. Also, while you can nest maps, it's a bit unwieldy. You're better off using some data structure or writing your own.
Also, what are you trying to reach with the quad tree? Quickly looking up a cell in the matrix? Some coordinate mapping function might serve you much better there.
Finally, I'm not so much worried about that amount of nodes in a hash map, as I am by the amount purely on its own. 65536² cells would end up being 4 GiB of memory even at one byte per cell.
I think it would be best to pedal all the way back to the question "what is my goal with this data", then find out which data structures could help you with that (keepign requirements such as lookups in mind) while managing to fit it in memory.
Definitely use directly linked nodes for both space and speed reasons.
With data this big I'd avoid Java altogether. You'll be constantly at the mercy of the garbage collector. Go for a language closer to the metal: C or C++, Pascal/Delphi, Ada, etc.
Put the four child pointers in an array so that you can refer to leaves as packed arrays of 2-bit indices (a nice reason to use Ada, which will let you define such things with no bit fiddling at all). I guess this is Morton sequencing. I did not know that term.
This method of indexing children in itself is a reason to avoid Java. Including a child array in a node class instance will cost you a pointer plus an array size field: 8 or 16 bytes per node that aren't needed in some other languages. With 4 billion cells, that's a lot.
In fact you should do the math. If you use implicit leaf cells, you still have 1 billion nodes to represent. If you use 32-bit indices to reference them (to save memory vice 64-bit pointers), the minimum is 16 bytes per node. Say node attributes are a mere 4 bytes. Then you have 20 Gigabytes just for a full tree even with none of the Java overhead.
Better have a good budget for RAM.
It is true that most typical quad-trees will simply use nodes with four child node pointers and traverse that, without any mention of hashmaps. However, it is also possible to write an efficient quadtree-like spatial indexing method that stores all its nodes in a big hashmap.
The benefit is that by using the Morton sequence (or another similarly generated value) as the key, you become able to retrieve nodes at any level with only one pointer dereference.
In "traditional" quadtree implementations we get cache misses due to repeated pointer dereferencing while looking up nodes, and this becomes the main bottleneck. So provided that the cost of encoding the coordinate space and getting a hash is lower than the cost of dereferencing the node pointers along the search path, such an implementation could be faster. Particularly if the map is very deep (having sparse locations requiring high precision).
You don't really need the Morton sequence, and you hardly need to think of it as a quadtree when doing this. A very simple example implementation:
In order to retrieve a quad of some level, use { x, y, level } as the hashmap key, where x and y are quantized to that level. You only need to include the level in the key if you are storing several levels in the same map.
Whether this is still a quadtree is up for discussion, but the functionality is the same.
Related
when I have an Array and I want to remove one value from it I need to shift the next element to lift but the idea is to do shifting one time when a n of null value in array.
Of course it is micro-optimisation, and ArrayList (maybe LinkedList) would be a production quality data structure for dynamic arrays.
Here you might keep an extra list of nulled entries. At a certain threshold one could do **System.arraycopy**s to remove the gaps. If there are many index based inserts too, you might opt for keeping gaps, maybe collecting small gaps together.
This is a traditional technique in editors for text.
For several data structures one might search through guava classes.
For instance write-on-copy data structures.
Or concurrency, compactifying in the background.
For a specific data structure & algorithm maybe someone else can give pointers.
I have been reading several SO posts regarding K-D Trees vs. R-Trees but I still have some questions regarding my specific application.
For my Java application, I want to maintain a relatively small number of spatial data points (a few hundred thousand). The key is that data insertion will not be bulk loaded, but rather, frequently and incrementally inserted. I should also mention that I will be performing a good number of periodic range queries on sub-regions of the spatial domain.
I have read that K-D Trees do not typically support incremental building and that R-trees are more suitable for this since they maintain a balanced state.
However, after looking into the solutions suggested here:
Java commercial-friendly R-tree implementation?
I did not find that the implementations were easy to work with for returning a list of points in range searches. However, I have found: http://java-ml.sourceforge.net/ to have a very nice implementation of a K-D Tree that works quickly and outperforms standard array storage for a test set of points (~25K). Additionally, I have read that R-trees store redundant information when dealing with points (since a point is a rectangle with min=max).
Since I am working with a smaller number of points, are the differences between the two structures less important than, say, if I was working with a database application storing millions of points?
It is incorrect that R-trees can't store points. They are designed to support rectangles, and will need to do so at inner nodes. But a good implementation should store points at the leaf level, and roughly have the double data capacity there.
You can trivially store point, and expose them as a "rectangles" with min=max to the tree management code.
Your data isn't small. Small would be like 100 objects. For 100 objects, an R-tree won't make much sense, as it would likely consists of a single leaf only. For good performance, an R-tree needs a good fan-out. k-d-tree always have a fan-out of 2; they are binary trees. At 100k objects, a k-d-tree will be pretty deep. Assuming that you have a fanout of 100 (for dynamic r-trees, you then should allow up to 200 objects per page), you can store 1 million points in a 3-level tree.
I've used the ELKI R*-tree, and it is really fast. But it's not commercial friendly, unless you get a different license: it's AGPL-3 licensed, which is a copyleft license.
Furthermore, the API isn't designed for standalone use. If you want to use them, the best way is to work with the full ELKI framework, instead of trying to rip out the R*-tree.
If your data is low dimensional (say, 3-dimensional) and has a finite bound, don't underestimate the performance of simple grid-based approaches. In particular for in-memory operations. In many cases, I wouldn't even go to an Octree, but just define the optimal grid for my use case, and then implement it using object lists. Keep sorted by one coordinate within each grid cell to further accelerate performance.
If you want to frequently add/remove/update data points, you may want to look at the PH-Tree. The is on open source Java version available: www.phtree.org
It works a bit like a quadtree, but is much more efficient by using binary hypercubes and prefix-sharing.
It has excellent update performance (no rebalancing required) and is quite memory efficient. It works better with larger datasets, but 100K should be fine for 2 or 3 dimensions.
I know this would be quite easy to do with PostGIS in a database, but I only expect my points to live for a couple of minutes max.
I would like to have a list of GPS points that live for 1-5min. When I add a new point, I would like to calculate a list of points from the aged list that are in a 1-10km radius of the new point.
It is still recommended to perform these operations in a database like PostGIS or Mongo?
If not, how would one go about calculating this in memory?
You can compute a quadkey, it's similar to a quadtree and bing maps use it for the tile server. A quadkey can be computed with a morton curve. You can download my php class hilbert-curve # phpclasses.org.
Databases are very slow, because they access slow disk memory.
For your solution, you need an indexing technic that Geo Spatial DBs uses, too, (PostGis or Oracle-Spatial). In your case your index stays in memory:
It is an (geo-) spatial index.
There are quad tree and R-Tree as state of the art. (sometimes a kd-tree also is used)
Quad tree are easier to implement, R-Tree can be twice as fast than quadtrees.
I would use the quad tree.
The most demanding part of your work is the delete operation of an point.
This is tricky and much slower than insert.
One solution, recomended by the original author of the quadtree, and which I would try: Mark the points as deleted, but keep them in the Quad tree.
Evry x minutes rebuild the whole quadtree, by creating a new one from thedata of the old one but by only keeping the points still existing.
Once the points are in the quad tree, a range search is reduced only to points "nearby"
I was wondering if the data layout Structs of Arrays (SoA) is always faster than an Array of Structs (AoS) or Array of Pointers (AoP) for problems with inputs that only fits in RAM programmed in C/JAVA.
Some days ago I was improving the performance of a Molecular Dynamic algorithm (in C), summarizing in this algorithm it is calculated the force interaction among particles based on their force and position.
Original the particles were represented by a struct containing 9 different doubles, 3 for particles forces (Fx,Fy,Fz) , 3 for positions and 3 for velocity. The algorithm had an array containing pointers to all the particles (AoP). I decided to change the layout from AoP to SoA to improve the cache use.
So, now I have a Struct with 9 array where each array stores forces, velocity and positions (x,y,z) of each particle. Each particle is accessed by it own array index.
I had a gain in performance (for an input that only fits in RAM) of about 1.9x, so I was wondering if typically changing from AoP or AoS to SoA it will always performance better, and if not in which types of algorithms this do not occurs.
Much depends of how useful all fields are. If you have a data structure where using one fields means you are likely to use all of them, then an array of struct is more efficient as it keeps together all the things you are likely to need.
Say you have time series data where you only need a small selection of the possible fields you have. You might have all sorts of data about an event or point in time, but you only need say 3-5 of them. In this case a structure of arrays is more efficient because a) you don't need to cache the fields you don't use b) you often access values in order i.e. caching a field, its next value and its next is useful.
For this reason, time-series information is often stored as a collection of columns.
This will depend on how exactly you access the data.
Try to imagine, what exactly happens in the hardware when you access your data, in either SoA or AoS.
To reason about your question, you must consider following things -
If the cache is absent, the performance should be the same, assuming that memory access latency is equal for all the elements of the data.
Now with the cache, if you access consecutive address locations, definitely you will get performance improvement. This is exactly valid in your case. When you have AoS, The locations are not consecutive in the memory, so you must lose some performance there.
You must be accessing in for loops your data like for(int i=0;i<1000000;i++) Fx[i] = 0. So if the data is huge in quantity, you will easily see the small performance benefits. If your data was small, this would not matter much.
Finally, you also don't know about the DRAM that you are using. It will have some benefits when you access consecutive data. For example to understand why it is like that you can refer to wiki.
I am making a game for Android. I am kind of trying to keep the game a secret so I cannot reveal too much about it, so bear with me. Basically it runs in real-time (so I will be implementing a thread to update objects' coordinates) and its style is kind of like Duck Hunt; you have to hit objects that move around on the screen.However, the game has been lagging slightly from time to time (running on a Samsung Galaxy S, which is one of the higher-end devices) so I believe I need to use a better data structure.
Basically I am storing these objects in a doubly linked list instead of in an array because I wanted the maximum number of objects on the screen to be dynamic. In other words, I have a reference to the head and tail objects and all the game objects are connected just as a typical linked list would. This provides the following problems:
Searching if an object intersects with a given coordinate (by touching the screen) takes O(n) time (worst case scenario) because I have to traverse from head to tail and check the hitbox of each object.
Collision checking between two objects takes O(n^2) time (again, worst case scenario) because for every object I have to check if its hitbox intersects with that of all other objects in the linked list.
Another reason why I chose linked list is because I basically have two linked lists: one for active objects (i.e. objects that are on the screen) and one for inactive objects. So let's say that there will be at most 8 objects on the game screen, if the active linked lists would have 5 objects, the inactive list will have 3 objects. Using two linked list makes it so that whenever an object becomes inactive, I can simply append it to the inactive linked list instead of dereferencing it and waiting for the garbage collector to reclaim memory. Also, if I need a new object, I can simply take an object from the inactive linked list and use it in the active linked list instead of having to allocate more memory to create a new object.
I have considered using a multi-dimensional array. This method involves dividing up the screen into "cells" in which an object can lie. For example, on a 480x800 screen, if the object's height and width are both 80 pixels, I would divide the screens into a 6x10 grid, or in the context of Java code, I would make GameObject[6][10]. I could simply divide the coordinates (on the screen) of an object by 80 to get its index (on the grid), which could also provide an O(1) insertion. This could also make searching by coordinate O(1) as I could do the same thing with the touch coordinates to check the appropriate index.
Collision checking might still take O(n^2) time because I still have to go through every cell in the grid (although this way I only have to compare with at most 8 cells that are adjacent to the cell I am examining currently).
However, the grid idea poses its own problems:
What if the screen has a resolution other than 480x800? Also, what if the grid cannot be evenly divided into 6x10? This might make the program more error prone.
Is using a grid of game objects expensive in terms of memory? Considering that I am developing for a mobile device, I need to take this into consideration.
So the ultimate question is: should I use linked list, multi-dimensional array, or something else that I haven't considered?
Below is an idea of how I am implementing collision:
private void checkForAllCollisions() {
GameObject obj1 = mHead;
while (obj1 != null) {
GameObject obj2 = obj1.getNext();
while (obj2 != null) {
//getHitBox() returns a Rect object that encompasses the object
Rect.intersects(obj1.getHitBox(), obj2.getHitBox());
//Some collision logic
}
obj1 = obj1.getNext();
}
}
Common data structures for better collision-detection performance are quadtrees (in 2 dimensions) and octrees (3 dimensions).
If you want to stick with a list - is there a reason you're using LinkedList and not ArrayList? I hope you are currently using the built-in linked list implementation...
A Samsung Galaxy S has 512 MB of RAM - don't worry about taking up too much memory with a single data structure (yet). You'd have to have a relatively high number of relatively heavy objects before this data structure starts sucking significant amounts of RAM. Even if you have 1000 GameObject instances, and each instance in the data structure (including the associated weight of storing an instance in said data structure) was 10,000 bytes (which is pretty honking big) that's still only 9.5 megabytes of memory.
So I definitely think that there's something we need to firstly consider.
1.) LL vs. Array
You said that you used a LL instead of an array because of the dynamic property the LL gives you. However, I would like to say that you can make using an array dynamic enough by doubling the size of it after it has filled up, which will give you O(1) runtime of insert amortized (given that the array is being unkept/unsorted). However, there will be a scenario where every now and then an insert can take O(n) since you must copy the data appropriately. So, for live games such as this, I believe the LL was a good choice.
2.) Your consideration of memory in terms of game objects.
This depends heavily on the structure of each GameObject, which you have not provided enough detailed information about. If each GameObject only contains a couple of ints or primitive types, chances are that you can afford the memory. If each GameObject is very memory costly and you are using a very large grid, then it will definitely cost tons of memory, but it will depend on both the memory usage of each GameObject as well as the size of the grid.
3.) If the resolution is different than 480x800, don't worry. If you are using a 6x10 grid for everything, consider using a density multiplier like so:
float scale = getContext().getResources().getDisplayMetrics().density;
and then getWidth() and getHeight() should be multiplied by this scale to get the exact width and height of the device that's using your app. You can then divide these numbers by the 6 and 10. However, note that some grids may look ugly on some devices, even if they are scaled correctly this way.
4.) Collision Handling
You mentioned that you were doing collision handling. Although I cannot say which is the best way to handle this in your situation (I'm still kind of confused how you're actually doing this based on your question), please keep in mind these alternatives:
a.) Linear Probing
b.) Separate Chaining
c.) Double Hashing
which are all admittedly hash collision handling strategies, but may be implementable given your game (which, again, you would know more about since you have kept some info back). However, I will say that if you use a grid, you may want to do something along the lines of linear probing or create 2 hash tables all together (if appropriate, but this seems to take away the grid-layout you're trying to achieve, so, again, your discretion). Also note that you may use a tree of some sort (like a quadtree) to get faster collision detection. If you don't know about quadtrees, check them out here. A good way to think of one is dividing a square picture into separate pixels at the leaves and storing the average colors at the parent nodes for pruning. Just a way to help you think of quadtrees meaningfully.
Hoped that helped. Again, maybe I'm misunderstanding something because of the vague nature of your question, so let me know. I'd be happy to help.