Optimize removing process complexity

Optimize removing process complexity - java

when I have an Array and I want to remove one value from it I need to shift the next element to lift but the idea is to do shifting one time when a n of null value in array.

Of course it is micro-optimisation, and ArrayList (maybe LinkedList) would be a production quality data structure for dynamic arrays.
Here you might keep an extra list of nulled entries. At a certain threshold one could do **System.arraycopy**s to remove the gaps. If there are many index based inserts too, you might opt for keeping gaps, maybe collecting small gaps together.
This is a traditional technique in editors for text.
For several data structures one might search through guava classes.
For instance write-on-copy data structures.
Or concurrency, compactifying in the background.
For a specific data structure & algorithm maybe someone else can give pointers.

Related

What is the purpose of a priority queue when heaps exist?

I know priority queues tend to use heaps but what is the point of a priority queue when they basically seem the same as heaps? I initially thought all priority queues use hash maps to keep track of all object's locations in the heap, making finding and updating/deleting said object easier. However, I have used Java's priority queue and you have to manually iterate over it to update or delete objects not at the root. It seems odd to have a priority queue that appears to literally just be a heap with nothing else special about it.

It might help to reason by analogy here:
List is to dynamic array as PriorityQueue is to binary heap.
That is, the abstract idea of a list (a sequence of things starting at position zero where items can be inserted and removed) is a nice, high-level concept, while a dynamic array (an array along with a capacity that doubles or 1.5x’s in size if extra space is needed) is one possible way of implementing a list. If you’re using a list, you can just think “oh, it’s a sequence, and I can put things places” without worrying how that sequence is actually represented. On the other hand, working with a dynamic array requires you to track which array elements are valid versus which ones don’t actually get used, you need to manually transfer things over when there’s no more space and think carefully about your growth strategy, etc. The distinction here is at what level you’re viewing things. If you just need “a sequence,” think “list.” If you need to build a type from scratch representing a sequence, think “dynamic array.”
This is basically what’s going on with priority queues versus binary heaps. A priority queue abstractly represents the idea of “I can put things in and they’ll come back in sorted order.” A binary heap is one specific possible way of implementing a priority queue. When working with an abstract priority queue, you can focus your thoughts purely on questions like “what elements do I want to add?” and “how do I rank them?” When working with a binary heap, you have to think about things like “do I use one-indexing or zero indexing?” and “what’s the formula for identifying the children of a node at index k?” If all you need is the ability to put things in a bag and take them out in sorted order, you can use a priority queue without worrying about how it works. If you need to build one from scratch, you can use a binary heap.
Going back to the list versus dynamic array analogy: there are many types you can use to represent lists. Dynamic arrays are one, but you could also use a circular buffer (good if you add or remove from the ends) or a linked list (good if items get moved around between lists). Similarly, there are many ways you can implement a priority queue. Binary heaps are one option, but there’s also pairing heaps, binomial heaps, etc. Keeping the relevant abstraction in focus - I just want a sequence of things, I just want a way to retrieve things in sorted order - means that you don’t need to worry as much about how things work when what you care about is what operations you want to do.

Personal opinion, you are right, Java's PriorityQueue is heap. Java as a high level programming language, it is reasonable for it to provide all the common and standard algorithm implementations, most of the time we focus on business logic development and how to get the job done faster. So we don't want spend too much time on building a priority queue from the ground up, besides it is tedious and error-prone to do it yourself.
If you want update or delete objects at the same time, and don't want to iterate over it manually, you can just do this:
Object updatedObject;
priorityQueue.add(priorityQueue.remove(updatedObject));
although it's not efficient enough when updating occurs frequently, there is an alternative algorithm called Fibonacci heap to do the job better:

It seems odd to have a priority queue that appears to literally just be a heap with nothing else special about it.
Why?
Nothing about the name PriorityQueue promises anything more than the ability to put items in one end and get them out the other in sorted-by-priority order. That's also basically the definition of a heap, which is why a heap makes an ideal data structure to implement a priority queue.
So, essentially, the Java Collections Framework designers implemented a heap. Only instead of calling it a Heap, they called it a PriorityQueue. End of story. As the song lyric goes: "Who could ask for anything more?"

Java's Priority queue is can be either a min Heap or a max Heap, and based on how you have constructed it, it will always give you the min/max value.

data structure for Time and memory constraint

What is the best approach to store and search for the primitive data types? Data structure that can tackle both time constraint and memory constraint? websites/books from where I can get clear knowledge on these things?

Try this book
Data Structures and Algorithms in Java By Adam Drozdek Second Edition
It helped me a lot and it helps in memory management,data compression and helps in a deep knowledge of data structures and algorithms.
To get memory constraint as a thing in data structures,stop using dynamic memory allocation(Dynamic Programming) cause it saves memory.

If you are ready to sacrifice the flexibility around adding an element and deleting an element from your data structure then a sorted int[] is the best bet you got as far as memory efficiency is concerned.
Since array is sorted, you can perform binary search.
if your initial load of this array is going to be few million ints and later there is a chance to add few hundred ints only, you can supplement int[] with another data structure like ArrayList to temporary hold those additions and then merge ArrayList to int[] once size grows significantly.
Deletion can be handled by setting that element with some fairly small unused negative number but yes that would not be a very clean solution. Or you can handle deletion again by a supplementary data structure if deletion is rare.
Its all trade - off for a particular scenario - there is nothing like best for all situations.
Hope it helps !!

Quadtree with HashMap

I am considering using a HashMap as the backing structure for a QuadTree. I believe I can use Morton sequencing to uniquely identify each square of my area of interest. I know that my QuadTree will have a height of at most 16. From my calculations, that would be lead to a matrix of 65,536 x 65,536 which should give me at most 4,294,967,296 cells. Does anyone know if that is too many elements for a HashMap? I could always write up a QuadTree using a Tree but I thought that I could get better performance with a HashMap.
Morton sequence of height 1 == (2x2) == 4
Morton sequence of height 2 == (4x4) == 16
Morton sequence of height 3 == (8x8) == 64
Morton Sequencing example for a tree of max height 3.
Here is what I know:
I will get data in lat/lon over a know rectangular area.
The data will not completely cover the whole area and will likely be
consolidated into chunks somewhere in that area. (worse case is data in all 4,294,967,296 cells)
The resolution of the data ends up breaking down the area into 65k by 65k rectangle.
I also know that I will likely get 10 to 1 queries to insert/update of
the data.

Hashmap is not a good idea.
There is a better solution, used in navigation systems:
Assign each Quadtree cell a letter: A (Left,upper), B(right, upper) , C and D.
Now you can adress each quad cell via a String:
ABACE: this identifies the cell in level 5. (A->B->A->C->E)
Search internet for details on that specific Quadtree coding.
Dont forgett: You decide the sub division rule (when to subdivide a cell into smaller ones), and that decides how many cells you get. The number you give is far to high.
It is only an theroetical calculation which reminds me 1:1 on Google Maps Quad tree.
Further it is import to know which type of Quadtree you need for your Application:
Point Quadtree, Region Quadtree (bounbding box), Line Quadtree.
If you know any existing Quadtree implementation in java. please post a comment, or edit this answer.
Further you cannot implement a one for all solution.
You have to know aproxmetly how many elements you will suport.
The theroretical maximum , which is not equal to the expected maximum, is not a good approach.
You have to know that because you must decide whether to store that in main memory, or on disk, this also influences the structure of the quadtree. The "ABCD" solution is suitable
for dynamic loading from disk.
The google approach stores images in the quadtree, this is different from points you want to store, so i doubt that your calculation is realistic.
If you want to store all streets of all countries in the world, you can estimate that
number because the number of points are known (Either OpenStreetMap, TomTom (Teelatlas), or (Nokia Maps) Navteq.
If you realized that you have to store the quadtree on disk, then proably the size is open, and limited by only the disk space.

I think that implementing a Quad Tree as a Tree will give you better results. Actually implementing such a big database in a HashMap is a bad idea anyways. Because if you have a lot of collisions, the performance of a HashMap decreases badly.
And apparently you know exactly how much data you have. In that case, a HashMap is totally redundant. A HashMap is meant for when you do not know how much data there is. But in this case, you know that every node of the tree has four elements. So why even bother using a HashMap.?
Also, your table is apparently at least 4GB large. On most systems, that just barely fits in your memory. And since there is also Java VM overhead, why do you store this in memory? It would be better to find a datastructure that works well on disks. One such datastructure for spatial data (which I assume you are having, since you are using a quad tree), is an R-Tree.

Whoa, we're getting a number of concepts here all at once. First of all, what are you trying to reach? Store a quad tree? A matrix of cells? Hash lookups?
If you want a quad tree, why use a hash map? You know there could be at most 4 child nodes to each node. A hash map is useful for an arbitrary number of key-value mappings where quick lookup is necessary. If you're only going to have 4, a hash might not even be important. Also, while you can nest maps, it's a bit unwieldy. You're better off using some data structure or writing your own.
Also, what are you trying to reach with the quad tree? Quickly looking up a cell in the matrix? Some coordinate mapping function might serve you much better there.
Finally, I'm not so much worried about that amount of nodes in a hash map, as I am by the amount purely on its own. 65536² cells would end up being 4 GiB of memory even at one byte per cell.
I think it would be best to pedal all the way back to the question "what is my goal with this data", then find out which data structures could help you with that (keepign requirements such as lookups in mind) while managing to fit it in memory.

Definitely use directly linked nodes for both space and speed reasons.
With data this big I'd avoid Java altogether. You'll be constantly at the mercy of the garbage collector. Go for a language closer to the metal: C or C++, Pascal/Delphi, Ada, etc.
Put the four child pointers in an array so that you can refer to leaves as packed arrays of 2-bit indices (a nice reason to use Ada, which will let you define such things with no bit fiddling at all). I guess this is Morton sequencing. I did not know that term.
This method of indexing children in itself is a reason to avoid Java. Including a child array in a node class instance will cost you a pointer plus an array size field: 8 or 16 bytes per node that aren't needed in some other languages. With 4 billion cells, that's a lot.
In fact you should do the math. If you use implicit leaf cells, you still have 1 billion nodes to represent. If you use 32-bit indices to reference them (to save memory vice 64-bit pointers), the minimum is 16 bytes per node. Say node attributes are a mere 4 bytes. Then you have 20 Gigabytes just for a full tree even with none of the Java overhead.
Better have a good budget for RAM.

It is true that most typical quad-trees will simply use nodes with four child node pointers and traverse that, without any mention of hashmaps. However, it is also possible to write an efficient quadtree-like spatial indexing method that stores all its nodes in a big hashmap.
The benefit is that by using the Morton sequence (or another similarly generated value) as the key, you become able to retrieve nodes at any level with only one pointer dereference.
In "traditional" quadtree implementations we get cache misses due to repeated pointer dereferencing while looking up nodes, and this becomes the main bottleneck. So provided that the cost of encoding the coordinate space and getting a hash is lower than the cost of dereferencing the node pointers along the search path, such an implementation could be faster. Particularly if the map is very deep (having sparse locations requiring high precision).
You don't really need the Morton sequence, and you hardly need to think of it as a quadtree when doing this. A very simple example implementation:
In order to retrieve a quad of some level, use { x, y, level } as the hashmap key, where x and y are quantized to that level. You only need to include the level in the key if you are storing several levels in the same map.
Whether this is still a quadtree is up for discussion, but the functionality is the same.

Array of Structs are always faster than Structs of arrays?

I was wondering if the data layout Structs of Arrays (SoA) is always faster than an Array of Structs (AoS) or Array of Pointers (AoP) for problems with inputs that only fits in RAM programmed in C/JAVA.
Some days ago I was improving the performance of a Molecular Dynamic algorithm (in C), summarizing in this algorithm it is calculated the force interaction among particles based on their force and position.
Original the particles were represented by a struct containing 9 different doubles, 3 for particles forces (Fx,Fy,Fz) , 3 for positions and 3 for velocity. The algorithm had an array containing pointers to all the particles (AoP). I decided to change the layout from AoP to SoA to improve the cache use.
So, now I have a Struct with 9 array where each array stores forces, velocity and positions (x,y,z) of each particle. Each particle is accessed by it own array index.
I had a gain in performance (for an input that only fits in RAM) of about 1.9x, so I was wondering if typically changing from AoP or AoS to SoA it will always performance better, and if not in which types of algorithms this do not occurs.

Much depends of how useful all fields are. If you have a data structure where using one fields means you are likely to use all of them, then an array of struct is more efficient as it keeps together all the things you are likely to need.
Say you have time series data where you only need a small selection of the possible fields you have. You might have all sorts of data about an event or point in time, but you only need say 3-5 of them. In this case a structure of arrays is more efficient because a) you don't need to cache the fields you don't use b) you often access values in order i.e. caching a field, its next value and its next is useful.
For this reason, time-series information is often stored as a collection of columns.

This will depend on how exactly you access the data.
Try to imagine, what exactly happens in the hardware when you access your data, in either SoA or AoS.
To reason about your question, you must consider following things -
If the cache is absent, the performance should be the same, assuming that memory access latency is equal for all the elements of the data.
Now with the cache, if you access consecutive address locations, definitely you will get performance improvement. This is exactly valid in your case. When you have AoS, The locations are not consecutive in the memory, so you must lose some performance there.
You must be accessing in for loops your data like for(int i=0;i<1000000;i++) Fx[i] = 0. So if the data is huge in quantity, you will easily see the small performance benefits. If your data was small, this would not matter much.
Finally, you also don't know about the DRAM that you are using. It will have some benefits when you access consecutive data. For example to understand why it is like that you can refer to wiki.

HashMap alternatives for memory-efficient data storage

I've currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. You'll no doubt be shocked when I tell you that this hasn't proven ideal. The overhead seems to use 5x more memory than the data itself.
This question asks about efficient collections libraries, and the answer was use Google Collections. My follow up is "which part?". I've been reading through the documentation but don't feel like it gives a very good sense of which classes are a good fit for this. (I'm also open to other libraries or suggestions).
So I'm looking for something that will let me store dense spreadsheet-type data with minimal memory overhead.
My columns are currently referenced by Field objects, rows by their indexes, and values are Objects, almost always Strings
Some columns will have a lot of repeated values
primary operations are to update or remove records based on values of certain fields, and also adding/removing/combining columns
I'm aware of options like H2 and Derby but in this case I'm not looking to use an embedded database.
EDIT: If you're suggesting libraries, I'd also appreciate it if you could point me to a particular class or two in them that would apply here. Whereas Sun's documentation usually includes information about which operations are O(1), which are O(N), etc, I'm not seeing much of that in third-party libraries, nor really any description of which classes are best suited for what.

Some columns will have a lot of
repeated values
immediately suggests to me the possible use of the FlyWeight pattern, regardless of the solution you choose for your collections.

Trove collections should have a particular care about space occupied (I think they also have tailored data structures if you stick to primitive types).. take a look here.
Otherwise you can try with Apache collections.. just do your benchmarks!
In anycase, if you've got many references around to same elements try to design some suited pattern (like flyweight)

Chronicle Map could have overhead of less than 20 bytes per entry (see a test proving this). For comparison, java.util.HashMap's overhead varies from 37-42 bytes with -XX:+UseCompressedOops to 58-69 bytes without compressed oops (reference).
Additionally, Chronicle Map stores keys and values off-heap, so it doesn't store Object headers, which are not accounted as HashMap's overhead above. Chronicle Map integrates with Chronicle-Values, a library for generation of flyweight implementations of interfaces, the pattern suggested by Brian Agnew in another answer.

So I'm assuming that you have a map of Map<ColumnName,Column>, where the column is actually something like ArrayList<Object>.
A few possibilities -
Are you completely sure that memory is an issue? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. It takes an awful lot of rows and maps to fill up a JVM.
You could test your data set with different types of maps in the collections. Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky.
What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns?

Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. This way you have only 4-8 bytes of overhead per cell.
If Strings are often repeated, you could use a String pool to reduce duplication of strings. Object pools for other immutable types may be useful in reducing memory consumed.
EDIT: You can structure your data as either row based or column based. If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. If its columns based, you can have an array per column. This can make handling primitive types much more efficent. i.e. you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row.
However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset.
(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it)

Guava does include a Table interface and a hash-based implementation. Seems like a natural fit to your problem. Note that this is still marked as beta.

keeps its data in an ArrayList of HashMaps
Well, this part seems terribly inefficient to me. Empty HashMap will already allocate 16 * size of a pointer bytes (16 stands for default initial capacity), plus some variables for hash object (14 + psize). If you have a lot of sparsely filled rows, this could be a big problem.
One option would be to use a single large hash with composite key (combining row and column). Although, that doesn't make operations on whole rows very effective.
Also, since you don't mention the operation of adding cell, you can create hashes with only necessary inner storage (initialCapacity parameter).
I don't know much about google collections, so can't help there. Also, if you find any useful optimization, please do post here! It would be interesting to know.

I've been experimenting with using the SparseObjectMatrix2D from the Colt project. My data is pretty dense but their Matrix classes don't really offer any way to enlarge them, so I went with a sparse matrix set to the maximum size.
It seems to use roughly 10% less memory and loads about 15% faster for the same data, as well as offering some clever manipulation methods. Still interested in other options though.

From your description, it seems that instead of an ArrayList of HashMaps you rather want a (Linked)HashMap of ArrayList (each ArrayList would be a column).
I'd add a double map from field-name to column-number, and some clever getters/setters that never throw IndexOutOfBoundsException.
You can also use a ArrayList<ArrayList<Object>> (basically a jagged dinamically growing matrix) and keep the mapping to field (column) names outside.
Some columns will have a lot of
repeated values
I doubt this matters, specially if they are Strings, (they are internalized) and your collection would store references to them.

Why don't you try using cache implementation like EHCache.
This turned out to be very effective for me, when I hit the same situation.
You can simply store your collection within the EHcache implementation.
There are configurations like:
Maximum bytes to be used from Local heap.
Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm.
You can be sure of avoiding any out of memory errors, using this types of cache implementations.
It only increases the IO operations of your application by a small degree.
This is just a birds eye view of the configuration. There are a lot of configurations to optimize your requirements.

For me apache commons collections did not save any space, here are two similar heap dumps just before OOME comparing Java 11 HashMap to Apache Commons HashedMap:
The Apache Commons HashedMap doesn't appear to make any meaningful difference.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.