I'm developing a new data structure that theoretically is more efficient than a hashmap. It does this by having an O(1) resize when there is a collision. The issue is that when inserting data(the metric I care most about), it is slightly slower than a hashmap when it should be significantly faster.
here is all the code used in the insert:
private void insert(Pair data, SwitchArray currTable){
if (currTable.isExpanded == false && currTable.iValue==null) { //checks the very first iValue
currTable.iValue = data;
return;
}
else if (!currTable.isExpanded) {// if there is a new colision
SwitchArray[] x = new SwitchArray[currTable.primeArray[currTable.depth]];
currTable.sA = x;
Integer index= Math.abs(data.key.hashCode()) % currTable.sA.length;
currTable.sA[index] = new SwitchArray(data, currTable.depth+1);
currTable.isExpanded = true;
insert(currTable.iValue,currTable);
currTable.iValue = null;
}
else{ // if expanded
Integer index= Math.abs(data.key.hashCode()) % currTable.sA.length;
if (currTable.sA[index] == null){
currTable.sA[index] = new SwitchArray(data, currTable.depth+1); //this updates ivalue in the constructor
} else {
currTable = currTable.sA[index];
insert(data,currTable);//go one level deeper
}
}
}
these are the two subclasses I refrence
class SwitchArray{
int depth;
int length;
SwitchArray[] sA;
Pair iValue;
int[] primeArray = new int[]{7,11,13,17,19,23,29,31,37};
boolean isExpanded = false;
public SwitchArray(Pair iValue, int depth){
this.iValue = iValue;
this.depth = depth;
length = primeArray[depth];
if (iValue != null)
iValue.myDepth = depth;
}
}
class Pair{
String key;
Integer value;
int myDepth;
public Pair(String key, Integer value){
this.key=key;
this.value=value;
myDepth = -1;
}
public String toString(){
return "( " + key + ", " + value + " | depth: " + myDepth + ")";
}
}
here is the code in its entirety
I have tested the efficiency by adding varying amounts of data (from 1 pair all the way until I got a Java heap space error) to both hashmaps and my MDHT, and graphed them by using excel. Consistantly MDHTs are slightly slower.
(I would like to also add that this is just a fun project I am doing, not trying to overthrow hashmaps or anything.)
So the question I ask you is how do I fix it or slightly improve it at least?
new SwitchArray[currTable.primeArray[currTable.depth]];
This is relatively slow as it needs to clear out the new array. You can't opt of this, although hotspot tends to recognize any array whose values are almost immediately guaranteed entirely filled up, and omits the initial writing of zeroes into the heap for it. This doesn't apply here and isn't an optimization that seems possible to add here.
insert
This method is recursive, and the number of times it recurses is related to the amount of collisions you have, therefore, it isn't O(1).
So the question I ask you is how do I fix it or slightly improve it at least?
HashMap wasn't written by some random moron. It's possibly not perfect but a rote algorithmic complexity improvement is not available. You may be able to build a theoretical improvement in basic opcode count, but this is extremely unlikely to beat hashmap. The reason? Hotspot.
The hotspot engine is a gigantic pattern matcher. It finds patterns that it knows how to optimize and optimizes them. Whilst it does all sorts of magic in order to recognize as many patterns as it can, there is one simple fundamental truth: It recognizes idiomatic java. This library of patterns to optimize isn't built based on 'what sequence of opcodes can I optimize?'. It's built on a much simpler notion than that: 'Which sequence of opcodes is commonly observed in java code?'
In other words, commonly used patterns are better optimized. And HashMap is very commonly used. Hence:
Your notion that you can do O(1) insertion when there are collisions is certainly possible, but you can't guarantee O(1) lookup by fundamental definitions. However, as a general rule, as long as you aren't overloading on collisions, that isn't the controlling performance issue. At small n, an O(n) algorithm and an O(n^2) algorithm are simply unrelated. The algorithmically slower algorithm will beat the faster one, or not - the point is, the algorithmic complexity is completely meaningless until n is 'large enough'. When is 'large enough'? Depends on the hardware, the algorithm, the data, and the phase of the moon - the point of big-O notation isn't to predict when 'large enough' is reached, merely to posit that there is SOME n, could be incredulously large, when the algorithmic complexity 'takes over' and accurately predicts the faster algorithm. Point is, with hashmaps, most likely either:
[A] This is an academic case where you add thousands of objects with clashing hashcodes. Who gives a piddle what the performance of anything is at this point? The fix is to address the broken hash impl, not to futz about trying to shine the turd. lookups are guaranteed to be O(n) in this case and the primary point of a hashmap is to faster than that. Just use ArrayList in this case, you can't beat its performance then. It has O(1) inserts and O(n) lookups. Besides, your code will just crash if you try; your buckets are limited to at most 37 items. A map with 37 items in it is far to the left of that magical fulcrum point where 'n' becomes relevant.
[B] There aren't a ton of collisions. n is simply not large enough for algorithmic complexity to matter.
And also:
Trying to improve on things by just 'writing it slightly more optimized' is doomed to failure: The 'judge' (the hotspot VM) is biased because HashMap is so common, all hotspot implementations are designed to recognize the bytecode in j.u.HashMap and optimize it. You may be able to do some theoretic improvements but they will be small; too small to outweigh the penalty of this biased judge.
CONCLUSION: It's not possible to improve HashMap's performance without adding significant caveats to the data you intend to store in your BetterHashMap. In other words, any generalized hashmap that is significantly better than j.u.HM in some regards and not significantly worse in others is an extraordinary job and likely impossible.
Related
Majority element question:
Given an array of size n, find the majority element. The majority element is the element that appears more than ⌊ n/2 ⌋ times.
You may assume that the array is non-empty and the majority element always exist in the array.
// Solution1 - Sorting ----------------------------------------------------------------
class Solution {
public int majorityElement(int[] nums) {
Arrays.sort(nums);
return nums[nums.length/2];
}
}
// Solution2 - HashMap ---------------------------------------------------------------
class Solution {
public int majorityElement(int[] nums) {
// int[] arr1 = new int[nums.length];
HashMap<Integer, Integer> map = new HashMap<>(100);
Integer k = new Integer(-1);
try{
for(int i : nums){
if(map.containsKey(i)){
map.put(i, map.get(i)+1);
}
else{
map.put(i, 1);
}
}
for(Map.Entry<Integer, Integer> entry : map.entrySet()){
if(entry.getValue()>(nums.length/2)){
k = entry.getKey();
break;
}
}
}catch(Exception e){
throw new IllegalArgumentException("Error");
}
return k;
}
}
The Arrays.sort() function is implemented in Java using QuickSort and has O(n log n) time complexity.
On the other hand, using HashMap to find the majority element has only O(n) time complexity.
Hence, solution 1 (sorting) should take longer than solution 2 (HashMap), but when I was doing the question on LeetCode, the average time taken by solution 2 is much more (almost 8 times more) than solution 1.
Why is that the case? I'm really confused.....
Is the size of the test case the reason? Will solution 2 become more efficient when the number of elements in the test case increases dramatically?
Big O isn't a measure of actual performance. It's only going to give you an idea of how your performance will evolve in comparison to n.
Practically, an algorithms in O(n.logn) will eventually be slower than O(n) for some n. But that n might be 1, 10, 10^6 or even 10^600 - at which point it's probably irrelevant because you'll never run into such a data set - or you won't have enough hardware for it.
Software engineers have to consider both actual performance and performance at the practical limit. For example hash map lookup is in theory faster than an unsorted array lookup... but then most arrays are small (10-100 elements) negating any O(n) advantage due the extra code complexity.
You could certainly optimize your code a bit, but in this case you're unlikely to change the outcome for small n unless you introduce another factor (e.g. artificially slow down the time per cycle with a constant).
(I wanted to find a good metaphor to illustrate, but it's harder than expected...)
It depends on the test cases, some test cases will be faster in HashMap while others not.
Why is that? The Solution 1 grantee in worst case O(N log2 N), but the HashMap O(N . (M + R)) where M is the cost of collisions and R the cost of resizing the array.
HashMap uses an array named table of the nodes internally, and it resizes different times when the input increase or shrink. And you assigned it with an initial capacity of 100.
So let see what happens? Java uses Separate chaining for resolving the collisions and some test cases may have lots of collisions which lead to consuming lots of time when a query or update the hashmap.
Conclusion the implementation of hashmap is affected by two factors: 1. Resize the table array based on the input size 2. How many collision appears in the input
So I'm presented with a problem that states. "Determine if a string contains all unique characters"
So I wrote up this solution that adds each character to a set, but if the character already exists it returns false.
private static boolean allUniqueCharacters(String s) {
Set<Character> charSet = new HashSet<Character>();
for (int i = 0; i < s.length(); i++) {
char currentChar = s.charAt(i);
if (!charSet.contains(currentChar)) {
charSet.add(currentChar);
} else {
return false;
}
}
return true;
}
According to the book I am reading this is the "optimal solution"
public static boolean isUniqueChars2(String str) {
if (str.length() > 128)
return false;
boolean[] char_set = new boolean[128];
for (int i = 0; i < str.length(); i++) {
int val = str.charAt(i);
if (char_set[val]) {
return false;
}
char_set[val] = true;
}
return true;
}
My question is, is my implementation slower than the one presented? I assume it is, but if a Hash look up is O(1) wouldn't they be the same complexity?
Thank you.
As Amadan said in the comments, the two solutions have the same time complexity O(n) because you have a for loop looping through the string, and you do constant time operations in the for loop. This means that the time it takes to run your methods increases linearly with the length of the string.
Note that time complexity is all about how the time it takes changes when you change the size of the input. It's not about how fast it is with data of the same size.
For the same string, the "optimal" solution should be faster because sets have some overheads over arrays. Handling arrays is faster than handling sets. However, to actually make the "optimal" solution work, you would need an array of length 2^16. That is how many different char values there are. You would also need to remove the check for a string longer than 128.
This is one of the many examples of the tradeoff between space and time. If you want it to go faster, you need more space. If you want to save space, you have to go slower.
Both algorithms have time complexity of O(N). The difference is in their space complexity.
The book's solution will always require storage for 128 characters - O(1), while your solution's space requirement will vary linearly according to the input - O(N).
The book's space requirement is based on an assumed character set with 128 characters. But this may be rather problematic (and not scalable) given the likelihood of needing different character sets.
The hashmap is in theory acceptable, but is a waste.
A hashmap is built over an array (so it is certainly more costly than an array), and collision resolution requires extra space (at least the double of the number of elements). In addition, any access requires the computation of the hash and possibly the resolution of collisions.
This adds a lot of overhead in terms of space and time, compared to a straight array.
Also note that it is kind of folklore that a hash table has an O(1) behavior. The worst case is much poorer, accesses can take up to O(N) time for a table of size N.
As a final remark, the time complexity of this algorithm is O(1) because you conclude false at worse when N>128.
Your algorithm is also O(1). You can think about complexity like how my algorithm will react to the change in amount of elements processed. Therefore O(n) and O(2n) are effectively equal.
People are talking about O notation as growth rate here
Your solution is could indeed be slower than the book's solution. Firstly, a hash lookup ideally has a constant time lookup. But, the retrieval of the object will not be if there are multiple hash collisions. Secondly, even if it is constant time lookup, there is usually significant overhead involved in executing the hash code function as compared to looking up an element in an array by index. That's why you may want to go with the array lookup. However, if you start to deal with non-ASCII Unicode characters, then you might not want to go with the array approach due to the significant amount of space overhead.
The bottleneck of your implementation is, that a set has a lookup (and insert) complexity* of O(log k), while the array has a lookup complexity in O(1).
This sounds like your algorithm must be much worse. But in fact it is not, as k is bounded by 128 (else the reference implementation would be wrong and produce a out-of-bounds error) and can be treated as a constant. This makes the set lookup O(1) as well with a bit bigger constants than the array lookup.
* assuming a sane implementation as tree or hashmap. The hashmap time complexity is in general not constant, as filling it up needs log(n) resize operations to avoid the increase of collisions which would lead to linear lookup time, see e.g. here and here for answers on stackoverflow.
This article even explains that java 8 by itself converts a hashmap to a binary tree (O(n log n) for the converstion, O(log n) for the lookup) before its lookup time degenerates to O(n) because of too many collisions.
Suppose I have a class:
`
public class Interval {
int start;
int end;
Interval() { start = 0; end = 0; }
Interval(int s, int e) { start = s; end = e; }
}
`
I would like to sort a list of intervals with Collections.sort() like this:
Collections.sort(intervals, new Comparator<Interval>(){
#Override
public int compare(Interval o1, Interval o2) {
if (o1.start == o2.start) {
return o1.end - o2.end;
}
else {
return o1.start - o2.start;
}
}
});
I know that sorting an array with the built-in sorting function takes O(nlogn) time, and the question is if I am sorting a list of objects with two properties, what is the time complexity of sorting this list? Thanks!!
#PaulMcKenzie's brief answer in comments is on the right track, but the full answer to your question is more subtle.
Many people do what you've done and confuse time with other measures of efficiency. What's correct in nearly all cases when someone says a "sort is O(n log n)" is that the number of comparisons is O(n log n).
I'm not trying to be pedantic. Sloppy analysis can make big problems in practice. You can't claim that any sort runs in O(n log n) time without a raft of additional statements about the data and the machine where the algorithm is running. Research papers usually do this by giving a standard machine model used for their analysis. The model states the time required for low level operations - memory access, arithmetic, and comparisons, for example.
In your case, each object comparison requires a constant number (2) of value comparisons. So long as value comparison itself is constant time -- true in practice for fixed-width integers -- O(n log n) is an accurate way to express run time.
However, something as simple as string sorting changes this picture. String comparison itself has a variable cost. It depends on string length! So sorting strings with a "good" sorting algorithm is O(nk log n), where k is the length of strings.
Ditto if you're sorting variable-length numbers (java BigIntegers for example).
Sorting is also sensitive to copy costs. Even if you can compare objects in constant time, sort time will depend on how big they are. Algorithms differ in how many times objects need to be moved in memory. Some accept more comparisons in order to do less copying. An implementation detail: sorting pointers vs. objects can change asymptotic run time - a space for time trade.
But even this has complications. After you've sorted pointers, touching the sorted elements in order hops around memory in arbitrary order. This can cause terrible memory hierarchy (cache) performance. Analysis that incorporates memory characteristics is a big topic in itself.
The big O notation actually do neglect the least contributing factors
for example if you complexity is n+1, n will be used and the 1 neglected.
So that answer is the same n * log(n).
As your code just adds one more statement, which will be translated into one instruction.
It should read the Collection.sort() Link here
This algorithm guaranteed n log(n) performance.
Note: Comparator does't change the its complexity rather than using Loops
I'm taking an introductory course to Java and one of my latest projects involve making sure an array doesn't contain any duplicate elements (has distinct elements). I used a for loop with an inner for loop, and it works, but I've heard that you should try to avoid using many iterations in a program (and other methods in my classes have a fair number of iterations as well). Is there any efficient alternative to this code? I'm not asking for code of course, just "concepts." Would there potentially be a recursive way to do this? Thanks!
The array sizes are generally <= 10.
/** Iterates through a String array ARRAY to see if each element in ARRAY is
* distinct. Returns false if ARRAY contains duplicates. */
boolean distinctElements(String[] array) { //Efficient?
for (int i = 0; i < array.length; i += 1) {
for (int j = i + 1; j < array.length; j += 1) {
if (array[i] == array[j]) {
return false;
}
}
} return true;
}
"Efficiency" is almost always a trade-off. Occasionally, there are algorithms that are simply better than others, but often they are only better in certain circumstances.
For example, this code above: it's got time complexity O(n^2).
One improvement might be to sort the strings: you can then compare the strings by comparing if an element is equal to its neighbours. The time complexity here is reduced to O(n log n), because of the sorting, which dominates the linear comparison of elements.
However - what if you don't want to change the elements of the array - for instance, some other bit of your code relies on them being in their original order - now you also have to copy the array and then sort it, and then look for duplicates. This doesn't increase the overall time or storage complexity, but it does increase the overall time and storage, since more work is being done and more memory is required.
Big-oh notation only gives you a bound on the time ignoring multiplicative factors. Maybe you only have access to a really slow sorting algorithm: actually, it turns out to be faster just to use your O(n^2) loops, because then you don't have to invoke the very slow sort.
This could be the case when you have very small inputs. An oft-cited example of an algorithm that has poor time complexity but actually is useful in practice is Bubble Sort: it's O(n^2) in the worst case, but if you have a small and/or nearly-sorted array, it can actually be pretty darn fast, and pretty darn simple to implement - never forget the inefficiency of you having to write and debug the code, and to have to ask questions on SO when it doesn't work as you expect.
What if you know that the elements are already sorted, because you know something about their source. Now you can simply iterate through the array, comparing neighbours, and the time complexity is now O(n). I can't remember where I read it, but I once saw a blog post saying (I paraphrase):
A given computer can never be made to go quicker; it can only ever do less work.
If you can exploit some property to do less work, that improves your efficiency.
So, efficiency is a subjective criterion:
Whenever you ask "is this efficient", you have to be able to answer the question: "efficient with respect to what?". It might be space; it might be time; it might be how long it takes you to write the code.
You have to know the constraints of the hardware that you're going to run it on - memory, disk, network requirements etc may influence your choices.
You need to know the requirements of the user on whose behalf you are running it. One user might want the results as soon as possible; another user might want the results tomorrow. There is never a need to find a solution better than "good enough" (although that can be a moving goal once the user sees what is possible).
You also have to know what inputs you want it to be efficient for, and what properties of that input you can exploit to avoid unnecessary work.
First, array[i] == array[j] tests reference equality. That's not how you test String(s) for value equality.
I would add each element to a Set. If any element isn't successfully added (because it's a duplicate), Set.add(E) returns false. Something like,
static boolean distinctElements(String[] array) {
Set<String> set = new HashSet<>();
for (String str : array) {
if (!set.add(str)) {
return false;
}
}
return true;
}
You could render the above without a short-circuit like
static boolean distinctElements(String[] array) {
Set<String> set = new HashSet<>(Arrays.asList(array));
return set.size() == array.length;
}
How can I store a 100K X 100K matrix in Java?
I can't do that with a normal array declaration as it is throwing a java.lang.OutofMemoryError.
The Colt library has a sparse matrix implementation for Java.
You could alternatively use Berkeley DB as your storage engine.
Now if your machine has enough actual RAM (at least 9 gigabytes free), you can increase the heap size in the Java command-line.
If the vast majority of entries in your matrix will be zero (or even some other constant value) a sparse matrix will be suitable. Otherwise it might be possible to rewrite your algorithm so that the whole matrix doesn't exist simultaneously. You could produce and consume one row at a time, for example.
Sounds like you need a sparse matrix. Others have already suggested good 3rd party implementations that may suite your needs...
Depending on your applications, you could get away without a third-party matrix library by just using a Map as a backing-store for your matrix data. Kind of...
public class SparseMatrix<T> {
private T defaultValue;
private int m;
private int n;
private Map<Integer, T> data = new TreeMap<Integer, T>();
/// create a new matrix with m rows and n columns
public SparseMatrix(int m, int n, T defaultValue) {
this.m = m;
this.n = n;
this.defaultValue = defaultValue;
}
/// set value at [i,j] (row, col)
public void setValueAt(int i, int j, T value) {
if (i >= m || j >= n || i < 0 || j < 0)
throw new IllegalArgumentException(
"index (" + i + ", " +j +") out of bounds");
data.put(i * n + j, value);
}
/// retrieve value at [i,j] (row, col)
public T getValueAt(int i, int j) {
if (i >= m || j >= n || i < 0 || j < 0)
throw new IllegalArgumentException(
"index (" + i + ", " +j +") out of bounds");
T value = data.get(i * n + j);
return value != null ? value : defaultValue;
}
}
A simple test-case illustrating the SparseMatrix' use would be:
public class SparseMatrixTest extends TestCase {
public void testMatrix() {
SparseMatrix<Float> matrix =
new SparseMatrix<Float>(100000, 100000, 0.0F);
matrix.setValueAt(1000, 1001, 42.0F);
assertTrue(matrix.getValueAt(1000,1001) == 42.0);
assertTrue(matrix.getValueAt(1001,1000) == 0.0);
}
}
This is not the most efficient way of doing it because every non-default entry in the matrix is stored as an Object. Depending on the number of actual values you are expecting, the simplicity of this approach might trump integrating a 3rd-party solution (and possibly dealing with its License - again, depending on your situation).
Adding matrix-operations like multiplication to the above SparseMatrix implementation should be straight-forward (and is left as an exercise for the reader ;-)
100,000 x 100,000 = 10,000,000,000 (10 billion) entries. Even if you're storing single byte entries, that's still in the vicinity of 10 GB - does your machine even have that much physical memory, let alone have a will to allocate that much to a single process?
Chances are you're going to need to look into some kind of a way to only keep part of the matrix in memory at any given time, and the rest buffered on disk.
There are a number possible solutions depending on how much memory you have, how sparse the array actually is, and what the access patterns are going to be.
If the calculation of 100K * 100K * 8 is less than the amount of physical memory on your machine for use by the JVM, a simple non-sparse array is viable solution.
If the array is sparse, with (say) 75% or more of the elements being zero, then you can save space by using a sparse array library. Various alternatives have been suggested, but in all cases, you still need to work out if this is going to give you enough savings. Figure out how many non-zero elements there are going to be, multiply that by 8 (to give you doubles) and (say) 4 to account for the overheads of the sparse array. If that is less than the amount of physical memory that you can make available to the JVM, then sparse arrays are a viable solution.
If sparse and non-sparse arrays (in memory) won't work, things will get more complicated, and the viability of any solution will depend on the access patterns for the array data.
One approach is to represent the array as a file that is mapped into memory in the form of a MappedByteBuffer. Assuming that you don't have enough physical memory to store the entire file in memory, you are going to be hitting the virtual memory system hard. So it is best if your algorithm only needs to operate on contiguous sections of the array at any time. Otherwise, you'll probably die from swapping.
A second approach is a variation of the first. Map the array/file a section at a time, and when you are done, unmap and move to the next section. This only works if the algorithm works on the array in sections.
A third approach is to represent the array using a light-weight database like BDB. This will be slower than any in-memory solution because reading array elements will translate into disc accesses. But if you get it wrong it won't kill the system like the memory mapped approach will. (And if you do this on Linux/Unix, the system's disc block cache may speed things up, depending on your algorithm's array access patterns)
A fourth approach is to use a distributed memory cache. This replaces disc i/o with network i/o, and it is hard to say whether this is a good or bad thing.
A fifth approach is to analyze your algorithm and see if it is amenable to implementing as a distributed algorithm; e.g. with sections of the array and corresponding parts of the algorithm on different machines.
You can upgrade to this machine:
http://www.azulsystems.com/products/compute_appliance.htm
864 processor cores and 768 GB of memory, only costs a single family house somewhere.
Well, I'd suggest that you increase the memory in your jvm but you've going to need a lot of memory, as you're talking about 10 billion items. It's (barely) possible with lots of memory or a clustered jvm, but that's probably the wrong answer.
You're getting the outOfmemory because if you declare int[1000], the memory is allocated immediately (additionally doubles take up more space than ints-an int representation will also save you space). Maybe you can substitute a more efficient implementation of your array (if you have many empty entries lookup "sparse matrix" representations).
You could store pieces in an outside system, like memcached or memory-mapped buffers.
There are lots of good suggestions here, maybe if you posted a more detailed description of the problem you're trying to solve people could be more specific.
You should try an "external" package to handle matrices, I never did that though, maybe something like jama.
Unless you have 100K x 100K x 8 ~ 80GB of memory, you cannot create this matrix in memory. You can create this matrix on disk and access it using memory mapping. However, using this approach will be very slow.
What are you trying to do? You may find that representing your data in a different way will be much more efficient.