Edit: Typos fixed and ambiguity tried to fix.
I have a list of five digit integers in a text file. The expected amount can only be as large as what a 5-digit integer can store. Regardless of how many there are, the FIRST line in this file tells me how many integers are present, so resizing will never be necessary. Example:
3
11111
22222
33333
There are 4 lines. The first says there are three 5-digit integers in the file. The next three lines hold these integers.
I want to read this file and store the integers (not the first line). I then want to be able to search this data structure A LOT, nothing else. All I want to do, is read the data, put it in the structure, and then be able to determine if there is a specific integer in there. Deletions will never occur. The only things done on this structure will be insertions and searching.
What would you suggest as an appropriate data structure? My initial thought was a binary tree of sorts; however, upon thinking, a HashTable may be the best implementation. Thoughts and help please?
It seems like the requirements you have are
store a bunch of integers,
where insertions are fast,
where lookups are fast, and
where absolutely nothing else matters.
If you are dealing with a "sufficiently small" range of integers - say, integers up to around 16,000,000 or so, you could just use a bitvector for this. You'd store one bit per number, all initially zero, and then set the bits to active whenever a number is entered. This has extremely fast lookups and extremely fast setting, but is very memory-intensive and infeasible if the integers can be totally arbitrary. This would probably be modeled with by BitSet.
If you are dealing with arbitrary integers, a hash table is probably the best option here. With a good hash function you'll get a great distribution across the table slots and very, very fast lookups. You'd want a HashSet for this.
If you absolutely must guarantee worst-case performance at all costs and you're dealing with arbitrary integers, use a balanced BST. The indirection costs in BSTs make them a bit slower than other data structures, but balanced BSTs can guarantee worst-case efficiency that hash tables can't. This would be represented by TreeSet.
Given that
All numbers are <= 99,999
You only want to check for existence of a number
You can simply use some form of bitmap.
e.g. create a byte[12500] (it is 100,000 bits which means 100,000 booleans to store existence of 0-99,999 )
"Inserting" a number N means turning the N-th bit on. Searching a number N means checking if N-th bit is on.
Pseduo code of the insertion logic is:
bitmap[number / 8] |= (1>> (number %8) );
searching looks like:
bitmap[number/8] & (1 >> (number %8) );
If you understand the rationale, then a even better news for you: In Java we already have BitSet which is doing what I was describing above.
So code looks like this:
BitSet bitset = new BitSet(12500);
// inserting number
bitset.set(number);
// search if number exists
bitset.get(number); // true if exists
If the number of times each number occurs don't matter (as you said, only inserts and see if the number exists), then you'll only have a maximum of 100,000. Just create an array of booleans:
boolean numbers = new boolean[100000];
This should take only 100 kilobytes of memory.
Then instead of add a number, like 11111, 22222, 33333 do:
numbers[11111]=true;
numbers[22222]=true;
numbers[33333]=true;
To see if a number exists, just do:
int whichNumber = 11111;
numberExists = numbers[whichNumber];
There you are. Easy to read, easier to mantain.
A Set is the go-to data structure to "find", and here's a tiny amount of code you need to make it happen:
Scanner scanner = new Scanner(new FileInputStream("myfile.txt"));
Set<Integer> numbers = Stream.generate(scanner::nextInt)
.limit(scanner.nextInt())
.collect(Collectors.toSet());
Related
I have been asked to come up with a solution where you have a file where each line represent a 10 digit phone number and we need to tell whether a given 10 digit phone number is present in the file or not.
I came up with Trie Data structure where each each children is nothing but a Map of integer as Key and Trie as Value.
class Trie{
boolean isEnd;
Map<Integer, Trie> map = new HashMap<>();
}
I can take int[] arr also to store the children.
As we have only numbers ranging from 0 - 9, so we can store these numbers in 4 bits only. Why to take 'int' or Integer as data type. How to reduce memory here?
How we can store this numbers in Map or array but not taking int as we will end up wasting lot of memory.
Moreover is there any better solution than Trie?
If you're going for memory efficiency, I would actually advise against using a trie and recommend a different data structure. As I understand it, you are only interested in answering queries of the form "have I see this phone number before?" While you could do this by treating the phone numbers as strings and throwing all of them into a trie, you wouldn't be taking advantage of the operations that tries are designed to support (fast prefix searching, retrieving elements in sorted order, etc.), so you'd be paying for things you wouldn't be using.
Moreover, let's think about the space usage of the trie. Even if every phone number had a long common prefix, each node in the trie requires space to store its child pointers. If you store even one (64-bit) pointer per node, you're using the same amount of space that you'd be using to store a 10-digit phone number (which fits comfortably into a 64-bit integer). If the phone numbers don't have long shared prefixes, you're potentially storing ten pointers per number, a huge space blowup, regardless of how big the hash table keys are.
Instead of throwing things into a trie, I'd consider just using a simple, vanilla hash table. After all, hash tables are specifically optimized to support membership queries and membership queries alone. Hashing phone numbers shouldn't be too bad, as they can be packed into 64-bit integers and hashed using a variety of simple hashing techniques. This lets you control what kind of time/space tradeoff you want to make (larger table sizes increase memory and decrease time, smaller tables increase time and decrease memory).
I have a huge set of long integer identifiers that need to be distributed into (n) buckets as uniformly as possible. The long integer identifiers might have pockets of missing identifiers.
With that being the criteria, is there a difference between Using the long integer as is and doing a modulo (n) [long integer] or is it better to have a hashCode generated for the string version of long integer (to improve the distribution) and then do a modulo (n) [hash_code of string(long integer)]? Is the additional string conversion necessary to get the uniform spread via hash code?
Since I got feedback that my question does not have enough background information. I am adding some more information.
The identifiers are basically auto-incrementing numeric row identifiers that are autogenerated in a database representing an item id. The reason for pockets of missing identifiers is because of deletes.
The identifiers themselves are long integers.
The identifiers (items) themselves are in the order of (10s-100)+ million in some cases and in the order of thousands in some cases.
Only in the case where the identifiers are in the order of millions do I want to really spread them out into buckets (identifier count >> bucket count) for storage in a no-SQL system(partitions).
I was wondering if because of the fact that items get deleted, should I be resorting to (Long).toString().hashCode() to get the uniform spread instead of using the long numeric directly. I had a feeling that doing a toString.hashCode is not going to fetch me much, and I also did not like the fact that java hashCode does not guarantee same value across java revisions (though for String their hashCode implementation seems to be documented and stable for the past releases across years
)
There's no need to involve String.
new Integer(i).hashCode()
... gives you a hash - designed for the very purpose of evenly distributing into buckets.
new Integer(i).hashCode() % n
... will give you a number in the range you want.
However Integer.hashCode() is just:
return value;
So new Integer(i).hashCode() % n is equivalent to i % n.
Your question as is cannot be answered. #slim's try is the best you will get, because crucial information is missing in your question.
To distribute a set of items, you have to know something about their initial distribution.
If they are uniformly distributed and the number of buckets is significantly higher than the range of the inputs, then slim's answer is the way to go. If either of those conditions doesn't hold, it won't work.
If the range of inputs is not significantly higher than the number of buckets, you need to make sure the range of inputs is an exact multiple of the number of buckets, otherwise the last buckets won't get as many items. For instance, with range [0-999] and 400 buckets, first 200 buckets get items [0-199], [400-599] and [800-999] while the other 200 buckets get iems [200-399] and [600-799].
That is, half of your buckets end up with 50% more items than the other half.
If they are not uniformly distributed, as modulo operator doesn't change the distribution except by wrapping it, the output distribution is not uniform either.
This is when you need a hash function.
But to build a hash function, you must know how to characterize the input distribution. The point of the hash function being to break the recurring, predictable aspects of your input.
To be fair, there are some hash functions that work fairly well on most datasets, for instance Knuth's multiplicative method (assuming not too large inputs). You might, say, compute
hash(input) = input * 2654435761 % 2^32
It is good at breaking clusters of values. However, it fails at divisibility. That is, if most of your inputs are divisible by 2, the outputs will be too. [credit to this answer]
I found this gist has an interesting compilation of diverse hashing functions and their characteristics, you might pick one that best matches the characteristics of your dataset.
I'm writing a java application that transforms numbers (long) into a small set of result objects. This mapping process is very critical to the app's performance as it is needed very often.
public static Object computeResult(long input) {
Object result;
// ... calculate
return result;
}
There are about 150,000,000 different key objects, and about 3,000 distinct values.
The transformation from the input number (long) to the output (immutable object) can be computed by my algorithm with a speed of 4,000,000 transformations per second. (using 4 threads)
I would like to cache the mapping of the 150M different possible inputs to make the translation even faster but i found some difficulties creating such a cache:
public class Cache {
private static long[] sortedInputs; // 150M length
private static Object[] results; // 150M length
public static Object lookupCachedResult(long input) {
int index = Arrays.binarySearch(sortedInputs, input);
return results[index];
}
}
i tried to create two arrays with a length of 150M. the first array holds all possible input longs, and it is sorted numerically. the second array holds a reference to one of the 3000 distinct, precalculated result objects at the index corresponding to the first array's input.
to get to the cached result, i do a binary search for the input number on the first array. the cached result is then looked up in the second array at the same index.
sadly, this cache method is not faster than computing the results. not even half, only about 1.5M lookups per second. (also using 4 threads)
Can anyone think of a faster way to cache results in such a scenario?
I doubt there is a database engine that is able to answer more than 4,000,000 queries per second on, let's say an average workstation.
Hashing is the way to go here, but I would avoid using HashMap, as it only works with objects, i.e. must build a Long each time you insert a long, which can slow it down. Maybe this performance issue is not significant due to JIT, but I would recommend at least to try the following and measure performance against the HashMap-variant:
Save your longs in a long-array of some length n > 3000 and do the hashing by hand via a very simple hash-function (and thus efficient) like
index = key % n. Since you know your 3000 possible values before hand you can empirically find an array-length n such that this trivial hash-function won't cause collisions. So you circumvent rehashing etc. and have true O(1)-performance.
Secondly I would recommend you to look at Java-numerical libraries like
https://github.com/mikiobraun/jblas
https://github.com/fommil/matrix-toolkits-java
Both are backed by native Lapack and BLAS implementations that are usually highly optimized by very smart people. Maybe you can formulate your algorithm in terms of matrix/vector-algebra such that it computes the whole long-array at one time (or chunk-wise).
There are about 150,000,000 different key objects, and about 3,000 distinct values.
With the few values, you should ensure that they get re-used (unless they're pretty small objects). For this an Interner is perfect (though you can run your own).
i tried hashmap and treemap, both attempts ended in an outOfMemoryError.
There's a huge memory overhead for both of them. And there isn't much point is using a TreeMap as it uses a sort of binary search which you've already tried.
There are at least three implementations of a long-to-object-map available, google for "primitive collections". This should use slightly more memory than your two arrays. With hashing being usually O(1) (let's ignore the worst case as there's no reason for it to happen, is it?) and much better memory locality, it'll beat(*) your binary search by a factor of 20. You binary search needs log2(150e6), i.e., about 27 steps and hashing may need on the average maybe two. This depends on how tightly you pack the hash table; this is usually a parameter given when it gets created.
In case you run your own (which you most probably shouldn't), I'd suggest to use an array of size 1 << 28, i.e., 268435456 entries, so that you can use bitwise operations for indexing.
(*) Such predictions are hard, but I'm sure it's worth trying.
I want to create a fast Huffman Code decoder in Java and therefore thought about lookup tables. Since those tables consume memory and we use Java code to navigate and access the tables one can easily (or not) write a programm / method that expresses the same table.
The problem with that approach is, I dont know what is the best strategy. I know it is a lot about what fits in the cache and branch prediction. Also the switch case implementation meaning the actual ASM is beyond me. If I have a in memory lookup table (or a hierarchy of it) I will be able to simply jump in and out but I doupt that for my purposal that table would fit in the cache.
Since I actually walk a tree one could implement it as if else statements requireing a certain number of comparisms but for each comparism it would need additional binary operations.
So the following options exist:
General Algorithm using in Memory lookup tables
If/else representation of the decision tree
If/else representation with small switch statements to find the correct group of symboles (same bit pattern length) (fewer if statements, might be more code).
Switch statement representation of the code
Writing and benchmarking is quite tricky so any initial thoughts would be great.
One additional problem that comes into play is the order of bits. The most significant bit comes always first meaning it is stored in reverse order.
If your tree is A = 0, B = 10, C = 11 to write BAC it would actually be 01 + 0 + 11 (plus means append).
So actually the code have to be written in reverse order. using if /else or switch approach for groups it would not be a problem since masking out the bits is simple and the reverse of bit is simply possible but it would lose the idea of getting the index within the group out of the mask since in reverse bit order add and remove have different meaning and also a simple lookup is not possible.
Reversing the bits is a costly operation (I use 4bit lookup tables) not outweighting the performance penality of binary operations.
But reversing the bits on the go is better suited for this and require four operations per bit (shifting up, Masking out, add and also shifting the input down). Since I read bits ahead all those operations will be done in registers so they might take only a few cycles.
This way I can use switch, sub and if to find the right symbol group and also to return those.
So finaly I need advices. Since my codes are global for language processing, they can be hardwired (ie be in source).
I wonder what the parser generators like ANTRL use to express those decisions. Since they also seam to switch or if/else based on the input symbole it would might give me a clue.
[Updates]
I found a simplification that avoids the reverse bit problem but still adds costs per group. So I end up in writing the bits in the order of the groups to traverse. So I will not need four modifications per bit but per group (different bit lengths).
For each group we have:
1. The value for the first element, the size (and therefore the value for the last element within that group.
Therefore for each group the algorithm looks like:
1. Read mbits and combine with the current read value.
2. Compare the value with the last value of that group is it smaller its within that group if not its outside. -> read next
3. If it is inside the group aan array of values can be accessed or use a switch statement.
This is totally generic and can be used without loops making it efficient. Also if the group was detected, the bit length of the code is known and the bits can be consumed from source since the code looks far ahead (reading from stream).
[Update 2]
To access the actual value one could use a single big array of elements grouped by group. Since the propability reduces for group to group it is very likely that a significant part fits L2 or L1 cache speeding up access here.
Or one uses switch statements.
[Update 3]
Depending on the cases of a switch the compiler generates either a tableswitch or a lookup switch. The lookup switch has a complexity of O(log n) and stores key, jmp offset pairs which is not preferable. Therefore checking for groups is better suited for if/else.
The tableswitch itself uses only a table of jump offsets and it only takes substract, compare, access, jmp to reach the destination, than it must executes a return value on a constant.
Therefore a table access looks more promising. Also to avoid an unnecessary jump each group might contain the logic to access and return the group symbols table. Storing everything in a big table is promising since it might be int or short per symbole and my codes often do only have 1000 to 4000 symbols at most making it actually short.
I will check if 1 - pattern will give me the opportunity to store and access the masks in a better way allowing for binary searching the correct group instead of advancing in O(n) and might even avoid any shift operations at all during the processing.
I couldn't make sense of most of what you wrote in your (long) question, but there is a simple approach.
We'll start with a single table. Let's say your longest Huffman code is 15 bits. (In fact, deflate limits the size of its Huffman codes to 15 bits.) Then construct a table with 32768 entries, where each entry is the number of bits in the next code, and the symbol for that code. For codes less than 15 bits, there is more than one entry in the table for the same code. E.g. if the code is 10010110 (7 bits) for the symbol 'C', then all of the indexes of the table xxxxxxxx10010110 have the same thing. Those entries all have {7, 'C'}.
Then you get 15 bits from the stream, and look up the next code in the table. You remove the number of bits from that table entry, and use the resulting symbol. Now you get as many bits from the stream as you need to have 15, and repeat. So if you used 7 bits, then get 8 more to get back to 15 and look up the next code.
The next subtlety is that if your Huffman code changes often, you might end up spending more time filling up that large table for each new Huffman code than you spend actually decoding. To avoid that, you can make a two-level table which has, say, a 9-bit lookup (512 entries) for the first portion of the code. If the code is 9-bits or less, then you proceed as above. That will be the most common case, since shorter codes are more frequent (that being the whole point of Huffman coding). If the table entry says that there are 10 or more bits in the code (and you don't know yet how much more), then you consume the first nine bits and go to a second-level table for those initial nine bits pointed to by the entry in the first table, that has entries for the remaining six bits (64 entries). That resolves the remainder of the code and so tells you how many more bits to consume and what the symbol is. This approach can greatly reduce the time spent filling tables, and is very nearly as fast since short codes are more common. This is the approach used by inflate in zlib.
In the end it was quite simple. I support almost all solutions now. One can test every symbol group (same bit length), use a lookup table (10bit + 10bit + 10bit (just tables of 10bit, symbolscount + 1 is the reference to those talbes)) and generating java (and if needed javascript but currently I use GWT to translate it).
I even use long reads and shift operations to reduce the access to binary information. This way the code gets more efficiently since I only support a maximum bit size (20bit (so a table of a table) which makes 2^20 symbols and therefore at most a million).
For the ordering I use a generator for the bit masks just using shift operations and no requirement of reversing bit orders or such.
The table lookups can also be expressed in Java storing the tables as arrays of arrays (its interesting how big the java files can be without compilers to complain)).
Also I found it interesting that since comparing is expressing an ordering (half order I guess) one can sort the symbols and instead of mapping the symbols mapping the comparison index. By comparing two index one can simply sort streams of codes without touching to much. By also storing the first or first two comparison index (16 or 32bit) one can efficiently sort and therefore binary sort compressed strings using the same Huffman code, which makes it ideal to compress strings in a certain language.
I am facing a problem where for a number of words, I make a call to a HashMultimap (Guava) to retrieve a set of integers. The resulting sets have, say, 10, 200 and 600 items respectively. I need to compute the intersection of these three (or four, or five...) sets, and I need to repeat this whole process many times (I have many sets of words). However, what I am experiencing is that on average these set intersections take so long to compute (from 0 to 300 ms) that my program takes a very long time to complete if I look at hundreds of thousands of sets of words.
Is there any substantially quicker method to achieve this, especially given I'm dealing with (sortable) integers?
Thanks a lot!
If you are able to represent your sets as arrays of bits (bitmaps), you can intersect them with AND operations. You could even implement this to run in parallel.
As an example (using jlordo's question): if set1 is {1,2,4} and set2 is {1,2,5}
Then your first set would be represented as: 00010110 (bits set for 1, 2, and 4).
Your second set would be represented as: 00100110 (bits set for 1, 2, and 5).
If you AND them together, you get: 00000110 (bits set for 1 and 2)
Of course, if you had a larger range of integers, then you will need more bytes. The beauty of bitmap indexes is that they take just one bit per possible element, thus occupying a relatively small space.
In Java, for example, you could use the BitSet data structure (not sure if it can do operations in parallel, though).
One problem with a bitmap based solution is that even if the sets themselves are very small, but contain very large numbers (or even unbounded) checking bitmaps would be very wasteful.
A different approach would be, for example, sorting the two sets, merging them and checking for duplicates. This can be done in O(nlogn) time complexity and extra O(n) space complexity, given set sizes are O(n).
You should choose the solution that matches your problem description (input range, expected set sizes, etc.).
The post http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset describes an implementation of an ordered primitive long set with set operations (union, minus and intersection). To my experience it's quite efficient for dense or sparse value populations.