I have an application where i generate unique combinations of numbers like where each combination is unique.I want to store the combinations in a way that i should be able to retrieve a few random combinations efficently later.
Here is an example of one combination
35 36 37 38 39 40 50
There are ~4 millions of combinations in all.
How should I store the data so that i can retrieve combinations later?
Since your combinations are unique and you actually don't have a query criteria on your numbers, it does not matter how you store them in the database. Just insert them in some table. To retrieve X random combinations simply do:
SELECT * FROM table ORDER BY RANDOM() LIMIT X
See:
Select random row(s) in SQLite
On storing array of integers in SQLite:
Insert a table of integers - int[] - into SQLite database,
I think there might be a different solution; in the sense of: do you really have to store all those combinations?!
Assuming that those combinations are just "random" - you could be using some (smart) maths, to some function getCombinationFor(), like
public List<Integer> getCombinationFor(long whatever)
that uses a fixed algorithm to create a unique result for each incoming input.
Like:
getCombinationFor(0): gives 0 1 2 3 10 20 30
getCombinationFor(1): gives 1 2 3 4 10 20 30 40
The above is of course pretty simple; and depending on your requirements towards those sequences you might require something much complicated. But: for sure, you can define such a function to return a permutation of a fixed set of numbers within a certain range!
The important thing is: this function returns a unique List for each and any input; and also important: given a certain sequence, you can immediately determine the number that was used to create that sequence.
So instead of generating a huge set of data containing unique sequences, you simply define an algorithm that knows how to create unique sequences in a deterministic way. If that would work for you, it completely frees you from storing your sequences at all!
Edit: just remembered that I was looking into something kinda "close" to this question/answer ... see here!
Sending data:
Use UNIQUE index. On Unique Violation Exception just rerandomize and send next record.
Retrive:
The simplest possible to implement is: Use HashSet, feed it whith random numbers (max is number of combinations in your database -1) while hashSet.size() is less then desired number of records to retrive. Use numbers from hashSet as ID's or rownums of records (combinations) to select the data: WHERE ID in (ids here).
Related
Edit: Typos fixed and ambiguity tried to fix.
I have a list of five digit integers in a text file. The expected amount can only be as large as what a 5-digit integer can store. Regardless of how many there are, the FIRST line in this file tells me how many integers are present, so resizing will never be necessary. Example:
3
11111
22222
33333
There are 4 lines. The first says there are three 5-digit integers in the file. The next three lines hold these integers.
I want to read this file and store the integers (not the first line). I then want to be able to search this data structure A LOT, nothing else. All I want to do, is read the data, put it in the structure, and then be able to determine if there is a specific integer in there. Deletions will never occur. The only things done on this structure will be insertions and searching.
What would you suggest as an appropriate data structure? My initial thought was a binary tree of sorts; however, upon thinking, a HashTable may be the best implementation. Thoughts and help please?
It seems like the requirements you have are
store a bunch of integers,
where insertions are fast,
where lookups are fast, and
where absolutely nothing else matters.
If you are dealing with a "sufficiently small" range of integers - say, integers up to around 16,000,000 or so, you could just use a bitvector for this. You'd store one bit per number, all initially zero, and then set the bits to active whenever a number is entered. This has extremely fast lookups and extremely fast setting, but is very memory-intensive and infeasible if the integers can be totally arbitrary. This would probably be modeled with by BitSet.
If you are dealing with arbitrary integers, a hash table is probably the best option here. With a good hash function you'll get a great distribution across the table slots and very, very fast lookups. You'd want a HashSet for this.
If you absolutely must guarantee worst-case performance at all costs and you're dealing with arbitrary integers, use a balanced BST. The indirection costs in BSTs make them a bit slower than other data structures, but balanced BSTs can guarantee worst-case efficiency that hash tables can't. This would be represented by TreeSet.
Given that
All numbers are <= 99,999
You only want to check for existence of a number
You can simply use some form of bitmap.
e.g. create a byte[12500] (it is 100,000 bits which means 100,000 booleans to store existence of 0-99,999 )
"Inserting" a number N means turning the N-th bit on. Searching a number N means checking if N-th bit is on.
Pseduo code of the insertion logic is:
bitmap[number / 8] |= (1>> (number %8) );
searching looks like:
bitmap[number/8] & (1 >> (number %8) );
If you understand the rationale, then a even better news for you: In Java we already have BitSet which is doing what I was describing above.
So code looks like this:
BitSet bitset = new BitSet(12500);
// inserting number
bitset.set(number);
// search if number exists
bitset.get(number); // true if exists
If the number of times each number occurs don't matter (as you said, only inserts and see if the number exists), then you'll only have a maximum of 100,000. Just create an array of booleans:
boolean numbers = new boolean[100000];
This should take only 100 kilobytes of memory.
Then instead of add a number, like 11111, 22222, 33333 do:
numbers[11111]=true;
numbers[22222]=true;
numbers[33333]=true;
To see if a number exists, just do:
int whichNumber = 11111;
numberExists = numbers[whichNumber];
There you are. Easy to read, easier to mantain.
A Set is the go-to data structure to "find", and here's a tiny amount of code you need to make it happen:
Scanner scanner = new Scanner(new FileInputStream("myfile.txt"));
Set<Integer> numbers = Stream.generate(scanner::nextInt)
.limit(scanner.nextInt())
.collect(Collectors.toSet());
As the question states, how calculate the optimal number to use and how to motivate it?
If we are going to build an hashtable which uses the following hash function:
h(k) = k mod m, k = key
So some sources tells me:
to use the number of elements to be inserted as the value of m
to use a close prime to m
that java simply use 31 as their value of m
And some people tell me to use the closed prime to 2^n as m
I'm so confused at this point that I don't know what value to use for m. Like for instance if we use the table size for m then what happens if we want to expand the table size? Will I then have to rehash all the values with the new value of m. If so why does Java use simply 31 as prime value for m.
I've also heard that the table size should be two times bigger then the total elements in the hashtable, that's for each time it rehashes. But how come we for instance use m=10 for a table of 10 elements when it should be m=20 to create that extra empty space?
Can someone please help me understand how to calculate the value of m to use based on different scenarios like when we want to have a static (where we know that we will only insnert like 10 elements) or dynamic (rehash after a certain limit) hashtable.
Lets illustrate my problem by the following examples:
I got the values {1,2,...,n}
Question: What would be a optimized value of m if I must use the division by mod in my hashfunction?
Senario 1: n = 100?
Senario 2: n = 5043?
Addition question:
Would the m value hashfunction be different if we used a open or closed hashtable?
Note that i'm now not in need to understand hashtable for java but hashtable in general where I must use a divsion mod hashfunction.
Thank you for your time!
You have several issues here:
1) What should m equal?
2) How much free space should you have in your hash table?
3) Should you make the size of your table be a prime number?
1) As was mentioned in the comments, the h(k) you describe isn't the hash function, it gives you the index into your hash table. The idea is that every object produces some hash code, which is a positive integer. You use the hash code to figure out where to put the object in the hash table (so that you can find it again later). You clearly don't want a hash table of size MAX_INT, so you choose some size m. Then for any object, you take its hash code, compute k % m, and now you have an integer in the interval [0, m-1], which is a valid index into your hash table.
2) Because a hash table works by using a hash code to find the place in a table where an object should go, you get into trouble if multiple items are assigned to the same location. This is called a collision. Every hash table implementation must deal with collisions, either by putting items into nearby spots or keeping a linked list of items in each location. No matter the solution, more collisions means lower performance for your hash table. For that reason, it is recommended that you not let your hash table fill up, otherwise, collisions are more likely. Keeping your hash table at least twice as large as the number of items is a common recommendation to reduce the probability of collisions. Obviously, this means you will have to resize your table as it fills up. Yes, this means that you have to rehash each item since it will go into a different location when you are taking a modulus by a different value. That is the hidden cost of a hash table: it runs in constant time (assuming few or no collisions), but it can have a large coefficient (ammortized resizing, rehashing, etc.).
3) It is also often recommended that you make the size of your hash table be a prime number. This is because it tends to produce a better distribution of items in your hash table in certain common use cases, thus avoiding collisions. Rather than giving a complete explanation here, I will refer you to this excellent answer: Why should hash functions use a prime number modulus?
I have a list of user interests marked with numbers.
Every user has several interests. How do I compose a number that represent the user's interests so I'll be able to find other users with similar or close interests in a simple MongoDB query.
When there are n different interests, each user can be represented as a length-n vector of booleans where the i'th element is true iff the user has listed interest i. Two such vectors can be compared with cosine similarity, Jaccard similarity, L1 distance, L2 distance, etc.
No idea how do it directly with MongoDB, but if you have "biginteger" datatype, then reduce the interests to a bitfield. You can't then remove interestes (without recalculating bitfield for everybody), but you can add interestes, since having them selected will just add more bits to the biginteger. Then to compare interestes of persons A and B, you have operations, in C/C++ like syntax:
common=bitCount(A&B) how many common interests A and B have
onlyA=bitCount(A^(A&B)) how many interests A has, that B does not have
onlyB=bitCount(B^(A&B)) how many interests B has, that A does not have
different=bitCount(A^B) how many different interests A and B have total (same as onlyA+onlyB)
total=bitCount(A|B) how many different interests A and B have total (same as common+different)
From these numbers you can evaluate how closely the interests match, exact formula depending on how you want to emphasize same interestes vs. different interests and what scale you want to have.
At least Java's BigInteger class has bit counting method out-of-the-box, otherwise it can be done with brute-force loop using &1 and >>1 operations. Don't know if MongoDB supports such constructs or has operator/function for bit count of a big int data, or even if MongoDB has big int data type...
I could probably try to do this in following way
I will have all the interests part of columns of a database.
For every user, every column will have a value 0 or 1.
To find if 2 usershave close interests, I will retrieve the values
of interests from DB and store in a domainObject (which has fields
for each interest (column).Then I will implement a comparator which
will update a int field based on the number of matching columns.
Based on this number I can decide on a logic , sa for example if
total interest is 10 and matches > 7 then it is close, else not
close etc.
I have started working with a large dataset that is arriving in JSON format. Unfortunately, the service providing the data feed delivers a non-trivial number of duplicate records. On the up-side, each record has a unique Id number stored as a 64 bit positive integer (Java long).
The data arrives once a week and is about 10M records in each delivery. I need to exclude duplicates from within the current delivery as well as records that were in previous batches.
The brute force approach to attacking the de-dup problem is push the Id number into a Java Set. Since the Set interface requires uniqueness, a failure during the insert will indicate a duplicate.
The question is: Is there a better way to look for a duplicate long as I import records?
I am using Hadoop to mine the data, so if there is a good way to use Hadoop to de-dup the records that would be a bonus.
Could you create a MapReduce task where the map output has a key of the unique ID number? That way, in your reduce task, you will be passed an iterator of all the values with that ID number. Output only the first value and your reduced output will be free of duplicates.
Let me see. Each java.lang.Long takes 24 bytes. Each HashMap$Entry takes 24 bytes as well, and the array for the HashMap takes 4 bytes. So you have 52 * 10M = 512M of heap storage for the map. This is for the 10M records of one week, though.
If you are on a 64-bit system, you can just set the heap size to 5 GB and see how far you get.
There should be other implementations of a java.util.Set that only consume about 16 bytes per entry, so you can handle three times the data as with a java.util.HashSet. I've written one myself, but I cannot share it. You may try GNU Trove instead.
You have to keep list of unique ids in HDFS and rebuild it after every batch load.
As since the cardinality in your case is quite large (you can expect > 1B unique records in one year) your unique id list needs to be split into multiple parts, say N. The partition algorithm is domain specific. The general approach is to convert ID into long hash string (16 bytes is OK) and creates 2^k buckets:
For k =8, for example:
bucket #1 contains all IDs whose hash value starts with 0
bucket #2 contains all IDs whose hash value starts with 1
...
bucket #256 contains all IDs whose hash value starts with 255
On every new batch you receive run dedupe job first: Map reads records , takes record ID, hashes it and outputs Key=bucket# (0..255 in our case) and Value = ID. Each reducer receives all IDS for a given bucket. Reducer loads ALL unique IDs for a given bucket known in your system already into internal Set and checks ALL incoming record IDs with this this internal Set. If record has ID which s not known yet you update internal Set and output the record.
On reducer close you output internal Set of unique IDs back to HDFS.
By splitting the whole set of IDs into number of buckets you create solution which scales well.
I have a table with over 100 thousand data consisting of number pairs. A sample of which is shown below.
A B
0010 0010
0010 0011
0010 0019
0010 0056
0011 0010
0011 0011
0011 0019
0011 0040
0019 0010
0019 0058
Here the numbers in Column A has possible pairs present in column B.
Explanation : User will have several of these numbers ranging form 10 -100. Now as we can see for 0010 - 0011 and 0019 is present. So if the user has a list containing 0010 along with 0011 a warning will be shown that this pair is not allowed and vice versa.
How to approach this in Java?
Loading the hash map with all the data doesnot seem to be a good option although the search will be much faster.
please suggest. Thanks
Testcase:
num = 0010; //value from list which user will be passing
test(num){
if(num.equals("0019")||num.equals("0011")) //compairing with database
System.out.println("incompatible pair present");
}
The above example is a very simple pseudo code. The actual problem will me much more complex.
Until the question is more clear...
Handling large amounts of data which are already stored in a database let me give you a recommendation: Whatever you want to do here, consider solving it with SQL instead of Java. Or at least write a SQL with an resulting ResultSet which is easy to evaluate in Java afterwards.
But until the question is not that clear ...
Are you trying to find entries where A is the same value but B is different?
SELECT t1.a, t1.b, t2.b
FROM MyTable t1, MyTable t2
WHERE t1.a = t2.b AND t1.b <> t2.b
If you're worried of running out of heap space, you could try using a persistent cache like ehcache. I suggest you check the actual memory consumed before going in for this solution though
Seems like your problem is limited to a very small domain - why cant you instantiate an two dimensional array of bool and set it to true whenever the indexes of two numbers creates an unsupported combination.
Example for usage:
if (forbiden[10][11] || forbiden[11][10])
{
throw new Exception("pairs of '10' and '11' are not allowed");
}
You can instantiate this array from the database by going over the data once and setting this array. You just need to translate 0010 to 10. You will have junk in Indexes 0-9, but you can eliminate it by "translating" the index by subtracting it from 9.
Does that hit your question?
If I have understood correctly what you want to do…
Create a unique index on t1(a,b). Put the user's new pair in an INSERT statement inside a try block. Catch key violation exceptions (will be aSQLException, possibly a subclass depending on your RDBMS) and explain to the user that is a forbidden pair.
Simple - definitely not scalable solution -- if your ranges really are 0000 - 9999.
Simply have a byte table with 999999 entries.
Each entry consists of a simple 0 for allowed or 1 for not allowed.
You find an entry in the table by logically concatenating the two pair numbers (key = first * 1000 + second).
The more scalable database solution is to create a table with a composite primary key (pair1 and pair2) the mere presence of an entry indicating a disallowed pair.
To clarify the question:
You have a table containing two numbers each record which are declared 'incompatible'.
You have a user list of numbers and you want to check if this list contains 'incompatible numbers'. Right?
Here you go with a simple SQL (took your example from comment):
SELECT *
FROM incompatible
WHERE A IN (1, 14, 67) AND B IN (1, 14, 67);
This SQL returns all incompatibilities. When the resultset is empty then there are no incompatibilities and everything is fine. If you only want to retrieve this fact then you can write SELECT 1 ... instead.
The SQL have to be build dynamically to contain the user's numbers in the IN clauses, of course.
To speed up queries you can create an (unique) index over both columns. So the database can do a index range scan (unique).
If this table does not yet contain a primary key then you should create a primary key over both columns.