I have a list of user interests marked with numbers.
Every user has several interests. How do I compose a number that represent the user's interests so I'll be able to find other users with similar or close interests in a simple MongoDB query.
When there are n different interests, each user can be represented as a length-n vector of booleans where the i'th element is true iff the user has listed interest i. Two such vectors can be compared with cosine similarity, Jaccard similarity, L1 distance, L2 distance, etc.
No idea how do it directly with MongoDB, but if you have "biginteger" datatype, then reduce the interests to a bitfield. You can't then remove interestes (without recalculating bitfield for everybody), but you can add interestes, since having them selected will just add more bits to the biginteger. Then to compare interestes of persons A and B, you have operations, in C/C++ like syntax:
common=bitCount(A&B) how many common interests A and B have
onlyA=bitCount(A^(A&B)) how many interests A has, that B does not have
onlyB=bitCount(B^(A&B)) how many interests B has, that A does not have
different=bitCount(A^B) how many different interests A and B have total (same as onlyA+onlyB)
total=bitCount(A|B) how many different interests A and B have total (same as common+different)
From these numbers you can evaluate how closely the interests match, exact formula depending on how you want to emphasize same interestes vs. different interests and what scale you want to have.
At least Java's BigInteger class has bit counting method out-of-the-box, otherwise it can be done with brute-force loop using &1 and >>1 operations. Don't know if MongoDB supports such constructs or has operator/function for bit count of a big int data, or even if MongoDB has big int data type...
I could probably try to do this in following way
I will have all the interests part of columns of a database.
For every user, every column will have a value 0 or 1.
To find if 2 usershave close interests, I will retrieve the values
of interests from DB and store in a domainObject (which has fields
for each interest (column).Then I will implement a comparator which
will update a int field based on the number of matching columns.
Based on this number I can decide on a logic , sa for example if
total interest is 10 and matches > 7 then it is close, else not
close etc.
Related
Suppose U is an ordered set of elements, S ⊆U, and x ∈ U. S is being updated concurrently. I want to get an estimate of the number of elements in S that is less x in O(log(|S|) time.
S is being maintained by another software component that I cannot change. However, whenever e is inserted (or deleted) into S I get a message e inserted (deleted). I don't want to maintain my own version of S since memory is limited. I am looking for a structure, ES, (perhaps using O(log(|S|) space) where I can get a reasonable estimate of the number of elements less than any give x. Assume that the entire set S can periodically be sampled to recreate or update ES.
Update: I think that this problem statement must include more specific values for U. One obvious case is where U are numbers (int, double,etc). Another case is where U are strings ordered lexigraphically.
In the case of numbers one could use a probability distribution (but how can that be determined?).
I am wondering if the set S can be scanned periodically. Place the entire set into an array and sort. Then pick the log(n) values at n/log(n), 2n/log(n) ... n where n = |S|. Then draw a histogram based on those values?
More generally how can one find the appropriate probability distribution from S?
Not sure what the unit of measure would be for strings lexigraphically ordered?
By concurrently, I'm assuming you mean thread-safe. In that case, I believe what you're looking for is a ConcurrentSkipListSet, which is essentially a concurrent TreeSet. You can use ConcurrentSkipListSet#headSet.size() or ConcurrentSkipListSet#tailSet.size() to get the amount of elements greater/less than (or equal to) a single element where you can pass in a custom Comparator.
Is x constant? If so it seems easy to track the number less than x as they are inserted and deleted?
If x isn't constant you could still take a histogram approach. Divide up the range that values can take. As items are inserted / deleted, keep track of how many items are in each range bucket. When you get a query, sum up all the values from smaller buckets.
I accept your point that bucketing is tricky - especially if you know nothing about the underlying data. You could record the first 100 values of x, and use those calculate a mean and a standard deviation. Then you could assume the values are normally distributed and calculate the buckets that way.
Obviously if you know more about the underlying data you can use a different distribution model. It would be easy enough to have a modular approach if you want it to be generic.
I have a huge set of long integer identifiers that need to be distributed into (n) buckets as uniformly as possible. The long integer identifiers might have pockets of missing identifiers.
With that being the criteria, is there a difference between Using the long integer as is and doing a modulo (n) [long integer] or is it better to have a hashCode generated for the string version of long integer (to improve the distribution) and then do a modulo (n) [hash_code of string(long integer)]? Is the additional string conversion necessary to get the uniform spread via hash code?
Since I got feedback that my question does not have enough background information. I am adding some more information.
The identifiers are basically auto-incrementing numeric row identifiers that are autogenerated in a database representing an item id. The reason for pockets of missing identifiers is because of deletes.
The identifiers themselves are long integers.
The identifiers (items) themselves are in the order of (10s-100)+ million in some cases and in the order of thousands in some cases.
Only in the case where the identifiers are in the order of millions do I want to really spread them out into buckets (identifier count >> bucket count) for storage in a no-SQL system(partitions).
I was wondering if because of the fact that items get deleted, should I be resorting to (Long).toString().hashCode() to get the uniform spread instead of using the long numeric directly. I had a feeling that doing a toString.hashCode is not going to fetch me much, and I also did not like the fact that java hashCode does not guarantee same value across java revisions (though for String their hashCode implementation seems to be documented and stable for the past releases across years
)
There's no need to involve String.
new Integer(i).hashCode()
... gives you a hash - designed for the very purpose of evenly distributing into buckets.
new Integer(i).hashCode() % n
... will give you a number in the range you want.
However Integer.hashCode() is just:
return value;
So new Integer(i).hashCode() % n is equivalent to i % n.
Your question as is cannot be answered. #slim's try is the best you will get, because crucial information is missing in your question.
To distribute a set of items, you have to know something about their initial distribution.
If they are uniformly distributed and the number of buckets is significantly higher than the range of the inputs, then slim's answer is the way to go. If either of those conditions doesn't hold, it won't work.
If the range of inputs is not significantly higher than the number of buckets, you need to make sure the range of inputs is an exact multiple of the number of buckets, otherwise the last buckets won't get as many items. For instance, with range [0-999] and 400 buckets, first 200 buckets get items [0-199], [400-599] and [800-999] while the other 200 buckets get iems [200-399] and [600-799].
That is, half of your buckets end up with 50% more items than the other half.
If they are not uniformly distributed, as modulo operator doesn't change the distribution except by wrapping it, the output distribution is not uniform either.
This is when you need a hash function.
But to build a hash function, you must know how to characterize the input distribution. The point of the hash function being to break the recurring, predictable aspects of your input.
To be fair, there are some hash functions that work fairly well on most datasets, for instance Knuth's multiplicative method (assuming not too large inputs). You might, say, compute
hash(input) = input * 2654435761 % 2^32
It is good at breaking clusters of values. However, it fails at divisibility. That is, if most of your inputs are divisible by 2, the outputs will be too. [credit to this answer]
I found this gist has an interesting compilation of diverse hashing functions and their characteristics, you might pick one that best matches the characteristics of your dataset.
Edit: Typos fixed and ambiguity tried to fix.
I have a list of five digit integers in a text file. The expected amount can only be as large as what a 5-digit integer can store. Regardless of how many there are, the FIRST line in this file tells me how many integers are present, so resizing will never be necessary. Example:
3
11111
22222
33333
There are 4 lines. The first says there are three 5-digit integers in the file. The next three lines hold these integers.
I want to read this file and store the integers (not the first line). I then want to be able to search this data structure A LOT, nothing else. All I want to do, is read the data, put it in the structure, and then be able to determine if there is a specific integer in there. Deletions will never occur. The only things done on this structure will be insertions and searching.
What would you suggest as an appropriate data structure? My initial thought was a binary tree of sorts; however, upon thinking, a HashTable may be the best implementation. Thoughts and help please?
It seems like the requirements you have are
store a bunch of integers,
where insertions are fast,
where lookups are fast, and
where absolutely nothing else matters.
If you are dealing with a "sufficiently small" range of integers - say, integers up to around 16,000,000 or so, you could just use a bitvector for this. You'd store one bit per number, all initially zero, and then set the bits to active whenever a number is entered. This has extremely fast lookups and extremely fast setting, but is very memory-intensive and infeasible if the integers can be totally arbitrary. This would probably be modeled with by BitSet.
If you are dealing with arbitrary integers, a hash table is probably the best option here. With a good hash function you'll get a great distribution across the table slots and very, very fast lookups. You'd want a HashSet for this.
If you absolutely must guarantee worst-case performance at all costs and you're dealing with arbitrary integers, use a balanced BST. The indirection costs in BSTs make them a bit slower than other data structures, but balanced BSTs can guarantee worst-case efficiency that hash tables can't. This would be represented by TreeSet.
Given that
All numbers are <= 99,999
You only want to check for existence of a number
You can simply use some form of bitmap.
e.g. create a byte[12500] (it is 100,000 bits which means 100,000 booleans to store existence of 0-99,999 )
"Inserting" a number N means turning the N-th bit on. Searching a number N means checking if N-th bit is on.
Pseduo code of the insertion logic is:
bitmap[number / 8] |= (1>> (number %8) );
searching looks like:
bitmap[number/8] & (1 >> (number %8) );
If you understand the rationale, then a even better news for you: In Java we already have BitSet which is doing what I was describing above.
So code looks like this:
BitSet bitset = new BitSet(12500);
// inserting number
bitset.set(number);
// search if number exists
bitset.get(number); // true if exists
If the number of times each number occurs don't matter (as you said, only inserts and see if the number exists), then you'll only have a maximum of 100,000. Just create an array of booleans:
boolean numbers = new boolean[100000];
This should take only 100 kilobytes of memory.
Then instead of add a number, like 11111, 22222, 33333 do:
numbers[11111]=true;
numbers[22222]=true;
numbers[33333]=true;
To see if a number exists, just do:
int whichNumber = 11111;
numberExists = numbers[whichNumber];
There you are. Easy to read, easier to mantain.
A Set is the go-to data structure to "find", and here's a tiny amount of code you need to make it happen:
Scanner scanner = new Scanner(new FileInputStream("myfile.txt"));
Set<Integer> numbers = Stream.generate(scanner::nextInt)
.limit(scanner.nextInt())
.collect(Collectors.toSet());
I am writing an algorithm to match students with different groups. Each group has a limited number of spots. Each student provides their top 5 choices of groups. The students are then placed into groups in a predetermined order (older students and students with perfect attendance are given higher priority). There is no requirement for groups to be filled entirely, but they cannot be filled passed capacity.
I've looked into similar marriage problems such as the Gale-Shapely stable marriage algorithm, but the problem I am having is that there far fewer groups than students and each group can accept multiple students.
What is the best way to implement such an algorithm to find a solution that has been optimized entirely such that there is no better arrangement of students in groups? In terms of algorithm complexity, I'm placing roughly 600 students into 10-20 groups.
NB The close votes are terribly misplaced. Algorithm choice and design to solve an ambiguous problem is absolutely part of programming.
I think you'll get farther with Minimum Weight Bipartite Matching than Stable Marriage (also called the Hungarian method or algorithm or Maximum Weight Matching, which can give you a min weight matching just by negating the weights).
You are out to match positions with students. So the two node types in the bipartite graph are these.
The simplest statement of the algorithm requires a complete weighed bipartite graph with equal numbers of nodes in each set. You can think of this as a square matrix. The weights are the elements. The rows are students. The columns are positions.
The algorithm will pick a single element from each row/column such that the sum is minimized.
#nava's proposal is basically a greedy version of MWBM that's not optimal. The true Hungarian algorithm will give you an optimal answer.
To handle the fact that you have fewer positions than students is easy. To the "real" positions add as many "dummy" positions as needed. Connect all these to all the students with super high-weight edges. The algorithm will only pick them after all the real positions are matched.
The trick is to pick the edge weights. Let's call the ordinal where a student would be considered for a position O_i for the i'th student. Then let R_ip be the ranking that the same student places on the p'th position. Finally, W_ip is the weight of the edge connecting the i'th student to the p'th position. You'll want something like:
W_ip = A * R_ip + B * O_i
You get to pick A and B to specify the relative importance of the students' preferences and the order they're ranked. It sounds like order is quite important. So in that case you want B to be big enough to completely override students' rankings.
A = 1, B = N^2, where N is the number of students.
Once you get an implementation working, it's actually fun to tweak the parameters to see how many students get what preference, etc. You might want to tweak the parameters a bit to give up a little on the order.
At the time I was working on this (late 90's), the only open source MWBM I could find was an ancient FORTRAN lib. It was O(N^3). It handled 1,000 students (selecting core academic program courses) in a few seconds. I spent a lot of time coding a fancy O(N^2 log N) version that turned out to be about 3x slower for N=1000. It only started "winning" at about 5,000.
These days there are probably better options.
I would modify the Knapsack problem (Knapsack problem, wikipedia) to work with K numbers of groups (knapsacks) instead of just one. You can assign "value" to the preferences they have and the number of spots would be the maximum "weight" of the Knapsack. With this, you can backtrack to check what is the optimal solution of the problem.
I am not sure how efficient you need the problem to be, but I think this will work.
the most mathematically perfect
is very opinion based.
simplicity (almost) always wins. here
Here is a pseudocode:
students <-- sorted by attendance
for i=0 to n in students:
groups <-- sorted by ith student's preference
for j=0 to m in groups:
if group j has space then add student i to group j; studentAssigned=true; break;
if studentAssigned=false;
add i to unallocated
for i=0 to k in unallocated
allocate i to a random group that is not filled
For each group:
create an ordered set and add all the students (you must design the heuristic which will order the students inside the set, which could be attendance level multiplied by 1 if the group is within their choice, 0 otherwise).
Fill the group with the first nth students
But there are some details that you didn't explain. For example, what happen if there are students that couldn't enter any of their 5 choices because they got full with other students with higher priority?
Hi I am building a simple multilayer network which is trained using back propagation. My problem at the moment is that some attributes in my dataset are nominal (non numeric) and I have to normalize them. I wanted to know what the best approach is. I was thinking along the lines of counting up how many distinct values there are for each attribute and assigning each an equal number between 0 and 1. For example suppose one of my attributes had values A to E then would the following be suitable?:
A = 0
B = 0.25
C = 0.5
D = 0.75
E = 1
The second part to my question is denormalizing the output to get it back to a nominal value. Would I first do the same as above to each distinct output attribute value in the dataset in order to get a numerical representation? Also after I get an output from the network, do I just see which number it is closer to? For example if I got 0.435 as an output and my output attribute values were assigned like this:
x = 0
y = 0.5
z = 1
Do I just find the nearest value to the output (0.435) which is y (0.5)?
You can only do what you are proposing if the variables are ordinal and not nominal, and even then it is a somewhat arbitrary decision. Before I suggest a solution, a note on terminology:
Nominal vs ordinal variables
Suppose A, B, etc stand for colours. These are the values of a nominal variable and can not be ordered in a meaningful way. You can't say red is greater than yellow. Therefore, you should not be assigning numbers to nominal variables .
Now suppose A, B, C, etc stand for garment sizes, e.g. small, medium, large, etc. Even though we are not measuring these sizes on an absolute scale (i.e. we don't say that small corresponds to 40 a chest circumference), it is clear that small < medium < large. With that in mind, it is still somewhat arbitrary whether you set small=1, medium=2, large=3, or small=2, medium=4, large=8.
One-of-N encoding
A better way to go about this is to to use the so called one-out-of-N encoding. If you have 5 distinct values, you need five input units, each of which can take the value 1 or 0. Continuing with my garments example, size extra small can be encoded as 10000, small as 01000, medium as 00100, etc.
A similar principle applies to the outputs of the network. If we treat garment size as output instead of input, when the network output the vector [0.01 -0.01 0.5 0.0001 -.0002], you interpret that as size medium.
In reply to your comment on #Daan's post: if you have 5 inputs, one of which takes 20 possible discrete values, you will need 24 input nodes. You might want to normalise the values of your 4 continuous inputs to the range [0, 1], because they may end out dominating your discrete variable.
It really depends on the meaning of the attributes you're trying to normalize, and the functions used inside your NN. For example, if your attributes are non-linear, or if you're using a non-linear activation function, then linear normalization might not end up doing what you want it to do.
If the ranges of attribute values are relatively small, splitting the input and output into sets of binary inputs and outputs will probably be simpler and more accurate.
EDIT:
If the NN was able to accurately perform it's function, one of the outputs will be significantly higher than the others. If not, you might have a problem, depending on when you see inaccurate results.
Inaccurate results during early training are expected. They should become less and less common as you perform more training iterations. If they don't, your NN might not be appropriate for the task you're trying to perform. This could be simply a matter of increasing the size and/or number of hidden layers. Or it could be a more fundamental problem, requiring knowledge of what you're trying to do.
If you've succesfully trained your NN but are seeing inaccuracies when processing real-world data sets, then your training sets were likely not representative enough.
In all of these cases, there's a strong likelihood that your NN did something entirely different than what you wanted it to do. So at this point, simply selecting the highest output is as good a guess as any. But there's absolutely no guarantee that it'll be a better guess.