Algorithms/hash functions to generate many to one mappings - java

I am looking for hash functions that can be used to generate batches out of integer stream. Specifically, I want to map integers xi from a set or stream (say X) to another set of integers or strings (say Y) such that many xi are mapped to one yj. While doing that, I want to ensure that there are at max n xi mapped to a single yj. As with the hashing, I need to be able to reliably find the y given an x.
I would like to ensure most of the yj have close to n number of xi mapped to them (to avoid very sparse mapping from X to Y).
One function I can think of is quotient:
int BATCH_SIZE = 3;
public int map(int x) {
return x / BATCH_SIZE;
}
for a stream of sequential integers, it can work fairly well. e.g. stream 1..9 will be mapped to
1 -> 0
2 -> 0
3 -> 1
4 -> 1
5 -> 1
6 -> 2
7 -> 2
8 -> 2
9 -> 3
and so on. However, for non sequential large integers and small batch size (my use case), this can generate super sparse mapping (each batch will have only 1 element most of the time).
Are there any standard ways to generate such a mapping (batching)

There's no way to get it to work under these assumptions.
You need to know how many items are in the stream and their distribution or you need to relax the ability to map item to batch precisely.
Let's say you have items a and b from the stream.
Are you going to put them together in the same batch or not? You can't answer this unless you know if you're going to get more items to fill the 2 or more batches (if you decide to put them in separate batches).
If you know how many there will be (even approximately) you can take their distribution and build batches based on that. Say you have string hashes (uniform distribution over 32bit). If you know you are getting 1M items and you want batches of 100 you can generate intervals of 2^32 / (1.000.000 / 100) and use that as the batch id (yj). This doesn't guarantee you get batches of exactly batch_size but they should be approximately batch_size. If the distribution is not uniform things are more difficult, but can still be done.
If you relax the ability to map item to batch then just group them every batch_size as they come out of the stream. You could keep a map for steam item to batch if you have the space.

Related

Java Structure that is able to determine approximate number of elements less then x in an ordered set which is updated concurrently

Suppose U is an ordered set of elements, S ⊆U, and x ∈ U. S is being updated concurrently. I want to get an estimate of the number of elements in S that is less x in O(log(|S|) time.
S is being maintained by another software component that I cannot change. However, whenever e is inserted (or deleted) into S I get a message e inserted (deleted). I don't want to maintain my own version of S since memory is limited. I am looking for a structure, ES, (perhaps using O(log(|S|) space) where I can get a reasonable estimate of the number of elements less than any give x. Assume that the entire set S can periodically be sampled to recreate or update ES.
Update: I think that this problem statement must include more specific values for U. One obvious case is where U are numbers (int, double,etc). Another case is where U are strings ordered lexigraphically.
In the case of numbers one could use a probability distribution (but how can that be determined?).
I am wondering if the set S can be scanned periodically. Place the entire set into an array and sort. Then pick the log(n) values at n/log(n), 2n/log(n) ... n where n = |S|. Then draw a histogram based on those values?
More generally how can one find the appropriate probability distribution from S?
Not sure what the unit of measure would be for strings lexigraphically ordered?
By concurrently, I'm assuming you mean thread-safe. In that case, I believe what you're looking for is a ConcurrentSkipListSet, which is essentially a concurrent TreeSet. You can use ConcurrentSkipListSet#headSet.size() or ConcurrentSkipListSet#tailSet.size() to get the amount of elements greater/less than (or equal to) a single element where you can pass in a custom Comparator.
Is x constant? If so it seems easy to track the number less than x as they are inserted and deleted?
If x isn't constant you could still take a histogram approach. Divide up the range that values can take. As items are inserted / deleted, keep track of how many items are in each range bucket. When you get a query, sum up all the values from smaller buckets.
I accept your point that bucketing is tricky - especially if you know nothing about the underlying data. You could record the first 100 values of x, and use those calculate a mean and a standard deviation. Then you could assume the values are normally distributed and calculate the buckets that way.
Obviously if you know more about the underlying data you can use a different distribution model. It would be easy enough to have a modular approach if you want it to be generic.

Is it possible to compare two DFEVar values on Kernel side

I am using Maxeler, MaxIDE.
I would like to use my input stream as output stream on the next cycle. I was hoping to decide this under an if condition. But the if condition won't allow me to compare two DFEVar(s). I was wondering is it possible?
Type mismatch: cannot convert from DFEVar to boolean
You can not use regular if statement to compare two DFE
Vars.
You should use ternary operator instead. See point 2. below for more details.
You can find the detailed explanation in the Maxeler tutorials.
From the: MaxCompiler: Dataflow Programming Tutorial
Conditionals in dataflow computing
There are three main methods of controlling conditionals that affect dataflow computation:
Global conditionals: These are typically large scale modes of operation depending on input pa-
rameters with a relatively small number of options. If we need to select different computations
based on input parameters, and these conditionals affect the dataflow portion of the design, we
simply create multiple .max files for each case. Some applications may require certain transformation to get them into the optimal structure for supporting multiple .max files.
if (mode==1) p1(x); else p2(x);
where p1 and p2 are programs that use different .max files.
Local Conditionals: Conditionals depending on local state of a computation.
if (a > b) x=x+1; else x=x − 1;
These can be transformed into dataflow computation as
x = (a > b) ? (x+1) : (x − 1);
Conditional Loops: If we do not know how long we need to iterate around a loop, we need to know
a bit about the loop’s behavior and typically values for the number of loop iterations. Once we
know the distribution of values we can expect, a dataflow implementation pipelines the optimal
number of iterations and treats each of the block of iterations as an action for the SLiC interface,
controlled by the CPU (or some other kernel).
The ternary-if operator ( ?: ) selects between two input streams. To select between more than
W two streams, the control.mux method is easier to use and read than nested ternary-if statements.

Faster methods for set intersection

I am facing a problem where for a number of words, I make a call to a HashMultimap (Guava) to retrieve a set of integers. The resulting sets have, say, 10, 200 and 600 items respectively. I need to compute the intersection of these three (or four, or five...) sets, and I need to repeat this whole process many times (I have many sets of words). However, what I am experiencing is that on average these set intersections take so long to compute (from 0 to 300 ms) that my program takes a very long time to complete if I look at hundreds of thousands of sets of words.
Is there any substantially quicker method to achieve this, especially given I'm dealing with (sortable) integers?
Thanks a lot!
If you are able to represent your sets as arrays of bits (bitmaps), you can intersect them with AND operations. You could even implement this to run in parallel.
As an example (using jlordo's question): if set1 is {1,2,4} and set2 is {1,2,5}
Then your first set would be represented as: 00010110 (bits set for 1, 2, and 4).
Your second set would be represented as: 00100110 (bits set for 1, 2, and 5).
If you AND them together, you get: 00000110 (bits set for 1 and 2)
Of course, if you had a larger range of integers, then you will need more bytes. The beauty of bitmap indexes is that they take just one bit per possible element, thus occupying a relatively small space.
In Java, for example, you could use the BitSet data structure (not sure if it can do operations in parallel, though).
One problem with a bitmap based solution is that even if the sets themselves are very small, but contain very large numbers (or even unbounded) checking bitmaps would be very wasteful.
A different approach would be, for example, sorting the two sets, merging them and checking for duplicates. This can be done in O(nlogn) time complexity and extra O(n) space complexity, given set sizes are O(n).
You should choose the solution that matches your problem description (input range, expected set sizes, etc.).
The post http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset describes an implementation of an ordered primitive long set with set operations (union, minus and intersection). To my experience it's quite efficient for dense or sparse value populations.

N-way merge sort a 2G file of strings

This is another question from cracking coding interview, I still have some doubt after reading it.
9.4 If you have a 2 GB file with one string per line, which sorting algorithm
would you use to sort the file and why?
SOLUTION
When an interviewer gives a size limit of 2GB, it should tell you something - in this case, it suggests that they don’t want you to bring all the data into memory.
So what do we do? We only bring part of the data into memory..
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge
The rationale behind using external sort is the size of data. Since the data is too huge and we can’t bring it all into memory, we need to go for a disk based sorting algorithm.
Doubt:
When in step 3, doing the merge sort, while comparing 2 arrays, do we need 2*X space each time we compare? And the limit was X MB. Should we make the chunks into (X/2)*2K = 2GB? So that each chunk will be X/2 MB and there will be 2K chunks. Or I am just understanding the merge sort wrong?
Thanks!
http://en.wikipedia.org/wiki/External_sorting
A quick look on Wikipedia tells me that during the merging process you never hold a whole chunk in memory. So basically, if you have K chunks, you will have K open file pointers but you will only hold one line from each file in memory at any given time. You will compare the lines you have in memory and then output the smallest one (say, from chunk 5) to your sorted file (also an open file pointer, not in memory), then overwrite that line with the next line from that file (in our example, file 5) into memory and repeat until you reach the end of all the chunks.
First off, step 3 itself is not a merge sort, the whole thing is a merge sort. Step 3 is just a merge, with no sorting involved at all.
And as to the storage required, there are two possibilities.
The first is to merge the sorted data in groups of two. Say you have three groups:
A: 1 3 5 7 9
B: 0 2 4 6 8
C: 2 3 5 7
With that method, you would merge A and B in to a single group Y then merge Y and C into the final result Z:
Y: 0 1 2 3 4 5 6 7 8 9 (from merging A and B).
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging Y and C).
This has the advantage of a very small constant memory requirement in that you only ever need to store the "next" element from each of two lists but, of course, you need to do multiple merge operations.
The second way is a "proper" N-way merge where you select the next element from any of the groups. With that you would check the lowest value in every list to see which one comes next:
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging A, B and C).
This involves only one merge operation but it requires more storage, basically one element per list.
Which of these you choose depends on the available memory and the element size.
For example, if you have 100M memory available to you and the element size is 100K, you can use the latter. That's because, for a 2G file, you need 20 groups (of 100M each) for the sort phase which means a proper N-way merge will need 100K by 20, or about 2M, well under your memory availability.
Alternatively, let's say you only have 1M available. That will be about 2000 (2G / 1M) groups and multiplying that by 100K gives 200M, well beyond your capacity.
So you would have to do that merge in multiple passes. Keep in mind though that it doesn't have to be multiple passes merging two lists.
You could find a middle ground where for example each pass merges ten lists. Ten groups of 100K is only a meg so will fit into your memory constraint and that will result in fewer merge passes.
The merging process is much simpler than that. You'll be outputting them to a new file, but basically you only need constant memory: you only need to read one element from each of the two input files at a time.

Distributedly "dumping"/"compressing" data samples

I'm not really sure what's the right title for my question
So here's the question
Suppose I have N number of samples, eg:
1
2
3
4
.
.
.
N
Now I want to "reduce" the size of the sample from N to M, by dumping (N-M) data from the N samples.
I want the dumping to be as "distributed" as possible,
so like if I have 100 samples and want to compress it to 50 samples, I would throw away every other sample. Another example, say the data is 100 samples and I want to compress it to 25 samples. I would throw away 1 sample in the each group of 100/25 samples, meaning I iterate through each sample and count, and every time my count reaches 4 I would throw away the sample and restart the count.
The problem is how do I do this if the 4 above was to be 2.333 for example. How do I treat the decimal point to throw away the sample distributively?
Thanks a lot..
The terms you are looking for are resampling, downsampling and decimation. Note that in the general case you can't just throw away a subset of your data without risking aliasing. You need to low pass filter your data first, prior to decimation, so that there is no information above your new Nyquist rate which would be aliased.
When you want to downsample by a non-integer value, e.g. 2.333 as per your example above you would normally do this by upsampling by an integer factor M and then downsampling by a different integer factor N, where the fraction M/N gives you the required resampling factor. In your example M = 3 and N = 7, so you would upsample by a factor of 3 and then downsample by a factor of 7.
You seem to be talking about sampling rates and digital signal processing
Before you reduce, you normally filter the data to make sure high frequencies in your sample are not aliased to lower frequencies. For instance, in your (take every fourth value), a frequency of that repeats every four samples will alias to the "DC" or zero cycle frequency (for example "234123412341" starting with the first of every grouping will get "2,2,2,2", which might not be what you want. (a 3 cycle would also alias to a cycle like itself (231231231231) => 231... (unless I did that wrong because I'm tired). Filtering is a little beyond what I would like to discuss right now as it's a pretty advanced topic.
If you can represent your "2.333" as some sort of fraction, lets see, that's 7/3. you were talking 1 out of every 4 samples (1/4) sou I would say you're taking 3 out of every 7 samples. so you might (take, drop, take, drop, take, drop, drop). but there might be other methods.
For audio data that you want to sound decent (as opposed to aliased and distorted in the frequency domain), see Paul R.'s answer involving resampling. One method of resampling is interpolation, such as using a windowed-Sinc interpolation kernel which will properly low-pass filter the data as well as allow creating interpolated intermediate values.
For non-sampled and non-audio data, where you just want to throw away some samples in a close-to-evenly distributed manner, and don't care about adding frequency domain noise and distortion, something like this might work:
float myRatio = (float)(N-1) / (float)(M-1); // check to make sure M > 1 beforehand
for (int i=0; i < M; i++) {
int j = (int)roundf(myRatio * (float)i); // nearest bin decimation
myNewArrayLengthM[i] = myOldArrayLengthN[j];
}

Categories