Matching Numbers Between two sets [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have two large data sets of numeric keys (millions of entries in each) and need to set up a data structure where I can quickly identify key matches between the two sets, allowing for some fixed variation.
So for instance, if there's a value of 356 in one set, I'd like to find any instances of 355, 356 or 357 in the other set. My initial idea was to set up two HashMaps, iterate over the one with the least amount of keys, and then query the larger one over the range (so querying for 355, 356, or 357 in the larger map).
Is there a particular data structure/matching algorithm for numeric values that I should be looking into?

Maybe a java BitSet could be useful in that case. Here's a code sample that uses BitSet of size = 1000000 with a range = 5 to do the check around each values from the first set into the second :
import java.util.*;
import java.lang.*;
import java.io.*;
class CheckRange
{
public static void main (String[] args) throws java.lang.Exception
{
int range = 5;
int maxSize = 1000000;
// Prepare the main BitSet (bs)
BitSet bs = new BitSet(maxSize);
bs.set(357);
bs.set(599001);
bs.set(123456);
// ...
// Prepare the BitSet to check in
BitSet bs2 = new BitSet(maxSize);
bs2.set(5688);
bs2.set(566685);
bs2.set(988562);
// ...
for (int i = bs.nextSetBit(0); i >= 0; i = bs.nextSetBit(i+1)) {
// Compute the ranges, checking the boundaries
int minIndex = Math.max(i - range, 0);
int maxIndex = Math.min(i + range, maxSize);
// Extract the matching subset
BitSet subset = bs2.get(minIndex, maxIndex);
// Print the number of bits set
System.out.println("Number of bit set int bs2 from bs at index " + i + " is " + subset.cardinality());
}
}
}

I'd suggest you start with the Java Set. The "matches between the two sets" that you are seeking sounds a lot like a set intersection.
See API for set operations in Java? and take a look at the description of retainAll.

I will try to summarize a little bit.
Option one - sorted arrays. With binary search, you will be able to find exact value with O(log N) complexity (here and below N is a number of elements in the structure). So, for your operation - log n (search in the first set) + log n (search in the second) + constant (check what you called variation), which is 2 * log N + constant which is O(log N). If data in the collections is changing, you'll have to spend O(log N) to insert it to proper position using similar binary search.
Option two - use Java Set. O(log N) for .contains() call + you'll have to call .contains() for each element of the variation, so we have O(|V| * log N), where |V| is variation size. You also add elements for O(log N).
Decision: I'd choose java set, because there is much fever code to write and you do not need to debug code that search/add element.

Related

Finding mode for every window of size k in an array

Given an array of size n and k, how do you find the mode for every contiguous subarray of size k?
For example
arr = 1 2 2 6 6 1 1 7
k = 3
ans = 2 2 6 6 1 1
I was thinking of having a hashmap where the key is no and value is frequency, treemap where the key is freq and value is number, and having a queue to remove the first element when the size > k. Here the time complexity is o(nlog(n)). Can we do this in O(1)?.
This can be done in O(n) time
I was intrigued by this problem in part because, as I indicated in the comments, I felt certain that it could be done in O(n) time. I had some time over this past weekend, so I wrote up my solution to this problem.
Approach: Mode Frequencies
The basic concept is this: the mode of a collection of numbers is the number(s) which occur with the highest frequency within that set.
This means that whenever you add a number to the collection, if the number added was not already one of the mode-values then the frequency of the mode would not change. So with the collection (8 9 9) the mode-values are {9} and the mode-frequency is 2. If you add say a 5 to this collection ((8 9 9 5)) neither the mode-frequency nor the mode-values change. If instead you add an 8 to the collection ((8 9 9 8)) then the mode-values change to {9, 8} but the mode-frequency is still unchanged at 2. Finally, if you instead added a 9 to the collection ((8 9 9 9)), now the mode-frequency goes up by one.
Thus in all cases when you add a single number to the collection, the mode-frequency is either unchanged or goes up by only one. Likewise, when you remove a single number from the collection, the mode-frequency is either unchanged or goes down by at most one. So all incremental changes to the collection result in only two possible new mode-frequencies. This means that if we had all of the distinct numbers of the collection indexed by their frequencies, then we could always find the new Mode in a constant amount of time (i.e., O(1)).
To accomplish this I use a custom data structure ("ModeTracker") that has a multiset ("numFreqs") to store the distinct numbers of the collection along with their current frequency in the collection. This is implemented with a Dictionary<int, int> (I think that this is a Map in Java). Thus given a number, we can use this to find its current frequency within the collection in O(1).
This data structure also has an array of sets ("freqNums") that given a specific frequency will return all of the numbers that have that frequency in the current collection.
I have included the code for this data structure class below. Note that this is implemented in C# as I do not know Java well enough to implement it there, but I believe that a Java programmer should have no trouble translating it.
(pseudo)Code:
class ModeTracker
{
HashSet<int>[] freqNums; //numbers at each frequency
Dictionary<int, int> numFreqs; //frequencies for each number
int modeFreq_ = 0; //frequency of the current mode
public ModeTracker(int maxFrequency)
{
freqNums = new HashSet<int>[maxFrequency + 2];
// populate frequencies, so we dont have to check later
for (int i=0; i<maxFrequency+1; i++)
{
freqNums[i] = new HashSet<int>();
}
numFreqs = new Dictionary<int, int>();
}
public int Mode { get { return freqNums[modeFreq_].First(); } }
public void addNumber(int n)
{
int newFreq = adjustNumberCount(n, 1);
// new mode-frequency is one greater or the same
if (freqNums[modeFreq_+1].Count > 0) modeFreq_++;
}
public void removeNumber(int n)
{
int newFreq = adjustNumberCount(n, -1);
// new mode-frequency is the same or one less
if (freqNums[modeFreq_].Count == 0) modeFreq_--;
}
int adjustNumberCount(int num, int adjust)
{
// make sure we already have this number
if (!numFreqs.ContainsKey(num))
{
// add entries for it
numFreqs.Add(num, 0);
freqNums[0].Add(num);
}
// now adjust this number's frequency
int oldFreq = numFreqs[num];
int newFreq = oldFreq + adjust;
numFreqs[num] = newFreq;
// remove old freq for this number and and the new one
freqNums[oldFreq].Remove(num);
freqNums[newFreq].Add(num);
return newFreq;
}
}
Also, below is a small C# function that demonstrates how to use this datastructure to solve the problem originally posed in the question.
int[] ModesOfSubarrays(int[] arr, int subLen)
{
ModeTracker tracker = new ModeTracker(subLen);
int[] modes = new int[arr.Length - subLen + 1];
for (int i=0; i < arr.Length; i++)
{
//add every number into the tracker
tracker.addNumber(arr[i]);
if (i >= subLen)
{
// remove the number that just rotated out of the window
tracker.removeNumber(arr[i-subLen]);
}
if (i >= subLen - 1)
{
// add the new Mode to the output
modes[i - subLen + 1] = tracker.Mode;
}
}
return modes;
}
I have tested this and it does appear to work correctly for all of my tests.
Complexity Analysis
Going through the individual steps of the `ModesOfSubarrays()` function:
The new ModeTracker object is created in O(n) time or less.
The modes[] array is created in O(n) time.
The For(..) loops N times:
. 3a: the addNumber() function takes O(1) time
. 3b: the removeNumber() function takes O(1) time
. 3c: getting the new Mode takes O(1) time
So the total time is O(n) + O(n) + n*(O(1) + O(1) + O(1)) = O(n)
Please let me know of any questions that you might have about this code.

Java performance : Java logic to find the 3 highest number from huge list [duplicate]

This question already has answers here:
Find the k largest elements in order
(7 answers)
Closed 4 years ago.
This was the question I was asked in an interview for checking the performance knowledge.
Question - I have a list (Arraylist by default and if you wanna change the list then justify) of integers.
There are millions of entries with random int values.
Values can repeat.
From this list I need to find 3 highest unique numbers in below cases.
1) when time is limited (time effective)
2) when memory is limited (memory effective)
I attempted the questions but couldn't get the effective solution.
My solution was to use stream API,
then distinct() to get unique numbers
Sort() to sort the list
And then display top 3 after collecting.
However, they said you don't need to sort.
I thought of using 3 variables to hold top 3 values.
Then I iterate over the list and check if current values in top 3 have higher values or not? If not then I swap.
However here, there are many comparisons and thus at every iteration we have to do these comparisons.
Can anyone suggest me what are some better ways to solve this problem?
Also, I'll be very thankful if anyone can provide some link /description /approaches for such performances related solving.
Edit : output required is top 3 unique values
If you don't have any space limitations, then one option might be to add your list collection to a Java TreeSet:
List<Integer> list = new ArrayList<>();
// populate the above list
TreeSet<Integer> set = new TreeSet<>(list);
set = (TreeSet<Integer>)set.descendingSet();
Each entry in your list will be placed into a red black tree behind TreeSet, and duplicates will be automatically removed. I make a call to TreeSet#descendingSet above, to give us a sorted set which will iterate in descending order by default.
Now all that is needed is to iterate the first three entries:
int count = 0;
Iterator<Integer> iterator = set.iterator();
while(iterator.hasNext() && count < 3) {
System.out.println("Value #" + count + " = " + iterator.next());
++count;
}
As for the memory limited approach, you would likely have to resort to some sort of in place sorting algorithm, which uses either only the original data structure, or perhaps just a bit extra. I have little expertise with such solutions, so I won't attempt anything other than what I just mentioned.
The following method will return the n largest elements of the given list. We iterate over the list and add the elements to a TreeSet which stores its elements in sorted order (with O(log n) insertion). When the number of elements in the set exceeds n, the first element (i.e. the smallest is removed). Additionally, the set does not allow duplicate entries.
public static <T extends Comparable<T>> List<T> highest(List<T> list, int n) {
final TreeSet<T> set = new TreeSet<T>();
for (final T t : list) {
set.add(t);
if (set.size() > n)
set.pollFirst();
}
return new ArrayList<T>(set);
}
Example (for the list [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]):
public static void main(String[] args) {
final List<Integer> list = new ArrayList<Integer>();
for (int i = 0; i < 10; ++i)
list.add(i);
System.out.println(highest(list, 3));
}
Output:
[7, 8, 9]

Find all the ways you can go up an n step staircase if you can take k steps at a time such that k <= n

This is a problem I'm trying to solve on my own to be a bit better at recursion(not homework). I believe I found a solution, but I'm not sure about the time complexity (I'm aware that DP would give me better results).
Find all the ways you can go up an n step staircase if you can take k steps at a time such that k <= n
For example, if my step sizes are [1,2,3] and the size of the stair case is 10, I could take 10 steps of size 1 [1,1,1,1,1,1,1,1,1,1]=10 or I could take 3 steps of size 3 and 1 step of size 1 [3,3,3,1]=10
Here is my solution:
static List<List<Integer>> problem1Ans = new ArrayList<List<Integer>>();
public static void problem1(int numSteps){
int [] steps = {1,2,3};
problem1_rec(new ArrayList<Integer>(), numSteps, steps);
}
public static void problem1_rec(List<Integer> sequence, int numSteps, int [] steps){
if(problem1_sum_seq(sequence) > numSteps){
return;
}
if(problem1_sum_seq(sequence) == numSteps){
problem1Ans.add(new ArrayList<Integer>(sequence));
return;
}
for(int stepSize : steps){
sequence.add(stepSize);
problem1_rec(sequence, numSteps, steps);
sequence.remove(sequence.size()-1);
}
}
public static int problem1_sum_seq(List<Integer> sequence){
int sum = 0;
for(int i : sequence){
sum += i;
}
return sum;
}
public static void main(String [] args){
problem1(10);
System.out.println(problem1Ans.size());
}
My guess is that this runtime is k^n where k is the numbers of step sizes, and n is the number of steps (3 and 10 in this case).
I came to this answer because each step size has a loop that calls k number of step sizes. However, the depth of this is not the same for all step sizes. For instance, the sequence [1,1,1,1,1,1,1,1,1,1] has more recursive calls than [3,3,3,1] so this makes me doubt my answer.
What is the runtime? Is k^n correct?
TL;DR: Your algorithm is O(2n), which is a tighter bound than O(kn), but because of some easily corrected inefficiencies the implementation runs in O(k2 × 2n).
In effect, your solution enumerates all of the step-sequences with sum n by successively enumerating all of the viable prefixes of those step-sequences. So the number of operations is proportional to the number of step sequences whose sum is less than or equal to n. [See Notes 1 and 2].
Now, let's consider how many possible prefix sequences there are for a given value of n. The precise computation will depend on the steps allowed in the vector of step sizes, but we can easily come up with a maximum, because any step sequence is a subset of the set of integers from 1 to n, and we know that there are precisely 2n such subsets.
Of course, not all subsets qualify. For example, if the set of step-sizes is [1, 2], then you are enumerating Fibonacci sequences, and there are O(φn) such sequences. As k increases, you will get closer and closer to O(2n). [Note 3]
Because of the inefficiencies in your coded, as noted, your algorithm is actually O(k2 αn) where α is some number between φ and 2, approaching 2 as k approaches infinity. (φ is 1.618..., or (1+sqrt(5))/2)).
There are a number of improvements that could be made to your implementation, particularly if your intent was to count rather than enumerate the step sizes. But that was not your question, as I understand it.
Notes
That's not quite exact, because you actually enumerate a few extra sequences which you then reject; the cost of these rejections is a multiplier by the size of the vector of possible step sizes. However, you could easily eliminate the rejections by terminating the for loop as soon as a rejection is noticed.
The cost of an enumeration is O(k) rather than O(1) because you compute the sum of the sequence arguments for each enumeration (often twice). That produces an additional factor of k. You could easily eliminate this cost by passing the current sum into the recursive call (which would also eliminate the multiple evaluations). It is trickier to avoid the O(k) cost of copying the sequence into the output list, but that can be done using a better (structure-sharing) data-structure.
The question in your title (as opposed to the problem solved by the code in the body of your question) does actually require enumerating all possible subsets of {1…n}, in which case the number of possible sequences would be exactly 2n.
If you want to solve this recursively, you should use a different pattern that allows caching of previous values, like the one used when calculating Fibonacci numbers. The code for Fibonacci function is basically about the same as what do you seek, it adds previous and pred-previous numbers by index and returns the output as current number. You can use the same technique in your recursive function , but add not f(k-1) and f(k-2), but gather sum of f(k-steps[i]). Something like this (I don't have a Java syntax checker, so bear with syntax errors please):
static List<Integer> cache = new ArrayList<Integer>;
static List<Integer> storedSteps=null; // if used with same value of steps, don't clear cache
public static Integer problem1(Integer numSteps, List<Integer> steps) {
if (!ArrayList::equal(steps, storedSteps)) { // check equality data wise, not link wise
storedSteps=steps; // or copy with whatever method there is
cache.clear(); // remove all data - now invalid
// TODO make cache+storedSteps a single structure
}
return problem1_rec(numSteps,steps);
}
private static Integer problem1_rec(Integer numSteps, List<Integer> steps) {
if (0>numSteps) { return 0; }
if (0==numSteps) { return 1; }
if (cache.length()>=numSteps+1) { return cache[numSteps] } // cache hit
Integer acc=0;
for (Integer i : steps) { acc+=problem1_rec(numSteps-i,steps); }
cache[numSteps]=acc; // cache miss. Make sure ArrayList supports inserting by index, otherwise use correct type
return acc;
}

Java recursive difference in array [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm currently learning Java and I stumbled on an exercise I can't finish.
The task is to write a recursive method that takes an array and returns the difference of the greatest and smallest value.
For example {12, 5, 3, 8} should return 5 (8 - 3). It is important to note that it is only allowed to compare values in their right order (result = rightValue - leftValue). For example 12-3 = 9 would not be allowed. Think of it like stock values. You want to find out which time to buy and sell the stocks to make the largest profit.
It was quiet easy to implement this iterative but I have no idea how to do it recursive. Also it is part of the task to solve it by using divide and conquer.
I've used divide and conquer approach here. I believe the trick here is to include middle in both the arrays that we're splitting the main array into.
/* edge cases ignored here */
int findMax(int[] arr, int left, int right){
if(right-left == 1) return (arr[right]-arr[left]);
int middle = left + (right-left)/2;
int max1 = findMax(arr, left, middle);
int max2 = findMax(arr, middle, right);
if(max1 >= 0 && max2 >= 0) return max1+max2;
else return Math.max(max1,max2);
}
Well I don't think recursion is very effective on this. You would probably never do this(other than homework). Something like this would do it:
int findGreatestDifference(Vector<Integer> numbers, int greaterDifference){
if(numbers.size() == 1 ) //return at last element
return greaterDifference;
int newDifference = (numbers.get(0) - numbers.get(1));
if (newDifference > greaterDifference)
greaterDifference = newDifference;
numbers.remove(numbers.size() - 1);
findGreatestDifference(numbers, greaterDifference);
return greaterDifference;
}
first time you call it, pass 0 as the greater difference, and again I don't find this as an effective way to do it. Iteration would be way better for this.
I hope this helps.
Algorithm (this is pretty much a sort task , then the subtraction step is trivial)
1) First sort the arrays (use recursive merge sort for large arrays and recursive insertion for smaller arrays).
Merge sort (https://en.wikipedia.org/wiki/Merge_sort)
Insertion sort (https://en.wikipedia.org/wiki/Insertion_sort)
2) Use the arrays smallest index[0] to get the smallest value & index[array.length-1] to get the largest
3)compute the difference (dont know what you mean by right order?)

Select 100 random lines from a file with a 1 million which can`t be read into memory [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Actually, this question was asked one of the interviews, I do not know the exact answer, could you explain in detail ?
How would you select 100 random lines from a file with a 1 million
lines? Can`t read file into memory.
Typically, in such scenarios, you do not know the number of items in the input file in advance (and you want to avoid requiring two passes over your data, to check the number of available items first). In that case the solution proposed by #radoh and others, where you will create indices to select from, will not work.
In such cases, you can use reservoir sampling: You only need to know the number of items you want to select (k as follows) and you iterate over the input data (S[1..n]). Below is the pseudocode taken from Wikipedia, I'll leave it to your practice to convert this into a working Java method (the method would typically look something like List<X> sample(Stream<X> data, int k)):
/*
S has items to sample, R will contain the result
*/
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i = 1 to k
R[i] := S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i) // important: inclusive range
if j <= k
R[j] := S[i]
Note that although the code mentions n explicitly (i.e. the number of input items), you do not need to know that value prior to computation. You can simply iterate over an Iterator or Stream (representing lines from a file in your case) and only need to keep the result array or collection R in memory. You can even sample a continuous stream, and at each point in time (at least, as soon, as you've seen k samples) you have k randomly chosen items.
Generate the 100 random (unique) numbers (ranging from 0..1000000-1) into a list and then go through the file reading the lines with indexes from the list. Ideally, the list of numbers should be a Set.
Pseudocode:
int i = 0;
List<String> myLines = new ArrayList();
while (fileScanner.hasNext()) {
String line = fileScanner.nextLine();
if (myRandomNumbers.contains(i)) {
myLines.add(line);
}
i++;
}
Here's a pretty efficient way to do it:
Iterator<String> linesIter = ...
List<String> selectedLines = new ArrayList();
Random rng = new Random(seed);
int linesStillNeeded = 100;
int linesRemaining = 1000000;
while (linesStillNeeded > 0) {
String line = linesIter.next();
linesRemaining--;
if (rng.nextInt(linesRemaining) < linesStillNeeded) {
selectedLines.add(line);
linesStillNeeded--;
}
}
I haven't coded in Java in a while, so you might want to treat this as pseudo-code.
This algorithm is based on the fact that the probability that any given line (assuming we are uniformly selecting k distinct lines out of a total of n lines) will be contained in the collection with probability k/n. This follows from
1) the number collections of k distinct lines (out of n lines) is choose(n, k),
2) the number of collections of k distinct lines (out of n lines) which contain a particular line is choose(n-1, k-1), and
3) choose(n-1,k-1)/choose(n,k) = k/n
Note that k and n here correspond to linesStillNeeded and linesStillRemaining in the code respectively.

Categories