Determinism of Java 8 streams - java

Motivation
I've just rewritten some 30 mostly trivial parsers and I need that the new versions behave exactly like the old ones. Therefore, I stored their example input files and some signature of the outputs produced by the old parsers for comparison with the new ones. This signature contains the counts of successfully parsed items, sums of some hash codes and up to 10 pseudo-randomly chosen items.
I thought this was a good idea as the equality of the hash code sums sort of guarantee that the outputs are exactly the same and the samples allow me to see what's wrong. I'm only using samples as otherwise it'd get really big.
The problem
Basically, given an unordered collection of strings, I want to get a list of up to 10 of them, so that when the collection changes a bit, I still get mostly the same samples in the same positions (the input is unordered, but the output is a list). This should work also when something is missing, so ideas like taking the 100th smallest element don't work.
ImmutableList<String> selectSome(Collection<String> list) {
if (list.isEmpty()) return ImmutableList.of();
return IntStream.range(1, 20)
.mapToObj(seed -> selectOne(list, seed))
.distinct()
.limit(10)
.collect(ImmutableList.toImmutableList());
}
So I start with numbers from 1 to 20 (so that after distinct I still most probably have my 10 samples), call a stateless deterministic function selectOne (defined below) returning one string which is maximal according to some funny criteria, remove duplicates, limit the result and collect it using Guava. All steps should be IMHO deterministic and "ordered", but I may be overlooking something. The other possibility would be that all my 30 new parsers are wrong, but this is improbable given that the hashes are correct. Moreover, the results of the parsing look correct.
String selectOne(Collection<String> list, int seed) {
// some boring mixing, definitely deterministic
for (int i=0; i<10; ++i) {
seed *= 123456789;
seed = Integer.rotateLeft(seed, 16);
}
// ensure seed is odd
seed = 2*seed + 1;
// first element is the candidate result
String result = list.iterator().next();
// the value is the hash code multiplied by the seed
// overflow is fine
int value = seed * result.hashCode();
// looking for s maximizing seed * s.hashCode()
for (final String s : list) {
final int v = seed * s.hashCode();
if (v < value) continue;
// tiebreaking by taking the bigger or smaller s
// this is needed for determinism
if (s.compareTo(result) * seed < 0) continue;
result = s;
value = v;
}
return result;
}
This sampling doesn't seem to work. I get a sequence like
"9224000", "9225000", "4165000", "9200000", "7923000", "8806000", ...
with one old parser and
"9224000", "9225000", "4165000", "3030000", "1731000", "8806000", ...
with a new one. Both results are perfectly repeatable. For other parsers, it looks very similar.
Is my usage of streams wrong? Do I have to add .sequential() or alike?
Update
Sorting the input collection has solved the problem:
ImmutableList<String> selectSome(Collection<String> collection) {
final List<String> list = Lists.newArrayList(collection);
Collections.sort(list);
.... as before
}
What's still missing is an explanation why.
The explanation
As stated in the answers, my tiebreaker was an all-breaker as I missed to check for a tie. Something like
if (v==value && s.compareTo(result) < 0) continue;
works fine.
I hope that my confused question may be at least useful for someone looking for "consistent sampling". It wasn't really Java 8 related.
I should've used Guava ComparisonChain or better Java 8 arg max to avoid my stupid mistake:
String selectOne(Collection<String> list, int seed) {
.... as before
final int multiplier = 2*seed + 1;
return list.stream()
.max(Comparator.comparingInt(s -> multiplier * s.hashCode())
.thenComparing(s -> s)) // <--- FOOL-PROOF TIEBREAKER
.get();
}

The mistake is that your tiebreaker is not in fact breaking a tie. We should be selecting s when v > value, but instead we're falling back to compareTo(). This breaks comparison symmetry, making your algorithm dependent on encounter order.
As a bonus, here's a simple test case to reproduce the bug:
System.out.println(selectOne(Arrays.asList("1", "2"), 4)); // 1
System.out.println(selectOne(Arrays.asList("2", "1"), 4)); // 2

In selectOne you just want to select String s with max rank of value = seed * s.hashCode(); for that given seed.
The problem is with the "tiebreaking" line:
if (s.compareTo(result) * seed < 0) continue;
It is not deterministic - for different order of elements it omits different elements from being check, and thus change in order of elements is changing the result.
Remove the tiebreaking if and the result will be insensitive to the order of elements in input list.

Related

Creating combinations of a BitSet

Assume I have a Java BitSet. I now need to make combinations of the BitSet such that only Bits which are Set can be flipped. i.e. only need combinations of Bits which are set.
For Eg. BitSet - 1010, Combinations - 1010, 1000, 0010, 0000
BitSet - 1100, Combination - 1100, 1000, 0100, 0000
I can think of a few solutions E.g. I can take combinations of all 4 bits and then XOR the combinations with the original Bitset. But this would be very resource-intensive for large sparse BitSets. So I was looking for a more elegant solution.
It appears that you want to get the power set of the bit set. There is already an answer here about how to get the power set of a Set<T>. Here, I will show a modified version of the algorithm shown in that post, using BitSets:
private static Set<BitSet> powerset(BitSet set) {
Set<BitSet> sets = new HashSet<>();
if (set.isEmpty()) {
sets.add(new BitSet(0));
return sets;
}
Integer head = set.nextSetBit(0);
BitSet rest = set.get(0, set.size());
rest.clear(head);
for (BitSet s : powerset(rest)) {
BitSet newSet = s.get(0, s.size());
newSet.set(head);
sets.add(newSet);
sets.add(s);
}
return sets;
}
You can perform the operation in a single linear pass instead of recursion, if you realize the integer numbers are a computer’s intrinsic variant of “on off” patterns and iterating over the appropriate integer range will ultimately produce all possible permutations. The only challenge in your case, is to transfer the densely packed bits of an integer number to the target bits of a BitSet.
Here is such a solution:
static List<BitSet> powerset(BitSet set) {
int nBits = set.cardinality();
if(nBits > 30) throw new OutOfMemoryError(
"Not enough memory for "+BigInteger.ONE.shiftLeft(nBits)+" BitSets");
int max = 1 << nBits;
int[] targetBits = set.stream().toArray();
List<BitSet> sets = new ArrayList<>(max);
for(int onOff = 0; onOff < max; onOff++) {
BitSet next = new BitSet(set.size());
for(int bitsToSet = onOff, ix = 0; bitsToSet != 0; ix++, bitsToSet>>>=1) {
if((bitsToSet & 1) == 0) {
int skip = Integer.numberOfTrailingZeros(bitsToSet);
ix += skip;
bitsToSet >>>= skip;
}
next.set(targetBits[ix]);
}
sets.add(next);
}
return sets;
}
It uses an int value for the iteration, which is already enough to represent all permutations that can ever be stored in one of Java’s builtin collections. If your source BitSet has 2³¹ one bits, the 2³² possible combinations do not only require a hundred GB heap, but also a collection supporting 2³² elements, i.e. a size not representable as int.
So the code above terminates early if the number exceeds the capabilities, without even trying. You could rewrite it to use a long or even BigInteger instead, to keep it busy in such cases, until it will fail with an OutOfMemoryError anyway.
For the working cases, the int solution is the most efficient variant.
Note that the code returns a List rather than a HashSet to avoid the costs of hashing. The values are already known to be unique and hashing would only pay off if you want to perform lookups, i.e. call contains with another BitSet. But to test whether an existing BitSet is a permutation of your input BitSet, you wouldn’t even need to generate all permutations, a simple bit operation, e.g. andNot would tell you that already. So for storing and iterating the permutations, an ArrayList is more efficient.

Common element in Java infinite streams

I have three infinite Java IntStream objects. I want to find the smallest element that is present in all three of them.
IntStream a = IntStream.iterate(286, i->i+1).map(i -> (Integer)i*(i+1)/2);
IntStream b = IntStream.iterate(166, i->i+1).map(i -> (Integer)i*(3*i-1)/2);
IntStream c = IntStream.iterate(144, i->i+1).map(i -> i*(2*i-1));
I can always employ a brute force solution (without streams) which involves iterating in nested loops, but I was wondering if we can do it more efficiently with streams?
You need to iterate all 3 in parallel, advancing the one with the lowest value, checking if all 3 are equal.
You code will not find an answer for next value after 40755, because the next value is 1_533_776_805, which has intermediate value (before division by 2) higher than Integer.MAX_VALUE (2_147_483_647).
So, here is one way to use your streams, after changing them to long and guarding against overflow.
LongStream a = LongStream.iterate(286, i->i+1).map(i -> Math.multiplyExact(i, i+1)/2);
LongStream b = LongStream.iterate(166, i->i+1).map(i -> Math.multiplyExact(i, 3*i-1)/2);
LongStream c = LongStream.iterate(144, i->i+1).map(i -> Math.multiplyExact(i, 2*i-1));
OfLong aIter = a.iterator();
OfLong bIter = b.iterator();
OfLong cIter = c.iterator();
long aVal = aIter.nextLong();
long bVal = bIter.nextLong();
long cVal = cIter.nextLong();
while (aVal != bVal || bVal != cVal) {
long min = Math.min(Math.min(aVal, bVal), cVal);
if (aVal == min)
aVal = aIter.nextLong();
if (bVal == min)
bVal = bIter.nextLong();
if (cVal == min)
cVal = cIter.nextLong();
}
System.out.println(aVal);
These functions are always increasing. So the code should stop when the magic equal triplet is found.
The thing to code is:
a) when a stream's current value is below any other, it can iterate next for itself.
b) when it meets the same candidate value, it waits for the 3rd stream to take a decision.
c) when it has a higher value than the all others, it changes the candidate and waits for both others.
Reference juggling.
There may not be a solution too (at least in short time).
Notice that stream c can only produce even numbers (when seeded with even). There might be some optimization there to skip a and b faster.
I don't think there is anything smart possible with stream API. The main reason is that you can't really go over one stream until some condition is met - instead, you look at current 3 elements and pick the next element from one of the streams before comparing the elements again.
The most efficient (and might be also the cleanest) solution is to use iterators and keep calling next() method on the right streams until the answer is found.
To start with, you can focus on two streams only and find their first common value:
while (elementA != elementB) {
if (elementA < elementB) {
elementA = iteratorA.next();
} else {
elementB = iteratorB.next();
}
}
Then you need to do make third stream catch up with these two:
while (elementC < elementA) {
elementC = iteratorC.next();
}
At this point there are two options:
either elementC == elementA in which case you have the answer
or elementC > elementA in which case you can go to next value on all three streams and start over
One thing to remember is the max value of integer. Because you have i^2, this means that it will overflow for i about 46k, so you need to change streams of ints to streams of longs (the answer is about 1.5 billion - and that's after division by 2 in these functions).
Since you are doing exercises for practice, I don't think it's right to give you the full working code, but let me know if you still struggle with it ;)

Find all the ways you can go up an n step staircase if you can take k steps at a time such that k <= n

This is a problem I'm trying to solve on my own to be a bit better at recursion(not homework). I believe I found a solution, but I'm not sure about the time complexity (I'm aware that DP would give me better results).
Find all the ways you can go up an n step staircase if you can take k steps at a time such that k <= n
For example, if my step sizes are [1,2,3] and the size of the stair case is 10, I could take 10 steps of size 1 [1,1,1,1,1,1,1,1,1,1]=10 or I could take 3 steps of size 3 and 1 step of size 1 [3,3,3,1]=10
Here is my solution:
static List<List<Integer>> problem1Ans = new ArrayList<List<Integer>>();
public static void problem1(int numSteps){
int [] steps = {1,2,3};
problem1_rec(new ArrayList<Integer>(), numSteps, steps);
}
public static void problem1_rec(List<Integer> sequence, int numSteps, int [] steps){
if(problem1_sum_seq(sequence) > numSteps){
return;
}
if(problem1_sum_seq(sequence) == numSteps){
problem1Ans.add(new ArrayList<Integer>(sequence));
return;
}
for(int stepSize : steps){
sequence.add(stepSize);
problem1_rec(sequence, numSteps, steps);
sequence.remove(sequence.size()-1);
}
}
public static int problem1_sum_seq(List<Integer> sequence){
int sum = 0;
for(int i : sequence){
sum += i;
}
return sum;
}
public static void main(String [] args){
problem1(10);
System.out.println(problem1Ans.size());
}
My guess is that this runtime is k^n where k is the numbers of step sizes, and n is the number of steps (3 and 10 in this case).
I came to this answer because each step size has a loop that calls k number of step sizes. However, the depth of this is not the same for all step sizes. For instance, the sequence [1,1,1,1,1,1,1,1,1,1] has more recursive calls than [3,3,3,1] so this makes me doubt my answer.
What is the runtime? Is k^n correct?
TL;DR: Your algorithm is O(2n), which is a tighter bound than O(kn), but because of some easily corrected inefficiencies the implementation runs in O(k2 × 2n).
In effect, your solution enumerates all of the step-sequences with sum n by successively enumerating all of the viable prefixes of those step-sequences. So the number of operations is proportional to the number of step sequences whose sum is less than or equal to n. [See Notes 1 and 2].
Now, let's consider how many possible prefix sequences there are for a given value of n. The precise computation will depend on the steps allowed in the vector of step sizes, but we can easily come up with a maximum, because any step sequence is a subset of the set of integers from 1 to n, and we know that there are precisely 2n such subsets.
Of course, not all subsets qualify. For example, if the set of step-sizes is [1, 2], then you are enumerating Fibonacci sequences, and there are O(φn) such sequences. As k increases, you will get closer and closer to O(2n). [Note 3]
Because of the inefficiencies in your coded, as noted, your algorithm is actually O(k2 αn) where α is some number between φ and 2, approaching 2 as k approaches infinity. (φ is 1.618..., or (1+sqrt(5))/2)).
There are a number of improvements that could be made to your implementation, particularly if your intent was to count rather than enumerate the step sizes. But that was not your question, as I understand it.
Notes
That's not quite exact, because you actually enumerate a few extra sequences which you then reject; the cost of these rejections is a multiplier by the size of the vector of possible step sizes. However, you could easily eliminate the rejections by terminating the for loop as soon as a rejection is noticed.
The cost of an enumeration is O(k) rather than O(1) because you compute the sum of the sequence arguments for each enumeration (often twice). That produces an additional factor of k. You could easily eliminate this cost by passing the current sum into the recursive call (which would also eliminate the multiple evaluations). It is trickier to avoid the O(k) cost of copying the sequence into the output list, but that can be done using a better (structure-sharing) data-structure.
The question in your title (as opposed to the problem solved by the code in the body of your question) does actually require enumerating all possible subsets of {1…n}, in which case the number of possible sequences would be exactly 2n.
If you want to solve this recursively, you should use a different pattern that allows caching of previous values, like the one used when calculating Fibonacci numbers. The code for Fibonacci function is basically about the same as what do you seek, it adds previous and pred-previous numbers by index and returns the output as current number. You can use the same technique in your recursive function , but add not f(k-1) and f(k-2), but gather sum of f(k-steps[i]). Something like this (I don't have a Java syntax checker, so bear with syntax errors please):
static List<Integer> cache = new ArrayList<Integer>;
static List<Integer> storedSteps=null; // if used with same value of steps, don't clear cache
public static Integer problem1(Integer numSteps, List<Integer> steps) {
if (!ArrayList::equal(steps, storedSteps)) { // check equality data wise, not link wise
storedSteps=steps; // or copy with whatever method there is
cache.clear(); // remove all data - now invalid
// TODO make cache+storedSteps a single structure
}
return problem1_rec(numSteps,steps);
}
private static Integer problem1_rec(Integer numSteps, List<Integer> steps) {
if (0>numSteps) { return 0; }
if (0==numSteps) { return 1; }
if (cache.length()>=numSteps+1) { return cache[numSteps] } // cache hit
Integer acc=0;
for (Integer i : steps) { acc+=problem1_rec(numSteps-i,steps); }
cache[numSteps]=acc; // cache miss. Make sure ArrayList supports inserting by index, otherwise use correct type
return acc;
}

How to look up range from set of contiguous ranges for given number

so simply put, this is what I am trying to do:
I have a collection of Range objects that are contiguous (non overlapping, with no gaps between them), each containing a start and end int, and a reference to another object obj. These ranges are not of a fixed size (the first could be 1-49, the second 50-221, etc.). This collection could grow to be quite large.
I am hoping to find a way to look up the range (or more specifically, the object that it references) that includes a given number without having to iterate over the entire collection checking each range to see if it includes the number. These lookups will be performed frequently, so speed/performance is key.
Does anyone know of an algorithm/equation that might help me out here? I am writing in Java. I can provide more details if needed, but I figured I would try to keep it simple.
Thanks.
If sounds like you want to use a TreeMap, where the key is the bottom of the range, and the value is the Range object.
Then to identify the correct range, just use the floorEntry() method to very quickly get the closest (lesser or equal) Range, which should contain the key, like so:
TreeMap<Integer, Range> map = new TreeMap<>();
map.put(1, new Range(1, 10));
map.put(11, new Range(11, 30));
map.put(31, new Range(31, 100));
// int key = 0; // null
// int key = 1; // Range [start=1, end=10]
// int key = 11; // Range [start=11, end=30]
// int key = 21; // Range [start=11, end=30]
// int key = 31; // Range [start=31, end=100]
// int key = 41; // Range [start=31, end=100]
int key = 101; // Range [start=31, end=100]
// etc.
Range r = null;
Map.Entry<Integer, Range> m = map.floorEntry(key);
if (m != null) {
r = m.getValue();
}
System.out.println(r);
Since the tree is always sorted by the natural ordering of the bottom range boundary, all your searches will be at worst O(log(n)).
You'll want to add some sanity checking for when your key is completely out of bounds (for instance, when they key is beyond the end of the map, it returns the last Range in the map), but this should give you an idea how to proceed.
Assuming that you lookups are of utmost importance, and you can spare O(N) memory and approximately O(N^2) preprocessing time, the algorithm would be:
introduce a class ObjectsInRange, which contains: start of range (int startOfRange) and a set of objects (Set<Object> objects)
introduce an ArrayList<ObjectsInRange> oir, which will contain ObjectsInRange sorted by the startOfRange
for each Range r, ensure that there exist ObjectsInRange (let's call them a and b) such that a.startOfRange = r.start and b.startOfRange = b.end. Then, for all ObjectsInRange x between a, and until (but not including) b, add r.obj to their x.objects set
The lookup, then, is as follows:
for integer x, find such i that oir[i].startOfRange <= x and oir[i+1].startOfRange > x
note: i can be found with bisection in O(log N) time!
your objects are oir[i].objects
If the collection is in order, then you can implement a binary search to find the right range in O(log(n)) time. It's not as efficient as hashing for very large collections, but if you have less than 1000 or so ranges, it may be faster (because it's simpler).

Number Guessing Game Over Intervals

I have just started my long path to becoming a better coder on CodeChef. People begin with the problems marked 'Easy' and I have done the same.
The Problem
The problem statement defines the following -:
n, where 1 <= n <= 10^9. This is the integer which Johnny is keeping secret.
k, where 1 <= k <= 10^5. For each test case or instance of the game, Johnny provides exactly k hints to Alice.
A hint is of the form op num Yes/No, where -
op is an operator from <, >, =.
num is an integer, again satisfying 1 <= num <= 10^9.
Yes or No are answers to the question: Does the relation n op num hold?
If the answer to the question is correct, Johnny has uttered a truth. Otherwise, he is lying.
Each hint is fed to the program and the program determines whether it is the truth or possibly a lie. My job is to find the minimum possible number of lies.
Now CodeChef's Editorial answer uses the concept of segment trees, which I cannot wrap my head around at all. I was wondering if there is an alternative data structure or method to solve this question, maybe a simpler one, considering it is in the 'Easy' category.
This is what I tried -:
class Solution //Represents a test case.
{
HashSet<SolutionObj> set = new HashSet<SolutionObj>(); //To prevent duplicates.
BigInteger max = new BigInteger("100000000"); //Max range.
BigInteger min = new BigInteger("1"); //Min range.
int lies = 0; //Lies counter.
void addHint(String s)
{
String[] vals = s.split(" ");
set.add(new SolutionObj(vals[0], vals[1], vals[2]));
}
void testHints()
{
for(SolutionObj obj : set)
{
//Given number is not in range. Lie.
if(obj.bg.compareTo(min) == -1 || obj.bg.compareTo(max) == 1)
{
lies++;
continue;
}
if(obj.yesno)
{
if(obj.operator.equals("<"))
{
max = new BigInteger(obj.bg.toString()); //Change max value
}
else if(obj.operator.equals(">"))
{
min = new BigInteger(obj.bg.toString()); //Change min value
}
}
else
{
//Still to think of this portion.
}
}
}
}
class SolutionObj //Represents a single hint.
{
String operator;
BigInteger bg;
boolean yesno;
SolutionObj(String op, String integer, String yesno)
{
operator = op;
bg = new BigInteger(integer);
if(yesno.toLowerCase().equals("yes"))
this.yesno = true;
else
this.yesno = false;
}
#Override
public boolean equals(Object o)
{
if(o instanceof SolutionObj)
{
SolutionObj s = (SolutionObj) o; //Make the cast
if(this.yesno == s.yesno && this.bg.equals(s.bg)
&& this.operator.equals(s.operator))
return true;
}
return false;
}
#Override
public int hashCode()
{
return this.bg.intValue();
}
}
Obviously this partial solution is incorrect, save for the range check that I have done before entering the if(obj.yesno) portion. I was thinking of updating the range according to the hints provided, but that approach has not borne fruit. How should I be approaching this problem, apart from using segment trees?
Consider the following approach, which may be easier to understand. Picture the 1d axis of integers, and place on it the k hints. Every hint can be regarded as '(' or ')' or '=' (greater than, less than or equal, respectively).
Example:
-----(---)-------(--=-----)-----------)
Now, the true value is somewhere on one of the 40 values of this axis, but actually only 8 segments are interesting to check, since anywhere inside a segment the number of true/false hints remains the same.
That means you can scan the hints according to their ordering on the axis, and maintain a counter of the true hints at that point.
In the example above it goes like this:
segment counter
-----------------------
-----( 3
--- 4
)-------( 3
-- 4
= 5 <---maximum
----- 4
)----------- 3
) 2
This algorithm only requires to sort the k hints and then scan them. It's near linear in k (O(k*log k), with no dependance on n), therefore it should have a reasonable running time.
Notes:
1) In practice the hints may have non-distinct positions, so you'll have to handle all hints of the same type on the same position together.
2) If you need to return the minimum set of lies, then you should maintain a set rather than a counter. That shouldn't have an effect on the time complexity if you use a hash set.
Calculate the number of lies if the target number = 1 (store this in a variable lies).
Let target = 1.
Sort and group the statements by their respective values.
Iterate through the statements.
Update target to the current statement group's value. Update lies according to how many of those statements would become either true or false.
Then update target to that value + 1 (Why do this? Consider when you have > 5 and < 7 - 6 may be the best value) and update lies appropriately (skip this step if the next statement group's value is this value).
Return the minimum value for lies.
Running time:
O(k) for the initial calculation.
O(k log k) for the sort.
O(k) for the iteration.
O(k log k) total.
My idea for this problem is similar to how Eyal Schneider view it. Denoting '>' as greater, '<' as less than and '=' as equals, we can sort all the 'hints' by their num and scan through all the interesting points one by one.
For each point, we keep in all the number of '<' and '=' from 0 to that point (in one array called int[]lessAndEqual), number of '>' and '=' from that point onward (in one array called int[]greaterAndEqual). We can easily see that the number of lies in a particular point i is equal to
lessAndEqual[i] + greaterAndEqual[i + 1]
We can easily fill the lessAndEqual and greaterAndEqual arrays by two scan in O(n) and sort all the hints in O(nlogn), which result the time complexity is O(nlogn)
Note: special treatment should be taken for the case when the num in hint is equals. Also notice that the range for num is 10^9, which require us to have some forms of point compression to fit the array into the memory

Categories