java hashcode for floating point numbers - java

I want to use Double (or Float) as keys in a Hashmap
Map<Double, String> map = new HashMap<Double, String>()
map.put(1.0, "one");
System.out.println(map.containsKey(Math.tan(Math.PI / 4)));
and this returns false.
if I were comparing these two numbers I would have done something like this
final double EPSILON = 1e-6;
Math.abs(1.0 - Math.tan(Math.PI / 4)) < EPSILON
But since Hashmap would use hashcode it breaks things for me.
I thought to implement a roundKey function that rounds to some multiple of EPSILON before using it as a key
map.put(roundKey(1.0), "one")
map.containsKey(roundKey(Math.tan(Math.PI / 4)))
is there a better way ?
what is the right way to implement this roundKey

If you know what rounding is appropriate, you can use that. e.g. if you need to round to cents, you can round to two decimal places.
However, for the example above discrete rounding to a fixed precision might not be appropriate. e.g. if you round to 6 decimal places, 1.4999e-6 and 1.5001e-6 will not match as one rounds up and the other down even though the difference is << 1e-6.
In that situation the closest you can do is to use a NavigableMap
NavigableMap<Double, String> map = new TreeMap<>();
double x = ....;
double error = 1e-6;
NavigableMap<Double, String> map2 = map.subMap(x - error, x + error);
or you can use
Map.Entry<Double, String> higher = map.higherEntry(x);
Map.Entry<Double, String> lower = map.lowerEntry(x);
Map.Entry<Double, String> entry = null;
if (higher == null)
entry = lower;
else if (lower == null)
entry = higher;
else if (Math.abs(lower.getKey() - x) < Math.abs(higher.getkey() - x))
entry = lower;
else
entry = higher;
// entry is the closest match.
if (entry != null && Math.abs(entry - x) < error) {
// found the closest entry within the error
}
This will find all the entries within a continuous range.

Best way is to not use floating point numbers as keys, as they are (as you discovered) not going to compare.
Kludgy "solutions" like calling them identical if they're within a certain range of each other only lead to problems later, as you're either going to have to stretch the filter or make it more strict in time, both leading to potential problems with existing code, and/or people will forget how things were supposed to work.
Of course in some applications you want to do that, but as a key for looking up something? No. You're probably better off using angles in degrees, and as integers, as the keys here. If you need greater precision than 1 degree, use the angle in e.g. tenth of degrees by storing a number of 0 through 3600.
That will give you reliable behaviour of your Map while retaining the data you're planning to store.

Related

Bug: parameter 'initialCapacity' of ConcurrentHashMap's construct method?

one of the construct method of java.util.concurrent.ConcurrentHashMap:
public ConcurrentHashMap(int initialCapacity) {
if (initialCapacity < 0)
throw new IllegalArgumentException();
int cap = ((initialCapacity >= (MAXIMUM_CAPACITY >>> 1)) ?
MAXIMUM_CAPACITY :
tableSizeFor(initialCapacity + (initialCapacity >>> 1) + 1));
this.sizeCtl = cap;
}
What does the parameter for method 'tableSizeFor(...)' mean?
initialCapacity + (initialCapacity >>> 1) + 1
I think the parameter should be like :
(int)(1.0 + (long)initialCapacity / LOAD_FACTOR)
or just:
initialCapacity
I think the parameter expression is wrong, at least is a bug.Did I misunderstand something?
I send a bug report to OpenJDK, seems they officially confirmed it is most likely a bug: https://bugs.openjdk.java.net/browse/JDK-8202422
Update: Doug Lea commented on the bug,seems that he agree it is a bug.
I strongly suppose it’s an optimization trick.
You’re on to the correct thought. The constructor you cite uses a the default load factor of 0.75, so to accommodate initialCapacity elements the hash table size needed to be at least
initialCapacity / 0.75
(roughly the same as multiplying by 1.3333333333). However floating-point divisions are expensive (a slight bit, not bad). And we would additionally need to round up to an integer. I guess that an integer division would already help
(initialCapacity * 4 + 2) / 3
(the + 2 is for making sure that the result is rounded up; the * 4 ought to be cheap since it can be implemented as a left shift). The implementors do even better: shifts are a lot cheaper than divisions.
initialCapacity + (initialCapacity >>> 1) + 1
This is really multiplying by 1.5, so is giving us a result that will often be greater than needed, but it’s fast. The + 1 is to compensate for the fact that the “multiplication” rounded down.
Details: the >>> is an unsigned right shift, filling a zero into the leftmost position. Already knowing that initialCapacity was non-negative this gives the same result as a division by 2, ignoring the remainder.
Edit: I may add that tableSizeFor rounds up to a power of 2, so most often the same power of 2 will be the final result even when the first calculation gave a slightly greater result than needed. For example, if you ask for capacity for 10 elements (to keep the calculation simple), table size 14 would be enough, where the formula yields 16. But the 14 would be rounded up to a power of 2, so we get 16 anyway, so in the end there is no difference. If you asked for room for 12 elements, size 16 would still suffice, but the formula yields 19, which is then rounded up to 32. This is the more unusual case.
Further edit: Thank you for the information in the comments that you have submitted this as a JDK bug and for providing the link: https://bugs.openjdk.java.net/browse/JDK-8202422. The first comment by Marin Buchholz agrees with you:
Yes, there is a bug here. The one-arg constructor effectively uses a
load-factor of 2/3, not the documented default of 3/4…
I myself would not have considered this a bug unless you regard it as a bug that you occasionally get a greater capacity than you asked for. On the other hand you are right, of course (in your exemplarily terse bug report) that there is an inconsistency: You would expect new ConcurrentHashMap(22) and new ConcurrentHashMap(22, 0.75f, 1) to give the same result since the latter just gives the documented default load factor/table density; but the table sizes you get are 64 from the former and 32 from the latter.
When you say (int)(1.0 + (long)initialCapacity / LOAD_FACTOR), it makes sense for HashMap, not for ConcurrentHashMap (not in the same sense it does for HashMap).
For HashMap, capacity is the number of buckets before a resize happens, for ConcurrentHashMap it's the number of entries before resize is performed.
Testing this is fairly easy:
private static <K, V> void debugResize(Map<K, V> map, K key, V value) throws Throwable {
Field table = map.getClass().getDeclaredField("table");
AccessibleObject.setAccessible(new Field[] { table }, true);
Object[] nodes = ((Object[]) table.get(map));
// first put
if (nodes == null) {
map.put(key, value);
return;
}
map.put(key, value);
Field field = map.getClass().getDeclaredField("table");
AccessibleObject.setAccessible(new Field[] { field }, true);
int x = ((Object[]) field.get(map)).length;
if (nodes.length != x) {
++currentResizeCalls;
}
}
public static void main(String[] args) throws Throwable {
// replace with new ConcurrentHashMap<>(1024) to see a different result
Map<Integer, Integer> map = new HashMap<>(1024);
for (int i = 0; i < 1024; ++i) {
debugResize(map, i, i);
}
System.out.println(currentResizeCalls);
}
For HashMap, resize happened once, for ConcurrentHashMap it didn't.
And the 1.5 growing is not a new thing at all, ArrayList has the same strategy.
The shifts, well, they are cheap(er) than usual math; but also because >>> is un-signed.

Determinism of Java 8 streams

Motivation
I've just rewritten some 30 mostly trivial parsers and I need that the new versions behave exactly like the old ones. Therefore, I stored their example input files and some signature of the outputs produced by the old parsers for comparison with the new ones. This signature contains the counts of successfully parsed items, sums of some hash codes and up to 10 pseudo-randomly chosen items.
I thought this was a good idea as the equality of the hash code sums sort of guarantee that the outputs are exactly the same and the samples allow me to see what's wrong. I'm only using samples as otherwise it'd get really big.
The problem
Basically, given an unordered collection of strings, I want to get a list of up to 10 of them, so that when the collection changes a bit, I still get mostly the same samples in the same positions (the input is unordered, but the output is a list). This should work also when something is missing, so ideas like taking the 100th smallest element don't work.
ImmutableList<String> selectSome(Collection<String> list) {
if (list.isEmpty()) return ImmutableList.of();
return IntStream.range(1, 20)
.mapToObj(seed -> selectOne(list, seed))
.distinct()
.limit(10)
.collect(ImmutableList.toImmutableList());
}
So I start with numbers from 1 to 20 (so that after distinct I still most probably have my 10 samples), call a stateless deterministic function selectOne (defined below) returning one string which is maximal according to some funny criteria, remove duplicates, limit the result and collect it using Guava. All steps should be IMHO deterministic and "ordered", but I may be overlooking something. The other possibility would be that all my 30 new parsers are wrong, but this is improbable given that the hashes are correct. Moreover, the results of the parsing look correct.
String selectOne(Collection<String> list, int seed) {
// some boring mixing, definitely deterministic
for (int i=0; i<10; ++i) {
seed *= 123456789;
seed = Integer.rotateLeft(seed, 16);
}
// ensure seed is odd
seed = 2*seed + 1;
// first element is the candidate result
String result = list.iterator().next();
// the value is the hash code multiplied by the seed
// overflow is fine
int value = seed * result.hashCode();
// looking for s maximizing seed * s.hashCode()
for (final String s : list) {
final int v = seed * s.hashCode();
if (v < value) continue;
// tiebreaking by taking the bigger or smaller s
// this is needed for determinism
if (s.compareTo(result) * seed < 0) continue;
result = s;
value = v;
}
return result;
}
This sampling doesn't seem to work. I get a sequence like
"9224000", "9225000", "4165000", "9200000", "7923000", "8806000", ...
with one old parser and
"9224000", "9225000", "4165000", "3030000", "1731000", "8806000", ...
with a new one. Both results are perfectly repeatable. For other parsers, it looks very similar.
Is my usage of streams wrong? Do I have to add .sequential() or alike?
Update
Sorting the input collection has solved the problem:
ImmutableList<String> selectSome(Collection<String> collection) {
final List<String> list = Lists.newArrayList(collection);
Collections.sort(list);
.... as before
}
What's still missing is an explanation why.
The explanation
As stated in the answers, my tiebreaker was an all-breaker as I missed to check for a tie. Something like
if (v==value && s.compareTo(result) < 0) continue;
works fine.
I hope that my confused question may be at least useful for someone looking for "consistent sampling". It wasn't really Java 8 related.
I should've used Guava ComparisonChain or better Java 8 arg max to avoid my stupid mistake:
String selectOne(Collection<String> list, int seed) {
.... as before
final int multiplier = 2*seed + 1;
return list.stream()
.max(Comparator.comparingInt(s -> multiplier * s.hashCode())
.thenComparing(s -> s)) // <--- FOOL-PROOF TIEBREAKER
.get();
}
The mistake is that your tiebreaker is not in fact breaking a tie. We should be selecting s when v > value, but instead we're falling back to compareTo(). This breaks comparison symmetry, making your algorithm dependent on encounter order.
As a bonus, here's a simple test case to reproduce the bug:
System.out.println(selectOne(Arrays.asList("1", "2"), 4)); // 1
System.out.println(selectOne(Arrays.asList("2", "1"), 4)); // 2
In selectOne you just want to select String s with max rank of value = seed * s.hashCode(); for that given seed.
The problem is with the "tiebreaking" line:
if (s.compareTo(result) * seed < 0) continue;
It is not deterministic - for different order of elements it omits different elements from being check, and thus change in order of elements is changing the result.
Remove the tiebreaking if and the result will be insensitive to the order of elements in input list.

Can you replicate the Floor function found in Excel in Java?

I have searched the internet but have not found any solutions for my question.
I would like to be able to use the same/replicate the type of FLOOR function found in Excel in Java. In particular I would like to be able to provide a value (double or preferably BigDecimal) and round down to the nearest multiple of a significance I provide.
Examples 1:
Value = 24,519.30235
Significance = 0.01
Returned Value = 24,519.30
Example 2:
Value = 76.81485697
Significance = 1
Returned Value = 76
Example 3:
Value = 12,457,854
Significance = 100
Returned Value = 12,457,800
I am pretty new to java and was wondering if someone knew if an API already includes the function or if they would be kind enough to give me a solution to the above. I am aware of BigDecimal but I might have missed the correct function.
Many thanks
Yes you can.
Lets say given numbers are
76.21445
and
0.01
what you can do is multiply 76.21445 by 100 (or divide per 0.01)
round the result to nearest or lower integer (depending which one you want)
and than multiply it by the number again.
Note that it may not exactly print what you want if you will not go for the numbers with decimal precision. (The problem of numbers which in the binary format are not finite in extansion). Also in Math you have the round function taking doing pretty much what you want.
http://docs.oracle.com/javase/7/docs/api/java/lang/Math.html you use it like this
round(200.3456, 2);
one Example Code could be
public static void main(String[] args) {
BigDecimal value = new BigDecimal("2.0");
BigDecimal significance = new BigDecimal("0.5");
for (int i = 1; i <= 10; i++) {
System.out.println(value + " --> " + floor(value, significance));
value = value.add(new BigDecimal("0.1"));
}
}
private static double floor(BigDecimal value, BigDecimal significance) {
double result = 0;
if (value != null) {
result = value.divide(significance).doubleValue();
result = Math.floor(result) * significance.doubleValue();
}
return result;
}
To round a BigDecimal, you can use setScale(). In your case, you want RoundingMode.FLOOR.
Now you need to determine the number of digits from the "significance". Use Math.log10(significance) for that. You'll probably have to round the result up.
If the result is negative, then you have a significance < 1. In this case, use setScale(-result, RoundingMode.FLOOR) to round to N digits.
If it's > 1, then use this code:
value
.divide(significance)
.setScale(0, RoundingMode.FLOOR)
.multiply(significance);
i.e. 1024 and 100 gives 10.24 -> 10 -> 1000.

How to look up range from set of contiguous ranges for given number

so simply put, this is what I am trying to do:
I have a collection of Range objects that are contiguous (non overlapping, with no gaps between them), each containing a start and end int, and a reference to another object obj. These ranges are not of a fixed size (the first could be 1-49, the second 50-221, etc.). This collection could grow to be quite large.
I am hoping to find a way to look up the range (or more specifically, the object that it references) that includes a given number without having to iterate over the entire collection checking each range to see if it includes the number. These lookups will be performed frequently, so speed/performance is key.
Does anyone know of an algorithm/equation that might help me out here? I am writing in Java. I can provide more details if needed, but I figured I would try to keep it simple.
Thanks.
If sounds like you want to use a TreeMap, where the key is the bottom of the range, and the value is the Range object.
Then to identify the correct range, just use the floorEntry() method to very quickly get the closest (lesser or equal) Range, which should contain the key, like so:
TreeMap<Integer, Range> map = new TreeMap<>();
map.put(1, new Range(1, 10));
map.put(11, new Range(11, 30));
map.put(31, new Range(31, 100));
// int key = 0; // null
// int key = 1; // Range [start=1, end=10]
// int key = 11; // Range [start=11, end=30]
// int key = 21; // Range [start=11, end=30]
// int key = 31; // Range [start=31, end=100]
// int key = 41; // Range [start=31, end=100]
int key = 101; // Range [start=31, end=100]
// etc.
Range r = null;
Map.Entry<Integer, Range> m = map.floorEntry(key);
if (m != null) {
r = m.getValue();
}
System.out.println(r);
Since the tree is always sorted by the natural ordering of the bottom range boundary, all your searches will be at worst O(log(n)).
You'll want to add some sanity checking for when your key is completely out of bounds (for instance, when they key is beyond the end of the map, it returns the last Range in the map), but this should give you an idea how to proceed.
Assuming that you lookups are of utmost importance, and you can spare O(N) memory and approximately O(N^2) preprocessing time, the algorithm would be:
introduce a class ObjectsInRange, which contains: start of range (int startOfRange) and a set of objects (Set<Object> objects)
introduce an ArrayList<ObjectsInRange> oir, which will contain ObjectsInRange sorted by the startOfRange
for each Range r, ensure that there exist ObjectsInRange (let's call them a and b) such that a.startOfRange = r.start and b.startOfRange = b.end. Then, for all ObjectsInRange x between a, and until (but not including) b, add r.obj to their x.objects set
The lookup, then, is as follows:
for integer x, find such i that oir[i].startOfRange <= x and oir[i+1].startOfRange > x
note: i can be found with bisection in O(log N) time!
your objects are oir[i].objects
If the collection is in order, then you can implement a binary search to find the right range in O(log(n)) time. It's not as efficient as hashing for very large collections, but if you have less than 1000 or so ranges, it may be faster (because it's simpler).

Java - Taking character frequencies, creating probabilities, and then generating pseudo-random characters

I'm creating a pseudo-random text generator using a Markov model. Basically, I use a hash table to store lists of substrings of order k(the order of the Markov model), then for each substring I have a TreeMap of the suffixes with their frequencies throughout the substring.
I'm struggling with generating the random suffix. For each substring, I have a TreeMap containing all of the possible suffixes and their frequencies. I'm having trouble with using this to create a probability for each suffix, and then generating a pseudo-random suffix based on the probabilities.
Any help on the concept of this and how to go about doing this is appreciated. If you have any questions or need clarification, please let me know.
I'm not sure that a TreeMap is really the best data-structure for this, but . . .
You can use the Math.random() method to obtain a random value between 0.0 (inclusive) and 1.0 (exclusive). Then, iterate over the elements of your map, accumulating their frequencies, until you surpass that value. The suffix that first surpasses this value is your result. Assuming that your map-elements' frequencies all add up to 1.0, this will choose all suffixes in proportion to their frequencies.
For example:
public class Demo
{
private final Map<String, Double> suffixFrequencies =
new TreeMap<String, Double>();
private String getRandomSuffix()
{
final double value = Math.random();
double accum = 0.0;
for(final Map.Entry<String, Double> e : suffixFrequencies.entrySet())
{
accum += e.getValue();
if(accum > value)
return e.getKey();
}
throw new AssertionError(); // or something
}
public static void main(final String... args)
{
final Demo demo = new Demo();
demo.suffixFrequencies.put("abc", 0.3); // value in [0.0, 0.3)
demo.suffixFrequencies.put("def", 0.2); // value in [0.3, 0.5)
demo.suffixFrequencies.put("ghi", 0.5); // value in [0.5, 1.0)
// Print "abc" approximately three times, "def" approximately twice,
// and "ghi" approximately five times:
for(int i = 0; i < 10; ++i)
System.out.println(demo.getRandomSuffix());
}
}
Notes:
Due to roundoff error, the throw new AssertionError() probably actually will happen every so often, albeit very rarely. So I recommend that you replace that line with something that just always chooses the first element or last element or something.
If the frequencies don't all add up to 1.0, then you should add a pass at the beginning of getRandomSuffix() that determines the sum of all frequencies. You can then scale value accordingly.

Categories