Using Java 7 HashMap in Java 8 - java

I have updated a Java application to Java 8. The application heavily relies on HashMaps.
When I run the benchmarks, I see unpredictable behavoir. For some inputs, the application runs faster than before, but for larger inputs, it's constantly slower.
I've checked the profiler and the most time consuming operation is HashMap.get. I suspect the changes
are due to the HashMap modification in Java 8, but it may not be true, as I have changed some other parts.
Is there an easy way that I hook in the original Java 7 HashMap into my Java 8 application so that I only change the hashmap implementation to see if I still observe the change in performance.
The following is a minimal program that tries to simulate what my application is doing.
The basic idea is that i need to share nodes in the application. At some runtime point, a node
should be retrieved or created if it already does not exist based on some integer properties. The following only uses two integer, but in the real application I have one, two and three integer keys.
import java.util.HashMap;
import java.util.Map;
import java.util.Random;
public class Test1 {
static int max_k1 = 500;
static int max_k2 = 500;
static Map<Node, Node> map;
static Random random = new Random();
public static void main(String[] args) {
for (int i = 0; i < 15; i++) {
long start = System.nanoTime();
run();
long end = System.nanoTime();
System.out.println((end - start) / 1000_000);
}
}
private static void run() {
map = new HashMap<>();
for (int i = 0; i < 10_000_000; i++) {
Node key = new Node(random.nextInt(max_k1), random.nextInt(max_k2));
Node val = getOrElseUpdate(key);
}
}
private static Node getOrElseUpdate(Node key) {
Node val;
if ((val = map.get(key)) == null) {
val = key;
map.put(key, val);
}
return val;
}
private static class Node {
private int k1;
private int k2;
public Node(int k1, int k2) {
this.k1 = k1;
this.k2 = k2;
}
#Override
public int hashCode() {
int result = 17;
result = 31 * result + k1;
result = 31 * result + k2;
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (!(obj instanceof Node))
return false;
Node other = (Node) obj;
return k1 == other.k1 && k2 == other.k2;
}
}
}
The benchmarking is primitive, but still, this is the result of 15 runs on Java 8:
8143
7919
7984
7973
7948
7984
7931
7992
8038
7975
7924
7995
6903
7758
7627
and this is for Java 7:
7247
6955
6510
6514
6577
6489
6510
6570
6497
6482
6540
6462
6514
4603
6270
The benchmarking is primitive, so I appreciate if someone who is familiar with JMH or other benchmarking tools run it, but from what I observe the results are better for Java 7. Any ideas?

Your hashCode() is very poor. In example you posted you have 250000 unique values but only 15969 unique hash codes. Because of lot of collisions, Java 8 swaps lists with trees. In your case it only adds overhead, because many elements not only have the same position in hash table but also the same hash code. The tree ends up as a linked list anyway.
There are couple of ways to fix this:
Improve your hashCode. return k1 * 500 + k2; resolves the issue.
Use THashMap. Open addressing should work better in case of collisions.
Make Node implement Comparable. This will be used by HashMap to construct balanced tree in case of conflicts.

Related

Large Map freezes my program

I have large Map where I store some objects. The Map is large: it has around 200k objects. When I try to run some methods, that require to read map values, the program freezes. When I debug it, it seems that my IDE is 'collecting data' (picture). It has never completed the task.
I have 16GB RAM.
What can I do to speed this up?
I get performance issues around 61 million elements.
import java.util.*;
public class BreakingMaps{
public static void main(String[] args){
int count = Integer.MAX_VALUE>>5;
System.out.println(count + " objects tested");
HashMap<Long, String> set = new HashMap<>(count);
for(long i = 0; i<count; i++){
Long l = i;
set.put(l, l.toString());
}
Random r = new Random();
for(int i = 0; i<1000; i++){
long k = r.nextInt()%count;
k = k<0?-k:k;
System.out.println(set.get(k));
}
}
}
I run the program with java -Xms12G -Xmx13G BreakingMaps
I suspect your problem is not the map, but circumstances surrounding the map. If I write the same program, but use a class with hashcode colisions then the program cannot handle 200K elements.
static class Key{
final long l;
public Key(long l){
this.l = l;
}
#Override
public int hashCode(){
return 1;
}
#Override
public boolean equals(Object o){
if(o!=null && o instanceof Key){
return ((Key)o).l==l;
}
return false;
}
}
Look at this - as the solution you can increase the heap size for your app:
java -Xmx6g myprogram.
But it's not very good. I'd suggest you to try to rework your data processing approach. Maybe you can apply some filtering before fetching the data to decrease the data size or implement some calculation on database level.

The performance of the HashMap in JAVA8

I had a question when I learned the HashMap source code in java8。
Source code is so complicated, how much efficiency?
So I wrote a code about the hash conflict。
public class Test {
final int i;
public Test(int i) {
this.i = i;
}
public static void main(String[] args) {
java.util.HashMap<Test, Test> set = new java.util.HashMap<Test, Test>();
long time;
Test last;
Random random = new Random(0);
int i = 0;
for (int max = 1; max < 200000; max <<= 1) {
long c1 = 0, c2 = 0;
int t = 0;
for (; i < max; i++, t++) {
last = new Test(random.nextInt());
time = System.nanoTime();
set.put(last, last);
c1 += (System.nanoTime() - time);
last = new Test(random.nextInt());
time = System.nanoTime();
set.get(last);
c2 += (System.nanoTime() - time);
}
System.out.format("%d\t%d\t%d\n", max, (c1 / t), (c2 / t));
}
}
public int hashCode() {
return 0;
}
public boolean equals(Object obj) {
if (obj == null)
return false;
if (!(obj instanceof Test))
return false;
Test t = (Test) obj;
return t.i == this.i;
}
}
I show the results in Excel。
enter image description here
I am using java6u45 java7u80 java8u131。
I do not understand why the performance of java8 will be so bad
I'm trying to write my own HashMap.
I would like to learn HashMap in java8 which is better, but I did not find it.
Your test scenario is non-optimal for Java 8 HashMap. HashMap in Java 8 optimizes collisions by using binary trees for any hash chains longer than a given threshold. However, this only works if the key type is comparable. If it isn't then the overhead of testing to see if the optimization is possible actually makes Java 8 HashMap slower. (The slow-down is more than I expected ... but that's another topic.)
Change your Test class to implement Comparable<Test> ... and you should see that Java 8 performs better than the others when the proportion of hash collisions is large enough.
Note that the tree optimization should be considered as a defensive measure for the case where the hash function doesn't perform. The optimization turns O(N) worst-case performance to O(logN) worst-case.
If you want your HashMap instances to have O(1) lookup, you should make sure that you use a good hash function for the key type. If the probability of collision is minimized, the optimization is moot.
Source code is so complicated, how much efficiency?
It is explained in the comments in the source code. And probably other places that Google can find for you :-)

What is the most efficient way to count the intersections between two sets (Java)?

Question Background
I am comparing two (at a time, actually many) text files, and I want to determine how similar they are. To do so, I have created small, overlapping groups of text from each file. I now want to determine the number of those groups from one file which are also from the other file.
I would prefer to use only Java 8 with no external libraries.
Attempts
These are my two fastest methods. The first contains a bunch of logic which allows it to stop if meeting the threshold is not possible with the remaining elements (this saves a bit of time in total, but of course executing the extra logic also takes time). The second is slower. It does not have those optimizations, actually determines the intersection rather than merely counting it, and uses a stream, which is quite new to me.
I have an integer threshold and dblThreshold (the same value cast to a double), which are the minimum percentage of the smaller file which must be shared to be of interest. Also, from my limited testing, it seems that writing all the logic for either set being larger is faster than calling the method again with reversed arguments.
public int numberShared(Set<String> sOne, Set<String> sTwo) {
int numFound = 0;
if (sOne.size() > sTwo.size()) {
int smallSize = sTwo.size();
int left = smallSize;
for (String item: sTwo) {
if (numFound < threshold && ((double)numFound + left < (dblThreshold) * smallSize)) {
break;
}
if (sOne.contains(item)) {
numFound++;
}
left--;
}
} else {
int smallSize = sOne.size();
int left = smallSize;
for (String item: sOne) {
if (numFound < threshold && ((double)numFound + left < (dblThreshold) * smallSize)) {
break;
}
if (sTwo.contains(item)) {
numFound++;
}
left--;
}
}
return numFound;
}
Second method:
public int numberShared(Set<String> sOne, Set<String> sTwo) {
if (sOne.size() < sTwo.size()) {
long numFound = sOne.parallelStream()
.filter(segment -> sTwo.contains(segment))
.collect(Collectors.counting());
return (int)numFound;
} else {
long numFound = sTwo.parallelStream()
.filter(segment -> sOne.contains(segment))
.collect(Collectors.counting());
return (int)numFound;
}
}
Any suggestions for improving upon these methods, or novel ideas and approaches to the problem are much appreciated!
Edit: I just realized that the first part of my threshold check (which seeks to eliminate, in some cases, the need for the second check with doubles) is incorrect. I will revise it as soon as possible.
If I understand you correctly, you have already determined which methods are fastest, but aren't sure how to implement your threshold-check when using Java 8 streams. Here's one way you could do that - though please note that it's hard for me to do much testing without having proper data and knowing what thresholds you're interested in, so take this simplified test case with a grain of salt (and adjust as necessary).
public class Sets {
private static final int NOT_ENOUGH_MATCHES = -1;
private static final String[] arrayOne = { "1", "2", "4", "9" };
private static final String[] arrayTwo = { "2", "3", "5", "7", "9" };
private static final Set<String> setOne = new HashSet<>();
private static final Set<String> setTwo = new HashSet<>();
public static void main(String[] ignoredArguments) {
setOne.addAll(Arrays.asList(arrayOne));
setTwo.addAll(Arrays.asList(arrayTwo));
boolean isFirstSmaller = setOne.size() < setTwo.size();
System.out.println("Number shared: " + (isFirstSmaller ?
numberShared(setOne, setTwo) : numberShared(setTwo, setOne)));
}
private static long numberShared(Set<String> smallerSet, Set<String> largerSet) {
SimpleBag bag = new SimpleBag(3, 0.5d, largerSet, smallerSet.size());
try {
smallerSet.forEach(eachItem -> bag.add(eachItem));
return bag.duplicateCount;
} catch (IllegalStateException exception) {
return NOT_ENOUGH_MATCHES;
}
}
public static class SimpleBag {
private Map<String, Boolean> items;
private int threshold;
private double fraction;
protected int duplicateCount = 0;
private int smallerSize;
private int numberLeft;
public SimpleBag(int aThreshold, double aFraction, Set<String> someStrings,
int otherSetSize) {
threshold = aThreshold;
fraction = aFraction;
items = new HashMap<>();
someStrings.forEach(eachString -> items.put(eachString, false));
smallerSize = otherSetSize;
numberLeft = otherSetSize;
}
public void add(String aString) {
Boolean value = items.get(aString);
boolean alreadyExists = value != null;
if (alreadyExists) {
duplicateCount++;
}
items.put(aString, alreadyExists);
numberLeft--;
if (cannotMeetThreshold()) {
throw new IllegalStateException("Can't meet threshold; stopping at "
+ duplicateCount + " duplicates");
}
}
public boolean cannotMeetThreshold() {
return duplicateCount < threshold
&& (duplicateCount + numberLeft < fraction * smallerSize);
}
}
}
So I've made a simplified "Bag-like" implementation that starts with the contents of the larger set mapped as keys to false values (since we know there's only one of each). Then we iterate over the smaller set, adding each item to the bag, and, if it's a duplicate, switching the value to true and keeping track of the duplicate count (I initially did a .count() at the end of .stream().allMatch(), but this'll suffice for your special case). After adding each item, we check whether we can't meet the threshold, in which case we throw an exception (arguably not the prettiest way to exit the .forEach(), but in this case it is an illegal state of sorts). Finally, we return the duplicate count, or -1 if we encountered the exception. In my little test, change 0.5d to 0.51d to see the difference.

Java multi-thread scalability issue

more updates
As is explained in the selected answer, the problem is in JVM's garbage collection algorithm.
JVM uses card marking algorithm to keep track of modified references in object fields. For each reference assignment to a field, it marks an associated bit in the card to be true -- this causes a false-sharing hence blocks scaling. The details are well described in this article: https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
The option -XX:+UseCondCardMark (in Java 1.7u40 and up) mitigates the problem, and makes it scale almost perfectly.
updates
I found out (hinted from Park Eung-ju) that assigning an object into a field variable makes the difference. If I remove the assignment, it scales perfectly.
I think probably it has something to do with Java memory model -- such as, an object reference must point to a valid address before it gets visible, but I am not completely sure. Both double and Object reference (likely) have 8 bytes size on 64 bit machine, so it seems to me that assigning a double value and an Object reference should be the same in terms of synchronization.
Anyone has a reasonable explanation?
Here I have a weird Java multi-threading scalability problem.
My code simply iterates over an array (using the visitor pattern) to compute simple floating-point operations and assign the result to another array. There is no data dependency, nor synchronization, so it should scale linearly (2x faster with 2 threads, 4x faster with 4 threads).
When primitive (double) array is used, it scales very well. When object type (e.g. String) array is used, it doesn't scale at all (even though the value of the String array is not used at all...)
Here's the entire source code:
import java.util.ArrayList;
import java.util.Arrays;
import java.util.concurrent.CyclicBarrier;
class Table1 {
public static final int SIZE1=200000000;
public static final boolean OBJ_PARAM;
static {
String type=System.getProperty("arg.type");
if ("double".equalsIgnoreCase(type)) {
System.out.println("Using primitive (double) type arg");
OBJ_PARAM = false;
} else {
System.out.println("Using object type arg");
OBJ_PARAM = true;
}
}
byte[] filled;
int[] ivals;
String[] strs;
Table1(int size) {
filled = new byte[size];
ivals = new int[size];
strs = new String[size];
Arrays.fill(filled, (byte)1);
Arrays.fill(ivals, 42);
Arrays.fill(strs, "Strs");
}
public boolean iterate_range(int from, int to, MyVisitor v) {
for (int i=from; i<to; i++) {
if (filled[i]==1) {
// XXX: Here we are passing double or String argument
if (OBJ_PARAM) v.visit_obj(i, strs[i]);
else v.visit(i, ivals[i]);
}
}
return true;
}
}
class HeadTable {
byte[] filled;
double[] dvals;
boolean isEmpty;
HeadTable(int size) {
filled = new byte[size];
dvals = new double[size];
Arrays.fill(filled, (byte)0);
isEmpty = true;
}
public boolean contains(int i, double d) {
if (filled[i]==0) return false;
if (dvals[i]==d) return true;
return false;
}
public boolean contains(int i) {
if (filled[i]==0) return false;
return true;
}
public double groupby(int i) {
assert filled[i]==1;
return dvals[i];
}
public boolean insert(int i, double d) {
if (filled[i]==1 && contains(i,d)) return false;
if (isEmpty) isEmpty=false;
filled[i]=1;
dvals[i] = d;
return true;
}
public boolean update(int i, double d) {
assert filled[i]==1;
dvals[i]=d;
return true;
}
}
class MyVisitor {
public static final int NUM=128;
int[] range = new int[2];
Table1 table1;
HeadTable head;
double diff=0;
int i;
int iv;
String sv;
MyVisitor(Table1 _table1, HeadTable _head, int id) {
table1 = _table1;
head = _head;
int elems=Table1.SIZE1/NUM;
range[0] = elems*id;
range[1] = elems*(id+1);
}
public void run() {
table1.iterate_range(range[0], range[1], this);
}
//YYY 1: with double argument, this function is called
public boolean visit(int _i, int _v) {
i = _i;
iv = _v;
insertDiff();
return true;
}
//YYY 2: with String argument, this function is called
public boolean visit_obj(int _i, Object _v) {
i = _i;
iv = 42;
sv = (String)_v;
insertDiff();
return true;
}
public boolean insertDiff() {
if (!head.contains(i)) {
head.insert(i, diff);
return true;
}
double old = head.groupby(i);
double newval=Math.min(old, diff);
head.update(i, newval);
head.insert(i, diff);
return true;
}
}
public class ParTest1 {
public static int THREAD_NUM=4;
public static void main(String[] args) throws Exception {
if (args.length>0) {
THREAD_NUM = Integer.parseInt(args[0]);
System.out.println("Setting THREAD_NUM:"+THREAD_NUM);
}
Table1 table1 = new Table1(Table1.SIZE1);
HeadTable head = new HeadTable(Table1.SIZE1);
MyVisitor[] visitors = new MyVisitor[MyVisitor.NUM];
for (int i=0; i<visitors.length; i++) {
visitors[i] = new MyVisitor(table1, head, i);
}
int taskPerThread = visitors.length / THREAD_NUM;
MyThread[] threads = new MyThread[THREAD_NUM];
CyclicBarrier barrier = new CyclicBarrier(THREAD_NUM+1);
for (int i=0; i<THREAD_NUM; i++) {
threads[i] = new MyThread(barrier);
for (int j=taskPerThread*i; j<taskPerThread*(i+1); j++) {
if (j>=visitors.length) break;
threads[i].addVisitors(visitors[j]);
}
}
Runtime r=Runtime.getRuntime();
System.out.println("Force running gc");
r.gc(); // running GC here (excluding GC effect)
System.out.println("Running gc done");
// not measuring 1st run (excluding JIT compilation effect)
for (int i=0; i<THREAD_NUM; i++) {
threads[i].start();
}
barrier.await();
for (int i=0; i<10; i++) {
MyThread.start = true;
long s=System.currentTimeMillis();
barrier.await();
long e=System.currentTimeMillis();
System.out.println("Iter "+i+" Exec time:"+(e-s)/1000.0+"s");
}
}
}
class MyThread extends Thread {
static volatile boolean start=true;
static int tid=0;
int id=0;
ArrayList<MyVisitor> tasks;
CyclicBarrier barrier;
public MyThread(CyclicBarrier _barrier) {
super("MyThread"+(tid++));
barrier = _barrier;
id=tid;
tasks = new ArrayList(256);
}
void addVisitors(MyVisitor v) {
tasks.add(v);
}
public void run() {
while (true) {
while (!start) { ; }
for (int i=0; i<tasks.size(); i++) {
MyVisitor v=tasks.get(i);
v.run();
}
start = false;
try { barrier.await();}
catch (InterruptedException e) { break; }
catch (Exception e) { throw new RuntimeException(e); }
}
}
}
The Java code can be compiled with no dependency, and you can run it with the following command:
java -Darg.type=double -server ParTest1 2
You pass the number of worker threads as an argument (the above uses 2 threads).
After setting up the arrays (that is excluded from the measured time), it does a same operation for 10 times, printing out the execution time at each iteration.
With the above option, it uses double array, and it scales very well with 1,2,4 threads (i.e. the execution time reduces to 1/2, and 1/4), but
java -Darg.type=Object -server ParTest1 2
With this option, it uses Object (String) array, and it doesn't scale at all!
I measured the GC time, but it was insignificant (and I also forced running GC before measuring times). I have tested with Java 6 (updates 43) and Java 7 (updates 51), but it's the same.
The code has comments with XXX and YYY describing the difference when arg.type=double or arg.type=Object option is used.
Can you figure out what is going on with the String-type argument passing here?
HotSpot VM generates following assemblies for reference type putfield bytecode.
mov ref, OFFSET_OF_THE_FIELD(this) <- this puts the new value for field.
mov this, REGISTER_A
shr 0x9, REGISTER_A
movabs OFFSET_X, REGISTER_B
mov %r12b, (REGISTER_A, REGISTER_B, 1)
putfield operation is completed in 1 instruction.
but there are more instructions following.
They are "Card Marking" instructions. (http://www.ibm.com/developerworks/library/j-jtp11253/)
Writing reference field to every objects in a card (512 bytes), will store a value in a same memory address.
And I guess, store to same memory address from multiple cores mess up with cache and pipelines.
just add
byte[] garbage = new byte[600];
to MyVisitor definition.
then every MyVisitor instances will be spaced enough not to share card marking bit, you will see the program scales.
This is not a complete answer but may provide a hint for you.
I have changed your code
Table1(int size) {
filled = new byte[size];
ivals = new int[size];
strs = new String[size];
Arrays.fill(filled, (byte)1);
Arrays.fill(ivals, 42);
Arrays.fill(strs, "Strs");
}
to
Table1(int size) {
filled = new byte[size];
ivals = new int[size];
strs = new String[size];
Arrays.fill(filled, (byte)1);
Arrays.fill(ivals, 42);
Arrays.fill(strs, new String("Strs"));
}
after this change, the running time with 4 threads with object type array reduced.
According to http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.7
For the purposes of the Java programming language memory model, a single write to a non-volatile long or double value is treated as two separate writes: one to each 32-bit half. This can result in a situation where a thread sees the first 32 bits of a 64-bit value from one write, and the second 32 bits from another write.
Writes and reads of volatile long and double values are always atomic.
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32-bit or 64-bit values.
Assigning references are always atomic,
and double is not atomic except when it is defined as volatile.
The problem is sv can be seen by other threads and its assignment is atomic.
Therefore, wrapping visitor's member variables (i, iv, sv) using ThreadLocal will solve the problem.
"sv = (String)_v;" makes the difference. I also confirmed that the type casting is not the factor. Just accessing _v can't make the difference. Assigning some value to sv field makes the difference. But I can't explain why.

Is there a Java data structure that is effectively an ArrayList with double indicies and built-in interpolation?

I am looking for a pre-built Java data structure with the following characteristics:
It should look something like an ArrayList but should allow indexing via double-precision rather than integers. Note that this means that it's likely that you'll see indicies that don't line up with the original data points (i.e., asking for the value that corresponds to key "1.5"). EDIT: For clarity, based on the comments, I'm not looking to change the ArrayList implementation. I'm looking for a similar interface and developer experience.
As a consequence, the value returned will likely be interpolated. For example, if the key is 1.5, the value returned could be the average of the value at key 1.0 and the value at key 2.0.
The keys will be sorted but the values are not ensured to be monotonically increasing. In fact, there's no assurance that the first derivative of the values will be continuous (making it a poor fit for certain types of splines).
Freely available code only, please.
For clarity, I know how to write such a thing. In fact, we already have an implementation of this and some related data structures in legacy code that I want to replace due to some performance and coding issues.
What I'm trying to avoid is spending a lot of time rolling my own solution when there might already be such a thing in the JDK, Apache Commons or another standard library. Frankly, that's exactly the approach that got this legacy code into the situation that it's in right now....
Is there such a thing out there in a freely available library?
Allowing double values as indices is a pretty large change from what ArrayList does.
The reason for this is that an array or list with double as indices would almost by definition be a sparse array, which means it has no value (or depending on your definition: a fixed, known value) for almost all possible indices and only a finite number of indices have an explicit value set.
There is no prebuilt class in Java SE that supports all that.
Personally I'd implement such a data structure as a skip-list (or similar fast-searching data structure) of (index, value) tuples with appropriate interpolation.
Edit: Actually there's a pretty good match for the back-end storage (i.e. everything except for the interpolation): Simply use a NavigableMap such as a TreeMap to store the mapping from index to value.
With that you can easily use ceilingEntry() and (if necessary) higherEntry() to get the closest value(s) to the index you need and then interpolate from those.
If your current implementation has complexity O(log N) for interpolating a value, the implementation I just made up may be for you:
package so2675929;
import java.util.Arrays;
public abstract class AbstractInterpolator {
private double[] keys;
private double[] values;
private int size;
public AbstractInterpolator(int initialCapacity) {
keys = new double[initialCapacity];
values = new double[initialCapacity];
}
public final void put(double key, double value) {
int index = indexOf(key);
if (index >= 0) {
values[index] = value;
} else {
if (size == keys.length) {
keys = Arrays.copyOf(keys, size + 32);
values = Arrays.copyOf(values, size + 32);
}
int insertionPoint = insertionPointFromIndex(index);
System.arraycopy(keys, insertionPoint, keys, insertionPoint + 1, size - insertionPoint);
System.arraycopy(values, insertionPoint, values, insertionPoint + 1, size - insertionPoint);
keys[insertionPoint] = key;
values[insertionPoint] = value;
size++;
}
}
public final boolean containsKey(double key) {
int index = indexOf(key);
return index >= 0;
}
protected final int indexOf(double key) {
return Arrays.binarySearch(keys, 0, size, key);
}
public final int size() {
return size;
}
protected void ensureValidIndex(int index) {
if (!(0 <= index && index < size))
throw new IndexOutOfBoundsException("index=" + index + ", size=" + size);
}
protected final double getKeyAt(int index) {
ensureValidIndex(index);
return keys[index];
}
protected final double getValueAt(int index) {
ensureValidIndex(index);
return values[index];
}
public abstract double get(double key);
protected static int insertionPointFromIndex(int index) {
return -(1 + index);
}
}
The concrete interpolators will only have to implement the get(double) function.
For example:
package so2675929;
public class LinearInterpolator extends AbstractInterpolator {
public LinearInterpolator(int initialCapacity) {
super(initialCapacity);
}
#Override
public double get(double key) {
final double minKey = getKeyAt(0);
final double maxKey = getKeyAt(size() - 1);
if (!(minKey <= key && key <= maxKey))
throw new IndexOutOfBoundsException("key=" + key + ", min=" + minKey + ", max=" + maxKey);
int index = indexOf(key);
if (index >= 0)
return getValueAt(index);
index = insertionPointFromIndex(index);
double lowerKey = getKeyAt(index - 1);
double lowerValue = getValueAt(index - 1);
double higherKey = getKeyAt(index);
double higherValue = getValueAt(index);
double rate = (higherValue - lowerValue) / (higherKey - lowerKey);
return lowerValue + (key - lowerKey) * rate;
}
}
And, finally, a unit test:
package so2675929;
import static org.junit.Assert.*;
import org.junit.Test;
public class LinearInterpolatorTest {
#Test
public void simple() {
LinearInterpolator interp = new LinearInterpolator(2);
interp.put(0.0, 0.0);
interp.put(1.0, 1.0);
assertEquals(0.0, interp.getValueAt(0), 0.0);
assertEquals(1.0, interp.getValueAt(1), 0.0);
assertEquals(0.0, interp.get(0.0), 0.0);
assertEquals(0.1, interp.get(0.1), 0.0);
assertEquals(0.5, interp.get(0.5), 0.0);
assertEquals(0.9, interp.get(0.9), 0.0);
assertEquals(1.0, interp.get(1.0), 0.0);
interp.put(0.5, 0.0);
assertEquals(0.0, interp.getValueAt(0), 0.0);
assertEquals(0.0, interp.getValueAt(1), 0.0);
assertEquals(1.0, interp.getValueAt(2), 0.0);
assertEquals(0.0, interp.get(0.0), 0.0);
assertEquals(0.0, interp.get(0.1), 0.0);
assertEquals(0.0, interp.get(0.5), 0.0);
assertEquals(0.75, interp.get(0.875), 0.0);
assertEquals(1.0, interp.get(1.0), 0.0);
}
#Test
public void largeKeys() {
LinearInterpolator interp = new LinearInterpolator(10);
interp.put(100.0, 30.0);
interp.put(200.0, 40.0);
assertEquals(30.0, interp.get(100.0), 0.0);
assertEquals(35.0, interp.get(150.0), 0.0);
assertEquals(40.0, interp.get(200.0), 0.0);
try {
interp.get(99.0);
fail();
} catch (IndexOutOfBoundsException e) {
assertEquals("key=99.0, min=100.0, max=200.0", e.getMessage());
}
try {
interp.get(201.0);
fail();
} catch (IndexOutOfBoundsException e) {
assertEquals("key=201.0, min=100.0, max=200.0", e.getMessage());
}
}
private static final int N = 10 * 1000 * 1000;
private double measure(int size) {
LinearInterpolator interp = new LinearInterpolator(size);
for (int i = 0; i < size; i++)
interp.put(i, i);
double max = interp.size() - 1;
double sum = 0.0;
for (int i = 0; i < N; i++)
sum += interp.get(max * i / N);
return sum;
}
#Test
public void speed10() {
assertTrue(measure(10) > 0.0);
}
#Test
public void speed10000() {
assertTrue(measure(10000) > 0.0);
}
#Test
public void speed1000000() {
assertTrue(measure(1000000) > 0.0);
}
}
So the functionality seems to work. I only measured speed in some simple cases, and these suggest that scaling will be better than linear.
Update (2010-10-17T23:45+0200): I made some stupid mistakes in checking the key argument in the LinearInterpolator, and my unit tests didn't catch them. Now I extended the tests and fixed the code accordingly.
In the Apache commons-math library, if you implement the UnivariateRealInterpolator and the return value of its interpolate method which is typed UnivariateRealFunction you'll be most of the way there.
The interpolator interface takes two arrays, x[] and y[]. The returned function has a method, value() that takes an x' and returns the interpolated y'.
Where it fails to provide an ArrayList-like experience is in the ability to add more values to the range and domain as if the List is growing.
Additionally, they look to be in need of some additional interpolation functions. There are only 4 implementations in the library for the stable release. As a commenter pointed out, it seems to be missing 'linear' or something even simpler like nearest neighbor. Maybe that's not really interpolation...
That's a huge change from ArrayList.
Same as Joachim's response above, but I'd probably implement this as a binary tree, and when I didn't find something I was looking for, average the value of the next smallest and largest values, which should be quick to traverse to.
Your description that it should be "like an ArrayList" is misleading, since what you've described is a one dimensional interpolator and has essentially nothing in common with an ArrayList. This is why you're getting suggestions for other data structures which IMO are sending you down the wrong path.
I don't know of any available in Java (and couldn't easily find one one google), but I think you should have a look at GSL - GNU Scientific Library which includes a spline interpolator. It may be a bit heavy for what you're looking for since it's a two dimensional interpolator, but it seems like you should be looking for something like this rather than something like an ArrayList.
If you'd like it to "look like an ArrayList" you can always wrap it in a Java class which has access methods similar to the List interface. You won't be able to actually implement the interface though, since the methods are declared to take integer indices.

Categories