I'd like to run a SingleShot JMH benchmark with all cache hierarchy related to the memory working up on are reliably flushed.
The benchmark looks roughly as follows:
#State(Scope.Benchmark)
public class MyBnchmrk {
public byte buffer[];
#Setup(Level.Trial)
public void generateSampleData() throws IOException {
// writes to buffer ...
}
#Setup(Level.Invocation)
public void flushCaches() {
//Perfectly I'd like to invoke here something like
//_mm_clflushopt() intrinsic as in GCC/clang for each line of the buffer
}
#Benchmark
#BenchmarkMode(Mode.SingleShotTime)
public void benchmarkMemoryBoundCode() {
//the benchmark
}
}
Is there a Java way to flush caches before single-shot measurement or hand-written clflush is required?
If you want to measure cache misses access, calling clflush directly is possible from java, but you end up writing JNI library with ASM intrinsic. Not to say, you can't probably do it in a reliable fashion, since you need to provide virtual address, and GC may move you buffer at any time.
Instead I offer you this:
Use single snapshot benchmark as you do
Measuring a single opreration would not be a good idea (measuring nanoseconds has high error). Instead create million of such identical buffers and do the same operation for million buffers. Every time you access a next buffer, which is not in the cache
You also can run some calculation between iterations. For example, reading 32+ mb of memory so it evicts cache lines from you cache. But with million of buffers, it doesn't show any profit
The resulting code:
#State(Scope.Benchmark)
#BenchmarkMode(Mode.SingleShotTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Fork(value = 1)
public class BufferBenchmarkLatency {
public static final int BATCH_SIZE = 1000000;
public static final int MY_BUFFER_SIZE = 1024;
public static final int CACHE_LINE_PADDING = 256;
public static class StateHolder extends Padder {
byte buffer[];
StateHolder() {
buffer = new byte[CACHE_LINE_PADDING + MY_BUFFER_SIZE + CACHE_LINE_PADDING];
Arrays.fill(buffer, (byte) ThreadLocalRandom.current().nextInt());
}
}
private final StateHolder[] arr = new StateHolder[BATCH_SIZE];
private int index;
#Setup(Level.Trial)
public void setUpTrial() {
for (int i = 0; i < arr.length; i++) {
arr[i] = new StateHolder();
}
ArrayUtil.shuffle(arr)
}
#Setup(Level.Iteration)
public void prepareForIteration(Blackhole blackhole) {
index = 0;
blackhole.consume(CacheUtil.evictCacheLines());
System.gc();
System.gc();
}
#Benchmark
public long read() {
byte[] buffer = arr[index].buffer;
return buffer[0];
}
#TearDown(Level.Invocation)
public void move() {
index++;
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(BufferBenchmarkLatency.class.getSimpleName())
.measurementBatchSize(BATCH_SIZE)
.warmupBatchSize(BATCH_SIZE)
.measurementIterations(10)
.warmupIterations(10)
.build();
new Runner(opt).run();
}
}
As you see, I padd state holder itself, so reading buffer references are always on the different cache lines (Padder class has 24 long fields). Oh, and I also padd buffer itself, JMH won't do it for you.
I've implemented this idea, and I have avg 100 ns result for simple operation like reading first element of the buffer. To read first element you need to read two cache lines (buffer reference + first element). The full code is here
Related
I have large Map where I store some objects. The Map is large: it has around 200k objects. When I try to run some methods, that require to read map values, the program freezes. When I debug it, it seems that my IDE is 'collecting data' (picture). It has never completed the task.
I have 16GB RAM.
What can I do to speed this up?
I get performance issues around 61 million elements.
import java.util.*;
public class BreakingMaps{
public static void main(String[] args){
int count = Integer.MAX_VALUE>>5;
System.out.println(count + " objects tested");
HashMap<Long, String> set = new HashMap<>(count);
for(long i = 0; i<count; i++){
Long l = i;
set.put(l, l.toString());
}
Random r = new Random();
for(int i = 0; i<1000; i++){
long k = r.nextInt()%count;
k = k<0?-k:k;
System.out.println(set.get(k));
}
}
}
I run the program with java -Xms12G -Xmx13G BreakingMaps
I suspect your problem is not the map, but circumstances surrounding the map. If I write the same program, but use a class with hashcode colisions then the program cannot handle 200K elements.
static class Key{
final long l;
public Key(long l){
this.l = l;
}
#Override
public int hashCode(){
return 1;
}
#Override
public boolean equals(Object o){
if(o!=null && o instanceof Key){
return ((Key)o).l==l;
}
return false;
}
}
Look at this - as the solution you can increase the heap size for your app:
java -Xmx6g myprogram.
But it's not very good. I'd suggest you to try to rework your data processing approach. Maybe you can apply some filtering before fetching the data to decrease the data size or implement some calculation on database level.
I had a class like this :
class Test
{
public static test getInstance()
{
return new test()
}
public void firstMethod()
{
//do something
}
public void secondMethod()
{
//do something
}
public void thirdMethod()
{
//do something
}
}
in the another class if we calling Test.getInstance().methodName() several times with different method, what happening?
Which one will be faster and using low memory in following codes?
Test.getInstance().firstMethod()
Test.getInstance().secondMethod()
Test.getInstance().thirdMethod()
or
Test test = Test.getInstance();
test.firstMethod();
test.secondMethod();
test.thirdMethod();
Test.getInstance().firstMethod()
Test.getInstance().secondMethod()
Test.getInstance().thirdMethod()
This will create three different instances of the Test class and call a method on each.
Test test = Test.getInstance();
test.firstMethod();
test.secondMethod();
test.thirdMethod();
Will create only one instance and invoke the three methods on that instance.
So it's a completely different behavior to begin with. Obviously, since the first creates three objects, then it should take more heap space.
If you're intending to implement a singleton class, however, both are equivalent.
Every time you call getInstance the system has to allocate heap storage for a Test object and initialize it.
Furthermore, somewhere down the line the system will have to garbage-collect all those extra Test objects. With a copying collector the overhead per object is minimal, but there is some -- if for no other reason than you're causing GC to occur more often.
class Test
{
public static Test getInstance()
{
return new Test();
}
public void firstMethod()
{
// do something
}
public void secondMethod()
{
// do something
}
public void thirdMethod()
{
// do something
}
}
public class Blah
{
public static void main(String[] args)
{
int i = 0;
long start = System.nanoTime();
Test t = new Test();
for (; i < 100000; i++)
{
t.firstMethod();
}
long stop = System.nanoTime();
System.out.println(stop - start);
i = 0;
start = System.nanoTime();
for (; i < 100000; i++)
{
Test.getInstance().firstMethod();
}
stop = System.nanoTime();
System.out.println(stop - start);
}
}
output:
~3486938
~4894574
Creating single instance with new Test() proved to be consistently faster by about 30%.
Memory calculations are harder due to disability to do it in one run. However if we run only the first loop (changing what's inside) and use:
Runtime runtime = Runtime.getRuntime();
long memory = runtime.totalMemory() - runtime.freeMemory();
just before printing. In two separate runs we can determine the difference: ~671200 or ~1342472 (seemed to change between runs rather randomly with no clear influence on runtime) for new Test() and ~2389288 (no big differences this time) for getInstance() in 100000 iterations. Again clear victory to single instance
more updates
As is explained in the selected answer, the problem is in JVM's garbage collection algorithm.
JVM uses card marking algorithm to keep track of modified references in object fields. For each reference assignment to a field, it marks an associated bit in the card to be true -- this causes a false-sharing hence blocks scaling. The details are well described in this article: https://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
The option -XX:+UseCondCardMark (in Java 1.7u40 and up) mitigates the problem, and makes it scale almost perfectly.
updates
I found out (hinted from Park Eung-ju) that assigning an object into a field variable makes the difference. If I remove the assignment, it scales perfectly.
I think probably it has something to do with Java memory model -- such as, an object reference must point to a valid address before it gets visible, but I am not completely sure. Both double and Object reference (likely) have 8 bytes size on 64 bit machine, so it seems to me that assigning a double value and an Object reference should be the same in terms of synchronization.
Anyone has a reasonable explanation?
Here I have a weird Java multi-threading scalability problem.
My code simply iterates over an array (using the visitor pattern) to compute simple floating-point operations and assign the result to another array. There is no data dependency, nor synchronization, so it should scale linearly (2x faster with 2 threads, 4x faster with 4 threads).
When primitive (double) array is used, it scales very well. When object type (e.g. String) array is used, it doesn't scale at all (even though the value of the String array is not used at all...)
Here's the entire source code:
import java.util.ArrayList;
import java.util.Arrays;
import java.util.concurrent.CyclicBarrier;
class Table1 {
public static final int SIZE1=200000000;
public static final boolean OBJ_PARAM;
static {
String type=System.getProperty("arg.type");
if ("double".equalsIgnoreCase(type)) {
System.out.println("Using primitive (double) type arg");
OBJ_PARAM = false;
} else {
System.out.println("Using object type arg");
OBJ_PARAM = true;
}
}
byte[] filled;
int[] ivals;
String[] strs;
Table1(int size) {
filled = new byte[size];
ivals = new int[size];
strs = new String[size];
Arrays.fill(filled, (byte)1);
Arrays.fill(ivals, 42);
Arrays.fill(strs, "Strs");
}
public boolean iterate_range(int from, int to, MyVisitor v) {
for (int i=from; i<to; i++) {
if (filled[i]==1) {
// XXX: Here we are passing double or String argument
if (OBJ_PARAM) v.visit_obj(i, strs[i]);
else v.visit(i, ivals[i]);
}
}
return true;
}
}
class HeadTable {
byte[] filled;
double[] dvals;
boolean isEmpty;
HeadTable(int size) {
filled = new byte[size];
dvals = new double[size];
Arrays.fill(filled, (byte)0);
isEmpty = true;
}
public boolean contains(int i, double d) {
if (filled[i]==0) return false;
if (dvals[i]==d) return true;
return false;
}
public boolean contains(int i) {
if (filled[i]==0) return false;
return true;
}
public double groupby(int i) {
assert filled[i]==1;
return dvals[i];
}
public boolean insert(int i, double d) {
if (filled[i]==1 && contains(i,d)) return false;
if (isEmpty) isEmpty=false;
filled[i]=1;
dvals[i] = d;
return true;
}
public boolean update(int i, double d) {
assert filled[i]==1;
dvals[i]=d;
return true;
}
}
class MyVisitor {
public static final int NUM=128;
int[] range = new int[2];
Table1 table1;
HeadTable head;
double diff=0;
int i;
int iv;
String sv;
MyVisitor(Table1 _table1, HeadTable _head, int id) {
table1 = _table1;
head = _head;
int elems=Table1.SIZE1/NUM;
range[0] = elems*id;
range[1] = elems*(id+1);
}
public void run() {
table1.iterate_range(range[0], range[1], this);
}
//YYY 1: with double argument, this function is called
public boolean visit(int _i, int _v) {
i = _i;
iv = _v;
insertDiff();
return true;
}
//YYY 2: with String argument, this function is called
public boolean visit_obj(int _i, Object _v) {
i = _i;
iv = 42;
sv = (String)_v;
insertDiff();
return true;
}
public boolean insertDiff() {
if (!head.contains(i)) {
head.insert(i, diff);
return true;
}
double old = head.groupby(i);
double newval=Math.min(old, diff);
head.update(i, newval);
head.insert(i, diff);
return true;
}
}
public class ParTest1 {
public static int THREAD_NUM=4;
public static void main(String[] args) throws Exception {
if (args.length>0) {
THREAD_NUM = Integer.parseInt(args[0]);
System.out.println("Setting THREAD_NUM:"+THREAD_NUM);
}
Table1 table1 = new Table1(Table1.SIZE1);
HeadTable head = new HeadTable(Table1.SIZE1);
MyVisitor[] visitors = new MyVisitor[MyVisitor.NUM];
for (int i=0; i<visitors.length; i++) {
visitors[i] = new MyVisitor(table1, head, i);
}
int taskPerThread = visitors.length / THREAD_NUM;
MyThread[] threads = new MyThread[THREAD_NUM];
CyclicBarrier barrier = new CyclicBarrier(THREAD_NUM+1);
for (int i=0; i<THREAD_NUM; i++) {
threads[i] = new MyThread(barrier);
for (int j=taskPerThread*i; j<taskPerThread*(i+1); j++) {
if (j>=visitors.length) break;
threads[i].addVisitors(visitors[j]);
}
}
Runtime r=Runtime.getRuntime();
System.out.println("Force running gc");
r.gc(); // running GC here (excluding GC effect)
System.out.println("Running gc done");
// not measuring 1st run (excluding JIT compilation effect)
for (int i=0; i<THREAD_NUM; i++) {
threads[i].start();
}
barrier.await();
for (int i=0; i<10; i++) {
MyThread.start = true;
long s=System.currentTimeMillis();
barrier.await();
long e=System.currentTimeMillis();
System.out.println("Iter "+i+" Exec time:"+(e-s)/1000.0+"s");
}
}
}
class MyThread extends Thread {
static volatile boolean start=true;
static int tid=0;
int id=0;
ArrayList<MyVisitor> tasks;
CyclicBarrier barrier;
public MyThread(CyclicBarrier _barrier) {
super("MyThread"+(tid++));
barrier = _barrier;
id=tid;
tasks = new ArrayList(256);
}
void addVisitors(MyVisitor v) {
tasks.add(v);
}
public void run() {
while (true) {
while (!start) { ; }
for (int i=0; i<tasks.size(); i++) {
MyVisitor v=tasks.get(i);
v.run();
}
start = false;
try { barrier.await();}
catch (InterruptedException e) { break; }
catch (Exception e) { throw new RuntimeException(e); }
}
}
}
The Java code can be compiled with no dependency, and you can run it with the following command:
java -Darg.type=double -server ParTest1 2
You pass the number of worker threads as an argument (the above uses 2 threads).
After setting up the arrays (that is excluded from the measured time), it does a same operation for 10 times, printing out the execution time at each iteration.
With the above option, it uses double array, and it scales very well with 1,2,4 threads (i.e. the execution time reduces to 1/2, and 1/4), but
java -Darg.type=Object -server ParTest1 2
With this option, it uses Object (String) array, and it doesn't scale at all!
I measured the GC time, but it was insignificant (and I also forced running GC before measuring times). I have tested with Java 6 (updates 43) and Java 7 (updates 51), but it's the same.
The code has comments with XXX and YYY describing the difference when arg.type=double or arg.type=Object option is used.
Can you figure out what is going on with the String-type argument passing here?
HotSpot VM generates following assemblies for reference type putfield bytecode.
mov ref, OFFSET_OF_THE_FIELD(this) <- this puts the new value for field.
mov this, REGISTER_A
shr 0x9, REGISTER_A
movabs OFFSET_X, REGISTER_B
mov %r12b, (REGISTER_A, REGISTER_B, 1)
putfield operation is completed in 1 instruction.
but there are more instructions following.
They are "Card Marking" instructions. (http://www.ibm.com/developerworks/library/j-jtp11253/)
Writing reference field to every objects in a card (512 bytes), will store a value in a same memory address.
And I guess, store to same memory address from multiple cores mess up with cache and pipelines.
just add
byte[] garbage = new byte[600];
to MyVisitor definition.
then every MyVisitor instances will be spaced enough not to share card marking bit, you will see the program scales.
This is not a complete answer but may provide a hint for you.
I have changed your code
Table1(int size) {
filled = new byte[size];
ivals = new int[size];
strs = new String[size];
Arrays.fill(filled, (byte)1);
Arrays.fill(ivals, 42);
Arrays.fill(strs, "Strs");
}
to
Table1(int size) {
filled = new byte[size];
ivals = new int[size];
strs = new String[size];
Arrays.fill(filled, (byte)1);
Arrays.fill(ivals, 42);
Arrays.fill(strs, new String("Strs"));
}
after this change, the running time with 4 threads with object type array reduced.
According to http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.7
For the purposes of the Java programming language memory model, a single write to a non-volatile long or double value is treated as two separate writes: one to each 32-bit half. This can result in a situation where a thread sees the first 32 bits of a 64-bit value from one write, and the second 32 bits from another write.
Writes and reads of volatile long and double values are always atomic.
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32-bit or 64-bit values.
Assigning references are always atomic,
and double is not atomic except when it is defined as volatile.
The problem is sv can be seen by other threads and its assignment is atomic.
Therefore, wrapping visitor's member variables (i, iv, sv) using ThreadLocal will solve the problem.
"sv = (String)_v;" makes the difference. I also confirmed that the type casting is not the factor. Just accessing _v can't make the difference. Assigning some value to sv field makes the difference. But I can't explain why.
Assume I have a large array of relatively small objects, which I need to iterate frequently.
I would like to optimize my iteration by improving cache performance, so I would like to allocate the objects [and not the reference] contiguously on the memory, so I'll get fewer cache misses, and the overall performance could be segnificantly better.
In C++, I could just allocate an array of the objects, and it will allocate them as I wanted, but in java - when allocating an array, I only allocate the reference, and the allocation is being done one object at a time.
I am aware that if I allocate the objects "at once" [one after the other], the jvm is most likely to allocate the objects as contiguous as it can, but it might be not enough if the memory is fragmented.
My questions:
Is there a way to tell the jvm to defrag the memory just before I start allocating my objects? Will it be enough to ensure [as much as possible] that the objects will be allocated continiously?
Is there a different solution to this issue?
New objects are creating in the Eden space. The eden space is never fragmented. It is always empty after a GC.
The problem you have is when a GC is performed, object can be arranged randomly in memory or even surprisingly in the reverse order they are referenced.
A work around is to store the fields as a series of arrays. I call this a column-based table instead of a row based table.
e.g. Instead of writing
class PointCount {
double x, y;
int count;
}
PointCount[] pc = new lots of small objects.
use columns based data types.
class PointCounts {
double[] xs, ys;
int[] counts;
}
or
class PointCounts {
TDoubleArrayList xs, ys;
TIntArrayList counts;
}
The arrays themselves could be in up to three different places, but the data is otherwise always continuous. This can even be marginally more efficient if you perform operations on a subset of fields.
public int totalCount() {
int sum = 0;
// counts are continuous without anything between the values.
for(int i: counts) sum += i;
return i;
}
A solution I use is to avoid GC overhead for having large amounts of data is to use an interface to access a direct or memory mapped ByteBuffer
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
public class MyCounters {
public static void main(String... args) {
Runtime rt = Runtime.getRuntime();
long used1 = rt.totalMemory() - rt.freeMemory();
long start = System.nanoTime();
int length = 100 * 1000 * 1000;
PointCount pc = new PointCountImpl(length);
for (int i = 0; i < length; i++) {
pc.index(i);
pc.setX(i);
pc.setY(-i);
pc.setCount(1);
}
for (int i = 0; i < length; i++) {
pc.index(i);
if (pc.getX() != i) throw new AssertionError();
if (pc.getY() != -i) throw new AssertionError();
if (pc.getCount() != 1) throw new AssertionError();
}
long time = System.nanoTime() - start;
long used2 = rt.totalMemory() - rt.freeMemory();
System.out.printf("Creating an array of %,d used %,d bytes of heap and tool %.1f seconds to set and get%n",
length, (used2 - used1), time / 1e9);
}
}
interface PointCount {
// set the index of the element referred to.
public void index(int index);
public double getX();
public void setX(double x);
public double getY();
public void setY(double y);
public int getCount();
public void setCount(int count);
public void incrementCount();
}
class PointCountImpl implements PointCount {
static final int X_OFFSET = 0;
static final int Y_OFFSET = X_OFFSET + 8;
static final int COUNT_OFFSET = Y_OFFSET + 8;
static final int LENGTH = COUNT_OFFSET + 4;
final ByteBuffer buffer;
int start = 0;
PointCountImpl(int count) {
this(ByteBuffer.allocateDirect(count * LENGTH).order(ByteOrder.nativeOrder()));
}
PointCountImpl(ByteBuffer buffer) {
this.buffer = buffer;
}
#Override
public void index(int index) {
start = index * LENGTH;
}
#Override
public double getX() {
return buffer.getDouble(start + X_OFFSET);
}
#Override
public void setX(double x) {
buffer.putDouble(start + X_OFFSET, x);
}
#Override
public double getY() {
return buffer.getDouble(start + Y_OFFSET);
}
#Override
public void setY(double y) {
buffer.putDouble(start + Y_OFFSET, y);
}
#Override
public int getCount() {
return buffer.getInt(start + COUNT_OFFSET);
}
#Override
public void setCount(int count) {
buffer.putInt(start + COUNT_OFFSET, count);
}
#Override
public void incrementCount() {
setCount(getCount() + 1);
}
}
run with the -XX:-UseTLAB option (to get accurate memory allocation sizes) prints
Creating an array of 100,000,000 used 12,512 bytes of heap and took 1.8 seconds to set and get
As its off heap, it has next to no GC impact.
Sadly, there is no way of ensuring objects are created/stay at adjacent memory locations in Java.
However, objects created in sequence will most likely end up adjacent to each other (of course this depends on the actual VM implementation). I'm pretty sure that the writers of the VM are aware that locality is highly desirable and don't go out of their way to scatter objects randomly around.
The Garbage Collector will at some point probably move the objects - if your objects are short lived, that should not be an issue. For long lived objects it then depends on how the GC implements moving the survivor objects. Again, I think its reasonable that the guys writing the GC have spent some thought on the matter and will perform copies in a way that does not screw locality more than unavoidable.
There are obviously no guarantees for any of above assumptions, but since we can't do anything about it anyway, stop worring :)
The only thing you can do at the java source level is to sometimes avoid composition of objects - instead you can "inline" the state you would normally put in a composite object:
class MyThing {
int myVar;
// ... more members
// composite object
Rectangle bounds;
}
instead:
class MyThing {
int myVar;
// ... more members
// "inlined" rectangle
int x, y, width, height;
}
Of course this makes the code less readable and duplicates potentially a lot of code.
Ordering class members by access pattern seems to have a slight effect (I noticed a slight alteration in a benchmarked piece of code after I had reordered some declarations), but I've never bothered to verify if its true. But it would make sense if the VM does no reordering of members.
On the same topic it would also be nice to (from a performance view) be able to reinterpret an existing primitive array as another type (e.g. cast int[] to float[]). And while you're at it, why not whish for union members as well? I sure do.
But we'd have to give up a lot of platform and architecture independency in exchange for these possibilities.
Doesn't work that way in Java. Iteration is not a matter of increasing a pointer. There is no performance impact based on where on the heap the objects are physically stored.
If you still want to approach this in a C/C++ way, think of a Java array as an array of pointers to structs. When you loop over the array, it doesn't matter where the actual structs are allocated, you are looping over an array of pointers.
I would abandon this line of reasoning. It's not how Java works and it's also sub-optimization.
I'm doing a comparison of methods in a construction of an array without repeated elements and then get the time it took and the memory used.
For the hash method and the treeset method the memory prints without problem, but for the bruteforce search it doesn't print any memory. Is it possible that the brute force doesn't use any "respectable" memory because it just compares a element one by one? this is the code I have. Is it possible that is something wrong?
public static void main(String[] args)
{
Random r = new Random();
int warmup = 0;
while(warmup<nr) {
tempoInicial = System.nanoTime();
memoriaInicial = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
while(ne<max)
{
valor = r.nextInt(maxRandom);
acrescentar();
}
tempoFinal = System.nanoTime();
memoriaFinal = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
retirar();
System.gc();
warmup++;
}
and
private static void acrescentar()
{
if(usaTreeSet)
{
if(ts.contains(valor))
return;
ts.add(valor);
}
if(usaHashSet)
{
if(hs.contains(valor))
return;
hs.add(valor);
}
if(usaBruteForce)
{
for(int i=0; i<ne; i++)
{
if(pilha[i]==valor)
return;
}
}
pilha[ne]=valor;
ne++;
}
When testing small amounts of memory try turning off the TLAB and no object is too small. ;) -XX:-UseTLAB The TLAB allocates blocks of memory at a time to each thread. These blocks do not count to the free memory.
You might find this article on Getting the size of an Object useful.