Java search array by brute force and doesnt spends memory? - java

I'm doing a comparison of methods in a construction of an array without repeated elements and then get the time it took and the memory used.
For the hash method and the treeset method the memory prints without problem, but for the bruteforce search it doesn't print any memory. Is it possible that the brute force doesn't use any "respectable" memory because it just compares a element one by one? this is the code I have. Is it possible that is something wrong?
public static void main(String[] args)
{
Random r = new Random();
int warmup = 0;
while(warmup<nr) {
tempoInicial = System.nanoTime();
memoriaInicial = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
while(ne<max)
{
valor = r.nextInt(maxRandom);
acrescentar();
}
tempoFinal = System.nanoTime();
memoriaFinal = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
retirar();
System.gc();
warmup++;
}
and
private static void acrescentar()
{
if(usaTreeSet)
{
if(ts.contains(valor))
return;
ts.add(valor);
}
if(usaHashSet)
{
if(hs.contains(valor))
return;
hs.add(valor);
}
if(usaBruteForce)
{
for(int i=0; i<ne; i++)
{
if(pilha[i]==valor)
return;
}
}
pilha[ne]=valor;
ne++;
}

When testing small amounts of memory try turning off the TLAB and no object is too small. ;) -XX:-UseTLAB The TLAB allocates blocks of memory at a time to each thread. These blocks do not count to the free memory.
You might find this article on Getting the size of an Object useful.

Related

Flush cache lines in case of a Single Shot benchmark

I'd like to run a SingleShot JMH benchmark with all cache hierarchy related to the memory working up on are reliably flushed.
The benchmark looks roughly as follows:
#State(Scope.Benchmark)
public class MyBnchmrk {
public byte buffer[];
#Setup(Level.Trial)
public void generateSampleData() throws IOException {
// writes to buffer ...
}
#Setup(Level.Invocation)
public void flushCaches() {
//Perfectly I'd like to invoke here something like
//_mm_clflushopt() intrinsic as in GCC/clang for each line of the buffer
}
#Benchmark
#BenchmarkMode(Mode.SingleShotTime)
public void benchmarkMemoryBoundCode() {
//the benchmark
}
}
Is there a Java way to flush caches before single-shot measurement or hand-written clflush is required?
If you want to measure cache misses access, calling clflush directly is possible from java, but you end up writing JNI library with ASM intrinsic. Not to say, you can't probably do it in a reliable fashion, since you need to provide virtual address, and GC may move you buffer at any time.
Instead I offer you this:
Use single snapshot benchmark as you do
Measuring a single opreration would not be a good idea (measuring nanoseconds has high error). Instead create million of such identical buffers and do the same operation for million buffers. Every time you access a next buffer, which is not in the cache
You also can run some calculation between iterations. For example, reading 32+ mb of memory so it evicts cache lines from you cache. But with million of buffers, it doesn't show any profit
The resulting code:
#State(Scope.Benchmark)
#BenchmarkMode(Mode.SingleShotTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Fork(value = 1)
public class BufferBenchmarkLatency {
public static final int BATCH_SIZE = 1000000;
public static final int MY_BUFFER_SIZE = 1024;
public static final int CACHE_LINE_PADDING = 256;
public static class StateHolder extends Padder {
byte buffer[];
StateHolder() {
buffer = new byte[CACHE_LINE_PADDING + MY_BUFFER_SIZE + CACHE_LINE_PADDING];
Arrays.fill(buffer, (byte) ThreadLocalRandom.current().nextInt());
}
}
private final StateHolder[] arr = new StateHolder[BATCH_SIZE];
private int index;
#Setup(Level.Trial)
public void setUpTrial() {
for (int i = 0; i < arr.length; i++) {
arr[i] = new StateHolder();
}
ArrayUtil.shuffle(arr)
}
#Setup(Level.Iteration)
public void prepareForIteration(Blackhole blackhole) {
index = 0;
blackhole.consume(CacheUtil.evictCacheLines());
System.gc();
System.gc();
}
#Benchmark
public long read() {
byte[] buffer = arr[index].buffer;
return buffer[0];
}
#TearDown(Level.Invocation)
public void move() {
index++;
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(BufferBenchmarkLatency.class.getSimpleName())
.measurementBatchSize(BATCH_SIZE)
.warmupBatchSize(BATCH_SIZE)
.measurementIterations(10)
.warmupIterations(10)
.build();
new Runner(opt).run();
}
}
As you see, I padd state holder itself, so reading buffer references are always on the different cache lines (Padder class has 24 long fields). Oh, and I also padd buffer itself, JMH won't do it for you.
I've implemented this idea, and I have avg 100 ns result for simple operation like reading first element of the buffer. To read first element you need to read two cache lines (buffer reference + first element). The full code is here

Is Java7 intelligent in collecting objects if they are not used further in a scope, although the scope has not completely ended

Consider the following code:
public class BitSetTest
{
public static void main(final String[] args) throws IOException
{
System.out.println("Start?");
int ch = System.in.read();
List<Integer> numbers = getSortedNumbers();
System.out.println("Generated numbers");
ch = System.in.read();
RangeSet<Integer> rangeSet = TreeRangeSet.create();
for (Integer number : numbers)
{
rangeSet.add(Range.closed(number, number));
}
System.out.println("Finished rangeset");
ch = System.in.read();
BitSet bitset = new BitSet();
for (Integer number : numbers)
{
bitset.set(number.intValue());
}
System.out.println("Finished bitset");
ch = System.in.read();
//System.out.println(numbers.size());
//System.out.println(rangeSet.isEmpty());
//System.out.println(bitset.size());
}
private static List<Integer> getSortedNumbers()
{
int max = 200000000;
int n = max / 10;
List<Integer> numbers = Lists.newArrayListWithExpectedSize(max);
File file = new File("numbers.txt");
if (file.exists())
{
try (BufferedReader reader = new BufferedReader(new FileReader(file)))
{
String line = reader.readLine();
while ((line = reader.readLine()) != null)
{
numbers.add(Integer.valueOf(line));
}
}
catch (IOException e1)
{
throw new RuntimeException(e1);
}
}
else
{
for (int i = 0; i < n; i++)
{
int number = (int) (Math.random() * max);
numbers.add(number);
if (i > 0 && i % 10000 == 0)
{
System.out.println(i);
}
}
Collections.sort(numbers);
try (BufferedWriter writer = new BufferedWriter(new FileWriter(file)))
{
writer.write(numbers.get(0) + "");
for (int i = 1; i < n; i++)
{
writer.write("\n");
writer.write(numbers.get(i) + "");
}
}
catch (IOException e1)
{
throw new RuntimeException(e1);
}
}
return numbers;
}
}
At the first pause(System.in.read()), JConsole shows memory usage as 4MB.
At the second pause("Generated Numbers"), since a large list is instantiated, the memory usage jumps to 922MB.
At the next pause ("Finished rangeset"), after running GC the memory comes back to 4MB which means the list is collected although the function has not ended in scope.
When the commented sys outs are uncommented and used, then the list does not get collected till the sysout gets executed.
Just wanted to understand whether the JVM is intelligent enough to determine the scope of an object based on the point from where it is not being used any further?
Garbage collection is based on generations (there are some changes in Java 8). Until Java 8 memory used to be divided into 3 parts: Young generation, Old generation and PermGen. All newly created objects will get into young generation, and if they are still available after some time they will be migrated to Old generation. PermGen used to be used mostly for JVM's own data. Garbage collection of young generation is called minor garbage collection and happens relatively frequently.
Java approach to garbage collection is "Mark and Swipe" (see the first link) and it marks all objects which are not referenced by any life code as dead and clean them up (swipe).
In your particular case the following is happening:
Java loads your class into young memory generation;
Java kicks off main class;
Your code allocates more memory in young generation;
While your code runs all these memory is referenced by life code therefore if not cleaned;
Your program stops running;
Garbage collection kicks in and detects that all data, and the class itself is not referenced any longer by any life code, therefore it marks it for deletion and eventually cleanup.
Based on what you are saying there is a good chance that your class and all your data doesn't even make to Old generation.
To be more clear: garbage collection is happening in parallel with your code and therefore can detect that some data is not referenced any more. Assuming that it detects if object is no longer referenced based on method end is not always correct (and proven by your test).

Multi Threading Java

In my program I essentially have a method similar to:
for (int x=0; x<numberofimagesinmyfolder; x++){
for(int y=0; y<numberofimagesinmyfolder; y++){
compare(imagex,imagey);
if(match==true){
System.out.println("image x matches image y");
}
}
}
So basically I have a folder of images and I compare all combinations of images...so compare image 1 against all images then image 2...and so on. My problem is when searching to see what images match, it takes a long time. I am trying to multithread this process. Does anyone have any idea of how to do this?
Instead of comparing the images every time, hash the images, save the hash, and then compare the hashes of each pair of messages. Since a hash is far smaller you can fit more into memory and cache, which should significantly speed up comparisons.
There is probably a better way to do the search for equality as well, but one option would be to stick all the hashes into an array then sort them by hash value. Then iterate over the list looking for adjacent entries that are equal. This should be O(n*log(n)) instead of O(n^2) like your current version.
inner loop should start at y=x+1 to take advantage of symmetry.
load all images into memory first. don't do all compares from disk.
Use a Java ExecutorService (basically a thread pool). Queue tasks for all index combinations. Let threads pull index combinations out of a task queue and execute comparisons.
Here is some general code to do the multi threading:
public static class CompareTask implements Runnable {
CountDownLatch completion;
Object imgA;
Object imgB;
public CompareTask(CountDownLatch completion, Object imgA, Object imgB) {
this.completion = completion;
this.imgA = imgA;
this.imgB = imgB;
}
#Override
public void run() {
// TODO: Do computation...
try {
System.out.println("Thread simulating task start.");
Thread.sleep(500);
System.out.println("Thread simulating task done.");
} catch (InterruptedException e) {
e.printStackTrace();
}
completion.countDown();
}
}
public static void main(String[] args) throws Exception {
Object[] images = new Object[10];
ExecutorService es = Executors.newFixedThreadPool(5);
CountDownLatch completion = new CountDownLatch(images.length * (images.length - 1) / 2);
for (int i = 0; i < images.length; i++) {
for (int j = i + 1; j < images.length; j++) {
es.submit(new CompareTask(completion, images[i], images[j]));
}
}
System.out.println("Submitted tasks. Waiting...");
completion.await();
System.out.println("Done");
es.shutdown();
}

Can I allocate objects contiguously in java?

Assume I have a large array of relatively small objects, which I need to iterate frequently.
I would like to optimize my iteration by improving cache performance, so I would like to allocate the objects [and not the reference] contiguously on the memory, so I'll get fewer cache misses, and the overall performance could be segnificantly better.
In C++, I could just allocate an array of the objects, and it will allocate them as I wanted, but in java - when allocating an array, I only allocate the reference, and the allocation is being done one object at a time.
I am aware that if I allocate the objects "at once" [one after the other], the jvm is most likely to allocate the objects as contiguous as it can, but it might be not enough if the memory is fragmented.
My questions:
Is there a way to tell the jvm to defrag the memory just before I start allocating my objects? Will it be enough to ensure [as much as possible] that the objects will be allocated continiously?
Is there a different solution to this issue?
New objects are creating in the Eden space. The eden space is never fragmented. It is always empty after a GC.
The problem you have is when a GC is performed, object can be arranged randomly in memory or even surprisingly in the reverse order they are referenced.
A work around is to store the fields as a series of arrays. I call this a column-based table instead of a row based table.
e.g. Instead of writing
class PointCount {
double x, y;
int count;
}
PointCount[] pc = new lots of small objects.
use columns based data types.
class PointCounts {
double[] xs, ys;
int[] counts;
}
or
class PointCounts {
TDoubleArrayList xs, ys;
TIntArrayList counts;
}
The arrays themselves could be in up to three different places, but the data is otherwise always continuous. This can even be marginally more efficient if you perform operations on a subset of fields.
public int totalCount() {
int sum = 0;
// counts are continuous without anything between the values.
for(int i: counts) sum += i;
return i;
}
A solution I use is to avoid GC overhead for having large amounts of data is to use an interface to access a direct or memory mapped ByteBuffer
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
public class MyCounters {
public static void main(String... args) {
Runtime rt = Runtime.getRuntime();
long used1 = rt.totalMemory() - rt.freeMemory();
long start = System.nanoTime();
int length = 100 * 1000 * 1000;
PointCount pc = new PointCountImpl(length);
for (int i = 0; i < length; i++) {
pc.index(i);
pc.setX(i);
pc.setY(-i);
pc.setCount(1);
}
for (int i = 0; i < length; i++) {
pc.index(i);
if (pc.getX() != i) throw new AssertionError();
if (pc.getY() != -i) throw new AssertionError();
if (pc.getCount() != 1) throw new AssertionError();
}
long time = System.nanoTime() - start;
long used2 = rt.totalMemory() - rt.freeMemory();
System.out.printf("Creating an array of %,d used %,d bytes of heap and tool %.1f seconds to set and get%n",
length, (used2 - used1), time / 1e9);
}
}
interface PointCount {
// set the index of the element referred to.
public void index(int index);
public double getX();
public void setX(double x);
public double getY();
public void setY(double y);
public int getCount();
public void setCount(int count);
public void incrementCount();
}
class PointCountImpl implements PointCount {
static final int X_OFFSET = 0;
static final int Y_OFFSET = X_OFFSET + 8;
static final int COUNT_OFFSET = Y_OFFSET + 8;
static final int LENGTH = COUNT_OFFSET + 4;
final ByteBuffer buffer;
int start = 0;
PointCountImpl(int count) {
this(ByteBuffer.allocateDirect(count * LENGTH).order(ByteOrder.nativeOrder()));
}
PointCountImpl(ByteBuffer buffer) {
this.buffer = buffer;
}
#Override
public void index(int index) {
start = index * LENGTH;
}
#Override
public double getX() {
return buffer.getDouble(start + X_OFFSET);
}
#Override
public void setX(double x) {
buffer.putDouble(start + X_OFFSET, x);
}
#Override
public double getY() {
return buffer.getDouble(start + Y_OFFSET);
}
#Override
public void setY(double y) {
buffer.putDouble(start + Y_OFFSET, y);
}
#Override
public int getCount() {
return buffer.getInt(start + COUNT_OFFSET);
}
#Override
public void setCount(int count) {
buffer.putInt(start + COUNT_OFFSET, count);
}
#Override
public void incrementCount() {
setCount(getCount() + 1);
}
}
run with the -XX:-UseTLAB option (to get accurate memory allocation sizes) prints
Creating an array of 100,000,000 used 12,512 bytes of heap and took 1.8 seconds to set and get
As its off heap, it has next to no GC impact.
Sadly, there is no way of ensuring objects are created/stay at adjacent memory locations in Java.
However, objects created in sequence will most likely end up adjacent to each other (of course this depends on the actual VM implementation). I'm pretty sure that the writers of the VM are aware that locality is highly desirable and don't go out of their way to scatter objects randomly around.
The Garbage Collector will at some point probably move the objects - if your objects are short lived, that should not be an issue. For long lived objects it then depends on how the GC implements moving the survivor objects. Again, I think its reasonable that the guys writing the GC have spent some thought on the matter and will perform copies in a way that does not screw locality more than unavoidable.
There are obviously no guarantees for any of above assumptions, but since we can't do anything about it anyway, stop worring :)
The only thing you can do at the java source level is to sometimes avoid composition of objects - instead you can "inline" the state you would normally put in a composite object:
class MyThing {
int myVar;
// ... more members
// composite object
Rectangle bounds;
}
instead:
class MyThing {
int myVar;
// ... more members
// "inlined" rectangle
int x, y, width, height;
}
Of course this makes the code less readable and duplicates potentially a lot of code.
Ordering class members by access pattern seems to have a slight effect (I noticed a slight alteration in a benchmarked piece of code after I had reordered some declarations), but I've never bothered to verify if its true. But it would make sense if the VM does no reordering of members.
On the same topic it would also be nice to (from a performance view) be able to reinterpret an existing primitive array as another type (e.g. cast int[] to float[]). And while you're at it, why not whish for union members as well? I sure do.
But we'd have to give up a lot of platform and architecture independency in exchange for these possibilities.
Doesn't work that way in Java. Iteration is not a matter of increasing a pointer. There is no performance impact based on where on the heap the objects are physically stored.
If you still want to approach this in a C/C++ way, think of a Java array as an array of pointers to structs. When you loop over the array, it doesn't matter where the actual structs are allocated, you are looping over an array of pointers.
I would abandon this line of reasoning. It's not how Java works and it's also sub-optimization.

Benchmarking small Arrays vs. Lists in Java: Is my benchmarking code wrong?

Disclaimer: I have looked through this
question and this question
but they both got derailed by small
details and general
optimization-is-unnecessary concerns.
I really need all the performance I
can get in my current app, which is
receiving-processing-spewing MIDI data
in realtime. Also it needs to scale up
as well as possible.
I am comparing array performance on a high number of reads for small lists to ArrayList and also to just having the variables in hand. I'm finding that an array beats ArrayList by a factor of 2.5 and even beats just having the object references.
What I would like to know is:
Is my benchmark okay? I have switched the order of the tests and number of runs with no change. I've also used milliseconds instead of nanoseconds to no avail.
Should I be specifying any Java options to minimize this difference?
If this difference is real, in this case shouldn't I prefer Test[] to ArrayList<Test> in this situation and put in the code necessary to convert them? Obviously I'm reading a lot more than writing.
JVM is Java 1.6.0_17 on OSX and it is definitely running in Hotspot mode.
public class ArraysVsLists {
static int RUNS = 100000;
public static void main(String[] args) {
long t1;
long t2;
Test test1 = new Test();
test1.thing = (int)Math.round(100*Math.random());
Test test2 = new Test();
test2.thing = (int)Math.round(100*Math.random());
t1 = System.nanoTime();
for (int i=0; i<RUNS; i++) {
test1.changeThing(i);
test2.changeThing(i);
}
t2 = System.nanoTime();
System.out.println((t2-t1) + " How long NO collection");
ArrayList<Test> list = new ArrayList<Test>(1);
list.add(test1);
list.add(test2);
// tried this too: helps a tiny tiny bit
list.trimToSize();
t1= System.nanoTime();
for (int i=0; i<RUNS; i++) {
for (Test eachTest : list) {
eachTest.changeThing(i);
}
}
t2 = System.nanoTime();
System.out.println((t2-t1) + " How long collection");
Test[] array = new Test[2];
list.toArray(array);
t1= System.nanoTime();
for (int i=0; i<RUNS; i++) {
for (Test test : array) {
test.changeThing(i);
}
}
t2 = System.nanoTime();
System.out.println((t2-t1) + " How long array ");
}
}
class Test {
int thing;
int thing2;
public void changeThing(int addThis) {
thing2 = addThis + thing;
}
}
Microbenchmarks are very, very hard to get right on a platform like Java. You definitely have to extract the code to be benchmarked into separate methods, run them a few thousand times as warmup and then measure. I've done that (code below) and the result is that direct access through references is then three times as fast as through an array, but the collection is still slower by a factor of 2.
These numbers are based on the JVM options -server -XX:+DoEscapeAnalysis. Without -server, using the collection is drastically slower (but strangely, direct and array access are quite a bit faster, indicating that there is something weird going on). -XX:+DoEscapeAnalysis yields another 30% speedup for the collection, but it's very much questionabled whether it will work as well for your actual production code.
Overall my conclusion would be: forget about microbenchmarks, they can too easily be misleading. Measure as close to production code as you can without having to rewrite your entire application.
import java.util.ArrayList;
public class ArrayTest {
static int RUNS_INNER = 1000;
static int RUNS_WARMUP = 10000;
static int RUNS_OUTER = 100000;
public static void main(String[] args) {
long t1;
long t2;
Test test1 = new Test();
test1.thing = (int)Math.round(100*Math.random());
Test test2 = new Test();
test2.thing = (int)Math.round(100*Math.random());
for(int i=0; i<RUNS_WARMUP; i++)
{
testRefs(test1, test2);
}
t1 = System.nanoTime();
for(int i=0; i<RUNS_OUTER; i++)
{
testRefs(test1, test2);
}
t2 = System.nanoTime();
System.out.println((t2-t1)/1000000.0 + " How long NO collection");
ArrayList<Test> list = new ArrayList<Test>(1);
list.add(test1);
list.add(test2);
// tried this too: helps a tiny tiny bit
list.trimToSize();
for(int i=0; i<RUNS_WARMUP; i++)
{
testColl(list);
}
t1= System.nanoTime();
for(int i=0; i<RUNS_OUTER; i++)
{
testColl(list);
}
t2 = System.nanoTime();
System.out.println((t2-t1)/1000000.0 + " How long collection");
Test[] array = new Test[2];
list.toArray(array);
for(int i=0; i<RUNS_WARMUP; i++)
{
testArr(array);
}
t1= System.nanoTime();
for(int i=0; i<RUNS_OUTER; i++)
{
testArr(array);
}
t2 = System.nanoTime();
System.out.println((t2-t1)/1000000.0 + " How long array ");
}
private static void testArr(Test[] array)
{
for (int i=0; i<RUNS_INNER; i++) {
for (Test test : array) {
test.changeThing(i);
}
}
}
private static void testColl(ArrayList<Test> list)
{
for (int i=0; i<RUNS_INNER; i++) {
for (Test eachTest : list) {
eachTest.changeThing(i);
}
}
}
private static void testRefs(Test test1, Test test2)
{
for (int i=0; i<RUNS_INNER; i++) {
test1.changeThing(i);
test2.changeThing(i);
}
}
}
class Test {
int thing;
int thing2;
public void changeThing(int addThis) {
thing2 = addThis + thing;
}
}
Your benchmark is only valid if your actual use case matches the benchmark code, i.e. very few operations on each element, so that execution time is largely determined by access time rather than the operations themselves. If that is the case then yes, you should be using arrays if performance is critical. If however your real use case involves a lot more actual computation per element, then the access time per element will become a lot less significant.
It is probably not valid. If I understand the way that JIT compilers work, compiling a method won't affect a call to that method that is already executing. Since the main method is only called once, it will end up being interpreted, and since most of the work is done in the body of that method, the numbers you get won't be particularly indicative of normal execution.
JIT compilation effects may go some way to explain why the no collections case was slower that the arrays case. That result is counter-intuitive, and it places a doubt on the other benchmark result that you reported.

Categories