I do not have a background in CS. I am really new to parallel programming and I do not know how exactly the hardware works when running a program. However, I have noticed the following. Say I have:
public class Counter {
private static int parallelCount = 0;
private static int sequentialCount = 0;
public static void main(String[] args) {
int n = 1000;
// I count in parallel:
IntStream.range(0, n).parallel().forEach(i -> {
parallelCount++;
});
// I count sequentially:
for (int i = 0; i < n; i++) {
sequentialCount++;
}
System.out.println("parallelCount = " + parallelCount);
System.out.println("sequentialCount = " + sequentialCount);
}
}
why I may get:
parallelCount = 984
sequentialCount = 1000
I guess this has to do with the hardware and the way the compiler access memory. I am really interested to know why this happens. And what is one possible solution?
Whenever more than one threads can access a value that is mutable then the system goes out of sync meaning the kind of problem that you are facing. No one can be sure what the result will be, and many a times the result will be wrong. You cannot guarantee which thread will write the value last.
Therefore, you need to synchronize the access to the shared resource (the integer you are incrementing) so that all threads get the latest updated value and the answer is always correct.
Coming to your program you can try making the parallelCount variable an Atomic Integer like AtomicInteger parallelCount = new AtomicInteger(); An Atomic Integer is thread safe meaning that they can be concurrently updated without running the system out of sync.
import java.util.concurrent.atomic.AtomicInteger;
import java.util.stream.IntStream;
public class Counter {
private static AtomicInteger parallelCount = new AtomicInteger();
private static int sequentialCount = 0;
public static void main(String[] args) {
int n = 1000;
// I count in parallel:
IntStream.range(0, n).parallel().forEach(i -> {
parallelCount.getAndIncrement();
});
// I count sequentially:
for (int i = 0; i < n; i++) {
sequentialCount++;
}
System.out.println("parallelCount = " + parallelCount);
System.out.println("sequentialCount = " + sequentialCount);
}
}
As you can expect standard for loop will increment sequentialCount 1000 times
Regarding parallel stream, the application will try to open multiple threads which need to execute your function on parallel. In this situation, multiple threads can increment value at the same time and store value to int.
For example, suppose that we have two threads that working parallel and want to increment value from variable parallelCount. If parallelCount contains value 50. Both threads will read value 50 and calculate the new value 51 and store it to memory.
This approach can produce other concurrent problems. In order to solve this problem, you can use synchronization, locking, atomic classes, or another approach.
Multiple theads do an operation that is not atomic (incrementing a value).
The code you wrote translates to byte code and might cause something like this:
To avoid this, you need to synchronize the access to that critical code.
But note, that if all of your code is critical code, then it's redundant to use multiple threads.
AtomicInteger
We can make use of AtomicInteger class from Java concurrency package while working with parallel streams as the behavior can be unpredictable while using primitive data type
import java.util.stream.IntStream;
import java.util.concurrent.atomic.AtomicInteger;
public class Main
{
private static AtomicInteger parallelCount = new AtomicInteger();
private static int sequentialCount = 0;
public static void main(String[] args) {
System.out.println("Hello World");
int n = 100000;
// I count in parallel:
IntStream.range(0, n).parallel().forEach(i -> {
parallelCount.incrementAndGet();
});
// I count sequentially:
for (int i = 0; i < n; i++) {
sequentialCount++;
}
System.out.println("parallelCount = " + parallelCount);
System.out.println("sequentialCount = " + sequentialCount);
}
}
Related
Consider the following code:
public static void main(String[] args) throws InterruptedException {
int nThreads = 10;
MyThread[] threads = new MyThread[nThreads];
AtomicReferenceArray<Object> array = new AtomicReferenceArray<>(nThreads);
for (int i = 0; i < nThreads; i++) {
MyThread thread = new MyThread(array, i);
threads[i] = thread;
thread.start();
}
for (MyThread thread : threads)
thread.join();
for (int i = 0; i < nThreads; i++) {
Object obj_i = array.get(i);
// do something with obj_i...
}
}
private static class MyThread extends Thread {
private final AtomicReferenceArray<Object> pArray;
private final int pIndex;
public MyThread(final AtomicReferenceArray<Object> array, final int index) {
pArray = array;
pIndex = index;
}
#Override
public void run() {
// some entirely local time-consuming computation...
pArray.set(pIndex, /* result of the computation */);
}
}
Each MyThread computes something entirely locally (without need to synchronize with other threads) and writes the result to its specific array cell. The main thread waits until all MyThreads have finished, and then retrieves the results and does something with them.
Using the get and set methods of AtomicReferenceArray provides a memory ordering which guarantees that the main thread will see the results written by the MyThreads.
However, since every array cell is written only once, and no MyThread has to see the result written by any other MyThread, I wonder if these strong ordering guarantees are actually necessary or if the following code, with plain array cell accesses, would be guaranteed to always yield the same results as the code above:
public static void main(String[] args) throws InterruptedException {
int nThreads = 10;
MyThread[] threads = new MyThread[nThreads];
Object[] array = new Object[nThreads];
for (int i = 0; i < nThreads; i++) {
MyThread thread = new MyThread(array, i);
threads[i] = thread;
thread.start();
}
for (MyThread thread : threads)
thread.join();
for (int i = 0; i < nThreads; i++) {
Object obj_i = array[i];
// do something with obj_i...
}
}
private static class MyThread extends Thread {
private final Object[] pArray;
private final int pIndex;
public MyThread(final Object[] array, final int index) {
pArray = array;
pIndex = index;
}
#Override
public void run() {
// some entirely local time-consuming computation...
pArray[pIndex] = /* result of the computation */;
}
}
On the one hand, under plain mode access the compiler or runtime might happen to optimize away the read accesses to array in the final loop of the main thread and replace Object obj_i = array[i]; with Object obj_i = null; (the implicit initialization of the array) as the array is not modified from within that thread. On the other hand, I have read somewhere that Thread.join makes all changes of the joined thread visible to the calling thread (which would be sensible), so Object obj_i = array[i]; should see the object reference assigned by the i-th MyThread.
So, would the latter code produce the same results as the above?
So, would the latter code produce the same results as the above?
Yes.
The "somewhere" that you've read about Thread.join could be JLS 17.4.5 (The "Happens-before order" bit of the Java Memory Model):
All actions in a thread happen-before any other thread successfully returns from a join() on that thread.
So, all of your writes to individual elements will happen before the final join().
With this said, I would strongly recommend that you look for alternative ways to structure your problem that don't require you to be worrying about the correctness of your code at this level of detail (see my other answer).
An easier solution here would appear to be to use the Executor framework, which hides typically unnecessary details about the threads and how the result is stored.
For example:
ExecutorService executor = ...
List<Future<Object>> futures = new ArrayList<>();
for (int i = 0; i < nThreads; i++) {
futures.add(executor.submit(new MyCallable<>(i)));
}
executor.shutdown();
for (int i = 0; i < nThreads; ++i) {
array[i] = futures.get(i).get();
}
for (int i = 0; i < nThreads; i++) {
Object obj_i = array[i];
// do something with obj_i...
}
where MyCallable is analogous to your MyThread:
private static class MyCallable implements Callable<Object> {
private final int pIndex;
public MyCallable(final int index) {
pIndex = index;
}
#Override
public Object call() {
// some entirely local time-consuming computation...
return /* result of the computation */;
}
}
This results in simpler and more-obviously correct code, because you're not worrying about memory consistency: this is handled by the framework. It also gives you more flexibility, e.g. running it on fewer threads than work items, reusing a thread pool etc.
Atomic operations are required to ensure memory barriers are present when multiple threads access the same memory location. Without memory barriers, there is no happened-before relationship between the threads and there is no guarantee that the main thread will see the modifications done by the other threads, hence data rance. So what you really need is memory barriers for the write and read operations. You can achieve that using AtomicReferenceArray or a synchronized block on a common object.
You have Thread.join in the second program before the read operations. That should remove the data race. Without the join, you need explicit synchronization.
I know this has been asked a few times before:
Java: how to synchronize array accesses and what are the limitations on what goes in a synchronized condition
Synchronizing elements in an array
but I couldn't quite find the answer to my question: When I have an array with multiple elements and my threads shall modify these elements at the same time (otherwise there would be no advantage of using threads, right?) and I use my array as a lock (which kind of makes sense because it is the critical data which could go into inconsistent state due to time-slicing) then my operations on the array are safe BUT all the parallelization would be lost, wouldn't it?!
I add some code I want to try this on:
import java.util.Arrays;
public class ParallelNine extends Thread{
public static final int[] input = new int[]{
119_119_119,
119_119_119,
111_111_111,
999_999_999,
};
private static int completed = 0;
static void process(int currentIndex){
System.out.println("Processing " + currentIndex);
String number = Integer.toString(input[currentIndex]);
int counter = 0;
for(int index = 0; index < number.length(); index++){
if(number.charAt(index) == '9')
counter++;
}
input[currentIndex] = counter;
}
#Override public void run(){
while(completed < input.length){
synchronized(input){
process(completed);
completed++;
}
}
}
public static void main(final String... args) throws InterruptedException{
Thread[] threads = new Thread[]{new ParallelNine(), new ParallelNine(), new ParallelNine(), new ParallelNine()};
for(final Thread next : threads){
next.run();
}
for(final Thread next : threads)
next.join();
System.out.println(Arrays.toString(input));
}
}
We have an array with primitive int values. A dedicated method (process) counts, how often the number 9 appears in the integers value at an index of our array and then overwrites the checked arrays element with the counted number of nines. So the correct output would be [3, 3, 0, 9].This array (input) is pretty small but if we have a few thousand entries for example it would make sense to have multiple threads counting nines: So I synchronized on the array but as mentioned above: All the parallelization is lost because only ONE thread at a time has access to the array!
I'm playing around with threads and I don't understand why this isn't working as I thought.
I am trying to calculate a sum using threads and was expecting for the thread pool to wait for all tasks to finish by the time I print out the result (due to the shutdown() call and the isTerminated() check).
What am I missing here?
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Test5 {
private Integer sum= new Integer(0);
public static void main(String[] args) {
ExecutorService pool = Executors.newCachedThreadPool();
Test5 obj = new Test5();
for(int i=0; i<1000; i++){
pool.execute(obj.new Adding());
}
pool.shutdown();
while(!pool.isTerminated()) {
//could be empty loop...
System.out.println(" Is it done? : " + pool.isTerminated());
}
System.out.println(" Is it done? : " + pool.isTerminated());
System.out.println("Sum is " + obj.sum);
}
class Adding implements Runnable {
public void run() {
synchronized(this) {
int tmp = sum;
tmp+=1;
sum=new Integer(tmp);
}
}
}
}
While I do get good results, I also get output such as this:
Is it done? : true
Sum is 983
You need to sync on the main object instance. I'm using int below, Integer will work too (needs to initialized to zero explicitly).
Here is the working code
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class AppThreadsSum {
int sum;
public static void main(String[] args) {
ExecutorService pool = Executors.newCachedThreadPool();
AppThreadsSum app = new AppThreadsSum();
for (int i = 0; i < 1000; i++) {
pool.execute(app.new Adding());
}
pool.shutdown();
while (!pool.isTerminated()) {
System.out.println(" Is it done? : " + pool.isTerminated());
}
System.out.println(" Is it done? : " + pool.isTerminated());
System.out.println("Sum is " + app.sum);
}
class Adding implements Runnable {
public void run() {
synchronized (AppThreadsSum.this) {
sum += 1;
}
}
}
}
p.s. Busy waiting is an anti-pattern to be avoided (copied from the neighbor answer to be complete and aware of this important thing, see comments)
You have a number of issues.
Your code is not threadsafe
Busy waiting is an anti-pattern to be avoided.
What do i mean by 1.?
Lets suppose we have two threads, A & B.
A reads sum into tmp as 1
B reads sum into tmp as 1
A increments sum to 2
A writes sum as 2
B increments sum to 2
B writes sum as 2
So we end up with 2 after two increments. No quite right.
Now you may say "but I have used synchronized, this should not happen". Well, you haven't.
When you create your Adding instances you new each one. You have 1000 separate Adding instances.
When you synchronized(this) you are synchronizing on the current instance, not across all Adding. So your synchronized block does nothing.
Now, the simple solution would be to use synchronized(Adding.class).
The synchronized(Adding.class) will make the code block synchronized correctly across all Adding instances.
The good solution would be to use an AtmoicInteger rather than an Integer as this increments atomically and is designed for exactly this sort of task.
Now onto 2.
You have a while(thing){} loop, this basically runs the thread like crazy testing thousands of times a millisecond until thing is true. This is a huge waste of CPU cycles. An ExecutorService has a special, blocking, method that waits until it has shutdown, awaitTermination.
Here is an example:
static final AtomicInteger sum = new AtomicInteger(0);
public static void main(String[] args) throws InterruptedException {
ExecutorService pool = Executors.newCachedThreadPool();
for (int i = 0; i < 1000; i++) {
pool.execute(new Adding());
}
pool.shutdown();
pool.awaitTermination(1, TimeUnit.DAYS);
System.out.println(" Is it done? : " + pool.isTerminated());
System.out.println("Sum is " + sum);
}
static class Adding implements Runnable {
public void run() {
sum.addAndGet(1);
}
}
I would also suggest not using a cachedThreadPool in this circumstance as you have 1000 Runnables being submitted and this will generate far more Threads than you have CPUs. I would suggest using newFixedThreadPool with a sane number of Threads.
I'm not even going to go into the use of int literals and Integer and why new Integer() is not needed.
I have a List<String> called lines and a huge (~3G) Set<String> called voc. I need to find all lines from lines that are in voc. Can I do this multithreaded way?
Currently I have this straightforward code:
for(String line: lines) {
if (voc.contains(line)) {
// Great!!
}
}
Is there a way to search for few lines at the same time? May be there are existing solutions?
PS: I am using javolution.util.FastMap, because it behaves better during filling up.
Here is a possible implementation. Please note that error/interruption handling has been omitted but this might give you a starting point. I included a main method so you could copy and paste this into your IDE for a quick demo.
Edit: Cleaned things up a bit to improve readability and List partitioning
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class ParallelizeListSearch {
public static void main(String[] args) throws InterruptedException, ExecutionException {
List<String> searchList = new ArrayList<String>(7);
searchList.add("hello");
searchList.add("world");
searchList.add("java");
searchList.add("debian");
searchList.add("linux");
searchList.add("jsr-166");
searchList.add("stack");
Set<String> targetSet = new HashSet<String>(searchList);
Set<String> matchSet = findMatches(searchList, targetSet);
System.out.println("Found " + matchSet.size() + " matches");
for(String match : matchSet){
System.out.println("match: " + match);
}
}
public static Set<String> findMatches(List<String> searchList, Set<String> targetSet) throws InterruptedException, ExecutionException {
Set<String> locatedMatchSet = new HashSet<String>();
int threadCount = Runtime.getRuntime().availableProcessors();
List<List<String>> partitionList = getChunkList(searchList, threadCount);
if(partitionList.size() == 1){
//if we only have one "chunk" then don't bother with a thread-pool
locatedMatchSet = new ListSearcher(searchList, targetSet).call();
}else{
ExecutorService executor = Executors.newFixedThreadPool(threadCount);
CompletionService<Set<String>> completionService = new ExecutorCompletionService<Set<String>>(executor);
for(List<String> chunkList : partitionList)
completionService.submit(new ListSearcher(chunkList, targetSet));
for(int x = 0; x < partitionList.size(); x++){
Set<String> threadMatchSet = completionService.take().get();
locatedMatchSet.addAll(threadMatchSet);
}
executor.shutdown();
}
return locatedMatchSet;
}
private static class ListSearcher implements Callable<Set<String>> {
private final List<String> searchList;
private final Set<String> targetSet;
private final Set<String> matchSet = new HashSet<String>();
public ListSearcher(List<String> searchList, Set<String> targetSet) {
this.searchList = searchList;
this.targetSet = targetSet;
}
#Override
public Set<String> call() {
for(String searchValue : searchList){
if(targetSet.contains(searchValue))
matchSet.add(searchValue);
}
return matchSet;
}
}
private static <T> List<List<T>> getChunkList(List<T> unpartitionedList, int splitCount) {
int totalProblemSize = unpartitionedList.size();
int chunkSize = (int) Math.ceil((double) totalProblemSize / splitCount);
List<List<T>> chunkList = new ArrayList<List<T>>(splitCount);
int offset = 0;
int limit = 0;
for(int x = 0; x < splitCount; x++){
limit = offset + chunkSize;
if(limit > totalProblemSize)
limit = totalProblemSize;
List<T> subList = unpartitionedList.subList(offset, limit);
chunkList.add(subList);
offset = limit;
}
return chunkList;
}
}
Simply splitting lines among different threads would (in Oracle JVM at least) spread the work into all CPUs if you are looking for this.
I like using CyclicBarrier, makes those threads controlled in an easier way.
http://javarevisited.blogspot.cz/2012/07/cyclicbarrier-example-java-5-concurrency-tutorial.html
It's absolutely possible to parallelize this using multiple threads. You could do the following:
Break up the list into a different "blocks," one per thread that will do the search.
Have each thread look over its block, checking whether each string is in the set, and if so adding the string to the resulting set.
For example, you might have the following thread routine:
public void scanAndAdd(List<String> allStrings, Set<String> toCheck,
ConcurrentSet<String> matches, int start, int end) {
for (int i = start; i < end; i++) {
if (toCheck.contains(allStrings.get(i))) {
matches.add(allStrings.get(i));
}
}
}
You could then spawn off as many threads as you needed to run the above method and wait for all of them to finish. The resulting matches would then be stored in matches.
For simplicity, I've had the output set be a ConcurrentSet, which automatically eliminates race conditions due to writes. Since you are only doing reads on the list of strings and set of strings to check for, no synchronization is required when reading from allStrings or performing lookups in toCheck.
Hope this helps!
Another option would be to use Akka, it does these kinds of things quite simply.
Actually, having done some search work with Akka, one of the things I can tell you about this too is that it supports two ways of parallelizing such things: through Composable Futures or Agents. For what you want, the Composable Futures would be completely sufficient. Then, Akka is actually not adding that much: Netty is providing the massively parallel io infrastructure, and Futures are part of the jdk, but Akka does make it super simple to put these two together and extend them when/if needed.
Disclaimer: I have looked through this
question and this question
but they both got derailed by small
details and general
optimization-is-unnecessary concerns.
I really need all the performance I
can get in my current app, which is
receiving-processing-spewing MIDI data
in realtime. Also it needs to scale up
as well as possible.
I am comparing array performance on a high number of reads for small lists to ArrayList and also to just having the variables in hand. I'm finding that an array beats ArrayList by a factor of 2.5 and even beats just having the object references.
What I would like to know is:
Is my benchmark okay? I have switched the order of the tests and number of runs with no change. I've also used milliseconds instead of nanoseconds to no avail.
Should I be specifying any Java options to minimize this difference?
If this difference is real, in this case shouldn't I prefer Test[] to ArrayList<Test> in this situation and put in the code necessary to convert them? Obviously I'm reading a lot more than writing.
JVM is Java 1.6.0_17 on OSX and it is definitely running in Hotspot mode.
public class ArraysVsLists {
static int RUNS = 100000;
public static void main(String[] args) {
long t1;
long t2;
Test test1 = new Test();
test1.thing = (int)Math.round(100*Math.random());
Test test2 = new Test();
test2.thing = (int)Math.round(100*Math.random());
t1 = System.nanoTime();
for (int i=0; i<RUNS; i++) {
test1.changeThing(i);
test2.changeThing(i);
}
t2 = System.nanoTime();
System.out.println((t2-t1) + " How long NO collection");
ArrayList<Test> list = new ArrayList<Test>(1);
list.add(test1);
list.add(test2);
// tried this too: helps a tiny tiny bit
list.trimToSize();
t1= System.nanoTime();
for (int i=0; i<RUNS; i++) {
for (Test eachTest : list) {
eachTest.changeThing(i);
}
}
t2 = System.nanoTime();
System.out.println((t2-t1) + " How long collection");
Test[] array = new Test[2];
list.toArray(array);
t1= System.nanoTime();
for (int i=0; i<RUNS; i++) {
for (Test test : array) {
test.changeThing(i);
}
}
t2 = System.nanoTime();
System.out.println((t2-t1) + " How long array ");
}
}
class Test {
int thing;
int thing2;
public void changeThing(int addThis) {
thing2 = addThis + thing;
}
}
Microbenchmarks are very, very hard to get right on a platform like Java. You definitely have to extract the code to be benchmarked into separate methods, run them a few thousand times as warmup and then measure. I've done that (code below) and the result is that direct access through references is then three times as fast as through an array, but the collection is still slower by a factor of 2.
These numbers are based on the JVM options -server -XX:+DoEscapeAnalysis. Without -server, using the collection is drastically slower (but strangely, direct and array access are quite a bit faster, indicating that there is something weird going on). -XX:+DoEscapeAnalysis yields another 30% speedup for the collection, but it's very much questionabled whether it will work as well for your actual production code.
Overall my conclusion would be: forget about microbenchmarks, they can too easily be misleading. Measure as close to production code as you can without having to rewrite your entire application.
import java.util.ArrayList;
public class ArrayTest {
static int RUNS_INNER = 1000;
static int RUNS_WARMUP = 10000;
static int RUNS_OUTER = 100000;
public static void main(String[] args) {
long t1;
long t2;
Test test1 = new Test();
test1.thing = (int)Math.round(100*Math.random());
Test test2 = new Test();
test2.thing = (int)Math.round(100*Math.random());
for(int i=0; i<RUNS_WARMUP; i++)
{
testRefs(test1, test2);
}
t1 = System.nanoTime();
for(int i=0; i<RUNS_OUTER; i++)
{
testRefs(test1, test2);
}
t2 = System.nanoTime();
System.out.println((t2-t1)/1000000.0 + " How long NO collection");
ArrayList<Test> list = new ArrayList<Test>(1);
list.add(test1);
list.add(test2);
// tried this too: helps a tiny tiny bit
list.trimToSize();
for(int i=0; i<RUNS_WARMUP; i++)
{
testColl(list);
}
t1= System.nanoTime();
for(int i=0; i<RUNS_OUTER; i++)
{
testColl(list);
}
t2 = System.nanoTime();
System.out.println((t2-t1)/1000000.0 + " How long collection");
Test[] array = new Test[2];
list.toArray(array);
for(int i=0; i<RUNS_WARMUP; i++)
{
testArr(array);
}
t1= System.nanoTime();
for(int i=0; i<RUNS_OUTER; i++)
{
testArr(array);
}
t2 = System.nanoTime();
System.out.println((t2-t1)/1000000.0 + " How long array ");
}
private static void testArr(Test[] array)
{
for (int i=0; i<RUNS_INNER; i++) {
for (Test test : array) {
test.changeThing(i);
}
}
}
private static void testColl(ArrayList<Test> list)
{
for (int i=0; i<RUNS_INNER; i++) {
for (Test eachTest : list) {
eachTest.changeThing(i);
}
}
}
private static void testRefs(Test test1, Test test2)
{
for (int i=0; i<RUNS_INNER; i++) {
test1.changeThing(i);
test2.changeThing(i);
}
}
}
class Test {
int thing;
int thing2;
public void changeThing(int addThis) {
thing2 = addThis + thing;
}
}
Your benchmark is only valid if your actual use case matches the benchmark code, i.e. very few operations on each element, so that execution time is largely determined by access time rather than the operations themselves. If that is the case then yes, you should be using arrays if performance is critical. If however your real use case involves a lot more actual computation per element, then the access time per element will become a lot less significant.
It is probably not valid. If I understand the way that JIT compilers work, compiling a method won't affect a call to that method that is already executing. Since the main method is only called once, it will end up being interpreted, and since most of the work is done in the body of that method, the numbers you get won't be particularly indicative of normal execution.
JIT compilation effects may go some way to explain why the no collections case was slower that the arrays case. That result is counter-intuitive, and it places a doubt on the other benchmark result that you reported.