Java Stream API: why the distinction between sequential and parallel execution mode?

Java Stream API: why the distinction between sequential and parallel execution mode? - java

From the Stream javadoc:
Stream pipelines may execute either sequentially or in parallel. This execution mode is a property of the stream. Streams are created with an initial choice of sequential or parallel execution.
My assumptions:
There is no functional difference between a sequential/parallel streams. Output is never affected by execution mode.
A parallel stream is always preferable, given appropriate number of cores and problem size to justify the overhead, due to the performance gains.
We want to write code once and run anywhere without having to care about the hardware (this is Java, after all).
Assuming these assumptions are valid (nothing wrong with a bit of meta-assumption), what's the value in having the execution mode exposed in the api?
It seems like you should just be able to declare a Stream, and the choice of sequential/parallel execution should be handled automagically in a layer below, either by library code or the JVM itself as a function of the cores available at runtime, the size of the problem, etc.
Sure, assuming parallel streams also work on a single core machine, perhaps just always using a parallel stream achieves this. But this is really ugly - why have explicit references to parallel streams in my code when it's the default option?
Even if there is a scenario where you'd deliberately want to hard code the use of a sequential stream - why is there not just a sub-interface SequentialStream for that purpose, rather than polluting Stream with an execution mode switch?

It seems like you should just be able to declare a Stream, and the choice of sequential/parallel execution should be handled automagically in a layer below, either by library code or the JVM itself as a function of the cores available at runtime, the size of the problem, etc.
The reality is that a) streams are a library, and have no special JVM magic, and b) you can't really design a library smart enough to automagically figure out what the right decision is in this particular case. There's no sensible way to estimate how costly a particular function will be without running it -- even if you could introspect its implementation, which you can't -- and now you're introducing a benchmark into every stream operation, trying to figure out if parallelizing it will be worth the cost of the parallelism overhead. That's just not practical, especially given that you don't know in advance how bad the parallelism overhead is, either.
A parallel stream is always preferable, given appropriate number of cores and problem size to justify the overhead, due to the performance gains.
Not always, in practice. Some tasks are just so small that they're not worth parallelizing, and parallelism does always have some overhead. (And frankly, most programmers tend to overestimate the usefulness of parallelism, slapping it everywhere when it's really hurting performance.)
Basically, it's a hard enough problem that you basically have to shove it off onto the programmer.

There's an interesting case in this question showing that sometimes parallel stream might be slower in orders of magnitude. In that particular example parallel version runs for ten minutes while sequential takes several seconds.

There is no functional difference between a sequential/parallel
streams. Output is never affected by execution mode.
There is a difference between sequential/parallel streams execution.
In the below code TEST_2 results shows that parallel thread execution is very much faster than the sequential way.
A parallel stream
is always preferable, given appropriate number of cores and problem
size to justify the overhead, due to the performance gains.
Not really. if task is not worthy(simple tasks) to be executed in parallel threads, then it is simply we are adding overhead to our code.
TEST_1 results shows this. Also note that if all the worker threads are busy on one parallel execution tasks; then other parallel stream operation elsewhere in your code will be waiting for that.
We want to
write code once and run anywhere without having to care about the
hardware (this is Java, after all).
Since only programmer knows about; is it worthy to execute this task in parallel/sequential irrespective of CPU's. So java API exposed both option to the developer.
import java.util.ArrayList;
import java.util.List;
/*
* Performance test over internal(parallel/sequential) and external iterations.
* https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html
*
*
* Parallel computing involves dividing a problem into subproblems,
* solving those problems simultaneously (in parallel, with each subproblem running in a separate thread),
* and then combining the results of the solutions to the subproblems. Java SE provides the fork/join framework,
* which enables you to more easily implement parallel computing in your applications. However, with this framework,
* you must specify how the problems are subdivided (partitioned).
* With aggregate operations, the Java runtime performs this partitioning and combining of solutions for you.
*
* Limit the parallelism that the ForkJoinPool offers you. You can do it yourself by supplying the -Djava.util.concurrent.ForkJoinPool.common.parallelism=1,
* so that the pool size is limited to one and no gain from parallelization
*
* #see ForkJoinPool
* https://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html
*
* ForkJoinPool, that pool creates a fixed number of threads (default: number of cores) and
* will never create more threads (unless the application indicates a need for those by using managedBlock).
* * http://stackoverflow.com/questions/10797568/what-determines-the-number-of-threads-a-java-forkjoinpool-creates
*
*/
public class IterationThroughStream {
private static boolean found = false;
private static List<Integer> smallListOfNumbers = null;
public static void main(String[] args) throws InterruptedException {
// TEST_1
List<String> bigListOfStrings = new ArrayList<String>();
for(Long i = 1l; i <= 1000000l; i++) {
bigListOfStrings.add("Counter no: "+ i);
}
System.out.println("Test Start");
System.out.println("-----------");
long startExternalIteration = System.currentTimeMillis();
externalIteration(bigListOfStrings);
long endExternalIteration = System.currentTimeMillis();
System.out.println("Time taken for externalIteration(bigListOfStrings) is :" + (endExternalIteration - startExternalIteration) + " , and the result found: "+ found);
long startInternalIteration = System.currentTimeMillis();
internalIteration(bigListOfStrings);
long endInternalIteration = System.currentTimeMillis();
System.out.println("Time taken for internalIteration(bigListOfStrings) is :" + (endInternalIteration - startInternalIteration) + " , and the result found: "+ found);
// TEST_2
smallListOfNumbers = new ArrayList<Integer>();
for(int i = 1; i <= 10; i++) {
smallListOfNumbers.add(i);
}
long startExternalIteration1 = System.currentTimeMillis();
externalIterationOnSleep(smallListOfNumbers);
long endExternalIteration1 = System.currentTimeMillis();
System.out.println("Time taken for externalIterationOnSleep(smallListOfNumbers) is :" + (endExternalIteration1 - startExternalIteration1));
long startInternalIteration1 = System.currentTimeMillis();
internalIterationOnSleep(smallListOfNumbers);
long endInternalIteration1 = System.currentTimeMillis();
System.out.println("Time taken for internalIterationOnSleep(smallListOfNumbers) is :" + (endInternalIteration1 - startInternalIteration1));
// TEST_3
Thread t1 = new Thread(IterationThroughStream :: internalIterationOnThread);
Thread t2 = new Thread(IterationThroughStream :: internalIterationOnThread);
Thread t3 = new Thread(IterationThroughStream :: internalIterationOnThread);
Thread t4 = new Thread(IterationThroughStream :: internalIterationOnThread);
t1.start();
t2.start();
t3.start();
t4.start();
Thread.sleep(30000);
}
private static boolean externalIteration(List<String> bigListOfStrings) {
found = false;
for(String s : bigListOfStrings) {
if(s.equals("Counter no: 1000000")) {
found = true;
}
}
return found;
}
private static boolean internalIteration(List<String> bigListOfStrings) {
found = false;
bigListOfStrings.parallelStream().forEach(
(String s) -> {
if(s.equals("Counter no: 1000000")){ //Have a breakpoint to look how many threads are spawned.
found = true;
}
}
);
return found;
}
private static boolean externalIterationOnSleep(List<Integer> smallListOfNumbers) {
found = false;
for(Integer s : smallListOfNumbers) {
try {
Thread.sleep(100);
} catch (Exception e) {
e.printStackTrace();
}
}
return found;
}
private static boolean internalIterationOnSleep(List<Integer> smallListOfNumbers) {
found = false;
smallListOfNumbers.parallelStream().forEach( //Removing parallelStream() will behave as single threaded (sequential access).
(Integer s) -> {
try {
Thread.sleep(100); //Have a breakpoint to look how many threads are spawned.
} catch (Exception e) {
e.printStackTrace();
}
}
);
return found;
}
public static void internalIterationOnThread() {
smallListOfNumbers.parallelStream().forEach(
(Integer s) -> {
try {
/*
* DANGEROUS
* This will tell you that if all the 7 FJP(Fork join pool) worker threads are blocked for one single thread (e.g. t1),
* then other normal three(t2 - t4) thread wont execute, will wait for FJP worker threads.
*/
Thread.sleep(100); //Have a breakpoint here.
} catch (Exception e) {
e.printStackTrace();
}
}
);
}
}

It seems like you should just be able to declare a Stream, and the choice of sequential/parallel execution should be handled automagically in a layer below, either by library code or the JVM itself as a function of the cores available at runtime, the size of the problem, etc.
To add to the already given answers:
Thats a pretty bold assumption. Imagine simulating a board-game for training some form of AI, it's pretty easy to parallelize the execution of different playthroughs - just create a new instance and let it run on its own thread. As it doesn't share any state with another playthrough you don't even have to consider multi-threading issues in your game logic. If you on the other hand parallelize the game logic itself you get all sorts of multi-threading issues and most likely pay a steep price for complexity and even performance.
Having control over the behaviour of streams gives you (appropriately limited) flexibility which in and of itself is a key feature for good library design.

Related

Parallelize a for loop in java

I have a for loop that is looping over a list of collections. Inside the loop some select/update queries are taking place on collection which are exclusive of the other collections. Since each collection has a lot of data to process on i would like to parallelize it.
The code snippet looks something like this:
//Some variables that are used within the for loop logic
for(String collection : collections) {
//Select queries on collection
//Update queries on collection
}
How can i achieve this in java?

You can use the parallelStream() method (since java 8):
collections.parallelStream().forEach((collection) -> {
//Select queries on collection
//Update queries on collection
});
More informations about streams.
Another way to do it is using Executors :
try
{
final ExecutorService exec = Executors.newFixedThreadPool(collections.size());
for (final String collection : collections)
{
exec.submit(() -> {
// Select queries on collection
// Update queries on collection
});
}
// We want to wait that the jobs are done.
final boolean terminated = exec.awaitTermination(500, TimeUnit.MILLISECONDS);
if (terminated == false)
{
exec.shutdownNow();
}
} catch (final InterruptedException e)
{
e.printStackTrace();
}
This example is more powerfull since you can easily know when the job is done, force termination... and more.

final int numberOfThreads = 32;
final ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
// List to store the 'handles' (Futures) for all tasks:
final List<Future<MyResult>> futures = new ArrayList<>();
// Schedule one (parallel) task per String from "collections":
for(final String str : collections) {
futures.add(executor.submit(() -> { return doSomethingWith(str); }));
}
// Wait until all tasks have completed:
for ( Future<MyResult> f : futures ) {
MyResult aResult = f.get(); // Will block until the result of the task is available.
// Optionally do something with the result...
}
executor.shutdown(); // Release the threads held by the executor.
// At this point all tasks have ended and we can continue as if they were all executed sequentially
Adjust the numberOfThreads as needed to achieve the best throughput. More threads will tend to utilize the local CPU better, but may cause more overhead at the remote end. To get good local CPU utilization, you want to have (much) more threads than CPUs (/cores) so that, whenever one thread has to wait, e.g. for a response from the DB, another thread can be switched in to execute on the CPU.

There are a number of question that you need to ask yourself to find the right answer:
If I have as many threads as the number of my CPU cores, would that be enough?
Using parallelStream() will give you as many threads as your CPU cores.
Will parallelizing the loop give me a performance boost or is there a bottleneck on the DB?
You could spin up 100 threads, processing in parallel, but this doesn't mean that you will do things 100 times faster, if your DB or the network cannot handle the volume. DB locking can also be an issue here.
Do I need to process my data in a specific order?
If you have to process your data in a specific order, this may limit your choices. E.g. forEach() doesn't guarantee that the elements of your collection will be processed in a specific order, but forEachOrdered() does (with a performance cost).
Is my datasource capable of fetching data reactively?
There are cases when our datasource can provide data in the form of a stream. In that case, you can always process this stream using a technology such as RxJava or WebFlux. This would enable you to take a different approach on your problem.
Having said all the above, you can choose the approach you want (executors, RxJava etc.) that fit better to your purpose.

Why threads do not cache object locally?

I have a String and ThreadPoolExecutor that changes the value of this String. Just check out my sample:
String str_example = "";
ThreadPoolExecutor poolExecutor = new ThreadPoolExecutor(10, 30, (long)10, TimeUnit.SECONDS, runnables);
for (int i = 0; i < 80; i++){
poolExecutor.submit(new Runnable() {
#Override
public void run() {
try {
Thread.sleep((long) (Math.random() * 1000));
String temp = str_example + "1";
str_example = temp;
System.out.println(str_example);
} catch (Exception e) {
e.printStackTrace();
}
}
});
}
so after executing this, i get something like that:
1
11
111
1111
11111
.......
So question is: i just expect the result like this if my String object has volatile modifier. But i have the same result with this modifier and without.

There are several reasons why you see "correct" execution.
First, CPU designers do as much as they can so that our programs run correctly even in presence of data races. Cache coherence deals with cache lines and tries to minimize possible conflicts. For example, only one CPU can write to a cache line at some point of time. After write was done other CPUs should request that cache line to be able to write to it. Not to say x86 architecture(most probable which you use) is very strict comparing to others.
Second, your program is slow and threads sleep for some random period of time. So they do almost all the work at different points of time.
How to achieve inconsistent behavior? Try something with for loop without any sleep. In that case field value most probably will be cached in CPU registers and some updates will not be visible.
P.S. Updates of field str_example are not atomic so you program may produce the same string values even in presense of volatile keyword.

When you talk about concepts like thread caching, you're talking about the properties of a hypothetical machine that Java might be implemented on. The logic is something like "Java permits an implementation to cache things, so it requires you to tell it when such things would break your program". That does not mean that any actual machine does anything of the sort. In reality, most machines you are likely to use have completely different kinds of optimizations that don't involve the kind of caches that you're thinking of.
Java requires you to use volatile precisely so that you don't have to worry about what kinds of absurdly complex optimizations the actual machine you're working on might or might not have. And that's a really good thing.

Your code is unlikely to exhibit concurrency bugs because it executes with very low concurrency. You have 10 threads, each of which sleep on average 500 ms before doing a string concatenation. As a rough guess, String concatenation takes about 1ns per character, and because your string is only 80 characters long, this would mean that each thread spends about 80 out of 500000000 ns executing. The chance of two or more threads running at the same time is therefore vanishingly small.
If we change your program so that several threads are running concurrently all the time, we see quite different results:
static String s = "";
public static void main(String[] args) throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(5);
for (int i = 0; i < 10_000; i ++) {
executor.submit(() -> {
s += "1";
});
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.MINUTES);
System.out.println(s.length());
}
In the absence of data races, this should print 10000. On my computer, this prints about 4200, meaning over half the updates have been lost in the data race.
What if we declare s volatile? Interestingly, we still get about 4200 as a result, so data races were not prevented. That makes sense, because volatile ensures that writes are visible to other threads, but does not prevent intermediary updates, i.e. what happens is something like:
Thread 1 reads s and starts making a new String
Thread 2 reads s and starts making a new String
Thread 1 stores its result in s
Thread 2 stores its result in s, overwriting the previous result
To prevent this, you can use a plain old synchronized block:
executor.submit(() -> {
synchronized (Test.class) {
s += "1";
}
});
And indeed, this returns 10000, as expected.

It is working because you are using Thread.sleep((long) (Math.random() * 100));So every thread has different sleep time and executing may be one by one as all other thread in sleep mode or completed execution.But though your code is working is not thread safe.Even if you use Volatile also will not make your code thread safe.Volatile only make sure visibility i.e when one thread make some changes other threads are able to see it.
In your case your operation is multi step process reading the variable,updating then writing to memory.So you required locking mechanism to make it thread safe.

Weak performance of CyclicBarrier with many threads: Would a tree-like synchronization structure be an alternative?

Our application requires all worker threads to synchronize at a defined point. For this we use a CyclicBarrier, but it does not seem to scale well. With more than eight threads, the synchronization overhead seems to outweigh the benefits of multithreading. (However, I cannot support this with measurement data.)
EDIT: Synchronization happens very frequently, in the order of 100k to 1M times.
If synchronization of many threads is "hard", would it help building a synchronization tree? Thread 1 waits for 2 and 3, which in turn wait for 4+5 and 6+7, respectively, etc.; after finishing, threads 2 and 3 wait for thread 1, thread 4 and 5 wait for thread 2, etc..
1
| \
2 3
|\ |\
4 5 6 7
Would such a setup reduce synchronization overhead? I'd appreciate any advice.
See also this featured question: What is the fastest cyclic synchronization in Java (ExecutorService vs. CyclicBarrier vs. X)?

With more than eight threads, the synchronization overhead seems to outweigh the benefits of multithreading. (However, I cannot support this with measurement data.)
Honestly, there's your problem right there. Figure out a performance benchmark and prove that this is the problem, or risk spending hours / days solving the entirely wrong problem.

You are thinking about the problem in a subtly wrong way that tends to lead to very bad coding. You don't want to wait for threads, you want to wait for work to be completed.
Probably the most efficient way is a shared, waitable counter. When you make new work, increment the counter and signal the counter. When you complete work, decrement the counter. If there is no work to do, wait on the counter. If you drop the counter to zero, check if you can make new work.

If I understand correctly, you're trying to break your solution up into parts and solve them separately, but concurrently, right? Then have your current thread wait for those tasks? You want to use something like a fork/join pattern.
List<CustomThread> threads = new ArrayList<CustomThread>();
for (Something something : somethings) {
threads.add(new CustomThread(something));
}
for (CustomThread thread : threads) {
thread.start();
}
for (CustomThread thread : threads) {
thread.join(); // Blocks until thread is complete
}
List<Result> results = new ArrayList<Result>();
for (CustomThread thread : threads) {
results.add(thread.getResult());
}
// do something with results.
In Java 7, there's even further support via a fork/join pool. See ForkJoinPool and its trail, and use Google to find one of many other tutorials.
You can recurse on this concept to get the tree you want, just have the threads you create generate more threads in the exact same way.
Edit: I was under the impression that you wouldn't be creating that many threads, so this is better for your scenario. The example won't be horribly short, but it goes along the same vein as the discussion you're having in the other answer, that you can wait on jobs, not threads.
First, you need a Callable for your sub-jobs that takes an Input and returns a Result:
public class SubJob implements Callable<Result> {
private final Input input;
public MyCallable(Input input) {
this.input = input;
}
public Result call() {
// Actually process input here and return a result
return JobWorker.processInput(input);
}
}
Then to use it, create an ExecutorService with a fix-sized thread pool. This will limit the number of jobs you're running concurrently so you don't accidentally thread-bomb your system. Here's your main job:
public class MainJob extends Thread {
// Adjust the pool to the appropriate number of concurrent
// threads you want running at the same time
private static final ExecutorService pool = Executors.newFixedThreadPool(30);
private final List<Input> inputs;
public MainJob(List<Input> inputs) {
super("MainJob")
this.inputs = new ArrayList<Input>(inputs);
}
public void run() {
CompletionService<Result> compService = new ExecutorCompletionService(pool);
List<Result> results = new ArrayList<Result>();
int submittedJobs = inputs.size();
for (Input input : inputs) {
// Starts the job when a thread is available
compService.submit(new SubJob(input));
}
for (int i = 0; i < submittedJobs; i++) {
// Blocks until a job is completed
results.add(compService.take())
}
// Do something with results
}
}
This will allow you to reuse threads instead of generating a bunch of new ones every time you want to run a job. The completion service will do the blocking while it waits for jobs to complete. Also note that the results list will be in order of completion.
You can also use Executors.newCachedThreadPool, which creates a pool with no upper limit (like using Integer.MAX_VALUE). It will reuse threads if one is available and create a new one if all the threads in the pool are running a job. This may be desirable later if you start encountering deadlocks (because there's so many jobs in the fixed thread pool waiting that sub jobs can't run and complete). This will at least limit the number of threads you're creating/destroying.
Lastly, you'll need to shutdown the ExecutorService manually, perhaps via a shutdown hook, or the threads that it contains will not allow the JVM to terminate.
Hope that helps/makes sense.

If you have a generation task (like the example of processing columns of a matrix) then you may be stuck with a CyclicBarrier. That is to say, if every single piece of work for generation 1 must be done in order to process any work for generation 2, then the best you can do is to wait for that condition to be met.
If there are thousands of tasks in each generation, then it may be better to submit all of those tasks to an ExecutorService (ExecutorService.invokeAll) and simply wait for the results to return before proceeding to the next step. The advantage of doing this is eliminating context switching and wasted time/memory from allocating hundreds of threads when the physical CPU is bounded.
If your tasks are not generational but instead more of a tree-like structure in which only a subset need to be complete before the next step can occur on that subset, then you might want to consider a ForkJoinPool and you don't need Java 7 to do that. You can get a reference implementation for Java 6. This would be found under whatever JSR introduced the ForkJoinPool library code.
I also have another answer which provides a rough implementation in Java 6:
public class Fib implements Callable<Integer> {
int n;
Executor exec;
Fib(final int n, final Executor exec) {
this.n = n;
this.exec = exec;
}
/**
* {#inheritDoc}
*/
#Override
public Integer call() throws Exception {
if (n == 0 || n == 1) {
return n;
}
//Divide the problem
final Fib n1 = new Fib(n - 1, exec);
final Fib n2 = new Fib(n - 2, exec);
//FutureTask only allows run to complete once
final FutureTask<Integer> n2Task = new FutureTask<Integer>(n2);
//Ask the Executor for help
exec.execute(n2Task);
//Do half the work ourselves
final int partialResult = n1.call();
//Do the other half of the work if the Executor hasn't
n2Task.run();
//Return the combined result
return partialResult + n2Task.get();
}
}
Keep in mind that if you have divided the tasks up too much and the unit of work being done by each thread is too small, there will negative performance impacts. For example, the above code is a terribly slow way to solve Fibonacci.

Why adding cores slows down my java program after around 10 cores?

My program uses fork/join as shown below to run thousands of tasks:
private static class Generator extends RecursiveTask<Long> {
final MyHelper mol;
final static SatChecker satCheck = new SatChecker();
public Generator(final MyHelper mol) {
super();
this.mol = mol;
}
#Override
protected Long compute() {
long count = 0;
try {
if (mol.isComplete(satCheck)) {
count = 1;
}
ArrayList<MyHelper> molList = mol.extend();
List<Generator> tasks = new ArrayList<>();
for (final MyHelper child : molList) {
tasks.add(new Generator(child));
}
for(final Generator task : invokeAll(tasks)) {
count += task.join();
}
} catch (Exception e){
e.printStackTrace();
}
return count;
}
}
My program makes heavy use of a third party library for isComplete and extend methods. The extend method also uses a native library. As far as the MyHelper class is concerned, there is no shared variable or synchronization between the tasks.
I use the taskset command from linux to restrict the number of cores used by my application. I get the best speed by using around 10 cores (say around 60 seconds). It means that using more than 10 cores results in slowing down the application, such that 16 cores finishes in the same time as 6 cores (around 90 seconds).
I am more confused because the selected cores are 100% busy (except for garbage collection every now and then).
Does anyone know what could cause such a slow down? And where should I look to solve this problem?
PS: I have made also implementations in Scala/akka and using ThreadPoolExecutor, but with similar results (although slower than fork/join)
PPS: My guess is that down deep in MyHelper or SatCheck, someone crosses the memory barrier (poisoning the cache). But how can I find that and fix or go about it?

There might be an overload due to assigning threads/tasks to the different cores. Also, are you sure that your program is entirely parallelizable? Indeed, some program cannot always use 100% efficiently all the cpus available and the time taken to dispatch the tasks might slow down the program more than it helps it.

I think that you should use a threshold on the size of your molList (or mol) variable to avoid forking on too small sets of data.
I'd been playing a bit with fork/join just to understand framework and my first examples did not take the threshold into consideration. Obviously i got awful perfomances. Fixing a proper limit on size of problem did the trick.
Finding the right value for the threshold requires you spend a bit of time trying different values and see how performances changes.
So, put an if at the very beginning of the compute method like this:
#Override
protected Long compute() {
if (mol.getSize() < THRESHOLD) //getSize or whatever gives you size of problem
return noForkJoinCompute(mol); //noForkJoinCompute gives you count without FJ
long count = 0;
try {
if (mol.isComplete(satCheck)) {
count = 1;
}
...

Dynamically spawn threads Java for a case like this

Suppose I have a List of integers. Each int I have must be multiplied by 100. To do this with a for loop I'd construct something like the following:
for(Integer i : numbers){
i = i*100;
}
But suppose for performance reasons I wanted to simultaneously spawn a thread for each number in numbers and perform a single multiplication on each thread returning the result to the same List. What would be the best way of doing such a thing?
My actual problem isn't as trivial as multiplication of ints but rather a task that each iteration of the loop takes a substantial amount of time, and so I'd like to do them all at the same time in order to decrease execution time.

If you can use Java 7, the Fork/Join framework is created for precisely this problem. If not, there is a JSR166 (the fork/join proposal) source code at this link.
Essentially, you would create a task for each step (in your case, for each index in the array) and submit it to a service that can pool threads (the fork part). Then you wait for everything to complete and merge the results (the join part).
The reason to use a service as opposed to launching your own threads, is there can be an overhead in creating threads, and in some cases, you may want to limit the number of threads. For example, if you're on a four CPU machine, it wouldn't make much sense to have more than four threads running concurrently.

If your tasks are independent of each other, you can use Executors framework.
Note that you would gain more speed if you create no more threads than you have CPU cores at your disposal.
Sample:
class WorkInstance {
final int argument;
final int result;
WorkInstance(int argument, int result) {
this.argument = argument;
this.result = result;
}
public String toString() {
return "WorkInstance{" +
"argument=" + argument +
", result=" + result +
'}';
}
}
public class Main {
public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
int numOfCores = 4;
final ExecutorService executor = Executors.newFixedThreadPool(numOfCores);
List<Integer> toMultiplyBy100 = Arrays.asList(1, 3, 19);
List<Future<WorkInstance>> tasks = new ArrayList<Future<WorkInstance>>(toMultiplyBy100.size());
for (final Integer workInstance : toMultiplyBy100)
tasks.add(executor.submit(new Callable<WorkInstance>() {
public WorkInstance call() throws Exception {
return new WorkInstance(workInstance, workInstance * 100);
}
}));
for (Future<WorkInstance> result : tasks)
System.out.println("Result: " + result.get());
executor.shutdown();
}
}

Spawning a new thread for
each number in numbers
is not a good idea. However, using a fixed thread pool of size matching the number of cores/CPUs might increase your performace slightly.

The quick and dirty way to get started is to use a thread pool, such as one returned by Executors.newCachedThreadPool(). Then create tasks that implement Runnable and submit() them to your thread pool. Also read up on the classes and interfaces linked by those Javadocs, lots of cool stuff you can try.
See the concurrency chapter in Effective Java, 2nd ed for a great introduction to multithreaded Java.

Take a look at ThreadPoolExecutor and create a task for each iteration. A prerequisite is that those tasks are independent though.
The use of a thread pool allows you to create a task per iteration but only run as many concurrently as there a threads, since you'd want to reduce the number of thread, for example to the number of cores or hardware threads available. Creating a whole lot of threads would be counter productive since they'd require a lot of context switching which hurts performance.

I assume you are on a commodity PC. You will at most have N threads executing at the same time on your machine, where N is the # of cores of your CPUs, so most likely in the [1, 4] range. Plus the contention on the shared list.
But even more importantly, the cost of spawning a new thread is much greater than the cost of doing a multiplication. One could have a thread pool... but in this specific case, it's not even worth talking about it. Really.

If it is the only application on a node you should determine which number of threads will finish the job most quickly (max_throughput). This depends on the processor you use how much JIT can optimize your code, so there is no general advise but measure.
After that you could distribute the jobs to a pool of worker threads by numbers modulo max_throughput

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.