Alternate apporach in processing huge array using Multithreading - java

I am new to multithreading in Java .I have implemented a multithreading program in java to process an array and need your help and suggestions to optimise it and refactor it if possible.
Scenario
We get a huge csv file, which has over 1000s of rows and we need to process it.
So i basically convert them to array, split them and pass to execution program and input will be subset of the arrays.
Right now i am splitting the array to 20 equal subset and passing to 20 threads for execution. It is taking ~2 mins which is fine . Without multithreading it takes 30 mins.
Help needed
I am giving the snapshot of my code below.
Although it works fine, i am wondering whether there is any way to standardize it more and refactor it. Rightnow it looks clumsy.
TO be more specific, instead of creating individual thread runners if i can parameterize it, then it will be great.
Code
private static void ProcessRecords(List<String[]> inputCSVData)
{
// Do some operation
}
**In the main program**
public static void main(String[] args)throws ClassNotFoundException, SQLException, IOException, InterruptedException
{
int size = csvData.size();
// Split the array
int firstArraySize = (size / 20);
int secondArrayEndIndex = (firstArraySize * 2) - 1;
csvData1 = csvData.subList(1, firstArraySize);
csvData2 = csvData.subList(firstArraySize, secondArrayEndIndex);
// .... and so on
Thread thread1 = new Thread(new Runnable() {
public void run() {
try {
ProcessRecords(csvData1);
} catch (ClassNotFoundException | SQLException | IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
});
Thread thread2 = new Thread(new Runnable() {
public void run()
{
try {
ProcessRecords(csvData2);
} catch (ClassNotFoundException | SQLException | IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
});
**and so on for 20 times**
thread1.start();
thread2.start();
//... For all remaining threads
// thread20.start();
thread1.join();
thread2.join();
//... For all remaining threads
// thread20.join();
}

Since Java 7, you can implement such mechanism efficiently out of the box thanks to the Fork/Join Framework. Starting from Java 8, you can do it directly with the Stream API more precisely with a parallel stream which uses behind the scene a ForkJoinPool in order to leverage its work-stealing algorithm to provide the best possible performances.
In your case, you could process it line by line as next:
csvData.parallelStream().forEach(MyClass::ProcessRecord);
With a method ProcessRecord of the class MyClass of type:
private static void ProcessRecord(String[] inputCSVData){
// Do some operation
}
By default a parallel stream will use the common ForkJoinPool with a size corresponding to Runtime.getRuntime().availableProcessors() which is enough for tasks with very few IO, if you have tasks with IO such that you would like to increase the size of the pool, simply provide the initial task to your custom ForkJoinPool, the parallel stream will then use your pool instead of the common pool.
ForkJoinPool forkJoinPool = new ForkJoinPool(20);
forkJoinPool.submit(() -> csvData.parallelStream().forEach(MyClass::ProcessRecord)).get();

You have done a lot of redundant work to come here. You can use an ExecutorService with a FixedThreadPool and submit tasks to the thread pool, instead of hard coding 20 threads.
Also, how was the value of 20 for the number of threads decided? Use,
Runtime.getRuntime().availableProcessors();
to determine the core count in the runtime.
public static void main(String[] args) throws ClassNotFoundException, SQLException, IOException, InterruptedException {
int size = csvData.size();
int threadCount = Runtime.getRuntime().availableProcessors();
ExecutorService executorService = Executors.newFixedThreadPool(threadCount);
int index = 0;
int chunkSize = size / threadCount;
while (index < size) {
final int start = index;
executorService.submit(new Runnable() {
#Override
public void run() {
try {
ProcessRecords(csvData.subList(start, chunkSize));
} catch (ClassNotFoundException | SQLException | IOException e) {
e.printStackTrace();
}
}
});
index += chunkSize;
}
executorService.shutdown();
while(!executorService.isTerminated()){
Thread.sleep(1000); //soround with try catch for InterruptedException
}
}

Related

How to multithread with threads generated in a loop?

So I'm writing code that will parse through multiple text files in a folder, gather information on them, and deposit that information in two static List instance variables. The order of which the information is deposited does not matter since I will end up sorting it anyways. But for some reason, increase the number of threads does not impact the speed. Here's my run method and the portion of my main method that utilizes multithreading.
public void run() {
parseFiles();
}
public static void main(String[] args) {
while (filesLeft != 0) {
Thread t = new Thread(new fileParser());
t.start();
try {
t.join();
}
catch (InterruptedException e) {
System.out.println("error.");
}
}
If extra information is required, I basically have a static instance variable as an array of the files I need to go through, as well as a constant being the number of threads (which is manually changed for testing purposes). If I were to have, say, 4 threads and 8 files, each call to parseFiles goes through the next 2 files of the array, the indices being monitored by a static instance variable. If I had, say, 4 threads and 9 files, the first thread parses 3 files, the following parse 2, with a statement something along the lines of filesToParse = Math.ceil(filesLeft / threadsLeft), the latter two variables within the ceiling function being static as well.
Is there any error in my code or should I simply be testing larger text files with more words to see a decrease in speed with added threads (currently I have 5 text files each with 20+ paragraphs and I get around 60-70ms).
Wrote a list piece of code that might be useful
public static void main(String[] args) {
long startTime = System.nanoTime();
final List<Runnable> tasks = generateTasks(NUM_TASKS);
List<Thread> threadPool = new LinkedList<>();
for(int i = 0; i < NUM_THREADS; i++) {
Thread thread = new Thread(() -> {
Runnable task = null;
while ((task = getTask(tasks)) != null) {
task.run();
}
});
threadPool.add(thread);
thread.start();
}
for(Thread thread: threadPool) {
try {
thread.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
long runTimeMs = (System.nanoTime() - startTime) / 1000000;
System.out.println(String.format("Ran %d tasks with %d threads in %d ms", NUM_TASKS, NUM_THREADS, runTimeMs));
}
private static Runnable getTask(List<Runnable> tasks) {
synchronized (tasks) {
return tasks.isEmpty() ? null : tasks.remove(0);
}
}

Java parallel tasks , only executing once

This code I have is not executing tasks in parallel,
it only executes the code in this case once (whatever is in the for loop, but it should be 2) :
public class mqDirect {
public static void main(String args[]) throws Exception {
int parallelism = 2;
ExecutorService executorService =
Executors.newFixedThreadPool(parallelism);
Semaphore semaphore = new Semaphore(parallelism);
for (int i = 0; i < 1; i++) {
try {
semaphore.acquire();
// snip ... do stuff..
semaphore.release();
} catch (Throwable throwable) {
semaphore.release();
}
executorService.shutdownNow();
}
}
}
In Java the main way to make code work in parallel is to create a Thread with a new Runnable as a constructor parameter. You then need to start it.
There are many tutorials to help you get this to happen properly.
As your code stands you are merely creating an ExecutorService (and not using it), creating a Semaphore (which should be done in the thread but isn't), performing some process and then shutting down the Executor.
BTW: ShutDownNow is probably not what you want, you should just use ShutDown.
OK, So I found this good tutorial
http://programmingexamples.wikidot.com/threadpoolexecutor
And I have done something like
public class mqDirect {
int poolSize = 2;
int maxPoolSize = 2;
long keepAliveTime = 10;
ThreadPoolExecutor threadPool = null;
final ArrayBlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(
5);
public mqDirect()
{
threadPool = new ThreadPoolExecutor(poolSize, maxPoolSize,
keepAliveTime, TimeUnit.SECONDS, queue);
}
public void runTask(Runnable task)
{
threadPool.execute(task);
System.out.println("Task count.." + queue.size());
}
public void shutDown()
{
threadPool.shutdown();
}
public static void main (String args[]) throws Exception
{
mqDirect mtpe = new mqDirect();
// start first one
mtpe.runTask(new Runnable()
{
public void run()
{
for (int i = 0; i < 2; i++)
{
try
{
System.out.println("First Task");
runMqTests();
Thread.sleep(1000);
} catch (InterruptedException ie)
{
}
}
}
});
// start second one
/*
* try{ Thread.sleep(500); }catch(InterruptedException
* ie){}
*/
mtpe.runTask(new Runnable()
{
public void run()
{
for (int i = 0; i < 2; i++)
{
try
{
System.out.println("Second Task");
runMqTests();
Thread.sleep(1000);
} catch (InterruptedException ie)
{
}
}
}
});
mtpe.shutDown();
// runMqTests();
}
And it works !
But the problem is , this duplicated code ... runMqtests() is the same task, is there a way to specify it to run in parallel without duplicating the code?
The example I based this off is assuming each task is different.
This code I have is not executing tasks in parallel, it only executes the code in this case once (whatever is in the for loop, but it should be 2) :
Just because you instantiate an ExecutorService instance doesn't mean that things magically run in parallel. You actually need to use that object aside from just shutting it down.
If you want the stuff in the loop to run in the threads in the service then you need to do something like:
int parallelism = 2;
ExecutorService executorService = Executors.newFixedThreadPool(parallelism);
for (int i = 0; i < parallelism; i++) {
executorService.submit(() -> {
// the code you want to be run by the threads in the exector-service
// ...
});
}
// once you have submitted all of the jobs, you can shut it down
executorService.shutdown();
// you might want to call executorService.awaitTermination(...) here
It is important to note that this will run your code in the service but there are no guarantees that it will be run "in parallel". This depends on your number of processors and the race conditions inherent with threads. For example, the first task might start up, run, and finish its code before the 2nd one starts. That's the nature of threaded programs which are by design asynchronous.
If, however, you have at least 2 cores, and the code that you submit to be run by the executor-service takes a long time to run then most likely they will be running at the same time at some point.
Lastly, as #OldCurmudgeon points out, you should call shutdown() on the service which allows current jobs already submitted to the service to run as opposed to shutdownNow() which cancels and queued jobs and also calls thread.interrupt() on any running jobs.
Hope this helps.

How can multithreading help increase performance in this situation?

I have a piece of code like this:
while(){
x = jdbc_readOperation();
y = getTokens(x);
jdbc_insertOperation(y);
}
public List<String> getTokens(String divText){
List<String> tokenList = new ArrayList<String>();
Matcher subMatcher = Pattern.compile("\\[[^\\]]*]").matcher(divText);
while (subMatcher.find()) {
String token = subMatcher.group();
tokenList.add(token);
}
return tokenList;
}
What I know is using multithreading can save time when one thread is get blocked by I/O or network. In this synchronous operations every step have to wait for its previous step get finished. What I want here is to maximize cpu utilization on getTokens().
My first thought is put getTokens() in the run method of a class, and create multiple threads. But I think it will not work since it seems not able to get performance benefit by having multiple threads on pure computation operations.
Is adoption of multithreading going to help increase performance in this case? If so, how can I do that?
It will depend on the pace that jdbc_readOperation() produces data to be processed in comparison with the pace that getTokens(x) processes the data. Knowing that will help you figure out if multi-threading is going to help you.
You could try something like this (just for you to get the idea):
int workToBeDoneQueueSize = 1000;
int workDoneQueueSize = 1000;
BlockingQueue<String> workToBeDone = new LinkedBlockingQueue<>(workToBeDoneQueueSize);
BlockingQueue<String> workDone = new LinkedBlockingQueue<>(workDoneQueueSize);
new Thread(() -> {
try {
while (true) {
workToBeDone.put(jdbc_readOperation());
}
} catch (InterruptedException e) {
e.printStackTrace();
// handle InterruptedException here
}
}).start();
int numOfWorkerThreads = 5; // just an example
for (int i = 0; i < numOfWorkerThreads; i++) {
new Thread(() -> {
try {
while (true) {
workDone.put(getTokens(workToBeDone.take()));
}
} catch (InterruptedException e) {
e.printStackTrace();
// handle InterruptedException here
}
}).start();
}
new Thread(() -> {
// you could improve this by making a batch operation
try {
while (true) {
jdbc_insertOperation(workDone.take());
}
} catch (InterruptedException e) {
e.printStackTrace();
// handle InterruptedException here
}
}).start();
Or you could learn how to use the ThreadPoolExecutor. (https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html)
Okay to speed up getTokens() you can split the inputted String divText by using String.substring() method. You split it into as many substrings as you will run Threads running the getTokens() method. Then every Thread will "run" on a certain substring of divText.
Creating more Threads than the CPU can handle should be avoided since context switches create inefficiency.
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#substring-int-int-
An alternative could be splitting the inputted String of getTokens with the String.split method http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split%28java.lang.String%29 e.g. in case the text is made up of words seperated by spaces or other symbols. Then specific parts of the resulting String array could be passed to different Threads.

how to maintain a list of threads?

I have hundreds of files to process. I do each file one at a time and it takes 30 minutes.
I'm thinking I can do this processing in 10 simultaneous threads, 10 files at a time, and I might be able to do it in 3 minutes instead of 30.
My question is, what is the "correct" way to manage my 10 threads? And when one is done, create a new one to a max number of 10.
This is what I have so far ... is this the "correct" way to do it?
public class ThreadTest1 {
public static int idCounter = 0;
public class MyThread extends Thread {
private int id;
public MyThread() {
this.id = idCounter++;
}
public void run() {
// this run method represents the long-running file processing
System.out.println("I'm thread '"+this.id+"' and I'm going to sleep for 5 seconds!");
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("I'm thread '"+this.id+"' and I'm done sleeping!");
}
}
public void go() {
int MAX_NUM_THREADS = 10;
List<MyThread> threads = new ArrayList<MyThread>();
// this for loop represents the 200 files that need to be processed
for (int i=0; i<200; i++) {
// if we've reached the max num of threads ...
while (threads.size() == MAX_NUM_THREADS) {
// loop through the threads until we find a dead one and remove it
for (MyThread t : threads) {
if (!t.isAlive()) {
threads.remove(t);
break;
}
}
}
// add new thread
MyThread t = new MyThread();
threads.add(t);
t.start();
}
}
public static void main(String[] args) {
new ThreadTest1().go();
}
}
You can use ExecutorService to manage you threads.
And you can add while loop to thread run method to execute file processing task repeatedly.
Also you can read about BlockingQueue usage. I think it will fit perfectly to allocate new files (tasks) between threads.
I would suggest using Camel's File component if you are open to it. The component will handle all the issues with concurrency to ensure that multiple threads don't try to process the same file. The biggest challenge with making your code multi-threaded is making sure the threads don't interact. Let a framework take care of this for you.
Example:
from("file://incoming?maxMessagesPerPoll=1&idempotent=true&moveFailed=failed&move=processed&readLock=none")
.threads(10).process()

How to start two process at the same time and then wait both completed?

I want to start two process at the same time and make sure complete them all before proceeding other steps. Can you help? I already tried Thread, it can't start two at the same time and wait until been done.
final CyclicBarrier gate = new CyclicBarrier(3);
Thread r2 = new Thread()
{
public void run()
{
try
{
int i = 0;
while (i < 3)
{
System.out.println("Goodbye, " + "cruel world!");
Thread.sleep(2000L);
i++;
gate.await();
}
}
catch (InterruptedException | BrokenBarrierException iex)
{
}
}
};
Thread r3 = new Thread()
{
public void run()
{
try
{
int i = 0;
while (i < 3)
{
System.out.println("Goodbye, " + "cruel world!");
Thread.sleep(2000L);
i++;
gate.await();
}
}
catch (InterruptedException | BrokenBarrierException iex)
{
}
}
};
r2.start();
r3.start();
gate.await();
System.out.println("Donew");
You can use Thread.join()to wait until your subprocesses/threads have finished.
You should not need CyclicBarrier.
Your problem is that you are repeatedly waiting for three parties, but only two threads are calling await() repeatedly. I would expect your code to immediately print, "Goodbye, cruel world!" twice, and "Done", then hang, because the loops are waiting for a third thread to invoke await() again, but the main thread has now terminated.
One solution is for your main thread to loop, invoking await() the same number of times that your task does. But that would be kind of ugly.
I'd suggest using the invokeAll() method of an ExecutorService. This will submit your tasks to the service at (approximately) the same time, then block until all tasks complete. If you want to try to improve the simultaneity of the task commencing, you could add a CyclicBarrier, but it looks like you are more concerned with when the tasks end, and invokeAll() will take care of that for you.
final class Sample
implements Callable<Void>
{
private static final int ITERATIONS = 3;
private static final long AVG_TIME_MS = 2000;
public static void main(String[] args)
throws InterruptedException
{
List<Sample> tasks = Arrays.asList(new Sample(), new Sample());
ExecutorService workers = Executors.newFixedThreadPool(tasks.size());
for (int i = 1; i <= ITERATIONS; ++i) {
/* invokeAll() blocks until all tasks complete. */
List<Future<Void>> results = workers.invokeAll(tasks);
for (Future<?> result : results) {
try {
result.get();
}
catch (ExecutionException ex) {
ex.getCause().printStackTrace();
return;
}
}
System.out.printf("Completed iteration %d.%n", i);
}
workers.shutdown();
System.out.println("Done");
}
#Override
public Void call()
throws InterruptedException
{
/* The average wait time will be AVG_TIME_MS milliseconds. */
ThreadLocalRandom random = ThreadLocalRandom.current();
long wait = (long) (-AVG_TIME_MS * Math.log(1 - random.nextDouble()));
System.out.printf("Goodbye, cruel world! (Waiting %d ms)%n", wait);
Thread.sleep(wait);
return null;
}
}
Notice how I spiced things up with a random wait time. Yet, invokeAll() waits until all of tasks in that iteration complete.
It's impossible for the single processor machines.
Even if you find lot of answers on threads its not gonna start two process at the same time
If you accept the Relative Simultanity that will be easy.

Categories