So I'm writing code that will parse through multiple text files in a folder, gather information on them, and deposit that information in two static List instance variables. The order of which the information is deposited does not matter since I will end up sorting it anyways. But for some reason, increase the number of threads does not impact the speed. Here's my run method and the portion of my main method that utilizes multithreading.
public void run() {
parseFiles();
}
public static void main(String[] args) {
while (filesLeft != 0) {
Thread t = new Thread(new fileParser());
t.start();
try {
t.join();
}
catch (InterruptedException e) {
System.out.println("error.");
}
}
If extra information is required, I basically have a static instance variable as an array of the files I need to go through, as well as a constant being the number of threads (which is manually changed for testing purposes). If I were to have, say, 4 threads and 8 files, each call to parseFiles goes through the next 2 files of the array, the indices being monitored by a static instance variable. If I had, say, 4 threads and 9 files, the first thread parses 3 files, the following parse 2, with a statement something along the lines of filesToParse = Math.ceil(filesLeft / threadsLeft), the latter two variables within the ceiling function being static as well.
Is there any error in my code or should I simply be testing larger text files with more words to see a decrease in speed with added threads (currently I have 5 text files each with 20+ paragraphs and I get around 60-70ms).
Wrote a list piece of code that might be useful
public static void main(String[] args) {
long startTime = System.nanoTime();
final List<Runnable> tasks = generateTasks(NUM_TASKS);
List<Thread> threadPool = new LinkedList<>();
for(int i = 0; i < NUM_THREADS; i++) {
Thread thread = new Thread(() -> {
Runnable task = null;
while ((task = getTask(tasks)) != null) {
task.run();
}
});
threadPool.add(thread);
thread.start();
}
for(Thread thread: threadPool) {
try {
thread.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
long runTimeMs = (System.nanoTime() - startTime) / 1000000;
System.out.println(String.format("Ran %d tasks with %d threads in %d ms", NUM_TASKS, NUM_THREADS, runTimeMs));
}
private static Runnable getTask(List<Runnable> tasks) {
synchronized (tasks) {
return tasks.isEmpty() ? null : tasks.remove(0);
}
}
Related
I am trying to learn multi-threads, and parallel execution in Java. I wrote example code like this:
public class MemoryManagement1 {
public static int counter1 = 0;
public static int counter2 = 0;
public static final Object lock1= new Object();
public static final Object lock2= new Object();
public static void increment1() {
synchronized(lock1) {
counter1 ++;
}
}
public static void increment2() {
synchronized(lock2) {
counter2 ++;
}
}
public static void processes() {
Thread thread1 = new Thread(new Runnable() {
#Override
public void run() {
for (int i = 0; i < 4; i++) {
increment1();
}
}
});
Thread thread2 = new Thread(new Runnable() {
#Override
public void run() {
for (int i = 0; i < 4; i++) {
increment2();
}
}
});
thread1.start();
thread2.start();
try {
thread1.join();
thread2.join();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Counter value is :" + counter1);
System.out.println("Counter value is :" + counter2);
}
public static void main(String[] args) {
processes();
}
}
The code is running properly, but how can I know that the code is running according to time-slicing or whether it is running with parallel execution. I have a CPU with 4 cores. As I understand it, the program should be run with parallel execution, but I am not sure.
The code is running properly, but how can I know that the code is
running according to time-slicing or whether it is running with
parallel execution.
A complete answer to this question would have to cover several factors, but I will be concise and focus mainly on the two most relevant points (IMO) to this question. For simplicity, let us assume that whenever possible each thread (created by the application) will be assigned to a different core.
First, it depends on the number of cores of the hardware that the application is being executed on, and how many threads (created by the application) are running simultaneously. For instance, if the hardware only has a single core or if the application creates more threads than the number of cores available, then some of those threads will inevitably not be executing truly in parallel (i.e., will be mapped to the same core).
Second, it depends if the threads executing their work synchronize with each other or not. In your code, two threads are created, synchronizing using a different object, and since your machine has 4 cores, in theory, each thread is running in parallel to each other.
It gets more complex than that because you can have parts of your code that are executed in parallel, and other parts that are executed sequentially by the threads involved. For instance, if the increment1 and increment2 methods were synchronizing on the same object, then those methods would not be executed in parallel.
Your program is indeed running in parallel execution. In this particular example however you don't need locks in your code, it would run perfectly well without them.
This code I have is not executing tasks in parallel,
it only executes the code in this case once (whatever is in the for loop, but it should be 2) :
public class mqDirect {
public static void main(String args[]) throws Exception {
int parallelism = 2;
ExecutorService executorService =
Executors.newFixedThreadPool(parallelism);
Semaphore semaphore = new Semaphore(parallelism);
for (int i = 0; i < 1; i++) {
try {
semaphore.acquire();
// snip ... do stuff..
semaphore.release();
} catch (Throwable throwable) {
semaphore.release();
}
executorService.shutdownNow();
}
}
}
In Java the main way to make code work in parallel is to create a Thread with a new Runnable as a constructor parameter. You then need to start it.
There are many tutorials to help you get this to happen properly.
As your code stands you are merely creating an ExecutorService (and not using it), creating a Semaphore (which should be done in the thread but isn't), performing some process and then shutting down the Executor.
BTW: ShutDownNow is probably not what you want, you should just use ShutDown.
OK, So I found this good tutorial
http://programmingexamples.wikidot.com/threadpoolexecutor
And I have done something like
public class mqDirect {
int poolSize = 2;
int maxPoolSize = 2;
long keepAliveTime = 10;
ThreadPoolExecutor threadPool = null;
final ArrayBlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(
5);
public mqDirect()
{
threadPool = new ThreadPoolExecutor(poolSize, maxPoolSize,
keepAliveTime, TimeUnit.SECONDS, queue);
}
public void runTask(Runnable task)
{
threadPool.execute(task);
System.out.println("Task count.." + queue.size());
}
public void shutDown()
{
threadPool.shutdown();
}
public static void main (String args[]) throws Exception
{
mqDirect mtpe = new mqDirect();
// start first one
mtpe.runTask(new Runnable()
{
public void run()
{
for (int i = 0; i < 2; i++)
{
try
{
System.out.println("First Task");
runMqTests();
Thread.sleep(1000);
} catch (InterruptedException ie)
{
}
}
}
});
// start second one
/*
* try{ Thread.sleep(500); }catch(InterruptedException
* ie){}
*/
mtpe.runTask(new Runnable()
{
public void run()
{
for (int i = 0; i < 2; i++)
{
try
{
System.out.println("Second Task");
runMqTests();
Thread.sleep(1000);
} catch (InterruptedException ie)
{
}
}
}
});
mtpe.shutDown();
// runMqTests();
}
And it works !
But the problem is , this duplicated code ... runMqtests() is the same task, is there a way to specify it to run in parallel without duplicating the code?
The example I based this off is assuming each task is different.
This code I have is not executing tasks in parallel, it only executes the code in this case once (whatever is in the for loop, but it should be 2) :
Just because you instantiate an ExecutorService instance doesn't mean that things magically run in parallel. You actually need to use that object aside from just shutting it down.
If you want the stuff in the loop to run in the threads in the service then you need to do something like:
int parallelism = 2;
ExecutorService executorService = Executors.newFixedThreadPool(parallelism);
for (int i = 0; i < parallelism; i++) {
executorService.submit(() -> {
// the code you want to be run by the threads in the exector-service
// ...
});
}
// once you have submitted all of the jobs, you can shut it down
executorService.shutdown();
// you might want to call executorService.awaitTermination(...) here
It is important to note that this will run your code in the service but there are no guarantees that it will be run "in parallel". This depends on your number of processors and the race conditions inherent with threads. For example, the first task might start up, run, and finish its code before the 2nd one starts. That's the nature of threaded programs which are by design asynchronous.
If, however, you have at least 2 cores, and the code that you submit to be run by the executor-service takes a long time to run then most likely they will be running at the same time at some point.
Lastly, as #OldCurmudgeon points out, you should call shutdown() on the service which allows current jobs already submitted to the service to run as opposed to shutdownNow() which cancels and queued jobs and also calls thread.interrupt() on any running jobs.
Hope this helps.
I am new to multithreading in Java .I have implemented a multithreading program in java to process an array and need your help and suggestions to optimise it and refactor it if possible.
Scenario
We get a huge csv file, which has over 1000s of rows and we need to process it.
So i basically convert them to array, split them and pass to execution program and input will be subset of the arrays.
Right now i am splitting the array to 20 equal subset and passing to 20 threads for execution. It is taking ~2 mins which is fine . Without multithreading it takes 30 mins.
Help needed
I am giving the snapshot of my code below.
Although it works fine, i am wondering whether there is any way to standardize it more and refactor it. Rightnow it looks clumsy.
TO be more specific, instead of creating individual thread runners if i can parameterize it, then it will be great.
Code
private static void ProcessRecords(List<String[]> inputCSVData)
{
// Do some operation
}
**In the main program**
public static void main(String[] args)throws ClassNotFoundException, SQLException, IOException, InterruptedException
{
int size = csvData.size();
// Split the array
int firstArraySize = (size / 20);
int secondArrayEndIndex = (firstArraySize * 2) - 1;
csvData1 = csvData.subList(1, firstArraySize);
csvData2 = csvData.subList(firstArraySize, secondArrayEndIndex);
// .... and so on
Thread thread1 = new Thread(new Runnable() {
public void run() {
try {
ProcessRecords(csvData1);
} catch (ClassNotFoundException | SQLException | IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
});
Thread thread2 = new Thread(new Runnable() {
public void run()
{
try {
ProcessRecords(csvData2);
} catch (ClassNotFoundException | SQLException | IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
});
**and so on for 20 times**
thread1.start();
thread2.start();
//... For all remaining threads
// thread20.start();
thread1.join();
thread2.join();
//... For all remaining threads
// thread20.join();
}
Since Java 7, you can implement such mechanism efficiently out of the box thanks to the Fork/Join Framework. Starting from Java 8, you can do it directly with the Stream API more precisely with a parallel stream which uses behind the scene a ForkJoinPool in order to leverage its work-stealing algorithm to provide the best possible performances.
In your case, you could process it line by line as next:
csvData.parallelStream().forEach(MyClass::ProcessRecord);
With a method ProcessRecord of the class MyClass of type:
private static void ProcessRecord(String[] inputCSVData){
// Do some operation
}
By default a parallel stream will use the common ForkJoinPool with a size corresponding to Runtime.getRuntime().availableProcessors() which is enough for tasks with very few IO, if you have tasks with IO such that you would like to increase the size of the pool, simply provide the initial task to your custom ForkJoinPool, the parallel stream will then use your pool instead of the common pool.
ForkJoinPool forkJoinPool = new ForkJoinPool(20);
forkJoinPool.submit(() -> csvData.parallelStream().forEach(MyClass::ProcessRecord)).get();
You have done a lot of redundant work to come here. You can use an ExecutorService with a FixedThreadPool and submit tasks to the thread pool, instead of hard coding 20 threads.
Also, how was the value of 20 for the number of threads decided? Use,
Runtime.getRuntime().availableProcessors();
to determine the core count in the runtime.
public static void main(String[] args) throws ClassNotFoundException, SQLException, IOException, InterruptedException {
int size = csvData.size();
int threadCount = Runtime.getRuntime().availableProcessors();
ExecutorService executorService = Executors.newFixedThreadPool(threadCount);
int index = 0;
int chunkSize = size / threadCount;
while (index < size) {
final int start = index;
executorService.submit(new Runnable() {
#Override
public void run() {
try {
ProcessRecords(csvData.subList(start, chunkSize));
} catch (ClassNotFoundException | SQLException | IOException e) {
e.printStackTrace();
}
}
});
index += chunkSize;
}
executorService.shutdown();
while(!executorService.isTerminated()){
Thread.sleep(1000); //soround with try catch for InterruptedException
}
}
public class MyResource {
private int count = 0;
void increment() {
count++;
}
void insert() { // incrementing shared resource count
for (int i = 0; i < 100000000; i++) {
increment();
}
}
void insert1() { //incrementing shared resource count
for (int i = 0; i < 100000000; i++) {
increment();
}
}
void startThread() {
Thread t1 = new Thread(new Runnable() { //thread incrementing count using insert()
#Override
public void run() {
insert();
}
});
Thread t2 = new Thread(new Runnable() { //thread incrementing count using insert1()
#Override
public void run() {
insert1();
}
});
t1.start();
t2.start();
try {
t1.join(); //t1 and t2 race to increment count by telling current thread to wait
t2.join();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
void entry() {
long start = System.currentTimeMillis();
startThread(); //commenting insert(); insert1() gives output as time taken = 452(approx) 110318544 (obvious)
// insert(); insert1(); //commenting startThread() gives output as time taken = 452(approx) 200000000
long end = System.currentTimeMillis();
long time = end - start;
System.out.println("time taken = " + time);
System.out.println(count);
}
}
Program entry point is from entry() method.
1.Only using insert(); insert1(); (Normal method calling ) and commenting startThread()(which executes thread) gives me result as shown in code.
2.Now commenting insert(); insert1(); and using startThread()(which executes thread) gives me result as shown in code.
3.Now I synchronize increment() gives me output as time taken = 35738 200000000
As Above synchronizing avoids access of shared resource but on other hand it takes lot of time to process.
So what's use of this synchronizing if it decrease the performance ?
Sometimes you just want two or more things to go on at the same time. Imagine the server of a chat application or a program that updates the GUI while a long task is running to let the user know that processing is going on
You are not suppose to use synchronization to increase performance, you are suppose to use it in order to protect shared resources.
Is this a real code example? Because if you want to use threads here in order to split the work synchronize
increment()
is not the best approach...
EDIT
as described here, you can change the design of this specific code to divide the work between the 2 threads more efficiently.
i altered their example to fit your needs, but all the methods described there are good.
import java.util.*;
import java.util.concurrent.*;
import static java.util.Arrays.asList;
public class Sums {
static class Counter implements Callable<Long> {
private final long _limit;
Counter(long limit) {
_limit = limit;
}
#Override
public Long call() {
long counter = 0;
for (long i = 0; i <= _limit; i++) {
counter++
}
return counter;
}
}
public static void main(String[] args) throws Exception {
int counter = 0;
ExecutorService executor = Executors.newFixedThreadPool(2);
List <Future<Long>> results = executor.invokeAll(asList(
new Counter(500000), new Counter(500000));
));
executor.shutdown();
for (Future<Long> result : results) {
counter += result.get();
}
}
}
and if you must use synchronisation, AtomicLong will do a better job.
Performance is not the only factor. Correctness can also be very important. Here is another question that has some low level details about the keyword synchronized.
If you are looking for performance, consider using the java.util.concurrent.atomic.AtomicLong class. It has been optimized for fast, atomic access.
EDIT:
Synchonized is overkill in this use case. Synchronized would be much more useful for FileIO or NetworkIO where the calls are much longer and correctness is much more important. Here is the source code for AtomicLong. Volatile was chosen because it is much more performant for short calls that change shared memory.
Adding a synchronized keyword adds in extra java bytecode that does a lot of checking for the right state to get the lock safely. Volatile will put the data in main memory, which takes longer to access, but the CPU enforces atomic access instead of the jvm generating extra code under the hood.
I have hundreds of files to process. I do each file one at a time and it takes 30 minutes.
I'm thinking I can do this processing in 10 simultaneous threads, 10 files at a time, and I might be able to do it in 3 minutes instead of 30.
My question is, what is the "correct" way to manage my 10 threads? And when one is done, create a new one to a max number of 10.
This is what I have so far ... is this the "correct" way to do it?
public class ThreadTest1 {
public static int idCounter = 0;
public class MyThread extends Thread {
private int id;
public MyThread() {
this.id = idCounter++;
}
public void run() {
// this run method represents the long-running file processing
System.out.println("I'm thread '"+this.id+"' and I'm going to sleep for 5 seconds!");
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("I'm thread '"+this.id+"' and I'm done sleeping!");
}
}
public void go() {
int MAX_NUM_THREADS = 10;
List<MyThread> threads = new ArrayList<MyThread>();
// this for loop represents the 200 files that need to be processed
for (int i=0; i<200; i++) {
// if we've reached the max num of threads ...
while (threads.size() == MAX_NUM_THREADS) {
// loop through the threads until we find a dead one and remove it
for (MyThread t : threads) {
if (!t.isAlive()) {
threads.remove(t);
break;
}
}
}
// add new thread
MyThread t = new MyThread();
threads.add(t);
t.start();
}
}
public static void main(String[] args) {
new ThreadTest1().go();
}
}
You can use ExecutorService to manage you threads.
And you can add while loop to thread run method to execute file processing task repeatedly.
Also you can read about BlockingQueue usage. I think it will fit perfectly to allocate new files (tasks) between threads.
I would suggest using Camel's File component if you are open to it. The component will handle all the issues with concurrency to ensure that multiple threads don't try to process the same file. The biggest challenge with making your code multi-threaded is making sure the threads don't interact. Let a framework take care of this for you.
Example:
from("file://incoming?maxMessagesPerPoll=1&idempotent=true&moveFailed=failed&move=processed&readLock=none")
.threads(10).process()