Parallelizing class with static member variables - java

I'm currently working with an application that does some heavy computational work. It has been ported from C to Java years ago and it shows a bit. Among others it uses public static variables to share data between classes.
The work is very suited for parallelizing as multiple files are processed and every file can be done completely independ of others. But just starting multiple threads doesn't work because of the static variables. I would like to prevent a rewrite because the classes are quite fast, mature and bug-free.
Is there an easy way for me to start multiple threads/processes from within the java program wherein each thread will have it's own copy of the static variables or will I have to resort to just calling the JVM multiple times by executing commands?

Yes, You can use multiple class loaders, or start multiple processes.
However, I suggest just fixing the code, it would be much simpler. Make all the static fields, non static and have a ThreadLocal variable which hold the instance copy for that thread.

Related

Java instance members and concurrency

Given what I understand of concurrency in java, it seems shared access to instance members must be coded to handle multi-threaded access only if the threads access the same instance of a given object, such as a servlet.See here:
Why instance variable in Servlet is not thread-safe
Since not all applications are servlet based, how do u determine which objects need to accomodate multi-threaded access? For example, in a large, non-servlet based enterprise application, given the sheer number of classes, how do you determine from a design stand-point which objects will have only one instance shared across multiple threads during run-time? The only situation I can think of is a singleton.
In Java's EL API, javax.el.BeanELResolver has a private inner class that uses synchronization to serialize access to one of its members. Unless I am missing something, BeanELResolver does not look like a singleton, and so each thread should have its own instance of BeanELResolver. What could have been the design consideration behind synchronizing one of its members?
There are many cases in which the state of one class can be shared across many threads, not just singletons. For example you could have a class or method creating objects (some sort of factory) and injecting the same dependency in all the created objects. The injected dependency will be shared across all the threads that call the factory method. The dependency could be anything: a counter, database access class, etc.
For example:
class ThreadSafeCounter{
/* constructor omitted */
private final String name;
private final AtomicInteger i = new AtomicInteger();
int increment() { return i.incrementAndGet(); }
}
class SheepTracker {
public SheepTracker(ThreadSafeCounter c) { sheepCounter = c;}
private final ThreadSafeCounter sheepCounter;
public int addSheep() { return c.increment(); }
}
class SheepTrackerFactory {
private final ThreadSafeCounter c;
public SheepTracker newSheepAdder() {
return new SheepTracker(c);
}
}
In the above, the SheepTrackerFactory can be used by many threads that all need to do the same thing, i.e., keeping track of sheep. The number of sheep across all the threads is maintained in a global state variable, the ThreadSafeCounter (it could be just an AtomicInteger in this example, but bear with me, you can imagine how this class could contain additional state/operations). Now each SheepTracker can be a lightweight class that performs other operations that don't require synchronization, but when they need to increment the number of sheep, they will do it in a thread-safe way.
You're asking a very broad question, so I'll try to answer with a broad answer. One of the first things your design has to consider, long before you dive into classes, is the design of the application's threading. In this step you consider the task at hand, and how to best utilize the hardware that has to solve it. Based on that, you choose the best threading design for your application.
For instance - Does the application perform intense computations? If so, can parts of the computation be parallelized to make better use of a multi core CPU? If so, make sure to design multiple threads compute on different cores in parallel.
Does your application perform a lot of I/O operations? If so, it's better to parallelize them so multiple threads could handle the input/output (which is slow and requires a lot of waiting for external devices) while other threads continue working on their own tasks. This is why servlets are executed by multiple threads in parallel.
Once you decide one the tasks you want to parallelize and the ones you prefer executing in a single thread, you go into the design of the classes themselves. Now it's clear which parts of your software have to be thread safe, and which don't. You have a data structure that's being accessed by a thread pool responsible for I/O? It has to be thread safe. You have an object that's being accessed by a single thread that performs maintenance tasks? It doesn't have to be.
Anyway, this has nothing to do with singletons. Singlton is a design pattern that means that only a single instance of a certain object can be created. It doesn't say anything about the number of threads accessing it or its members.
Any instance can be shared between threads, not only singletons.
That's why it's pretty hard to come up with a design where anyone in a development team can instantly see which types or instances will be shared between threads and which won't. It outright impossible to prevent sharing of some instances. So the solution must be somewhere else. Read up on "memory barriers" to understand the details.
Synchronization is used for two purposes:
Define memory barriers (i.e. when changes should become visible to other threads)
Make sure that complex data structures can be shared between threads (i.e. locking).
Since there is no way to prevent people from sharing a single BeanELResolver instance between different threads, they probably need to make sure that concurrent access doesn't break some complex structure (probably a Map).

use of Static variable in mapreduce programs

I'm actually coding a program on hadoop. On my reduce task i have to use a static variable because i want it to be edited by many threads (these threads are called from the reduce fonction).
The problem is that this variable is beeing edited by the threads of the current reduce task and also by the threads of the other reduce tasks, and i want to avoid this.
So my question is: is there a way or a trick to use to modify this variable by only the threads of the current reducer ?
I hope that my question is clear enough to help you to help me ;).
Thank you very much stack-community :)
You may be wanting to review the "shared nothing" aspect of map/reduce programs. They are not intended (and in fact are not able..) to share JVM objects - including static variables.
The reducers would typically be mostly independent of each other. They operate on a particular partition of the data that has been processed in the Mapper stage.
In unusual circumstances there is still the opportunity to use Counters to share data across Reducers. However it is more likely that you would want to study existing map/reduce programs and see how they maintain the separation across reducers .
thank you javadba,
you said that the reducers are independent, yes but if more than one reducer are exucuted on a same node then static variables will be shared.
So the solution i used it may helps others is that i used differente varialbes for each reducer according to the number of unique key.
For exemple if in my first program I had declared $public static int x; $ so to avoid the problem I cited above i declare $public static ArrayList x; $ and each element from this ArrayList is dedicated to a unique reducer.

Inter-threads communication

The easiest implementation is when we call from single class main method other classes implementing runnable:
public static void main(String [] args){
// declarations ...
receiver.start();
player.start();
}
Say inside receiver I have while loop which receives a packet value and I want to send that value to the second thread. How to do that?
Just to clarify I don't yet care about one thread controlling another, I just want first thread to share values with second.
And tiny question aside - does JDK 7 Fork really dramatically increases performance for java concurrent api?
Thank You For Your Time,
A simple option is to use a java.util.concurrent.atomic.AtomicReference (or one of the other Atomic... classes). Create a single instance of AtomicReference, and pass it to the code that the various threads run, and set the value from the receiver thread. The other thread(s) can then read the value at their leisure, in a thread-safe manner.
does JDK 7 Fork really dramatically increases performance for java concurrent api?
No, it's just a new API to make some things easier. It's not there to make things faster.
The java.util.concurrent -package contains many helpful interfaces and classes for safely communicating between threads. I'm not sure I understand your use-case here, but if your player (producer) is supposed to pass tasks to the receiver (consumer), you could for example use a BlockingQueue -implementation.

Manually Increasing the Amount of CPU a Java Application Uses

I've just made a program with Eclipse that takes a really long time to execute. It's taking even longer because it's loading my CPU to 25% only (I'm assuming that is because I'm using a quad-core and the program is only using one core). Is there any way to make the program use all 4 cores to max it out? Java is supposed to be natively multi-threaded, so I don't understand why it would only use 25%.
You still have to create and manage threads manually in your application. Java can't determine that two tasks can run asynchronously and automatically split the work into several threads.
This is a pretty vague question because we don't know much about what your program does. If your program is single-threaded, then no number of cores on your machine is going to make it run any faster. Java does have threading support, but it won't automatically parallelize your code for you. To speed it up, you'll need to identify parts of the computation that can be run in parallel with one another and add code as appropriate to split up and reconstitute the work. Without more info on what your program does, I can't help you out.
Another important detail to note is that Java threads are not the same as system threads. The JVM often has its own thread scheduler that tries to put Java threads onto actual system threads in a way that's fair, but there's no actual guarantee that it will do so.
Yes, Java is multi-threaded, but the multi-threading doesn't happen "by magic".
Have a look at either at the Thread class or at the Executor framework. Essentially you need to split your job into "subtasks" each of which can run on a single processor, then do something like this:
Executor ex = Executors.newFixedThreadPool(4);
while (thereAreMoreSubtasksToDo) {
ex.execute(new Runnable() {
public void run() {
... do subtask ...
}
});
}
Turning a serial routine/algorithm into a parallel one isn't necessarily trivial: you need to know in particular about a range of issues broadly termed "thread-safety". You may be interested in some material I've written about thread-safety in Java, and threading in general if you follow the links: the key thing to bear in mind is that if any data/objects are being shared among the different threads running, then you need to take special precautions. That said, for independent things that you just want to "run at the same time", then the above pattern will get you started.
Java is multi-threaded but if your application runs in only one thread, only one thread will be used. (Apart from the internal threads Java uses for finalization, garbage collection and so on.)
If you want your code to use multiple threads, you have to split it up manually, either by starting threads by yourself or using a third party thread pool. I'd suggest the latter option as it's safer but both can work equally well.
You've got a bit of learning ahead of you (actually, quite a bit of learning) - but it's learning you should do if you are going to be doing any serious programming.
Here's a starting point: http://download.oracle.com/javase/tutorial/essential/concurrency/
But you might want to look into a good book on Java multi-threading (I did this so long ago that any book I could recommend would be out of print). This sort of hard topic is well suited for learning from a text instead of online tutorials.

How good is the JVM at parallel processing? When should I create my own Threads and Runnables? Why might threads interfere?

I have a Java program that runs many small simulations. It runs a genetic algorithm, where each fitness function is a simulation using parameters on each chromosome. Each one takes maybe 10 or so seconds if run by itself, and I want to run a pretty big population size (say 100?). I can't start the next round of simulations until the previous one has finished. I have access to a machine with a whack of processors in it and I'm wondering if I need to do anything to make the simulations run in parallel. I've never written anything explicitly for multicore processors before and I understand it's a daunting task.
So this is what I would like to know: To what extent and how well does the JVM parallel-ize? I have read that it creates low level threads, but how smart is it? How efficient is it? Would my program run faster if I made each simulation a thread? I know this is a huge topic, but could you point me towards some introductory literature concerning parallel processing and Java?
Thanks very much!
Update:
Ok, I've implemented an ExecutorService and made my small simulations implement Runnable and have run() methods. Instead of writing this:
Simulator sim = new Simulator(args);
sim.play();
return sim.getResults();
I write this in my constructor:
ExecutorService executor = Executors.newFixedThreadPool(32);
And then each time I want to add a new simulation to the pool, I run this:
RunnableSimulator rsim = new RunnableSimulator(args);
exectuor.exectue(rsim);
return rsim.getResults();
The RunnableSimulator::run() method calls the Simulator::play() method, neither have arguments.
I think I am getting thread interference, because now the simulations error out. By error out I mean that variables hold values that they really shouldn't. No code from within the simulation was changed, and before the simulation ran perfectly over many many different arguments. The sim works like this: each turn it's given a game-piece and loops through all the location on the game board. It checks to see if the location given is valid, and if so, commits the piece, and measures that board's goodness. Now, obviously invalid locations are being passed to the commit method, resulting in index out of bounds errors all over the place.
Each simulation is its own object right? Based on the code above? I can pass the exact same set of arguments to the RunnableSimulator and Simulator classes and the runnable version will throw exceptions. What do you think might cause this and what can I do to prevent it? Can I provide some code samples in a new question to help?
Java Concurrency Tutorial
If you're just spawning a bunch of stuff off to different threads, and it isn't going to be talking back and forth between different threads, it isn't too hard; just write each in a Runnable and pass them off to an ExecutorService.
You should skim the whole tutorial, but for this particular task, start here.
Basically, you do something like this:
ExecutorService executorService = Executors.newFixedThreadPool(n);
where n is the number of things you want running at once (usually the number of CPUs). Each of your tasks should be an object that implements Runnable, and you then execute it on your ExecutorService:
executorService.execute(new SimulationTask(parameters...));
Executors.newFixedThreadPool(n) will start up n threads, and execute will insert the tasks into a queue that feeds to those threads. When a task finishes, the thread it was running on is no longer busy, and the next task in the queue will start running on it. Execute won't block; it will just put the task into the queue and move on to the next one.
The thing to be careful of is that you really AREN'T sharing any mutable state between tasks. Your task classes shouldn't depend on anything mutable that will be shared among them (i.e. static data). There are ways to deal with shared mutable state (locking), but if you can avoid the problem entirely it will be a lot easier.
EDIT: Reading your edits to your question, it looks like you really want something a little different. Instead of implementing Runnable, implement Callable. Your call() method should be pretty much the same as your current run(), except it should return getResults();. Then, submit() it to your ExecutorService. You will get a Future in return, which you can use to test if the simulation is done, and, when it is, get your results.
You can also see the new fork join framework by Doug Lea. One of the best book on the subject is certainly Java Concurrency in Practice. I would strong recommend you to take a look at the fork join model.
Java threads are just too heavyweight. We have implement parallel branches in Ateji PX as very lightweight scheduled objects. As in Erlang, you can create tens of millions of parallel branches before you start noticing an overhead. But it's still Java, so you don't need to switch to a different language.
If you are doing full-out processing all the time in your threads, you won't benefit from having more threads than processors. If your threads occasionally wait on each other or on the system, then Java scales well up to thousands of threads.
I wrote an app that discovered a class B network (65,000) in a few minutes by pinging each node, and each ping had retries with an increasing delay. When I put each ping on a separate thread (this was before NIO, I could probably improve it now), I could run to about 4000 threads in windows before things started getting flaky. Linux the number was nearer 1000 (Never figured out why).
No matter what language or toolkit you use, if your data interacts, you will have to pay some attention to those areas where it does. Java uses a Synchronized keyword to prevent two threads from accessing a section at the same time. If you write your Java in a more functional manner (making all your members final) you can run without synchronization, but it can be--well let's just say solving problems takes a different approach that way.
Java has other tools to manage units of independent work, look in the "Concurrent" package for more information.
Java is pretty good at parallel processing, but there are two caveats:
Java threads are relatively heavyweight (compared with e.g. Erlang), so don't start creating them in the hundreds or thousands. Each thread gets its own stack memory (default: 256KB) and you could run out of memory, among other things.
If you run on a very powerful machine (especially with a lot of CPUs and a large amount of RAM), then the VM's default settings (especially concerning GC) may result in suboptimal performance and you may have to spend some times tuning them via command line options. Unfortunately, this is not a simple task and requires a lot of knowledge.

Categories