use of Static variable in mapreduce programs

use of Static variable in mapreduce programs - java

I'm actually coding a program on hadoop. On my reduce task i have to use a static variable because i want it to be edited by many threads (these threads are called from the reduce fonction).
The problem is that this variable is beeing edited by the threads of the current reduce task and also by the threads of the other reduce tasks, and i want to avoid this.
So my question is: is there a way or a trick to use to modify this variable by only the threads of the current reducer ?
I hope that my question is clear enough to help you to help me ;).
Thank you very much stack-community :)

You may be wanting to review the "shared nothing" aspect of map/reduce programs. They are not intended (and in fact are not able..) to share JVM objects - including static variables.
The reducers would typically be mostly independent of each other. They operate on a particular partition of the data that has been processed in the Mapper stage.
In unusual circumstances there is still the opportunity to use Counters to share data across Reducers. However it is more likely that you would want to study existing map/reduce programs and see how they maintain the separation across reducers .

thank you javadba,
you said that the reducers are independent, yes but if more than one reducer are exucuted on a same node then static variables will be shared.
So the solution i used it may helps others is that i used differente varialbes for each reducer according to the number of unique key.
For exemple if in my first program I had declared $public static int x; $ so to avoid the problem I cited above i declare $public static ArrayList x; $ and each element from this ArrayList is dedicated to a unique reducer.

Related

Parallelizing class with static member variables

I'm currently working with an application that does some heavy computational work. It has been ported from C to Java years ago and it shows a bit. Among others it uses public static variables to share data between classes.
The work is very suited for parallelizing as multiple files are processed and every file can be done completely independ of others. But just starting multiple threads doesn't work because of the static variables. I would like to prevent a rewrite because the classes are quite fast, mature and bug-free.
Is there an easy way for me to start multiple threads/processes from within the java program wherein each thread will have it's own copy of the static variables or will I have to resort to just calling the JVM multiple times by executing commands?

Yes, You can use multiple class loaders, or start multiple processes.
However, I suggest just fixing the code, it would be much simpler. Make all the static fields, non static and have a ThreadLocal variable which hold the instance copy for that thread.

Java multi-threading accessing same variable

I have a Java program which create 2 threads, inside these 2 threads, they are trying to update the global variable abc to different value, let's say integer 1 and integer 3.
Let's say they execute the code at the same time (at same milisecond), for example:
public class MyThread implements Runnable{
public void run(){
while(true){
if (currentTime == specificTime){
abc = 1; //another thread update abc to 3
}
}
}
}
In this case, how can we determine the result of the variable abc? I am very curious how Operating System schedule the execution?
(I know Synchronize should be used, but I just want to know naturally how the system will handle this kind of conflict problem.)

The operating system has little involvement in this: at the time your threads are running, the memory allocated to abc is under control of JVM running your program, so it's your program that is in control.
When two threads access the same memory location, the last writer wins. Which particular thread gets to be the last writer, however, is non-deterministic, unless you use synchronization.
Moreover, without you taking special care of accessing the shared data, one thread may not even see the results of the other thread writing to the abc location.
To avoid synchronization issues, you should use synchronization or one of the java.util.concurrent.atomic classes.

From Java's perspective the situation is fairly simple if abc is not volatile or accessed with appropriate synchronisation.
Let's assume that abc is 0 originally. After your two threads have updated it to respectively 1 and 3, abc could be observed in three states: 0, 1 or 3. Which value you get is not deterministic and the result may vary from one run to the other.

Depends on the operating system, running environment etc.
Some environments will actually stop you from doing this - known as thread safety.
Otherwise the results are totally unpredictable which is why it is so dangerous to do this.
It mainly just depends on which thread updated it last for what the value will be. One thread will get CPU cycles before the other to do the atomic operation first.
Also, I don't think that operating systems go as far as to schedule threads because in most operating systems it is the program that is responsible for them, and without explicit calls like synchronise, or a threading pool model then I think the order of execution is pretty hard to predict. Its a very environment dependent thing.

From the system's perspective the result will depend on many software, hardware and run-time factors that cannot be known in advance. From this perspective there is no conflict nor a problem.
From the programmer's perspective the result is not deterministic and therefore a problem/conflic. The conflict needs to be resolved at design-time.

In this case, how can we determine the result of the variable abc? I
am very curious how Operating System schedule the execution?
The result will not be deterministic, as the value will be the last written one. You can not make any guarantee about the result. The execution is scheduled like any other one. As you demand no synchronization in your code the JVM will not enforce anything for you.
I know Synchronize should be used, but I just want to know naturally
how the system will handle this kind of conflict problem.
Simple said: it wont, as for the system there is no conflict. Only for you, the programmer, problems will occur, since you will eventually run into a data race and not deterministic behavior. It is completely up to you.

just add volatile modificator to your variable, then it'll be udpated through all threads. And thread reading it will get it's actual value. volatile means that value will be always up to date for all threads accessing it.

Simple data thread question - java

If I have a static array which does not change after being populated. Multiple threads can read this array at the same time can't they? I believe problems arise when one thread trys to read the array while another is modifying it.
Thank you for your response.

Just don't access it with multiple threads while the array is being populated. If nothing is modifying the data (only reads) then you should be fine. Your assumptions are correct.

It's safe. The first step in static initialization is to synchronize on the class [1]
If all other accesses are read, the program is correctly synchronized.
[1] http://java.sun.com/docs/books/jls/third_edition/html/execution.html#12.4.2

Yes, exactly. Google "critical region" for more details.

To avoid multiple threads accessing the same data at the same moment, you have to use synchronized.

Yes, as long as everything is read-only then you'll be fine. Just make sure that no threads attempt to read while the array is being populated (e.g. if it's lazily populated).

Yes, data reads are thread safe - it is only when you are changing the data that you have to be concerned.

If you are mostly reading the array, ReadWriteLock will perform better then synchronized http://download.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/locks/ReadWriteLock.html

I am sure you already have you answer by now thanks to the answers above. Since you are working with multiple threads I will assume that you are working with multiple cores either on the same chip or distributed. In both cases there is something that you can do to improve performance: Let all threads have a copy of data as it is Read only access. This way they can use caches or local memory efficiently and avoid reading over inter-core, interprocessor links. This is obviously impractical if the array is way too big (Bigger than largest cache).

Approach to a thread safe program

All,
What should be the approach to writing a thread safe program. Given a problem statement, my perspective is:
1 > Start of with writing the code for a single threaded environment.
2 > Underline the fields which would need atomicity and replace with possible concurrent classes
3 > Underline the critical section and enclose them in synchronized
4 > Perform test for deadlocks
Does anyone have any suggestions on the other approaches or improvements to my approach. So far, I can see myself enclosing most of the code in synchronized blocks and I am sure this is not correct.
Programming in Java

Writing correct multi-threaded code is hard, and there is not a magic formula or set of steps that will get you there. But, there are some guidelines you can follow.
Personally I wouldn't start with writing code for a single threaded environment and then converting it to multi-threaded. Good multi-threaded code is designed with multi-threading in mind from the start. Atomicity of fields is just one element of concurrent code.
You should decide on what areas of the code need to be multi-threaded (in a multi-threaded app, typically not everything needs to be threadsafe). Then you need to design how those sections will be threadsafe. Methods of making one area of the code threadsafe may be different than making other areas different. For example, understanding whether there will be a high volume of reading vs writing is important and might affect the types of locks you use to protect the data.
Immutability is also a key element of threadsafe code. When elements are immutable (i.e. cannot be changed), you don't need to worry about multiple threads modifying them since they cannot be changed. This can greatly simplify thread safety issues and allow you to focus on where you will have multiple data readers and writers.
Understanding details of concurrency in Java (and details of the Java memory model) is very important. If you're not already familiar with these concepts, I recommend reading Java Concurrency In Practice http://www.javaconcurrencyinpractice.com/.

You should use final and immutable fields wherever possible, any other data that you want to change add inside:
synchronized (this) {
// update
}
And remember, sometimes stuff brakes, and if that happens, you don't want to prolong the program execution by taking every possible way to counter it - instead "fail fast".

As you have asked about "thread-safety" and not concurrent performance, then your approach is essentially sound. However, a thread-safe program that uses synchronisation probably does not scale much in a multi cpu environment with any level of contention on your structure/program.
Personally I like to try and identify the highest level state changes and try and think about how to make them atomic, and have the state changes move from one immutable state to another – copy-on-write if you like. Then the actual write can be either a compare-and-set operation on an atomic variable or a synchronised update or whatever strategy works/performs best (as long as it safely publishes the new state).
This can be a bit difficult to structure if your new state is quite different (requires updates to several fields for instance), but I have seen it very successfully solve concurrent performance issues with synchronised access.

Buy and read Brian Goetz's "Java Concurrency in Practice".

Any variables (memory) accessible by multiple threads potentially at the same time, need to be protected by a synchronisation mechanism.

How good is the JVM at parallel processing? When should I create my own Threads and Runnables? Why might threads interfere?

I have a Java program that runs many small simulations. It runs a genetic algorithm, where each fitness function is a simulation using parameters on each chromosome. Each one takes maybe 10 or so seconds if run by itself, and I want to run a pretty big population size (say 100?). I can't start the next round of simulations until the previous one has finished. I have access to a machine with a whack of processors in it and I'm wondering if I need to do anything to make the simulations run in parallel. I've never written anything explicitly for multicore processors before and I understand it's a daunting task.
So this is what I would like to know: To what extent and how well does the JVM parallel-ize? I have read that it creates low level threads, but how smart is it? How efficient is it? Would my program run faster if I made each simulation a thread? I know this is a huge topic, but could you point me towards some introductory literature concerning parallel processing and Java?
Thanks very much!
Update:
Ok, I've implemented an ExecutorService and made my small simulations implement Runnable and have run() methods. Instead of writing this:
Simulator sim = new Simulator(args);
sim.play();
return sim.getResults();
I write this in my constructor:
ExecutorService executor = Executors.newFixedThreadPool(32);
And then each time I want to add a new simulation to the pool, I run this:
RunnableSimulator rsim = new RunnableSimulator(args);
exectuor.exectue(rsim);
return rsim.getResults();
The RunnableSimulator::run() method calls the Simulator::play() method, neither have arguments.
I think I am getting thread interference, because now the simulations error out. By error out I mean that variables hold values that they really shouldn't. No code from within the simulation was changed, and before the simulation ran perfectly over many many different arguments. The sim works like this: each turn it's given a game-piece and loops through all the location on the game board. It checks to see if the location given is valid, and if so, commits the piece, and measures that board's goodness. Now, obviously invalid locations are being passed to the commit method, resulting in index out of bounds errors all over the place.
Each simulation is its own object right? Based on the code above? I can pass the exact same set of arguments to the RunnableSimulator and Simulator classes and the runnable version will throw exceptions. What do you think might cause this and what can I do to prevent it? Can I provide some code samples in a new question to help?

Java Concurrency Tutorial
If you're just spawning a bunch of stuff off to different threads, and it isn't going to be talking back and forth between different threads, it isn't too hard; just write each in a Runnable and pass them off to an ExecutorService.
You should skim the whole tutorial, but for this particular task, start here.
Basically, you do something like this:
ExecutorService executorService = Executors.newFixedThreadPool(n);
where n is the number of things you want running at once (usually the number of CPUs). Each of your tasks should be an object that implements Runnable, and you then execute it on your ExecutorService:
executorService.execute(new SimulationTask(parameters...));
Executors.newFixedThreadPool(n) will start up n threads, and execute will insert the tasks into a queue that feeds to those threads. When a task finishes, the thread it was running on is no longer busy, and the next task in the queue will start running on it. Execute won't block; it will just put the task into the queue and move on to the next one.
The thing to be careful of is that you really AREN'T sharing any mutable state between tasks. Your task classes shouldn't depend on anything mutable that will be shared among them (i.e. static data). There are ways to deal with shared mutable state (locking), but if you can avoid the problem entirely it will be a lot easier.
EDIT: Reading your edits to your question, it looks like you really want something a little different. Instead of implementing Runnable, implement Callable. Your call() method should be pretty much the same as your current run(), except it should return getResults();. Then, submit() it to your ExecutorService. You will get a Future in return, which you can use to test if the simulation is done, and, when it is, get your results.

You can also see the new fork join framework by Doug Lea. One of the best book on the subject is certainly Java Concurrency in Practice. I would strong recommend you to take a look at the fork join model.

Java threads are just too heavyweight. We have implement parallel branches in Ateji PX as very lightweight scheduled objects. As in Erlang, you can create tens of millions of parallel branches before you start noticing an overhead. But it's still Java, so you don't need to switch to a different language.

If you are doing full-out processing all the time in your threads, you won't benefit from having more threads than processors. If your threads occasionally wait on each other or on the system, then Java scales well up to thousands of threads.
I wrote an app that discovered a class B network (65,000) in a few minutes by pinging each node, and each ping had retries with an increasing delay. When I put each ping on a separate thread (this was before NIO, I could probably improve it now), I could run to about 4000 threads in windows before things started getting flaky. Linux the number was nearer 1000 (Never figured out why).
No matter what language or toolkit you use, if your data interacts, you will have to pay some attention to those areas where it does. Java uses a Synchronized keyword to prevent two threads from accessing a section at the same time. If you write your Java in a more functional manner (making all your members final) you can run without synchronization, but it can be--well let's just say solving problems takes a different approach that way.
Java has other tools to manage units of independent work, look in the "Concurrent" package for more information.

Java is pretty good at parallel processing, but there are two caveats:
Java threads are relatively heavyweight (compared with e.g. Erlang), so don't start creating them in the hundreds or thousands. Each thread gets its own stack memory (default: 256KB) and you could run out of memory, among other things.
If you run on a very powerful machine (especially with a lot of CPUs and a large amount of RAM), then the VM's default settings (especially concerning GC) may result in suboptimal performance and you may have to spend some times tuning them via command line options. Unfortunately, this is not a simple task and requires a lot of knowledge.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

use of Static variable in mapreduce programs - java

Related

Parallelizing class with static member variables

Java multi-threading accessing same variable

Simple data thread question - java

Approach to a thread safe program

How good is the JVM at parallel processing? When should I create my own Threads and Runnables? Why might threads interfere?

Categories

Resources