Should logic of Spark transformation and action need to be threadsafe? - java

This may be a stupid question. However, I would like to know if I have something like this - rdd.mapPartitions(func). Should the logic in func be threadsafe?
Thanks

The short answer is no, it does not have to be thread safe.
The reason for this is that spark divides the data between partitions. It then creates a task for each partition and the function you write would run within that specific partition as a single threaded operation (i.e. no other thread would access the same data).
That said, you have to make sure you do not create thread "unsafety" manually by accessing resources which are not the RDD data. For example, if you create a static object and access that, it might cause issues as multiple tasks might run in the same executor (JVM) and access it as well. That said, you shouldn't be doing something like that to begin with unless you know exactly what you are doing...

Any function passed to the mapPartitions (or any other action or transformation) has to be thread safe. Spark on JVM (this is not necessarily true for guest languages) uses executor threads and doesn't guarantee any isolation between individual tasks.
This is particularly important when you use resources which are not initialized in the function, but passed with the closure like for example objects initialized in the main function, but referenced in the function.
It goes without saying you should not modify any of the arguments unless it is explicitly allowed.

When you do "rdd.mapPartitions(func)", the func may actually execute in a different jvm!!! Thread does not have significance across JVM.
If you are running in local mode, and using global state or thread unsafe functions, the job might work as expected but the behaviours is not defined or supported.

Related

synchronize on method parameter

I have a pretty basic method,
//do stuff
}
. I was having issues in that new quotes would update the order, so I wanted to synchronize on the order parameter. So my code would like:
handleOrder(IOrder order) {
synchronized(order){
//do stuff
}
}
Now however, intellij is complaining that:
Synchronization on method parameter 'order'
Inspection info: Reports synchronization on a local variable or parameter. It is very difficult to guarantee correctness when such synchronization is used. It may be possible to improve code like this by controlling access through e.g. a synchronized wrapper class, or by synchronizing on a field.
Is this something I actually need to be concerned about?
Yes, because this type of synchronization is generally an indication that the code cannot easily be reviewed to ensure that deadlocks don't take place.
When you synchronize on a field, you're combining the synchronization code with the instance being used in a way that permits you to have most, if not all of the competing methods in the same file. This makes it easier to review the file for deadlocks and errors in the synchronization approach. The same idea applies when using a synchronized wrapper class.
When you synchronize on a passed instance (local field) then you need to review all of the code of the entire application for other synchronization efforts on the same instance to get the same level of security that a mistake was not made. In addition, this will have to be done frequently, as there is little assurance that after the next commit, a developer will have done the same code scan to make sure that their synchronization didn't impact code that lived in some remote directory (or even in a remote JAR file that doesn't have source code on their machine).

How to make all class methods run only on single thread? (synchronized class?)

I know synchronized keyword makes method run only on single class at a time. But here is the problem.
I have a database class with methods e.g. insertAccount, updateSetting, etc. If I make insertAccount, updateSetting synchronized, each of them will be able to run only on one thread at a time.
If there was one method for whole database, it would be great, but there are not one. If one thread calls insertAccount and another thread calls updateSetting at the same time, it will go bad, right?
Because only one of these methods can be run at any time. So what do I do?
Is there a way to apply something like synchronized to the whole class? So that if 1st thread calls insertAccount and 2nd thread calls updateSetting at the same time, 2nd thread has to wait until 1st thread finishes accessing database.
The real answer here: step back and do some studying. You should not be using synchronized here, but rather look into a lock object that a reader/writer needs to acquire prior turning to that "DB class". See here for more information.
On the other hand, you should understand what transactions are, and how your database supports those. Meaning: there are different kinds of problems; and the different layers (application code, database) have different responsibilities.
You see, using "trial and error" isn't an approach that will work out here. You should spend some serious time studying the underlying concepts. Otherwise you are risking to damage your data set; and worse: you risk writing code that works fine most of the time; but fails in obscure ways "randomly". Because that is what happens when multiple threads manipulate shared data in an uncontrolled manner.
You misunderstood how synchronized work.
If you mark two method of class by synchronized only one of them could be executed at any moment of time (except if you invoke wait).
Also note that if you have several instances of this class you can execute methods of different instances simultaneously.
#Test(singleThreaded = true) Use above annotation above class and its tests will be run using a single thread even though you have used parallel="methods" in your testng.xml file

Java multi-threading accessing same variable

I have a Java program which create 2 threads, inside these 2 threads, they are trying to update the global variable abc to different value, let's say integer 1 and integer 3.
Let's say they execute the code at the same time (at same milisecond), for example:
public class MyThread implements Runnable{
public void run(){
while(true){
if (currentTime == specificTime){
abc = 1; //another thread update abc to 3
}
}
}
}
In this case, how can we determine the result of the variable abc? I am very curious how Operating System schedule the execution?
(I know Synchronize should be used, but I just want to know naturally how the system will handle this kind of conflict problem.)
The operating system has little involvement in this: at the time your threads are running, the memory allocated to abc is under control of JVM running your program, so it's your program that is in control.
When two threads access the same memory location, the last writer wins. Which particular thread gets to be the last writer, however, is non-deterministic, unless you use synchronization.
Moreover, without you taking special care of accessing the shared data, one thread may not even see the results of the other thread writing to the abc location.
To avoid synchronization issues, you should use synchronization or one of the java.util.concurrent.atomic classes.
From Java's perspective the situation is fairly simple if abc is not volatile or accessed with appropriate synchronisation.
Let's assume that abc is 0 originally. After your two threads have updated it to respectively 1 and 3, abc could be observed in three states: 0, 1 or 3. Which value you get is not deterministic and the result may vary from one run to the other.
Depends on the operating system, running environment etc.
Some environments will actually stop you from doing this - known as thread safety.
Otherwise the results are totally unpredictable which is why it is so dangerous to do this.
It mainly just depends on which thread updated it last for what the value will be. One thread will get CPU cycles before the other to do the atomic operation first.
Also, I don't think that operating systems go as far as to schedule threads because in most operating systems it is the program that is responsible for them, and without explicit calls like synchronise, or a threading pool model then I think the order of execution is pretty hard to predict. Its a very environment dependent thing.
From the system's perspective the result will depend on many software, hardware and run-time factors that cannot be known in advance. From this perspective there is no conflict nor a problem.
From the programmer's perspective the result is not deterministic and therefore a problem/conflic. The conflict needs to be resolved at design-time.
In this case, how can we determine the result of the variable abc? I
am very curious how Operating System schedule the execution?
The result will not be deterministic, as the value will be the last written one. You can not make any guarantee about the result. The execution is scheduled like any other one. As you demand no synchronization in your code the JVM will not enforce anything for you.
I know Synchronize should be used, but I just want to know naturally
how the system will handle this kind of conflict problem.
Simple said: it wont, as for the system there is no conflict. Only for you, the programmer, problems will occur, since you will eventually run into a data race and not deterministic behavior. It is completely up to you.
just add volatile modificator to your variable, then it'll be udpated through all threads. And thread reading it will get it's actual value. volatile means that value will be always up to date for all threads accessing it.

Non blocking strategy for executing a pair of operations atomically in Java

Lets say I have a Set and another Queue. I want to check in the set if it contains(Element) and if not add(element) to the queue. I want to do the two steps atomically.
One obvious way is to use synchronized blocks or Lock.lock()/unlock() methods. Under thread contention , these will cause context switches. Is there any simple design strategy for achieving this in a non-blocking manner ? may be using some Atomic constructs ?
I don't think you can rely on any mechanism, except the ones you pointed out yourself, simply because you're operating on two structures.
There's decent support for concurrent/atomic operations on one data structure (like "put if not exists" in a ConcurrentHashMap), but for a sequence of operations, you're stuck with either a lock or a synchronized block.
For some operations you can employ what is called a "safe sequence", where concurrent operations may overlap without conflicting. For instance, you might be able to add a member to a set (in theory) without the need to synchronize, since two threads simultaneously adding the same member do not conceptually conflict with each other.
But to query one object and then conditionally operate on a different object is a much more complicated scenario. If your sequence was to query the set, then conditionally insert the member into the set and into the queue, the query and first insert could be replaced with a "compare and swap" operation that syncs without stalling (except perhaps at the memory access level), and then one could insert the member into the queue based on the success of the first operation, only needing to synchronize the queue insert itself. However, this sequence leaves the scenario where another thread could fail the insert and still not find the member in the queue.
Since the contention case is the relevant case you should look at "spin locks". They do not give away the CPU but spin on a flag expecting the flag to be free very soon.
Note however that real spin locks are seldom useful in Java because the normal Lock is quite good. See this blog where someone had first implemented a spinlock in Java only to find that after some corrections (i.e. after making the test correct) spin locks are on par with the standard stuff.
You can use java.util.concurrent.ConcurrentHashMap to get the semantics you want. They have a putIfAbsent that does an atomic insert. You then essentially try to add an element to the map, and if it succeeds, you know that thread that performed the insert is the only one that has, and you can then put the item in the queue safely. The other significant point here is that the operations on a ConcurrentMap insure "happens-before" semantics.
ConcurrentMap<Element,Boolean> set = new ConcurrentHashMap<Element,Boolean>();
Queue<Element> queue = ...;
void maybeAddToQueue(Element e) {
if (set.putIfAbsent(e, true) == null) {
queue.offer(e);
}
}
Note, the actual value type (Boolean) of the map is unimportant here.

Java logging across multiple threads

We have a system that uses threading so that it can concurrently handle different bits of functionality in parallel. We would like to find a way to tie all log entries for a particular "transaction" together. Normally, one might use 'threadName' to gather these together, but clearly that fails in a multithreaded situation.
Short of passing a 'transaction key' down through every method call, I can't see a way to tie these together. And passing a key into every single method is just ugly.
Also, we're kind of tied to Java logging, as our system is built on a modified version of it. So, I would be interested in other platforms for examples of what we might try, but switching platforms is highly unlikely.
Does anyone have any suggestions?
Thanks,
Peter
EDIT: Unfortunately, I don't have control over the creation of the threads as that's all handled by a workflow package. Otherwise, the idea of caching the ID once for each thread (on ThreadLocal maybe?) then setting that on the new threads as they are created is a good idea. I may try that anyway.
You could consider creating a globally-accessible Map that maps a Thread's name to its current transaction ID. Upon beginning a new task, generate a GUID for that transaction and have the Thread register itself in the Map. Do the same for any Threads it spawns to perform the same task. Then, when you need to log something, you can simply lookup the transaction ID from the global Map, based on the current Thread's name. (A bit kludgy, but should work)
This is a perfect example for AspectJ crosscuts. If you know the methods that are being called you can put interceptors on them and bind dynamically.
This article will give you several options http://www.ibm.com/developerworks/java/library/j-logging/
However you mentioned that your transaction spans more than one thread, take a look at how log4j cope with binding additional information to current thread with MDC and NDC classes. It uses ThreadLocal as you were advised before, but interesting thing is how log4j injects data into log messages.
//In the code:
MDC.put("RemoteAddress", req.getRemoteAddr());
//In the configuration file, add the following:
%X{RemoteAddress}
Details:
http://onjava.com/pub/a/onjava/2002/08/07/log4j.html?page=3
http://wiki.apache.org/logging-log4j/NDCvsMDC
How about naming your threads to include the transaction ID? Quick and Dirty, admittedly, but it should work (until you need the thread name for something else or you start reusing threads in a thread pool).
If you are logging, then you must have some kind of logger object. You should have a spearate instance in each thread.
add a method to it called setID(String id).
When it is initialized in your thread, set a unique ID using the method.
prepend the set iD to each log entry.
A couple people have suggested answers that have the newly spawned thread somehow knowing what the transaction ID is. Unless I'm missing something, in order to get this ID into the newly spawned thread, I would have to pass it all the way down the line into the method that spawns the thread, which I'd rather not do.
I don't think you need to pass it down, but rather the code responsible for handing work to these threads needs to have the transactionID to pass. Wouldn't the work-assigner have this already?

Categories