Distributed Concurrency Control - java

I've been working on this for a few days now, and I've found several solutions but none of them incredibly simple or lightweight. The problem is basically this: We have a cluster of 10 machines, each of which is running the same software on a multithreaded ESB platform. I can deal with concurrency issues between threads on the same machine fairly easily, but what about concurrency on the same data on different machines?
Essentially the software receives requests to feed a customer's data from one business to another via web services. However, the customer may or may not exist yet on the other system. If it does not, we create it via a web service method. So it requires a sort of test-and-set, but I need a semaphore of some sort to lock out the other machines from causing race conditions. I've had situations before where a remote customer was created twice for a single local customer, which isn't really desirable.
Solutions I've toyed with conceptually are:
Using our fault-tolerant shared file system to create "lock" files which will be checked for by each machine depending on the customer
Using a special table in our database, and locking the whole table in order to do a "test-and-set" for a lock record.
Using Terracotta, an open source server software which assists in scaling, but uses a hub-and-spoke model.
Using EHCache for synchronous replication of my in-memory "locks."
I can't imagine that I'm the only person who's ever had this kind of problem. How did you solve it? Did you cook something up in-house or do you have a favorite 3rd-party product?

you might want to consider using Hazelcast distributed locks. Super lite and easy.
java.util.concurrent.locks.Lock lock = Hazelcast.getLock ("mymonitor");
lock.lock ();
try {
// do your stuff
}finally {
lock.unlock();
}
Hazelcast - Distributed Queue, Map, Set, List, Lock

We use Terracotta, so I would like to vote for that.
I've been following Hazelcast and it looks like another promising technology, but can't vote for it since I've not used it, and knowing that it uses a P2P based system at its heard, I really would not trust it for large scaling needs.
But I have also heard of Zookeeper, which came out of Yahoo, and is moving under the Hadoop umbrella. If you're adventurous trying out some new technology this really has lots of promise since it's very lean and mean, focusing on just coordination. I like the vision and promise, though it might be too green still.
http://www.terracotta.org
http://wiki.apache.org/hadoop/ZooKeeper
http://www.hazelcast.com

Terracotta is closer to a "tiered" model - all client applications talk to a Terracotta Server Array (and more importantly for scale they don't talk to one another). The Terracotta Server Array is capable of being clustered for both scale and availability (mirrored, for availability, and striped, for scale).
In any case as you probably know Terracotta gives you the ability to express concurrency across the cluster the same way you do in a single JVM by using POJO synchronized/wait/notify or by using any of the java.util.concurrent primitives such as ReentrantReadWriteLock, CyclicBarrier, AtomicLong, FutureTask and so on.
There are a lot of simple recipes demonstrating the use of these primitives in the Terracotta Cookbook.
As an example, I will post the ReentrantReadWriteLock example (note there is no "Terracotta" version of the lock - you just use normal Java ReentrantReadWriteLock)
import java.util.concurrent.locks.*;
public class Main
{
public static final Main instance = new Main();
private int counter = 0;
private ReentrantReadWriteLock rwl = new ReentrantReadWriteLock(true);
public void read()
{
while (true) {
rwl.readLock().lock();
try {
System.out.println("Counter is " + counter);
} finally {
rwl.readLock().unlock();
}
try { Thread.currentThread().sleep(1000); } catch (InterruptedException ie) { }
}
}
public void write()
{
while (true) {
rwl.writeLock().lock();
try {
counter++;
System.out.println("Incrementing counter. Counter is " + counter);
} finally {
rwl.writeLock().unlock();
}
try { Thread.currentThread().sleep(3000); } catch (InterruptedException ie) { }
}
}
public static void main(String[] args)
{
if (args.length > 0) {
// args --> Writer
instance.write();
} else {
// no args --> Reader
instance.read();
}
}
}

I recommend to use Redisson. It implements over 30 distributed data structures and services including java.util.Lock. Usage example:
Config config = new Config();
config.addAddress("some.server.com:8291");
Redisson redisson = Redisson.create(config);
Lock lock = redisson.getLock("anyLock");
lock.lock();
try {
...
} finally {
lock.unlock();
}
redisson.shutdown();

I was going to advice on using memcached as a very fast, distributed RAM storage for keeping logs; but it seems that EHCache is a similar project but more java-centric.
Either one is the way to go, as long as you're sure to use atomic updates (memcached supports them, don't know about EHCache). It's by far the most scalable solution.
As a related datapoint, Google uses 'Chubby', a fast, RAM-based distributed lock storage as the root of several systems, among them BigTable.

I have done a lot of work with Coherence, which allowed several approaches to implementing a distributed lock. The naive approach was to request to lock the same logical object on all participating nodes. In Coherence terms this was locking a key on a Replicated Cache. This approach doesn't scale that well because the network traffic increases linearly as you add nodes. A smarter way was to use a Distributed Cache, where each node in the cluster is naturally responsible for a portion of the key space, so locking a key in such a cache always involved communication with at most one node. You could roll your own approach based on this idea, or better still, get Coherence. It really is the scalability toolkit of your dreams.
I would add that any half decent multi-node network based locking mechanism would have to be reasonably sophisticated to act correctly in the event of any network failure.

Not sure if I understand the entire context but it sounds like you have 1 single database backing this? Why not make use of the database's locking: if creating the customer is a single INSERT then this statement alone can serve as a lock since the database will reject a second INSERT that would violate one of your constraints (e.g. the fact that the customer name is unique for example).
If the "inserting of a customer" operation is not atomic and is a batch of statements then I would introduce (or use) an initial INSERT that creates some simple basic record identifying your customer (with the necessary UNIQUEness constraints) and then do all the other inserts/updates in the same transaction. Again the database will take care of consistency and any concurrent modifications will result in one of them failing.

I made a simple RMI service with two methods: lock and release. both methods take a key (my data model used UUIDs as pk so that was also the locking key).
RMI is a good solution for this because it's centralized. you can't do this with EJBs (specialially in a cluster as you don't know on which machine your call will land). plus, it's easy.
it worked for me.

If you can set up your load balancing so that requests for a single customer always get mapped to the same server then you can handle this via local synchronization. For example, take your customer ID mod 10 to find which of the 10 nodes to use.
Even if you don't want to do this in the general case your nodes could proxy to each other for this specific type of request.
Assuming your users are uniform enough (i.e. if you have a ton of them) that you don't expect hot spots to pop up where one node gets overloaded, this should still scale pretty well.

You might also consider Cacheonix for distributed locks. Unlike anything else mentioned here Cacheonix support ReadWrite locks with lock escalation from read to write when needed:
ReadWriteLock rwLock = Cacheonix.getInstance().getCluster().getReadWriteLock();
Lock lock = rwLock.getWriteLock();
try {
...
} finally {
lock.unlock();
}
Full disclosure: I am a Cacheonix developer.

Since you are already connecting to a database, before adding another infra piece, take a look at JdbcSemaphore, it is simple to use:
JdbcSemaphore semaphore = new JdbcSemaphore(ds, semName, maxReservations);
boolean acq = semaphore.acquire(acquire, 1, TimeUnit.MINUTES);
if (acq) {
// do stuff
semaphore.release();
} else {
throw new TimeoutException();
}
It is part of spf4j library.

Back in the day, we'd use a specific "lock server" on the network to handle this. Bleh.
Your database server might have resources specifically for doing this kind of thing. MS-SQL Server has application locks usable through the sp_getapplock/sp_releaseapplock procedures.

We have been developing an open source, distributed synchronization framework, currently DistributedReentrantLock and DistributedReentrantReadWrite lock has been implemented, but still are in testing and refactoring phase. In our architecture lock keys are devided in buckets and each node is resonsible for certain number of buckets. So effectively for a successfull lock requests, there is only one network request. We are also using AbstractQueuedSynchronizer class as local lock state, so all the failed lock requests are handled locally, this drastically reduces network trafic.
We are using JGroups (http://jgroups.org) for group communication and Hessian for serialization.
for details, please check out http://code.google.com/p/vitrit/.
Please send me your valuable feedback.
Kamran

Related

Do concurrent web crawlers typically store visited URLs in a concurrent map, or use synchronization to avoid crawling the same pages twice?

I'm playing around writing a simple multi-threaded web crawler. I see a lot of sources talk about web crawlers as obviously parallel because you can start crawling from different URLs, but I never see them discuss how web crawlers handle URLs that they've already seen before. It seems that some sort of global map would be essential to avoid re-crawling the same pages over and over, but how would the critical section be structured? How fine grained can the locks be to maximize performance? I just want to see a good example that's not too dense and not too simplistic.
Specific domain use case : Use in memory
If it is specific domain say abc.com then it is better to have vistedURL set or Concurrent hash map in memory, in memory will be faster to check visited status, memory consumption will be comparatively less. DB will have IO overhead and it is costly and visited status check will be very frequent. It will hit your performance drastically. As per your use case, you can use in memory or DB. My use case was specific to domain where visited URL will not be again visited so I used Concurrent hash map.
If you insist to do it using only java concurrency framework, then the ConcurrentHashMap may be the way to go. The interesting method in it is the ConcurrentHashMap.putIfAbsent method, it will give you very good efficiency, and the idea how to use it is:
You will have some "multithreaded source of incoming url addresses" from crawled pages - you can use some concurrent queue to store them, or just create a ExecutorService with (unbounded?) queue in which you will place Runnables that will crawl the urls.
Inside the crawling Runnables you should have a reference to this common ConcurrentHashMap of already crawled pages, and at the very begin of the run method do:
private final ConcurrentHashMap<String, Long> crawledPages = new ConcurrentHashMap<String, Long>();
...
private class Crawler implements Runnable {
private String urlToBeCrawled;
public void Crawler(String urlToBeCrawled) {
this.urlToBeCrawled = urlToBeCrawled;
}
public void run() {
if (crawledPages.putIfAbsent(urlToBeCrawled, System.currentTimeMillis())==null) {
doCrawlPage(urlToBeCrawled);
}
}
}
if crawledPages.putIfAbsent(urlToBeCrawled) will return null to you, then you know that this page was not crawled by anyone, since this method atomically puts the value you can progress with crawling this page - you're the lucky thread, if it will return a non-null value, then you know someone has already take care about this url, so your runnable should finish, and the thread goes back to pool to be used by next Runnable.
You can use ConcurrentHashMap to store to find duplicate url.ConcurrentHashMap also use split lock mechanism instead of using global lock.
or you can use your own implementation where you can split your all data among different key.
For an example of Guava API
Striped<ReadWriteLock> rwLockStripes = Striped.readWriteLock(10);
String key = "taskA";
ReadWriteLock rwLock = rwLockStripes.get(key);
try{
rwLock.lock();
.....
}finally{
rwLock.unLock();
}
ConcurrentHashMap example
private Set<String> urls = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>());
for crawler don't use ConcurrentHashMap, rather use Databse
The number of visisted URL's will grow very fast, so it is not a good thing to store them in memory, better use a databese, store the URL and the date it was last crawled, then just check the URL if it already exists in DB or is eligible for refreshing. I use for example a Derby DB in embedded mode, and it works perfectly for my web crawler. I don't advise to use in memory DB like H2, because with the number of crawled pages you eventually will get OutOfMemoryException.
You will rather rarely have the case of crawling the same page more than once in the same time, so checking in DB if it was already crawled recently is enough to not waste significant resources on "re-crawling the same pages over and over". I belive this is "a good solution that's not too dense and not too simplistic"
Also, using Databse with the "last visit date" for url, you can stop and continue the work when you want, with ConcurrentHashMap you will loose all the results when app exit. You can use "last visit date" for url to determine if it needs recrawling or not.

global distributed lock that can be set to expire Java

I have a use case where I want to have a globally distributed lock. We started out using SELECT .. FOR UPDATE, but that quickly started to have problems as we scaled up the number of servers. Also it didn't account for processes that checked out the lock and then died and failed to return the lock.
We need to be able to set an expiration on the lock (i.e. if the process who checked out the lock does not return it in 2 hours, the lock is automatically returned to the pool). I realize that this introduces the issue where we are ignoring locks, but we are fairly certain that the process has died if not complete in 2 hours. Also the job is idempotent, so if it is done more than once it's not a big deal.
I've looked through a number of distributed locking systems and come across this questions that have been extremely helpful. All of the solutions extend off of Java's java.util.concurrency.locks.Lock, which actually may be the issue I'm coming across because that interface doesn't have the expiration feature I need. We have a similar strategy to mongo-java-distributed-lock where we use MongoDB's findAndModify. We're considering:
mongo-java-distributed-lock
Redisson
hazelcast
as our distributed locking mechanism (all happen to implement java.util.concurrency.locks.Lock).
The biggest problem is that because java.util.concurrency.locks.Lock doesn't have an option for expiring a lock, these don't fit all the goals. This answer probably gets closest with hazelcast, but it is reliant on an entire server failing, not just a thread taking too long. Another option is possibly using a Samaphore with hazelcast as described here. I could have a reaper thread that is then able to cancel the locks of others if they are taking too long. With Mongo and Redis I could take advantage of their ability to expire objects, but that doesn't seem to be part of either of the libraries since they just implement java.util.concurrency.locks.Lock in the end.
So this was just a long winded way of asking, is there a distributed locking mechanism out there that I can have automatically expire after N seconds? Should I be looking at a different mechanism than java.util.concurrency.locks.Lock in this situation altogether?
You may use Redisson based on Redis server. It implements familiar Java data structures including java.util.Lock with distributed and scalable abilities. Including ability to setup lock release timeout. Usage example:
Config config = new Config();
// for single server
config.useSingleServer()
.setAddress("127.0.0.1:6379");
// or
// for master/slave servers
config.useSentinelConnection()
.setMasterName("mymaster")
.addSentinelAddress("127.0.0.1:26389", "127.0.0.1:26379");
Redisson redisson = Redisson.create(config);
Lock lock = redisson.getLock("anyLock");
try {
// unlock automatically after 10 seconds of hold
lock.lock(10, TimeUnit.SECONDS);
} finally {
lock.unlock();
}
...
redisson.shutdown();
You should consider using zookeeper. And there is an easy to use library for these kind of "distributed" stuff which is built on top of zookeeper : curator framework. I think what you are looking for is shared reentrant lock. You can also check other locks in recipes.
What about this one?
http://www.gemstone.com/docs/5.5.0/product/docs/japi/com/gemstone/gemfire/distributed/DistributedLockService.html
Its lock method seems to have what you need:
public abstract boolean lock(Object name,
long waitTimeMillis,
long leaseTimeMillis)
Attempts to acquire a lock named name. Returns true as soon as the lock is acquired. If the lock is currently held by another thread in this or any other process in the distributed system, or another thread in the system has locked the entire service, this method keeps trying to acquire the lock for up to waitTimeMillis before giving up and returning false. If the lock is acquired, it is held until unlock(Object name) is invoked, or until leaseTimeMillis milliseconds have passed since the lock was granted - whichever comes first.
Actually, as far as I can tell mongo-java-distributed-lock has the ability to expire a lock through the use of DistributedLockOptions.setInactiveLockTimeout()
I haven't tried it yet, but think I will...
EDIT: I have now also tried it, and it works well...
String lockName = "com.yourcompany.yourapplication.somelock";
int lockTimeoutMilliSeconds = 500;
String dbURI = CWConfig.get().getMongoDBConfig().getDbURI();
DistributedLockSvcFactory lockSvcFactory = new DistributedLockSvcFactory(new DistributedLockSvcOptions(dbURI));
DistributedLockSvc lockSvc = lockSvcFactory.getLockSvc();
DistributedLock lock = lockSvc.create(lockName);
lock.getOptions().setInactiveLockTimeout(lockTimeoutMilliSeconds);
try {
lock.lock();
// Do work
} finally {
lock.unlock();
}

Java Concurrent Locks failing in Spring Web App deployed in Clustered Environment

In my spring web app, I have a service method containing a block of code guarded by a lock.
Only a single thread can enter the code block at a time.
This works fine in a non clustered environment but fails in a clustered one. In a clustered environment, within a node, synchronization happens but among different nodes, code block is executed in parallel. Is this because in each node a separate Lock object is created ?
Can anyone advise me ?
Code Sample
//Service Class
#Service
class MyServiceClass {
private final Lock globalLock;
#Autowired
public MyServiceClass(#Qualifier("globalLock") final Lock globalLock){
this.globalLock = globalLock;
}
public void myServiceMethod(){
...
globalLock.lock();
try {
...
}
finally {
globalLock.unlock();
}
...
}
}//End of MyServiceClass
//Spring Configuration XML
<bean id="globalLock" class="java.util.concurrent.locks.ReentrantLock" scope="singleton" />
If you want to synchronize objects in a cluster environment, this meaning many VMs involved, your solution would involve some kind of communication between the different VMs involved.
In this case, it will require some imagination to get the thing done: you will need the mutual exclusion implemented on some object that is common to all the VMs involved, and that may escalate when you put additional machines into the cluster. Have you thought some solution based on JNDI? Here you have something on it, but I am afraid it looks rather an academic discussion:
http://jsr166-concurrency.10961.n7.nabble.com/Lock-implementation-td2180.html
There is always the chance to implement something based on DB mechanisms (always thinking that your DB is a central and common resource to all the nodes in the cluster). You could devise something based on some SELECT FOR UPDATE mechanism implemented in your database, over some table used only for synchronization...
You have an interesting problem! :) Good luck
You are right, the reason is that each node has it's own lock. To solve this, consider introducing in the database a table SERVICE_LOCKS, with the columns service class name, service Id, lock status and acquisition timestamp.
For service Id make each service generate a unique distributed Id using UUID.randomUUID().
To acquire the locks, issue an update to try to grab it, and then query it to see if you have the lock. But don't do select, check and then update. Locks older than a certain amount of time should be not taken into account.
This is an implementation of to the coarse grained lock design pattern, where an application level pessimistic lock is acquired to lock shared resources.
Depending on the business logic on the services and the type of transaction manager you use, increasing the isolation level of the service method to REPEATABLE_READ might be an option.
For a solution that does not involve the database, have a look at a framework for distributed concurrent processing based on the Actor concurrency model - The Akka Framework (click Remoting button).

Java: TaskExecutor for Asynchronous Database Writes?

I'm thinking of using Java's TaskExecutor to fire off asynchronous database writes. Understandably threads don't come for free, but assuming I'm using a fixed threadpool size of say 5-10, how is this a bad idea?
Our application reads from a very large file using a buffer and flushes this information to a database after performing some data manipulation. Using asynchronous writes seems ideal here so that we can continue working on the file. What am I missing? Why doesn't every application use asynchronous writes?
Why doesn't every application use asynchronous writes?
It's often necessary/usefull/easier to deal with a write failure in a synchronous manner.
I'm not sure a threadpool is even necessary. I would consider using a dedicated databaseWriter thread which does all writing and error handling for you. Something like:
public class AsyncDatabaseWriter implements Runnable {
private LinkedBlockingQueue<Data> queue = ....
private volatile boolean terminate = false;
public void run() {
while(!terminate) {
Data data = queue.take();
// write to database
}
}
public void ScheduleWrite(Data data) {
queue.add(data);
}
}
I personally fancy the style of using a Proxy for threading out operations which might take a long time. I'm not saying this approach is better than using executors in any way, just adding it as an alternative.
Idea is not bad at all. Actually I just tried it yesterday because I needed to create a copy of online database which has 5 different categories with like 60000 items each.
By moving parse/save operation of each category into the parallel tasks and partitioning each category import into smaller batches run in parallel I reduced the total import time from several hours (estimated) to 26 minutes. Along the way I found good piece of code for splitting the collection: http://www.vogella.de/articles/JavaAlgorithmsPartitionCollection/article.html
I used ThreadPoolTaskExecutor to run tasks. Your tasks are just simple implementation of Callable interface.
why doesn't every application use asynchronous writes? - erm because every application does a different thing.
can you believe some applications don't even use a database OMG!!!!!!!!!
seriously though, given as you don't say what your failure strategies are - sounds like it could be reasonable. What happens if the write fails? or the db does away somehow
some databases - like sybase - have (or at least had) a thing where they really don't like multiple writers to a single table - all the writers ended up blocking each other - so maybe it wont actually make much difference...

Redis & Java in a multithreaded application help!

We have an application which is currently threaded (about 50 threads) to process transactions.
We have setup a redis database and using DECRBY to deduct credits from a users account.
Here is an example of the process:
1. Get amount of credits for this transaction
2. Get current credit amount from from Redis: GET <key>
3. If amount of credits exceeds amount cost of transaction continue
4. DECRBY the transaction amount from Redis.
The issue i have here is obvious, when the users credits reaches 0, it does fail the transaction (good), but it lets about 10-20 transactions through because of the threading.
I have thought of setting up WATCH, MULTI, EXEC with Redis, and then retry, but won't this cause a bottleneck (I think its called race conditions) because the threads will be constantly fighting to complete the transaction.
Any suggestions ?
Locking is what you need. Since DB locks are expensive, you can implement a simple locking scheme in Redis using SETNX and also avoid race conditions. It's well explained here - http://redis.io/commands/setnx. But you still need to implement retries at application level.
It isn't the most conventional way of doing it IMO (most usual way is probably to use a lock in a RDBMS), but using WATCH, MULTI, EXEC looks akin to CAS and it doesn't seem too weird to me.
I'd assume that the author of Redis intended WATCH to be used like this. Performance implication obviously depends on how this thing is implemented (which I don't know), but my bet is that it will perform pretty good.
This is because it seems likely that there will be very less to almost no contention for the same keys in your situation (what is the chance of a user frantically issuing transactions for him/herself?), the success rate for the first swap operation will be really good. So the retry will only happen in very rare cases. Since Redis seems to be a credible framework, they also probably know what they are doing (i.e. less contention = easy job for Redis, thus it can probably handle it!).
You could try to use Redis based Lock object implementation for Java provided by Redisson framework instead of retrying with WATCH-MULTI commands. Working with WATCH-MULTI involves unnecessary requests to Redis during each attempt which works much slower than already acquired lock.
Here is the code sample:
Lock lock = redisson.getLock("transationLock");
lock.lock();
try {
... // instructions
} finally {
lock.unlock();
}

Categories