I think I found more bugs in my web application. Normally, I do not worry about concurrency issues, but when you get a ConcurrentModificationException, you begin to rethink your design.
I am using JBoss Seam in conjuction with Hibernate and EHCache on Jetty. Right now, it is a single application server with multiple cores.
I briefly looked over my code and found a few places that haven't thrown an exception yet, but I am fairly sure they can.
The first servlet filter I have basically checks if there are messages to notify the user of an event that occurred in the background (from a job, or another user). The filter simply adds messages to the page in a modal popup. The messages are stored on the session context, so it is possible another request could pull the same messages off the session context.
Right now, it works fine, but I am not hitting a page with many concurrent requests. I am thinking that I might need to write some JMeter tests to ensure this doesn't happen.
The second servlet filter logs all incoming requests along with the session. This permits me to know where the client is coming from, what browser they're running, etc. The problem I am seeing more recently is on image gallery pages (where there are many requests at about the same time), I end up getting a concurrent modification exception because I'm adding a request to the session.
The session contains a list of requests, this list appears to be being hit by multiple threads.
#Entity
public class HttpSession
{
protected List<HttpRequest> httpRequests;
#Fetch(FetchMode.SUBSELECT)
#OneToMany(mappedBy = "httpSession")
public List<HttpRequest> getHttpRequests()
{return(httpRequests);}
...
}
#Entity
public class HttpRequest
{
protected HttpSession httpSession;
#ManyToOne(optional = false)
#JoinColumn(nullable = false)
public HttpSession getHttpSession()
{return(httpSession);}
...
}
In that second servlet filter, I am doing something of the sort:
httpSession.getHttpRequests().add(httpRequest);
session.saveOrUpdate(httpSession);
The part that errors out is when I do some comparison to see what changed from request to request:
for(HttpRequest httpRequest:httpSession.getHttpRequests())
That line there blows up with a concurrent modification exception.
Things to walk away with:
1. Will JMeter tests be useful here?
2. What books do you recommend for writing web applications that scale under concurrent load?
3. I tried placing synchronized around where I think I need it, ie on the method that loops through the requests, but it still fails. What else might I need to do?
I added some comments:
I had thought about making the logging of the http requests a background task. I can easily spawn a background task to save that information. I am trying to remember why I didn't evaluate that too much. I think there is some information that I would like to have access to on the spot.
If I made it asynchronous, that would speed up the throughput quite a bit - well I'd have to use JMeter to measure those differences.
I would still have to deal with the concurrency issue there.
Thanks,
Walter
A ConcurrentModificationException occurrs when any collection is modified while iterating over it. You can do it in a single thread, e.g.:
for( Object o : someList ) {
someList.add( new Object() );
}
Wrap your list with Collections.synchronizedList or return an unmodifiable copy of the list.
I'm not sure about scaling web applications in particular, but Java Concurrency in Practice is a fantastic book on concurrency in general.
The list should be replaced with a version that is threadsafe, or all access to it has to be synchronized (readers and writers) on the same object. It is not enough to synchronize just the method that reads from the list.
It's been caused because the list has been modified by another request while you're still iterating over it in one request. Replacing the List by ConcurrentLinkedQueue (click link to see javadoc) should fix the particular problem.
As to your other questions:
1: Will JMeter tests be useful here?
Yes, it is certainly useful to stress-test webapplications and spot concurrency bugs.
2: What books do you recommend for writing web applications that scale under concurrent load?
Not specific tied to webapplications, but more to concurrency in general, the book Concurrency in Practice is the most recommended one in that area. You can perfectly apply the learned things on webapplications as well, they are a perfect real world example of "heavy concurrent" applications.
3: I tried placing synchronized around where I think I need it, ie on the method that loops through the requests, but it still fails. What else might I need to do?
You basically need to synchronize any access to the list on the same lock. But just replacing by ConcurrentLinkedQueue is easier.
You're getting an exception on the iterator, because another thread is altering the collection backing the iterator while you're in mid-iteration.
You could wrap access to the list in synchronized access (both adding and iterating) but there are problems with this, as it could take significantly longer to iterate through a list, along with the processing that goes along with it, and you'd be holding the lock to the list for all of that time.
Another option would be to copy the list and pass out the copy for iteration, which might be a better idea if the objects are small, as you'd only be holding the lock while you make the copy, rather than while you're iterating through the list.
Store your values in a ConcurrentHashMap, which uses lock striping to minimize lock contention. You could then have your get method return a copied list of the keys you want, rather than the complete objects, and access them one at at time directly from the map.
As is mentioned in another answer here, Java Concurrency in Practice is a great book.
The other posters are correct in stating that you need to be writing to a threadsafe data structure. In doing so, you may slow down your response time due to thread contention. Since this essentially a logging operation that is a side effect of the request itself (Or am I not understanding you correctly?) you could spawn a new thread responsible for writing to the threadsafe data structure. That allows you to proceed with the actual response instead of burning response time on a logging operation. It might be worth investigating setting up a threadpool to reduce the time required to use the logging threads.
Any concurrency book by Doug Lea is worth reading.
Related
I'm playing around writing a simple multi-threaded web crawler. I see a lot of sources talk about web crawlers as obviously parallel because you can start crawling from different URLs, but I never see them discuss how web crawlers handle URLs that they've already seen before. It seems that some sort of global map would be essential to avoid re-crawling the same pages over and over, but how would the critical section be structured? How fine grained can the locks be to maximize performance? I just want to see a good example that's not too dense and not too simplistic.
Specific domain use case : Use in memory
If it is specific domain say abc.com then it is better to have vistedURL set or Concurrent hash map in memory, in memory will be faster to check visited status, memory consumption will be comparatively less. DB will have IO overhead and it is costly and visited status check will be very frequent. It will hit your performance drastically. As per your use case, you can use in memory or DB. My use case was specific to domain where visited URL will not be again visited so I used Concurrent hash map.
If you insist to do it using only java concurrency framework, then the ConcurrentHashMap may be the way to go. The interesting method in it is the ConcurrentHashMap.putIfAbsent method, it will give you very good efficiency, and the idea how to use it is:
You will have some "multithreaded source of incoming url addresses" from crawled pages - you can use some concurrent queue to store them, or just create a ExecutorService with (unbounded?) queue in which you will place Runnables that will crawl the urls.
Inside the crawling Runnables you should have a reference to this common ConcurrentHashMap of already crawled pages, and at the very begin of the run method do:
private final ConcurrentHashMap<String, Long> crawledPages = new ConcurrentHashMap<String, Long>();
...
private class Crawler implements Runnable {
private String urlToBeCrawled;
public void Crawler(String urlToBeCrawled) {
this.urlToBeCrawled = urlToBeCrawled;
}
public void run() {
if (crawledPages.putIfAbsent(urlToBeCrawled, System.currentTimeMillis())==null) {
doCrawlPage(urlToBeCrawled);
}
}
}
if crawledPages.putIfAbsent(urlToBeCrawled) will return null to you, then you know that this page was not crawled by anyone, since this method atomically puts the value you can progress with crawling this page - you're the lucky thread, if it will return a non-null value, then you know someone has already take care about this url, so your runnable should finish, and the thread goes back to pool to be used by next Runnable.
You can use ConcurrentHashMap to store to find duplicate url.ConcurrentHashMap also use split lock mechanism instead of using global lock.
or you can use your own implementation where you can split your all data among different key.
For an example of Guava API
Striped<ReadWriteLock> rwLockStripes = Striped.readWriteLock(10);
String key = "taskA";
ReadWriteLock rwLock = rwLockStripes.get(key);
try{
rwLock.lock();
.....
}finally{
rwLock.unLock();
}
ConcurrentHashMap example
private Set<String> urls = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>());
for crawler don't use ConcurrentHashMap, rather use Databse
The number of visisted URL's will grow very fast, so it is not a good thing to store them in memory, better use a databese, store the URL and the date it was last crawled, then just check the URL if it already exists in DB or is eligible for refreshing. I use for example a Derby DB in embedded mode, and it works perfectly for my web crawler. I don't advise to use in memory DB like H2, because with the number of crawled pages you eventually will get OutOfMemoryException.
You will rather rarely have the case of crawling the same page more than once in the same time, so checking in DB if it was already crawled recently is enough to not waste significant resources on "re-crawling the same pages over and over". I belive this is "a good solution that's not too dense and not too simplistic"
Also, using Databse with the "last visit date" for url, you can stop and continue the work when you want, with ConcurrentHashMap you will loose all the results when app exit. You can use "last visit date" for url to determine if it needs recrawling or not.
I have an Actor that - in its very essence - maintains a list of objects. It has three basic operations, an add, update and a remove (where sometimes the remove is called from the add method, but that aside), and works with a single collection. Obviously, that backing list is accessed concurrently, with add and remove calls interleaving each other constantly.
My first version used a ListBuffer, but I read somewhere it's not meant for concurrent access. I haven't gotten concurrent access exceptions, but I did note that finding & removing objects from it does not always work, possibly due to concurrency.
I was halfway rewriting it to use a var List, but removing items from Scala's default immutable List is a bit of a pain - and I doubt it's suitable for concurrent access.
So, basic question: What collection type should I use in a concurrent access situation, and how is it used?
(Perhaps secondary: Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread?)
(Tertiary: In Scala, what collection type is best for inserts and random access (delete / update)?)
Edit: To the kind responders: Excuse my late reply, I'm making a nasty habit out of dumping a question on SO or mailing lists, then moving on to the next problem, forgetting the original one for the moment.
Take a look at the scala.collection.mutable.Synchronized* traits/classes.
The idea is that you mixin the Synchronized traits into regular mutable collections to get synchronized versions of them.
For example:
import scala.collection.mutable._
val syncSet = new HashSet[Int] with SynchronizedSet[Int]
val syncArray = new ArrayBuffer[Int] with SynchronizedBuffer[Int]
You don't need to synchronize the state of the actors. The aim of the actors is to avoid tricky, error prone and hard to debug concurrent programming.
Actor model will ensure that the actor will consume messages one by one and that you will never have two thread consuming message for the same Actor.
Scala's immutable collections are suitable for concurrent usage.
As for actors, a couple of things are guaranteed as explained here the Akka documentation.
the actor send rule: where the send of the message to an actor happens before the receive of the same actor.
the actor subsequent processing rule: where processing of one message happens before processing of the next message by the same actor.
You are not guaranteed that the same thread processes the next message, but you are guaranteed that the current message will finish processing before the next one starts, and also that at any given time, only one thread is executing the receive method.
So that takes care of a given Actor's persistent state. With regard to shared data, the best approach as I understand it is to use immutable data structures and lean on the Actor model as much as possible. That is, "do not communicate by sharing memory; share memory by communicating."
What collection type should I use in a concurrent access situation, and how is it used?
See #hbatista's answer.
Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread
The second (though the thread on which messages are processed may change, so don't store anything in thread-local data). That's how the actor can maintain invariants on its state.
This is a recent interview question to my friend:
How would you handle a situation where users enter some data in the screen and let's say 5 of them clicked on the Submit button *the SAME time ?*
(By same time,the interviewer insisted that they are same to the level of nanoseconds)
My answer was just to make the method that handles the request synchronized and only one request can acquire the lock on the method at a given time.
But it looks like the interviewer kept insisting there was a "better way" to handle it .
One other approach to handle locking at the database level, but I don't think it is "better".
Are there any other approaches. This seems to be a fairly common problem.
If you have only one network card, you can only have one request coming down it at once. ;)
The answer he is probably looking for is something like
Make the servlet stateless so they can be executed concurrently.
Use components which allow thread safe concurrent access like Atomic* or Concurrent*
Use locks only where you obsolutely have to.
What I prefer to do is to make the service so fast it can respond before the next resquest can come in. ;) Though I don't have the overhead of Java EE or databases to worry about.
Does it matter that they click at the same time e.g. are they both updating the same record on a database?
A synchronized method will not cut it, especially if it's a webapp distributed amongst multiple JVMs. Also the synchronized method may block, but then the other threads would just fire after the first completes and you'd have lost writes.
So locking at database level seems to be the option here i.e. if the record has been updated, report an error back to the users whose updates were serviced after the first.
You do not have to worry about this as web server launches each request in isolated thread and manages it.
But if you have some shared resource like some file for logging then you need to achieve concurrency and put thread lock on it in request and inter requests
I need to use the Stanford Parser in a web service. As SentenceParser loads a big object, I will make sure it is a singleton, but in this case, is it thread safe (no according to http://nlp.stanford.edu/software/parser-faq.shtml). How else would it be done efficiently? One option is locking the object while being used.
Any idea how the people at Stanford are doing this for http://nlp.stanford.edu:8080/parser/ ?
If the contention is not a factor, locking (synchronization) would be one option as you mentioned, and it might be good enough.
If there are contentions, however, I see three general options.
(1) instantiating it every time
Just instantiate it as a local variable every time you perform parsing. Local variables are trivially safe. The instantiation is not free of course, but it may be acceptable depending on the specific situation.
(2) using threadlocals
If instantiation turns out to be costly, consider using threadlocals. Each thread would retain its own copy of the parser, and the parser instance would be reused on a given thread. Threadlocals are not without problems, however. Threadlocals may not be garbage collected without being set to null or until the holding thread goes away. So there is a memory concern if there are too many of them. Second, beware of the reuse. If these parsers are stateful, you need to ensure to clean up and restore the initial state so subsequent use of the threadlocal instance does not suffer from the side effect of previous use.
(3) pooling
Pooling is in general no longer recommended, but if the object sizes are truly large so that you need to have a hard limit on the number of instances you can allow, then using an object pool might be the best option.
I don't know how the people at Stanford have implemented their service but I would build such a service based on a message framework, such as http://www.rabbitmq.com/. So your front end service will receive documents and use a message queue to communicate (store documents and retrieve results) with several workers that execute NLP parsing. The workers -- after finishing processing -- will store results into a queue that is consumed by the front end service. This architecture will let you to dynamically add new workers in case of high load. Especially that NLP tagging takes some time - up several seconds per document.
Suppose that I have a method called doSomething() and I want to use this method in a multithreaded application (each servlet inherits from HttpServlet).I'm wondering if it is possible that a race condition will occur in the following cases:
doSomething() is not staic method and it writes values to a database.
doSomething() is static method but it does not write values to a database.
what I have noticed that many methods in my application may lead to a race condition or dirty read/write. for example , I have a Poll System , and for each voting operation, a certain method will change a single cell value for that poll as the following:
[poll_id | poll_data ]
[1 | {choice_1 : 10, choice_2 : 20}]
will the JSP/Servlets app solve these issues by itself, or I have to solve all that by myself?
Thanks..
It depends on how doSomething() is implemented and what it actually does. I assume writing to the database uses JDBC connections, which are not threadsafe. The preferred way of doing that would be to create ThreadLocal JDBC connections.
As for the second case, it depends on what is going on in the method. If it doesn't access any shared, mutable state then there isn't a problem. If it does, you probably will need to lock appropriately, which may involve adding locks to every other access to those variables.
(Be aware that just marking these methods as synchronized does not fix any concurrency bugs. If doSomething() incremented a value on a shared object, then all accesses to that variable need to be synchronized since i++ is not an atomic operation. If it is something as simple as incrementing a counter, you could use AtomicInteger.incrementAndGet().)
The Servlet API certainly does not magically make concurrency a non-issue for you.
When writing to a database, it depends on the concurrency strategy in your persistence layer. Pessimistic locking, optimistic locking, last-in-wins? There's way more going on when you 'write to a database' that you need to decide how you're going to handle. What is it you want to have happen when two people click the button at the same time?
Making doSomething static doesn't seem to have too much bearing on the issue. What's happening in there is the relevant part. Is it modifying static variables? Then yes, there could be race conditions.
The servlet api will not do anything for you to make your concurrency problems disappear. Things like using the synchronized keyword on your servlets are a bad idea because you are basically forcing your threads to be processed one at a time and it ruins your ability to respond quickly to multiple users.
If you use Spring or EJB3, either one will provide threadlocal database connections and the ability to specify transactions. You should definitely check out one of those.
Case 1, your servlet uses some code that accesses a database. Databases have locking mechanisms that you should exploit. Two important reasons for this: the database itself might be used from other applications that read and write that data, it's not enough for your app to deal with contending with itself. And: your own application may be deployed to a scaled, clustered web container, where multiple copies of your code are executing on separate machines.
So, there are many standard patterns for dealing with locks in databases, you may need to read up on Pessimistic and Optimistic Locking.
The servlet API and JBC connection pooling gives you some helpful guarantees so that you can write your servlet code without using Java synchronisation provided your variables are in method scope, in concept you have
Start transaction (perhaps implicit, perhaps on entry to an ejb)
Get connection to DB ( Gets you a connection from pool, associated with your tran)
read/write/update code
Close connection (actually keeps it for your thread until your transaction commits)
Commit (again maybe implictly)
So your only real issue is dealing with any contentions in the DB. All of the above tends to be done rather more nicely using things such as JPA these days, but under the covers thats more or less what's happening.
Case 2: static method, this presumably implies that you now keep everything in a memory structure. This (barring remote invocation of some sort) impies a single JVM and you managing your own locking. Should your JVM or machine crash I guess you lose your data. If you care about your data then using a DB is probably better.
OR, how about a completely other approach: servlet simply records the "vote" by writing a message to a persistent JMS queue. Have some other processes pick up the votes from the queue and adds them up. You won't give immediate feedback to the voter this way, but you decouple the user's experience from the actual (in similar scenarios) quite complex processing .
I thing that the best solution for your problem is to use something like "synchronized" keyword and wait/notify!