Safe multithreaded file creation in java

Safe multithreaded file creation in java - java

I have a webapp that needs to sometimes download some bytes from a url and package it up and send back to the requester. The downloaded bytes are stored for a little while so they can be reused if the same url is needed to be downloaded. I am trying to figure out how best to prevent the threads from downloading the same url at the same time if the requests come in at the same time. I was thinking of creating a class like below that would prevent the same url from being downloaded at the same time. If a url is unable to be locked then it either waits until its not locked anymore to try and download it as long as it does not exist after the unlock.
public class URLDownloader
{
HashMap<String,String> activeThreads = new HashMap<String,String>();
public synchronized void lockURL(String url, String threadID) throws UnableToLockURLException
{
if(!activeThreads.containsKey(url))
activeThreads.put(url, threadID)
else
throw UnableToLockURLException()
}
public synchonized void unlockURL(String url, String threadID)
{
//need to check to make sure its locked and by the passed in thread
returns activeThreads.remove(url);
}
public synchonized void isURLStillLocked(String url)
{
returns activeThreads.contains(url);
}
}
Does anyone have a better solution for this? Does my solution seem valid? Are there any open source components out there that already do this very well that I can leverage?
Thanks

I would suggest to keep a ConcurrentHashSet<String> to keep track of your unique URLs visible to all your threads. This construct might not exist directly in the java library but can easily constructed by a ConcurrentHashMap like so: Collections.newSetFromMap(new ConcurrentHashMap<String,Boolean>())

It sounds like you don't need a lock, since if there are multiple requests to download the same URL, the point is to download it only once.
Also, I think it would make more sense in terms of encapsulation to put the check for a stored URL / routine to store new URLs in the URLDownloader class, rather than in the calling classes. Your threads can simply call e.g. fetchURL(), and let URLDownloader handle the specifics.
So, you can implement this in two ways. If you don't have a constant stream of download requests, the simpler way is to have only one URLDownloader thread running, and to make its fetchURL method synchronized, so that you only download one URL at a time. Otherwise, keep the pending download requests in a central LinkedHashSet<String>, which preserves order and ignores repeats.

Related

Do concurrent web crawlers typically store visited URLs in a concurrent map, or use synchronization to avoid crawling the same pages twice?

I'm playing around writing a simple multi-threaded web crawler. I see a lot of sources talk about web crawlers as obviously parallel because you can start crawling from different URLs, but I never see them discuss how web crawlers handle URLs that they've already seen before. It seems that some sort of global map would be essential to avoid re-crawling the same pages over and over, but how would the critical section be structured? How fine grained can the locks be to maximize performance? I just want to see a good example that's not too dense and not too simplistic.

Specific domain use case : Use in memory
If it is specific domain say abc.com then it is better to have vistedURL set or Concurrent hash map in memory, in memory will be faster to check visited status, memory consumption will be comparatively less. DB will have IO overhead and it is costly and visited status check will be very frequent. It will hit your performance drastically. As per your use case, you can use in memory or DB. My use case was specific to domain where visited URL will not be again visited so I used Concurrent hash map.

If you insist to do it using only java concurrency framework, then the ConcurrentHashMap may be the way to go. The interesting method in it is the ConcurrentHashMap.putIfAbsent method, it will give you very good efficiency, and the idea how to use it is:
You will have some "multithreaded source of incoming url addresses" from crawled pages - you can use some concurrent queue to store them, or just create a ExecutorService with (unbounded?) queue in which you will place Runnables that will crawl the urls.
Inside the crawling Runnables you should have a reference to this common ConcurrentHashMap of already crawled pages, and at the very begin of the run method do:
private final ConcurrentHashMap<String, Long> crawledPages = new ConcurrentHashMap<String, Long>();
...
private class Crawler implements Runnable {
private String urlToBeCrawled;
public void Crawler(String urlToBeCrawled) {
this.urlToBeCrawled = urlToBeCrawled;
}
public void run() {
if (crawledPages.putIfAbsent(urlToBeCrawled, System.currentTimeMillis())==null) {
doCrawlPage(urlToBeCrawled);
}
}
}
if crawledPages.putIfAbsent(urlToBeCrawled) will return null to you, then you know that this page was not crawled by anyone, since this method atomically puts the value you can progress with crawling this page - you're the lucky thread, if it will return a non-null value, then you know someone has already take care about this url, so your runnable should finish, and the thread goes back to pool to be used by next Runnable.

You can use ConcurrentHashMap to store to find duplicate url.ConcurrentHashMap also use split lock mechanism instead of using global lock.
or you can use your own implementation where you can split your all data among different key.
For an example of Guava API
Striped<ReadWriteLock> rwLockStripes = Striped.readWriteLock(10);
String key = "taskA";
ReadWriteLock rwLock = rwLockStripes.get(key);
try{
rwLock.lock();
.....
}finally{
rwLock.unLock();
}
ConcurrentHashMap example
private Set<String> urls = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>());

for crawler don't use ConcurrentHashMap, rather use Databse
The number of visisted URL's will grow very fast, so it is not a good thing to store them in memory, better use a databese, store the URL and the date it was last crawled, then just check the URL if it already exists in DB or is eligible for refreshing. I use for example a Derby DB in embedded mode, and it works perfectly for my web crawler. I don't advise to use in memory DB like H2, because with the number of crawled pages you eventually will get OutOfMemoryException.
You will rather rarely have the case of crawling the same page more than once in the same time, so checking in DB if it was already crawled recently is enough to not waste significant resources on "re-crawling the same pages over and over". I belive this is "a good solution that's not too dense and not too simplistic"
Also, using Databse with the "last visit date" for url, you can stop and continue the work when you want, with ConcurrentHashMap you will loose all the results when app exit. You can use "last visit date" for url to determine if it needs recrawling or not.

if multiple requests are handled by a server to run a single servlet then where we need to take care of synchronization?

If multiple requests are handled by a server to run a single servlet then where we need to take care of synchronization?
I have got the answer from How does a single servlet handle multiple requests from client side how multiple requests are handled. But then again there is a question that why we need synchronization if all requests are handled separately?
Can you give some real life example how a shared state works and how a servlet can be dependent? I am not much interested in code but looking for explanation with example of any portal application? Like if there is any login page how it is accessed by n number of users concurrently.
If more than one request is handled by the server.. like what I read is server make a thread pool of n threads to serve the requests and I guess each thread will have their own set of parameters to maintain the session... so is there any chance that two or more threads (means two or more requests) can collide with each other?

Synchronization is required when multiple threads are modifying a shared resources.
So, when all your servlets are independent of each other, you don't worry about the fact that they run in parallel.
But, if they work on "shared state" somehow (for example by reading/writing values into some sort of centralized data store); then you have to make sure that things don't go wrong. Of course: the layer/form how to provide the necessary synchronization to your application depends on your exact setup.
Yes, my answer is pretty generic; but so is your question.

Synchronization in Java will only be needed if shared object is mutable. if your shared object is either read-only or immutable object, then you don't need synchronization, despite running multiple threads. Same is true with what threads are doing with an object if all the threads are only reading value then you don't require synchronization in Java.
Read more

Basically if your servlet application is multi-threaded, then data associated with servlet will not be thread safe. The common example given in many text books are things like a hit counter, stored as a private variable:
e.g
public class YourServlet implements Servlet {
private int counter;
public void service(ServletRequest req, ServletResponse, res) {
//this is not thread safe
counter ++;
}
}
This is because the service method and Servlet is operated on by multiple thread incoming as HTTP requests. The unary increment operator has to firstly read the current value, add one and the write the value back. Another thread doing the same operation concurrently, may increment the value after the first thread has read the value, but before it is written back, thus resulting in a lost write.
So in this case you should use synchronisation, or even better, the AtomicInteger class included as part of Java Concurrency from 1.5 onwards.

Thread Safe Servlet With a Static String

I have reviewed a sample of a chat server with Node JS and socket IO at http://ahoj.io/nodejs-and-websocket-simple-chat-tutorial. In that sample a simple history variable was used at server to save chat history data. As the Node Js is single thread every thing works fine. (You can ignore above node JS example if you are not interested in node js :) I will explain it in java below )
Consider below servlet which gets message String from request and add it to an string. This code could be an example of a Chat Server. It gets user messages from request and all it to a history String and other clients can read it.
public class ChatServlet implements Servlet {
private static String history = "";
public void service(ServletRequest request, ServletResponse response)
history = history.concat(request.getParameter("message"));
}
}
Theoretically, this code is not thread-safe as it use a global static variable (How do servlets work? Instantiation, sessions, shared variables and multithreading) .
However, I have tested above code with jMeter with lots of concurrence request and the history string always stores all the messages (So no client message lost or over-written), and nothing went wrong!
I have not work with threads so I wonder if I am missing something here ! Is the above code thread-safe and can it be trusted.

As others have confirmed, this is indeed not thread-safe in that it cannot be trusted. Some quirk in JVM implementation may make this a workable servlet, but there is no guarantee that it will work at another JVM or even at another time.
To add to the variety of proposed implementations, here's one with AtomicReference:
AtomicReference<String> history = new AtomicReference<>("");
public void service(ServletRequest request, ServletResponse response)
history.updateAndGet(h -> h.concat("123"));
}

No, it's not. Thread safety bugs can be difficult to trigger - maybe your program will miss one message in a billion, or maybe it will never miss a message by coincidence. If it was thread safe, though, it would be guaranteed to never happen.
You could simply use a synchronized block to ensure that only one thread accesses history at a time, like this:
synchronized(ChatServlet.class) {
history = history.concat(request.getParameter("message"));
}
This means: lock ChatServlet.class, add the message to the history, then unlock ChatServlet.class.
You can never have two threads lock the same object at the same time - if they try, one of them will proceed, and the rest will wait around for the first one to unlock the object (and then another one will proceed, and the rest will wait for it to unlock the object, and so on).
Also make sure to only read history inside a synchronized(ChatServlet.class) block - otherwise, it's not guaranteed that the reading thread will see the latest updates.

It isn't thread-safe. Code that isn't thread-safe isn't guaranteed to fail, but it's not guaranteed to work either.

Multithread GAE servlets to handle concurrent users

I'd like to multithread my GAE servlets so that the same servlet on the same instance can handle up to 10 (on frontend instance I believe the max # threads is 10) concurrent requests from different users at the same time, timeslicing between each of them.
public class MyServlet implements HttpServlet {
private Executor executor;
#Override
public void doGet(HttpServletRequest request, HttpServletResponse response) {
if(executor == null) {
ThreadFactory threadFactory = ThreadManager.currentRequestFactory();
executor = Executors.newCachedThreadPoolthreadFactory);
}
MyResult result = executor.submit(new MyTask(request));
writeResponseAndReturn(response, result);
}
}
So basically when GAE starts up, the first time it gets a request to this servlet, an Executor is created and then saved. Then each new servlet request uses that executor to spawn a new thread. Obviously everything inside MyTask must be thread-safe.
What I'm concerned about is whether or not this truly does what I'm hoping it does. That is, does this code create a non-blocking servlet that can handle multiple requests from multiple users at the same time? If not, why and what do I need to do to fix it? And, in general, is there anything else that a GAE maestro can spot that is dead wrong? Thanks in advance.

I don't think your code would work.
The doGet method is running in threads managed by the servlet container. When a request comes in, a servlet thread is occupied, and it will not be released until doGet method return. In your code, the executor.submit would return a Future object. To get the actual result you need to invoke get method on the Future object, and it would block until the MyTask finishes its task. Only after that, doGet method returns and new requests can kick in.
I am not familiar with GAE, but according to their docs, you can declare your servlet as thread-safe and then the container will dispatch multiple requests to each web server in parallel:
<!-- in appengine-web.xml -->
<threadsafe>true</threadsafe>

You implicitly asked two questions, so let me answer both:
1. How can I get my AppEngine Instance to handle multiple concurrent requests?
You really only need to do two things:
Add the statement <threadsafe>true</threadsafe> to your appengine-web.xml file, which you can find in the war\WEB-INF folder.
Make sure that the code inside all your request handlers is actually thread-safe, i.e. use only local variables in your doGet(...), doPost(...), etc. methods or make sure you synchronize all access to class or global variables.
This will tell the AppEngine instance server framework that your code is thread-safe and that you are allowing it to call all of your request handlers multiple times in different threads to handle several requests at the same time. Note: AFAIK, It is not possible to set this one a per-servlet basis. So, ALL your servlets need to be thread-safe!
So, in essence, the executor-code you posted is already included in the server code of each AppEngine instance, and actually calls your doGet(...) method from inside the run method of a separate thread that AppEngine creates (or reuses) for each request. Basically doGet() already is your MyTask().
The relevant part of the Docs is here (although it doesn't really say much): https://developers.google.com/appengine/docs/java/config/appconfig#Using_Concurrent_Requests
2. Is the posted code useful for this (or any other) purpose?
AppEngine in its current form does not allow you to create and use your own threads to accept requests. It only allows you to create threads inside your doGet(...) handler, using the currentRequestThreadFactory() method you mentioned, but only to do parallel processing for this one request and not to accept a second one in parallel (this happens outside doGet()).
The name currentRequestThreadFactory() might be a little misleading here. It does not mean that it will return the current Factory of RequestThreads, i.e. threads that handle requests. It means that it returns a Factory that can create Threads inside the currentRequest. So, unfortunately it is actually not even allowed to use the returned ThreadFactory beyond the scope of the current doGet() execution, like you are suggesting by creating an Executor based on it and keeping it around in a class variable.
For frontend instances, any threads you create inside a doGet() call will get terminated immediately when your doGet() method returns. For backend instances, you are allowed to create threads that keep running, but since you are not allowed to open server sockets for accepting requests inside these threads, these will still not allow you to manage the request handling yourself.
You can find more details on what you can and cannot do inside an appengine servlet here:
The Java Servlet Environment - The Sandbox (specifically the Threads section)
For completeness, let's see how your code can be made "legal":
The following should work, but it won't make a difference in terms of your code being able to handle multiple requests in parallel. That will be determined solely by the <threadsafe>true</threadsafe> setting in you appengine-web.xml. So, technically, this code is just really inefficient and splits an essentially linear program flow across two threads. But here it is anyways:
public class MyServlet implements HttpServlet {
#Override
public void doGet(HttpServletRequest request, HttpServletResponse response) {
ThreadFactory threadFactory = ThreadManager.currentRequestThreadFactory();
Executor executor = Executors.newCachedThreadPool(threadFactory);
Future<MyResult> result = executor.submit(new MyTask(request)); // Fires off request handling in a separate thread
writeResponse(response, result.get()); // Waits for thread to complete and builds response. After that, doGet() returns
}
}
Since you are already inside a separate thread that is specific to the request you are currently handling, you should definitely save yourself the "thread inside a thread" and simply do this instead:
public class MyServlet implements HttpServlet {
#Override
public void doGet(HttpServletRequest request, HttpServletResponse response) {
writeResponse(response, new MyTask(request).call()); // Delegate request handling to MyTask object in current thread and write out returned response
}
}
Or, even better, just move the code from MyTask.call() into the doGet() method. ;)
Aside - Regarding the limit of 10 simultaneous servlet threads you mentioned:
This is a (temporary) design-decision that allows Google to control the load on their servers more easily (specifically the memory use of servlets).
You can find more discussion on those issues here:
Issue 7927: Allow configurable limit of concurrent requests per instance
Dynamic Backend Instance Scaling
If your bill shoots up due to increased latency, you may not be refunded the charges incurred
This topic has been bugging the heck out of me, too, since I am a strong believer in ultra-lean servlet code, so my usual servlets could easily handle hundreds, if not thousands, of concurrent requests. Having to pay for more instances due to this arbitrary limit of 10 threads per instance is a little annoying to me to say the least. But reading over the links I posted above, it sounds like they are aware of this and are working on a better solution. So, let's see what announcements Google I/O 2013 will bring in May... :)

I second the assessments of ericson and Markus A.
If however, for some reason (or for some other scenario) you want to follow the path that uses your code snippet as a starting point, I'd suggest that you change your executor definition to:
private static Executor executor;
so that it becomes static across instances.

Want to create a synchronized data download in servlet action

I want to create a servlet method like below. In this method I want to perform some data download.So if request for data download comes I just do the download. If already a download is going on I want somehow the second request to wait till the first thread is done with download. Once the first thread is done with download the second thread can start automatically.
DoTheDownloadAction(){
}
How can i achieve the above requirement?

Considering you have a DownloadHelper class and in your Servlet you have created one instance of that class then you can do something like this :
DoTheDownloadAction() {
synchronized(downloadHelper) {
//Downloading something
}
}
Lets imagine you have a button called as "download" with id="download" in your jsp and you have this code in your javascript
var globalDownloadStatus = false;
jQuery(document).ready(function(){
jQuery('#download'(.click(function(){
if(globalDownloadStatus == true) {
alert('download already in progress, please wait');
return;
}
jQuery.get('yourservletpath', function(data){
alert('Download Complete');
});
});
});

Sounds like the perfect candidate for a semaphore, or (depending on the complexity and the downstream effects) the simpler way to affect the same change would be to synchronize the download code to a relevant key for your application.

Take in account that usually web-servers could distributed for scalability. Usually the appropriate solution is to synchronize via database locks. However for you maybe just enough to use synchronized java keyword on an object you want wait to.
Also, you are asking for a pessimistic lock. It's usually bad architecture design.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Safe multithreaded file creation in java - java

I would suggest to keep a ConcurrentHashSet<String> to keep track of your unique URLs visible to all your threads. This construct might not exist directly in the java library but can easily constructed by a ConcurrentHashMap like so: Collections.newSetFromMap(new ConcurrentHashMap<String,Boolean>())

Related

Do concurrent web crawlers typically store visited URLs in a concurrent map, or use synchronization to avoid crawling the same pages twice?

if multiple requests are handled by a server to run a single servlet then where we need to take care of synchronization?

Thread Safe Servlet With a Static String

Multithread GAE servlets to handle concurrent users

Want to create a synchronized data download in servlet action

Categories

Resources