Hapshmap vs tomcat app server - java

I got one question from interview, are you used Hashmap or Hashtable in the current project?
My Answer : I said I have used Hashmap not Hashtable, because it is not multithreaded environment(project does not have multiple thread processing).
Q :Tomcat creates multiple thread for request processing then why are you using Hashmap?
My Ans :
It will create multiple thread in and each thread have it's own threadstack memory for keep those objects and processing the requests.
is it my answer was correct if not please correct me the ans for this question.

It depends on context.
If you have some shared datastructure that is used between requests, then yes, you'd need some kind of synchronization. You might want to consider a java.util.concurrent.ConcurrentHashMap however, which offers lower-contention reading than a Hashtable.
You are right though that if you create the structure inside a request, and do not share it between threads / requests, a HashMap would be fine.
Just to flesh this out, to reply to a comment:
Imagine you are writing an endpoint that accepts an array of key/value pairs. If this endpoint repeatedly needs to refer to these request values according to the key, but the values aren't needed by any other request, you may wish to put them into a HashMap. If the server services n concurrent requests to the same endpoint concurrently, it would create n instances of the controller, each executing the method with their own stacks (as you pointed out), and their own copy of the HashMap. Importantly, each instance of the HashMap will never have to deal with concurrent access form multiple threads.
Now imagine a second scenario where site wants to stop users from trying to log in too often. You could use a dictionary in the application context, that stores counts of each user's login activity to try to find if an account is being attacked (by the way, this is illustrative - don't implement this scenario in this way). In this case, n simultaneous requests would all be updating the dictionary at the same time. If multiple threads attempt to add new keys at the same time, this could kill the application.
Your comment below refers to application / sessions contexts. The session is still shared; even though it belongs to one user, that user could make multiple concurrent requests to the server, which all update the same HashMap, e.g. their shopping cart

Related

if multiple requests are handled by a server to run a single servlet then where we need to take care of synchronization?

If multiple requests are handled by a server to run a single servlet then where we need to take care of synchronization?
I have got the answer from How does a single servlet handle multiple requests from client side how multiple requests are handled. But then again there is a question that why we need synchronization if all requests are handled separately?
Can you give some real life example how a shared state works and how a servlet can be dependent? I am not much interested in code but looking for explanation with example of any portal application? Like if there is any login page how it is accessed by n number of users concurrently.
If more than one request is handled by the server.. like what I read is server make a thread pool of n threads to serve the requests and I guess each thread will have their own set of parameters to maintain the session... so is there any chance that two or more threads (means two or more requests) can collide with each other?
Synchronization is required when multiple threads are modifying a shared resources.
So, when all your servlets are independent of each other, you don't worry about the fact that they run in parallel.
But, if they work on "shared state" somehow (for example by reading/writing values into some sort of centralized data store); then you have to make sure that things don't go wrong. Of course: the layer/form how to provide the necessary synchronization to your application depends on your exact setup.
Yes, my answer is pretty generic; but so is your question.
Synchronization in Java will only be needed if shared object is mutable. if your shared object is either read-only or immutable object, then you don't need synchronization, despite running multiple threads. Same is true with what threads are doing with an object if all the threads are only reading value then you don't require synchronization in Java.
Read more
Basically if your servlet application is multi-threaded, then data associated with servlet will not be thread safe. The common example given in many text books are things like a hit counter, stored as a private variable:
e.g
public class YourServlet implements Servlet {
private int counter;
public void service(ServletRequest req, ServletResponse, res) {
//this is not thread safe
counter ++;
}
}
This is because the service method and Servlet is operated on by multiple thread incoming as HTTP requests. The unary increment operator has to firstly read the current value, add one and the write the value back. Another thread doing the same operation concurrently, may increment the value after the first thread has read the value, but before it is written back, thus resulting in a lost write.
So in this case you should use synchronisation, or even better, the AtomicInteger class included as part of Java Concurrency from 1.5 onwards.

Large number of single threaded task queues

At our company we have a server which is distributed into few instances. Server handles users requests. Requests from different users can be processed in parallel. Requests from same users should be executed strongly sequentionally. But they can arrive to different instances due to balancing. Currently we use Redis-based distributed locks but this is error-prone and requires more work around concurrency than business logic.
What I want is something like this (more like a concept):
Distinct queue for each user
Queue is named after user id
Each requests identified by request id
Imagine two requests from the same user arriving at two different instances concurrently:
Each instance put their request id into this user queue.
Additionaly, they both store their request ids locally.
Then some broker takes request id from the top of "some_user_queue" and moves it into "some_user_queue_processing"
Both instances listen for "some_user_queue_processing". They peek into it and see if this is request id they stored locally. If yes, then do processing. If not, then ignore and wait.
When work is done server deletes this id from "some_user_queue_processing".
Then step 3 again.
And all of this happens concurrently for a lot (thousands of them) of different users (and their queues).
Now, I know this sounds a lot like actors, but:
We need solution requiring as small changes as possible to make fast transition from locks. Akka will force us to rewrite almost everything from scratch.
We need production ready solution. Quasar sounds good, but is not production ready yet (more correctly, their Galaxy cluster).
Tops at my work are very conservative, they simply don't want another dependency which we'll need to support. But we already use Redis (for distributed locks), so I thought maybe it could help with this too.
Thanks
The best solution that matches the description of your problem is Redis Cluster.
Basically, the cluster solves your concurrency problem, in the following way:
Two (or more) requests from the same user, will always go to the same instance, assuming that you use the user-id as a key and the request as a value. The value must be actually a list of requests. When you receive one, you will append it to that list. In other words, that is your queue of requests (a single one for every user).
That matching is being possible by the design of the cluster implementation. It is based on a range of hash-slots spread over all the instances.
When a set command is executed, the cluster performs a hashing operation, which results in a value (the hash-slot that we are going to write on), which is located on a specific instance. The cluster finds the instance that contains the right range, and then performs the writing procedure.
Also, when a get is performed, the cluster does the same procedure: it finds the instance that contains the key, and then it gets the value.
The transition from locks is very easy to perform because you only need to have the instances ready (with the cluster-enabled directive set on "yes") and then to run the cluster-create command from redis-trib.rb script.
I've worked last summer with the cluster in a production environment and it behaved very well.

How can I obtain some order in and multi-thread reading queue

In my app, I would receive some user data, putting them into an ArrayBlockingQueue, and then put them into a database. Here several threads are used for 'getting the data from the queue and putting it into database'. Then an issue came up.
The database is used to store each user's current status, thus the data's time sequence is very important. But when using multi threads to 'get and put', the order can not be ensured.
So I came up with an idea, it's like 'field grouping': for different users' data, multi-threads is fine, the order between them can be ignored; but each user's data must be retrieved by the same thread.
Now the question is, how can I do that?
Is the number of Users limited? Then you can simply cache a thread across each user.
// thread cache
Map<Sting, Thread> threadcache = new HashMap<String,Thread>();
threadcache.put("primary_key", t);
// when accessing the daya
Thread torun = threadcache.get(queue.peek());
torun.start();
else
Java thread takes name Thread.setName()/getName. Use that to identify a thread, still reuse is something you have to handle according to your business logic.
Try using PriorityBlockingQueue<E> . <E> should be comparable. Implement logic such that that each user's data is individually sorted as per required attributes. Also use threadpools instead of managing threads discretely .

is Google appengine single threaded ?(java)

My question is "is Google appengine single threaded? .Now when i ask that i know that i cannot start my own threads by using threading in java .But we can start threads using backend.
I am concerned about threading with request to how requests are handled.I read someonewhere that in appengine each request is queued and then served one by one.And i can configure the max time for which a request can be queued.If time to server request exceeds max time then new instance is created.
So what if i want to use single instance (free quota).
If i get multiple requests as r1 , r2 ,r3,r4 (in this order).Then will each of the requests be served one after other (in case of single instance)?
If i create multiple instances when the load increases and new instance is created dynamically will the data that is present in main memory of instance one will it be cloned to instance too?
Will the data in 2 instances in synch all the time?
Agree with what Nick said, but also want to point out that this statement:
"Now when i ask that i know that i cannot start my own threads by using threading in java"
is no longer true. For more details, see the section about threads here:
https://developers.google.com/appengine/docs/java/runtime#The_Sandbox
So, in summary, App Engine is multi-threaded in a couple of ways:
- requests can be handled concurrently by a single instance using a thread per request
- a single request may explicitly start additional threads
As stated in the docs, you can enable concurrent requests on your Java app, in which case multiple threads will be spawned, each of which handles requests independently.
Instances are not cloned off already running instances, nor are they synchronized in any way - you are expected to write your code in a manner that doesn't depend on specific mutable instance state.

Fastest architecture for multithreaded web crawler

There should be a frontier object - Holding a set of visited and waiting to crawl URL's.
There should be some thread responsible for crawling web pages.
There would be also some kind of controller object to create crawling threads.
I don't know what architecture would be faster, easier to extend. How to divide responsibilities to make as as few synchronization as possible and also minimize number of checking if current URL has been already visited.
Should controller object be responsible of providing new URL's to working threads - this mean working threads will need to crawl all given URL's and then sleep for undefined time. Controller will be interpreting this threads so crawling thread should handle InterruptedException (How expensive it is in Java - it seems that exception handling is not very fast ).
Or maybe controller should only starts the threads and let crawling threads to fetch frontier themselves?
create a shared, thread-safe list with the URL's to be crawled. create an Executor with the number of threads corresponding to the number of crawlers you desire to run concurrently. start your crawlers as Runnables with a reference to the shared list and submit each of them to the Executor. each crawler removes the next URL from the list and does whatever you need it to do, looping until the list is empty.
Its been a few years since this question was asked, but in Nov 2015 we are currently using frontera and scrapyd
Scrapy uses twisted which makes it a good multithreaded crawler, and on multi-core machines that means we are only limited by the inbound bandwidth. Frontera-distributed uses hbase and kafka to score links and keep all the data accessible to clients.
Create a central resource with a hash map that can store URL as key with last time scanned. Make this thread safe. Then just spawn threads with links in a queue which can be picked up by the crawlers as starting point. Each thread would then carry on crawling and updating the resource. A thread in the resource clears up outdated crawls. The in memory resource can be serialised at start or it could be in a db depending on your app needs.
You could make this resource accessible via remote services to allow multiple machines. You could make the resource itself spread over several machines by segregating urls. Etc...
You should use a blocking queue, that contains urls that need to be fetched. In this case you could create multiple consumers that will fetch urls in multiple threads. If queue is empty, than all fetchers will be locked. In this case you should run all threads at the beginning and should not controll them later.
Also you need to maintain a list of already downloaded pages in some persistent storage and check before adding to the queue.
If you don't want to re-invent the wheel, why not look at Apache Nutch.

Categories