My question is "is Google appengine single threaded? .Now when i ask that i know that i cannot start my own threads by using threading in java .But we can start threads using backend.
I am concerned about threading with request to how requests are handled.I read someonewhere that in appengine each request is queued and then served one by one.And i can configure the max time for which a request can be queued.If time to server request exceeds max time then new instance is created.
So what if i want to use single instance (free quota).
If i get multiple requests as r1 , r2 ,r3,r4 (in this order).Then will each of the requests be served one after other (in case of single instance)?
If i create multiple instances when the load increases and new instance is created dynamically will the data that is present in main memory of instance one will it be cloned to instance too?
Will the data in 2 instances in synch all the time?
Agree with what Nick said, but also want to point out that this statement:
"Now when i ask that i know that i cannot start my own threads by using threading in java"
is no longer true. For more details, see the section about threads here:
https://developers.google.com/appengine/docs/java/runtime#The_Sandbox
So, in summary, App Engine is multi-threaded in a couple of ways:
- requests can be handled concurrently by a single instance using a thread per request
- a single request may explicitly start additional threads
As stated in the docs, you can enable concurrent requests on your Java app, in which case multiple threads will be spawned, each of which handles requests independently.
Instances are not cloned off already running instances, nor are they synchronized in any way - you are expected to write your code in a manner that doesn't depend on specific mutable instance state.
Related
I got one question from interview, are you used Hashmap or Hashtable in the current project?
My Answer : I said I have used Hashmap not Hashtable, because it is not multithreaded environment(project does not have multiple thread processing).
Q :Tomcat creates multiple thread for request processing then why are you using Hashmap?
My Ans :
It will create multiple thread in and each thread have it's own threadstack memory for keep those objects and processing the requests.
is it my answer was correct if not please correct me the ans for this question.
It depends on context.
If you have some shared datastructure that is used between requests, then yes, you'd need some kind of synchronization. You might want to consider a java.util.concurrent.ConcurrentHashMap however, which offers lower-contention reading than a Hashtable.
You are right though that if you create the structure inside a request, and do not share it between threads / requests, a HashMap would be fine.
Just to flesh this out, to reply to a comment:
Imagine you are writing an endpoint that accepts an array of key/value pairs. If this endpoint repeatedly needs to refer to these request values according to the key, but the values aren't needed by any other request, you may wish to put them into a HashMap. If the server services n concurrent requests to the same endpoint concurrently, it would create n instances of the controller, each executing the method with their own stacks (as you pointed out), and their own copy of the HashMap. Importantly, each instance of the HashMap will never have to deal with concurrent access form multiple threads.
Now imagine a second scenario where site wants to stop users from trying to log in too often. You could use a dictionary in the application context, that stores counts of each user's login activity to try to find if an account is being attacked (by the way, this is illustrative - don't implement this scenario in this way). In this case, n simultaneous requests would all be updating the dictionary at the same time. If multiple threads attempt to add new keys at the same time, this could kill the application.
Your comment below refers to application / sessions contexts. The session is still shared; even though it belongs to one user, that user could make multiple concurrent requests to the server, which all update the same HashMap, e.g. their shopping cart
I have a Java EE web application. Now when a particular request comes (say /xyz url patter) I want to do complex procesing as follows
Each of the following 3 steps are very complex and takes time.
Get data from one table from DB.Table has huge data and querying takes time.
Make a web service call to some other webserive A and get its data.
Make another web service call to some otheer webserice B and get its data .
Do some processing by using output of 1, 2, 3
1, 2, and 3 are independent of each other so can be called in parallel.
Now the questions are:
Can I do operations 1, 2, and 3 in three separate threads?
Is it advisable to create 3 threads for each request?
Should I use thread pooling?
To address your first question I go through the 4 steps:
Yes, if the database driver you are using allows concurrent access, respectively is safe to use from different threads.
A web service is normally designed to deal with different requests at the same time so this should work as well, the question here is how many threads you want to use (and how long it takes to process one request) and whether the web service will guard itself against too many requests at once.
The same applies here.
Yes, but you have to do synchronization here, as in: wait until all threads have received their results. You can realize this with a java.util.concurrent.CyclicBarrier
Second question
That depends on your data and especially how fast the web services will answer, you should try it out.
Third question Definitively, that's what they are for. This will also help you to structure your application.
1) Can i do operations 1 ,2 and 3 in three separate threads?
Yes, you can.
2) Is it advisable to create 3 threads for each request?
As long as these things don't depend on each other, and as long as you're not depending on getting these in the same transaction, then it seems like it should be ok. You will have to handle the case where one or more threads don't succeed, of course. You'll need a separate watchdog thread to cancel the threads if they take too long or if one comes back with a failure.
3) Should I use thread pooling?
Regardless of what else you do, whenever you use threads you should use a pool. That way if there's a problem where threads don't complete or go into some bad state or otherwise become unavailable, you protect your application from running out of threads.
I am confused about the applicability of multi threading in general...
I am creating an application which executes some code which has been saved in xml format. The work is to use apache http client and retrieve some data from websites...More than 1 website can be visited by one block of code in xml...
Now I want that if 2 users have created their own respective codes and saved them in XML, then each user's 'job' (ie block of code in xml format) runs in a separate thread.
I have with me code to execute one user's code...Now I want that multiple persons' code can be run in parallel. But I have some doubts--
(1) The Apache HTTP client provides a way of multithreaded communication, currently I am simply using the default HTTP client- this same client can be made to visit multiple websites, one after the other- as per code block in xml. Am I correct in thinking that I do not need to change my code so that it uses the recommended multithreaded communication?
(2) I am thinking of creating a servlet that when invoked, executes one block of xml code. So to execute 2 blocks of code as given by 2 different users, I will have to invoke this servlet twice. I am going to deploy this application using Amazon Elastic Beanstalk, so what I am confused about is, do I need to use multi threading at all in my program? Can I not simply invoke the existing code (which is used to execute one block of code at a time) from the servlet? And I do want to keep processing of the different blocks of XML code separate from each other, so I dont think I should use multi threading here.. Am I correct in my assumption?
Running it one after the other as per your 1st option will not be considered 'concurrent' .
Coming to the servlet method , the way you describe it will work concurrently , but you also need to think about how many users concurrently ? Since for each user , there would be a separate request , there would be some network latency involved for multiple calls. You need to think about all these factors before going ahead with this option
Since you have the code for one user's job , you can define a thread class which has userid as an attribute. In the run() method call the code for a particular user's job.
Now create two threads and set the appropriate userid for each thread and spawn them off.
If the number of users are more , you can look at using Java's Thread Pool Executor .
Since you are going to use a servlet container then it's going to manage multithreading for you. Every servlet request will be executed in a different thread. In that scenario one servlet call would execute on block of code from provided XML in a single threaded manner. If there are several sites declared per block of code they would be visited serially. Other user in the same time may call the same server with other block of code running in parallel with the first one.
There should be a frontier object - Holding a set of visited and waiting to crawl URL's.
There should be some thread responsible for crawling web pages.
There would be also some kind of controller object to create crawling threads.
I don't know what architecture would be faster, easier to extend. How to divide responsibilities to make as as few synchronization as possible and also minimize number of checking if current URL has been already visited.
Should controller object be responsible of providing new URL's to working threads - this mean working threads will need to crawl all given URL's and then sleep for undefined time. Controller will be interpreting this threads so crawling thread should handle InterruptedException (How expensive it is in Java - it seems that exception handling is not very fast ).
Or maybe controller should only starts the threads and let crawling threads to fetch frontier themselves?
create a shared, thread-safe list with the URL's to be crawled. create an Executor with the number of threads corresponding to the number of crawlers you desire to run concurrently. start your crawlers as Runnables with a reference to the shared list and submit each of them to the Executor. each crawler removes the next URL from the list and does whatever you need it to do, looping until the list is empty.
Its been a few years since this question was asked, but in Nov 2015 we are currently using frontera and scrapyd
Scrapy uses twisted which makes it a good multithreaded crawler, and on multi-core machines that means we are only limited by the inbound bandwidth. Frontera-distributed uses hbase and kafka to score links and keep all the data accessible to clients.
Create a central resource with a hash map that can store URL as key with last time scanned. Make this thread safe. Then just spawn threads with links in a queue which can be picked up by the crawlers as starting point. Each thread would then carry on crawling and updating the resource. A thread in the resource clears up outdated crawls. The in memory resource can be serialised at start or it could be in a db depending on your app needs.
You could make this resource accessible via remote services to allow multiple machines. You could make the resource itself spread over several machines by segregating urls. Etc...
You should use a blocking queue, that contains urls that need to be fetched. In this case you could create multiple consumers that will fetch urls in multiple threads. If queue is empty, than all fetchers will be locked. In this case you should run all threads at the beginning and should not controll them later.
Also you need to maintain a list of already downloaded pages in some persistent storage and check before adding to the queue.
If you don't want to re-invent the wheel, why not look at Apache Nutch.
we have been using ThreadLocal so far to carry some data so as to not clutter the API. However below are some of issues of using thread local that which I dont like
1) over the years the data items being carried in thread local has increased
2) Since we started using threads (for some light weight processing), we have also migrating these data to the threads in the pool and copying them back again
I am thinking of using an in memory DB for these (we doesnt want to add this to the API). I wondering if this approach is good. What r the pros and cons.
ok here is a simple scenario
user logs in and submits a request
system establishes context for this entire request which include
- unique id for this request
- username
- system looged in (user can log into multiple systems)
- some DOMAIN EVENTS for later use
the request passes through multiple logical layers (presentation, business domain,
rules, integration) etc
in the integration layer, we borrow few threads from pool to parallel pull data from
multiple partners. each of the pulls need some data stored earlier in thread local, so we migrate those to the pooled threads
after all data is received from partners, we migrate back the new thread local data accumulated in the child threads to the main thread
at the end of the interaction we persist the DOMAIN events to DB
you may want to introduce a request context: http://www.corej2eepatterns.com/Patterns2ndEd/ContextObject.htm
you can handle creation/destruction of such an object in a Filter if you're using a WebContainer or an Interceptor if you're using an ApplicationServer.