Fastest architecture for multithreaded web crawler - java

There should be a frontier object - Holding a set of visited and waiting to crawl URL's.
There should be some thread responsible for crawling web pages.
There would be also some kind of controller object to create crawling threads.
I don't know what architecture would be faster, easier to extend. How to divide responsibilities to make as as few synchronization as possible and also minimize number of checking if current URL has been already visited.
Should controller object be responsible of providing new URL's to working threads - this mean working threads will need to crawl all given URL's and then sleep for undefined time. Controller will be interpreting this threads so crawling thread should handle InterruptedException (How expensive it is in Java - it seems that exception handling is not very fast ).
Or maybe controller should only starts the threads and let crawling threads to fetch frontier themselves?

create a shared, thread-safe list with the URL's to be crawled. create an Executor with the number of threads corresponding to the number of crawlers you desire to run concurrently. start your crawlers as Runnables with a reference to the shared list and submit each of them to the Executor. each crawler removes the next URL from the list and does whatever you need it to do, looping until the list is empty.

Its been a few years since this question was asked, but in Nov 2015 we are currently using frontera and scrapyd
Scrapy uses twisted which makes it a good multithreaded crawler, and on multi-core machines that means we are only limited by the inbound bandwidth. Frontera-distributed uses hbase and kafka to score links and keep all the data accessible to clients.

Create a central resource with a hash map that can store URL as key with last time scanned. Make this thread safe. Then just spawn threads with links in a queue which can be picked up by the crawlers as starting point. Each thread would then carry on crawling and updating the resource. A thread in the resource clears up outdated crawls. The in memory resource can be serialised at start or it could be in a db depending on your app needs.
You could make this resource accessible via remote services to allow multiple machines. You could make the resource itself spread over several machines by segregating urls. Etc...

You should use a blocking queue, that contains urls that need to be fetched. In this case you could create multiple consumers that will fetch urls in multiple threads. If queue is empty, than all fetchers will be locked. In this case you should run all threads at the beginning and should not controll them later.
Also you need to maintain a list of already downloaded pages in some persistent storage and check before adding to the queue.

If you don't want to re-invent the wheel, why not look at Apache Nutch.

Related

Thread Management Best Practices for Network Bound Web Server

I have a network server that has a few dozen backend controllers, each which process a different user request (i.e., user clicking something on the website).
Each controller makes network calls to a handful of services to get the data it needs. Each network call takes somewhere around 200ms. These network calls can be done in parallel, so I want to launch one thread for each of them and then collect at the end - maximizing parallelization. (5 network calls in parallel takes 200ms, where 5 in sequence will take 1000ms).
However I am unsure of best practice to design the thread management strategy.
Should I have one threadpool with say 1000 (arbitrary number for example) threads in it, and each controller draws from that pool?
Should I not have a threadpool at all and create new threads in each controller as I need them? This option seems "dumb" but I wonder - how much is the cost in CPU cycles of creating a thread, compared to waiting for network response? Quite minimal.
Should I have one threadpool per controller? (Meaning dozens of threadpools, each with around 5 or 6 threads for that specific controller).
Seeking pros / cons of each strategy, best practices, or an alternate strategy I haven't considered.

Java C3P0 Resource Starving

I have been looking at an issue where a main workflow table with a relevant dao (which is hit lots by multiple threads) is struggling to keep up with requests in a fair way - almost like resource starvation.
The threads all are responsible for pulling data (around 5 of them) from various external systems.
The issue here is that when a thread gets so much information at once - it hammers requests to the table which leaves the others competing for access / resource. As such, typically they time out and need to be restarted.
Are there any mechanisms or strategies to manage this kind of thing. I was thinking off the top of my head (this is my first initial thought) to create some form of blocking list which all the threads can add too (on a first come first served basis maybe) and then filter through the SimpleJdbcOperations that way.
I would be open to any theories for solving such a problem that are considered standard for this kind of problem.
Thanks

How to effectively process lot of objects on a list on server side

I have a List which contains a lot of objects.
The problem is that i have to process these objects (process includes cloning, deep copy, and making DB calls, running business logic etc etc.
Doing this in a normal fashion, first come first serve is really time consuming and in a web application , this generally results in transaction timeouts at the server side (as this processing is anync from client perspective).
How do i process those objects so as to take minimal time and not overload the DB.
I'm using java 7 on server environment.
I'm already using a messaging solution , rabbitmq, which gets me the item and its quantity. problem occurs when i try to deep copy items to mimic real items (business logic every item should be uniquely processed) and save them to DB.
After some discussions, the viable solution is using a ABQ (array blocking queues) which is processed by a pool of threads.
Following are the thought out benefits:
1) we wont have to manage the 3rd party queues created e.g. rabbitmq
2) At any point in time the blocking queue wont have all the items to be processed as the consumer threads will be simultaneously processing them, so it will leave lesser memory footprint.
#cody123 i'm using spring batch for retry mechanisms in this case.
After another round of profiling i found that the bottle neck was the DB connection pool having low number of max connections.
I deduced this by running the same transaction without db thread pool and it went perfectly well and completed without any exception.
However combining the previous approach i.e. managing an ABQ and light commits with HA DB will be the best solution.

is Google appengine single threaded ?(java)

My question is "is Google appengine single threaded? .Now when i ask that i know that i cannot start my own threads by using threading in java .But we can start threads using backend.
I am concerned about threading with request to how requests are handled.I read someonewhere that in appengine each request is queued and then served one by one.And i can configure the max time for which a request can be queued.If time to server request exceeds max time then new instance is created.
So what if i want to use single instance (free quota).
If i get multiple requests as r1 , r2 ,r3,r4 (in this order).Then will each of the requests be served one after other (in case of single instance)?
If i create multiple instances when the load increases and new instance is created dynamically will the data that is present in main memory of instance one will it be cloned to instance too?
Will the data in 2 instances in synch all the time?
Agree with what Nick said, but also want to point out that this statement:
"Now when i ask that i know that i cannot start my own threads by using threading in java"
is no longer true. For more details, see the section about threads here:
https://developers.google.com/appengine/docs/java/runtime#The_Sandbox
So, in summary, App Engine is multi-threaded in a couple of ways:
- requests can be handled concurrently by a single instance using a thread per request
- a single request may explicitly start additional threads
As stated in the docs, you can enable concurrent requests on your Java app, in which case multiple threads will be spawned, each of which handles requests independently.
Instances are not cloned off already running instances, nor are they synchronized in any way - you are expected to write your code in a manner that doesn't depend on specific mutable instance state.

What's the proper design of a data reading/storing application?

I need to read 200,000 or so records from a website and store them in DB. The application is a desktop app implemented on top of Netbeans Rich Client Platform. By using Apache HttpComponent library, I can send request to the website and retrieve the response that contains the record information; then using regex, I can fairly easily extract the dozen of fields that I need from the HTML.
I am thinking to have 2 worker threads besides the GUI thread. One worker thread handles the HTTP request/response part and also extracts the record from the HTML using regex; while the other worker thread stores the records into DB. So, there will be a data structure to hold the records so that it can be shared between the two worker threads. I am also considering to have a buffer of size 100 (for example) for the HTTP worker thread to store the records, and when the buffer is full, transfer 100 records at one time to the shared records holder.
Please comment on my design and also my questions are:
what is the proper data structure to hold the records?
how to synchronized it between the two worker threads?
how would the multi-threads be implemented in the modular system of Netbeans Platform?
what is the proper data structure to hold the records?
Depends on the data. Probably a simple class with a bunch of fields (preferably immutable to make using multiple threads safer).
how to synchronized it between the two worker threads?
One of the BlockingQueue implementations might be good for that. ArrayBlockingQueue can be used as a fixed-size buffer for passing work between the threads.
how would the multi-threads be implemented in the modular system of Netbeans Platform?
No idea whether NetBeans Platform has anything to say about that. Launching your own threads should work.
First of all, this kind of HTML parsing would slow down your app quite badly. Also, the code would be quite fragile since HTML changes quite often for aesthetic enhancements. You should resort to 'HTML scraping' as the last resort. Most customers agree to opening up a web-service/data-service for this once you explain the disadvantages.
If you really have no other alternatives, then I think your approach is good. But instead of waiting for the buffer to be full, you could have a set of threads writing into the buffer and a set of threads reading from the buffer simultaneously. I would suggest using more number of HTTP scraper threads and less number of DB-write threads since the HTTP request-response cycle and HTML parsing would be order of times slower than a database write.

Categories