I need to read 200,000 or so records from a website and store them in DB. The application is a desktop app implemented on top of Netbeans Rich Client Platform. By using Apache HttpComponent library, I can send request to the website and retrieve the response that contains the record information; then using regex, I can fairly easily extract the dozen of fields that I need from the HTML.
I am thinking to have 2 worker threads besides the GUI thread. One worker thread handles the HTTP request/response part and also extracts the record from the HTML using regex; while the other worker thread stores the records into DB. So, there will be a data structure to hold the records so that it can be shared between the two worker threads. I am also considering to have a buffer of size 100 (for example) for the HTTP worker thread to store the records, and when the buffer is full, transfer 100 records at one time to the shared records holder.
Please comment on my design and also my questions are:
what is the proper data structure to hold the records?
how to synchronized it between the two worker threads?
how would the multi-threads be implemented in the modular system of Netbeans Platform?
what is the proper data structure to hold the records?
Depends on the data. Probably a simple class with a bunch of fields (preferably immutable to make using multiple threads safer).
how to synchronized it between the two worker threads?
One of the BlockingQueue implementations might be good for that. ArrayBlockingQueue can be used as a fixed-size buffer for passing work between the threads.
how would the multi-threads be implemented in the modular system of Netbeans Platform?
No idea whether NetBeans Platform has anything to say about that. Launching your own threads should work.
First of all, this kind of HTML parsing would slow down your app quite badly. Also, the code would be quite fragile since HTML changes quite often for aesthetic enhancements. You should resort to 'HTML scraping' as the last resort. Most customers agree to opening up a web-service/data-service for this once you explain the disadvantages.
If you really have no other alternatives, then I think your approach is good. But instead of waiting for the buffer to be full, you could have a set of threads writing into the buffer and a set of threads reading from the buffer simultaneously. I would suggest using more number of HTTP scraper threads and less number of DB-write threads since the HTTP request-response cycle and HTML parsing would be order of times slower than a database write.
Related
Can you help me in two problem :
A. We have a table on which read and write operation happens simultaneously. Write happens very vastly so read is very slow - sometimes my web application does not come up due to heavy write operation on this table. How could i handle such scenario. Write happens through different Java application while read happens through our web application, so web application become very slow. Any idea?
B. Write happens to this table happens through 200 threads, these thread take connection from connection pool and write into the table and this application run 24 by 7. is the thread priority is having issue and stopping read operation from web application.
C. Can we have master- master replication for that table only- so write happens in one table and write happens in other table and every two minute data migrates from one table to other table?
Please suggest me .
Thanks in advance.
Check connection pool size - maybe it's too small and your threads waste time waiting for connection from pool.
Check your database settings, if you just running it with out-of-the-box params there maybe a good space for improvements.
You probably need some kind of event-driven system - when vehicle sends data DB is not updated, but a message is added to some queue (e.g. JMS). Your app then caches data on startup, and updates both cache and database upon receiving this message. The key thing is that the only component that interacts with DB is your app, and data changed only when you receive event - so you don't need to query DB to read the data, plus you may do updates in the background using only few threads, etc. There are quite good open-source messaging systems (e.g. Apache Active MQ) and caching libraries (e.g. EH Cache), so you can built reasonably perfomant and fault-tolerant system with not too much effort.
I guess introducing messaging will be a serious reengineering, so to solve your immediate problem replication might be the best solution - merge data from the updateable table to another one every 2 minutes, and the tracker will read that another table; obviously works well if you only read the data in the web-app, and not update them, otherwise you need to put a lot of effort to keep 2 tables in sync. A variation of that is batching - data from vehicle are iserted into intermediate table, and then every 2 minutes transferred into main table from which reader queries them; intermediate table is cleaned after transfer.
The one true way to solve this is to use a queue of write events and to stop the writing periodically so that the reader has a chance.
Create a queue for incoming write updates
Create an atomicXXX (see java.util.concurrency) to use as a lock
Create a thread pool to read from the queue and execute the updates when the lock is unset
Use javax.swing.Timer to periodically set the lock and read the table data.
Before trying anything too complicated try this perhaps:
1) Don't use Thread priorities, they are rarely what you want.
2) Set up your own priority scheme, perhaps simply by having a (priority) queue for both reads and writes where reads are prioritized. That is: add read and write requests to a single queue and have them block or be notified of the result.
3) check your database features to optimize write heavy tables
I have a huge line-separated text file and I want to make some calculations on each line. I need to make a multithreaded program to process it because it is the processing of each line that takes the most time to complete rather than reading each line. (the bottleneck lies in the CPU processing, rather than the IO)
There are two options I came up with:
1) Open the file from main thread, create a lock on the file handle and pass the file handle around the worker threads and then let each worker read-access the file directly
2) Create a producer / consumer setup where only the main thread has direct read-access to the file, and feeds lines to each worker thread using a shared queue
Things to know:
I am really interested in speed performance for this task
Each line is independent
I am working this in C++ but I guess the issue here is a bit language-independent
Which option would you choose and why?
I would suggest the second option, since it will be more clear design wise and less complicated than first option. First option is less scalable and require additional communication among thread in order to synchronize they progress on file lines. While in second option you have one dispatcher which deals with IO and initiate workers threads to starts they computation, and each computational thread is completely independent from each other, hence allows you scaling. Moreover in the second option you separate your logic in more clear way.
If we are talking about massively large file, which needs to be processed with a large cluster - MapReduce is probably the best solution.
The framework allows you great scalability, and already handles all the dirty work of managing the workers and tolerating failures for you.
The framework is specifically designed to recieve files read from file system [originally for GFS] as input.
Note that there is an open source implementation of map-reduce: Apache Hadoop
If each line is really independent and processing is much slower than reading the file, what you can do is to read all the data at once and store it in an array, such that each line represents element of an array.
Then all your threads can do the processing in parallel. For example, if you have 200 lines and 4 threads, each thread could perform calculation on 50 lines. Morever, Since this method would be embarrassingly parallel, you could easily use OpenMP for that.
I would suggest the second option because it is definitely better design-wise and would allow you to have better control over the work that the worker threads are doing.
Moreover that would increase the performance since the inter-thread communication in that case is the minimum of the two options you described
Another option is to memory map the file and maintaining a shared structure properly handling mutual exclusion of the threads.
There should be a frontier object - Holding a set of visited and waiting to crawl URL's.
There should be some thread responsible for crawling web pages.
There would be also some kind of controller object to create crawling threads.
I don't know what architecture would be faster, easier to extend. How to divide responsibilities to make as as few synchronization as possible and also minimize number of checking if current URL has been already visited.
Should controller object be responsible of providing new URL's to working threads - this mean working threads will need to crawl all given URL's and then sleep for undefined time. Controller will be interpreting this threads so crawling thread should handle InterruptedException (How expensive it is in Java - it seems that exception handling is not very fast ).
Or maybe controller should only starts the threads and let crawling threads to fetch frontier themselves?
create a shared, thread-safe list with the URL's to be crawled. create an Executor with the number of threads corresponding to the number of crawlers you desire to run concurrently. start your crawlers as Runnables with a reference to the shared list and submit each of them to the Executor. each crawler removes the next URL from the list and does whatever you need it to do, looping until the list is empty.
Its been a few years since this question was asked, but in Nov 2015 we are currently using frontera and scrapyd
Scrapy uses twisted which makes it a good multithreaded crawler, and on multi-core machines that means we are only limited by the inbound bandwidth. Frontera-distributed uses hbase and kafka to score links and keep all the data accessible to clients.
Create a central resource with a hash map that can store URL as key with last time scanned. Make this thread safe. Then just spawn threads with links in a queue which can be picked up by the crawlers as starting point. Each thread would then carry on crawling and updating the resource. A thread in the resource clears up outdated crawls. The in memory resource can be serialised at start or it could be in a db depending on your app needs.
You could make this resource accessible via remote services to allow multiple machines. You could make the resource itself spread over several machines by segregating urls. Etc...
You should use a blocking queue, that contains urls that need to be fetched. In this case you could create multiple consumers that will fetch urls in multiple threads. If queue is empty, than all fetchers will be locked. In this case you should run all threads at the beginning and should not controll them later.
Also you need to maintain a list of already downloaded pages in some persistent storage and check before adding to the queue.
If you don't want to re-invent the wheel, why not look at Apache Nutch.
As you know, trading strategies take actions based on real time feed, such as when the bid or the last trade price changes. A data feed provider streams quotes to our desktop application asynchronously in a separate thread from the main thread. This data feed thread is spawned when you make a request to the data feed provider and lives until you explictly send a request to stop the streaming.
As it stands, the data feed thread executes trading strategies because most of them are designed to enter or update orders upon tick data. Do you see any problem with this approach? Is this design common in trading applications?
I'm using Java.
You definitely don't want to execute a trading strategy on the data feed thread, particularly if the execution takes a while. That execution should happen on a different thread. I am not that familiar with Java, but I assume you could make use of a thread pool there. In C# a very powerful way to spread out work over multiple threads would be using Tasks.
Another thing you might want to think about is what to do when there are new ticks for an instrument while you are still processing the previous tick. In many cases it makes sense to only process the most recent one. I have written up a little post on what I termed the most recent update pattern with a sample implementation in C#. Maybe you find that useful.
As it stands, the data feed thread executes trading strategies because most of them are designed to enter or update orders upon tick data.
Not quite. The data feed thread triggers the execution of trading strategies. You don't want any other processing to slow down the data feed thread.
Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example.
There are 20+ tasks defined in that application, each one may depends on other ones or not.
The application execute tasks one by one, the whole application runs in a single thread.
It takes 3~7 hours to finish all the tasks. I think it's too long, so I think maybe I can improve performance by multi-threading.
I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task.
for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). I wonder is that worth to use multi-threading?
As I know a little about multi-threading, I hope some one provide some tutorial links on this topic.
Multi-threading will improve your performance but there are a couple of things you need to know:
Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables.
Cut tasks into several work units where each unit does one job.
To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc.
What you should do:
One (!) thread to read the file and create "jobs" out of it. Each job should contains a small, but not too small "unit of work". Push those into a queue
The next thread(s) wait(s) for jobs in the queue and do the calculations. This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. The result of this conversion step goes into the next queue
One or more threads to upload the data via JDBC.
The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys)
Using different worker threads gives you lots of advantages:
It's easy to test each thread separately. Since they don't share data, you need no synchronization. The queues will do that for you
You can quickly change the number of threads for each step to tweak performance
Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. As always depending on your requirements.
Also you may want to checkout springbatch which is designed for batch processing.
As far as I know,the JDBC Bridge uses synchronized methods to serialize all calls to ODBC so using mutliple threads won't give you any performance boost unless it boosts your application itself.
I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. Even worse, it might all execute without any problems, but the results might be off from true values. Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south.
Here are a couple of links that might provide useful:
Oracle's java trail: best place to start
A good tutorial for java concurrency
an interesting article on concurrency
If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really..
Good luck
I had a similar task. But in my case, all the tables were unrelated to each other.
STEP1:
Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database.
STEP2:
Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks.
P.S. You could identify different inter-related jobs in your application and categorize them in groups; and running each group in different threads.
Links to run you up:
JAVA Threading
follow the last example in the above link(Example: Partitioning a large task with multiple threads)
SQL Loader can dramatically improve performance
The fastest way I've found to insert large numbers of records into Oracle is with array operations. See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. It's described in one of the examples here:
http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc
If Multi threading would complicate your work, you could go with Async messaging. I'm not fully aware of what your needs are, so, the following is from what I am seeing currently.
Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. This could be plain Java with static void main()
Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers.
Each row of data is asynchronously split between 2 servers and 50 beans on each server.
You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading.
Also, spring is providing spring-batch which can help you. http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios