I have stuck in a serious problem. I am sending a request to server which contains some URL as its data.If I explain it , it is like I have a file which contains some URL in a sequential order I have to read those sequential data by using thread. Now the problem is there are one hundred thousand URL, I have to send each URL in the server in a particular time(say suppose 30 seconds).So I have to create threads which will serve the task in the desired time. But I have to read the file in such a way if first thread serve first 100 URL then 2nd thread will serve the next 100 URL and in the same way the other threads also.And I am doing it in a socket programming,so there is only one port at a time that I can use. So how to solve this problem. Give me a nice and simple idea, and if possible give me an example also.
Thanks in Advance
Nice and simple idea (if I understand your question correctly): You can use a LinkedList as a queue. Read in the 1,000 urls from file and put them in the list. Spawn your threads, which then pull (and remove) the next 100 urls from the list. LinkedList is not thread-safe though, so you must synchronize access yourself.
One thing that you could look into is the fork/join framework. The way that the java tutorials explains this is: "It is designed for work that can be broken into smaller pieces recursively. The goal is to use all the available processing power to make your application wicked fast". Then all you really need to do is figure out how to break up your tasks.
http://download.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html
you can find the jar for this at: http://g.oswego.edu/dl/concurrency-interest/
Related
I have problem with counting responses from response queue. I mean, once per day we run a job which gather some data from db and send them to queue. When we receive all responses we should shutdown connection. The problem is how we can check if all responses arrived ? Keeping this in global variable is risky because of concurrence issue. Any idea ? I am quite new in JMS so maybe solution is obvious but I dont see it.
I don't know what your stack is or whatever tools you might be using to accomplish this but I've got this in mind and this might help you out (hopefully).
Generate a hash for each job you plan on queuing and store it in a concurrent list/map. (i.e: ConcurrentHashMap)
Send the job to the queue.
Once the job is done and sends back a response, reproduce the hash and store it a separate concurrent list/map that holds all the jobs that are done.
Now that you have two lists of all the jobs supposed to be executed and the jobs that you got a response from. There multiple ways to accomplish this. If you lookup Java Concurrency, you'd find plenty of tutorials and documentation. I like to use CyclicBarrierandCountDownLatch`. If plan on using any of these methods, take extra precautions to prevent your application from hanging or worse, a filthy memory leak.
OR, you could simply check on how many queuing requests and responses you've and if they are equal to each other, drop the connection.
I have two (Java) processes on different JVMs running repeatedly. The first one regularly finds some "information" and needs to store it somewhere. The second process regularly reads this information to handle it. The intervals are more or less random, so process 1 may find three pieces of information until process 2 reads them or vice versa.
My approach is to write this information to text files. But I am afraid that appending and reading the text files accidentally happens at the same time so that I run into locks. But writing a new text file for each piece of information seems like overkill.
What would be a better solution?
EDIT: I am sorry, I did not make clear: The java processes run in different JVMs. They cannot see each other directly.
You can get this to work, provided you are careful with file handling and you don't have a high update rate e.g. 10 updates per second.
Note: you could do it with file renaming instead of locks.
What would be a better solution?
Just about anything, SO is not for recommending things, but in this case I could recommend just about anything without more specific requirements. I could for example recommend my library Chronicle Queue because I wrote it and I sure it could do what you want, however there are many possible alternatives.
I am sending about one line of text every minute.
So you can write a temporary file for each message, rename it when finished. The consumer can have a directory watcher so it knows as soon as you have done this. The consumer could delete the file when done. This has an overhead but it would be less than 10 ms.
If you want to keep a record of all messages, the producer can also write to a log file.
I have a huge line-separated text file and I want to make some calculations on each line. I need to make a multithreaded program to process it because it is the processing of each line that takes the most time to complete rather than reading each line. (the bottleneck lies in the CPU processing, rather than the IO)
There are two options I came up with:
1) Open the file from main thread, create a lock on the file handle and pass the file handle around the worker threads and then let each worker read-access the file directly
2) Create a producer / consumer setup where only the main thread has direct read-access to the file, and feeds lines to each worker thread using a shared queue
Things to know:
I am really interested in speed performance for this task
Each line is independent
I am working this in C++ but I guess the issue here is a bit language-independent
Which option would you choose and why?
I would suggest the second option, since it will be more clear design wise and less complicated than first option. First option is less scalable and require additional communication among thread in order to synchronize they progress on file lines. While in second option you have one dispatcher which deals with IO and initiate workers threads to starts they computation, and each computational thread is completely independent from each other, hence allows you scaling. Moreover in the second option you separate your logic in more clear way.
If we are talking about massively large file, which needs to be processed with a large cluster - MapReduce is probably the best solution.
The framework allows you great scalability, and already handles all the dirty work of managing the workers and tolerating failures for you.
The framework is specifically designed to recieve files read from file system [originally for GFS] as input.
Note that there is an open source implementation of map-reduce: Apache Hadoop
If each line is really independent and processing is much slower than reading the file, what you can do is to read all the data at once and store it in an array, such that each line represents element of an array.
Then all your threads can do the processing in parallel. For example, if you have 200 lines and 4 threads, each thread could perform calculation on 50 lines. Morever, Since this method would be embarrassingly parallel, you could easily use OpenMP for that.
I would suggest the second option because it is definitely better design-wise and would allow you to have better control over the work that the worker threads are doing.
Moreover that would increase the performance since the inter-thread communication in that case is the minimum of the two options you described
Another option is to memory map the file and maintaining a shared structure properly handling mutual exclusion of the threads.
Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example.
There are 20+ tasks defined in that application, each one may depends on other ones or not.
The application execute tasks one by one, the whole application runs in a single thread.
It takes 3~7 hours to finish all the tasks. I think it's too long, so I think maybe I can improve performance by multi-threading.
I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task.
for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). I wonder is that worth to use multi-threading?
As I know a little about multi-threading, I hope some one provide some tutorial links on this topic.
Multi-threading will improve your performance but there are a couple of things you need to know:
Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables.
Cut tasks into several work units where each unit does one job.
To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc.
What you should do:
One (!) thread to read the file and create "jobs" out of it. Each job should contains a small, but not too small "unit of work". Push those into a queue
The next thread(s) wait(s) for jobs in the queue and do the calculations. This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. The result of this conversion step goes into the next queue
One or more threads to upload the data via JDBC.
The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys)
Using different worker threads gives you lots of advantages:
It's easy to test each thread separately. Since they don't share data, you need no synchronization. The queues will do that for you
You can quickly change the number of threads for each step to tweak performance
Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. As always depending on your requirements.
Also you may want to checkout springbatch which is designed for batch processing.
As far as I know,the JDBC Bridge uses synchronized methods to serialize all calls to ODBC so using mutliple threads won't give you any performance boost unless it boosts your application itself.
I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. Even worse, it might all execute without any problems, but the results might be off from true values. Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south.
Here are a couple of links that might provide useful:
Oracle's java trail: best place to start
A good tutorial for java concurrency
an interesting article on concurrency
If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really..
Good luck
I had a similar task. But in my case, all the tables were unrelated to each other.
STEP1:
Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database.
STEP2:
Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks.
P.S. You could identify different inter-related jobs in your application and categorize them in groups; and running each group in different threads.
Links to run you up:
JAVA Threading
follow the last example in the above link(Example: Partitioning a large task with multiple threads)
SQL Loader can dramatically improve performance
The fastest way I've found to insert large numbers of records into Oracle is with array operations. See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. It's described in one of the examples here:
http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc
If Multi threading would complicate your work, you could go with Async messaging. I'm not fully aware of what your needs are, so, the following is from what I am seeing currently.
Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. This could be plain Java with static void main()
Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers.
Each row of data is asynchronously split between 2 servers and 50 beans on each server.
You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading.
Also, spring is providing spring-batch which can help you. http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios
I need to read 200,000 or so records from a website and store them in DB. The application is a desktop app implemented on top of Netbeans Rich Client Platform. By using Apache HttpComponent library, I can send request to the website and retrieve the response that contains the record information; then using regex, I can fairly easily extract the dozen of fields that I need from the HTML.
I am thinking to have 2 worker threads besides the GUI thread. One worker thread handles the HTTP request/response part and also extracts the record from the HTML using regex; while the other worker thread stores the records into DB. So, there will be a data structure to hold the records so that it can be shared between the two worker threads. I am also considering to have a buffer of size 100 (for example) for the HTTP worker thread to store the records, and when the buffer is full, transfer 100 records at one time to the shared records holder.
Please comment on my design and also my questions are:
what is the proper data structure to hold the records?
how to synchronized it between the two worker threads?
how would the multi-threads be implemented in the modular system of Netbeans Platform?
what is the proper data structure to hold the records?
Depends on the data. Probably a simple class with a bunch of fields (preferably immutable to make using multiple threads safer).
how to synchronized it between the two worker threads?
One of the BlockingQueue implementations might be good for that. ArrayBlockingQueue can be used as a fixed-size buffer for passing work between the threads.
how would the multi-threads be implemented in the modular system of Netbeans Platform?
No idea whether NetBeans Platform has anything to say about that. Launching your own threads should work.
First of all, this kind of HTML parsing would slow down your app quite badly. Also, the code would be quite fragile since HTML changes quite often for aesthetic enhancements. You should resort to 'HTML scraping' as the last resort. Most customers agree to opening up a web-service/data-service for this once you explain the disadvantages.
If you really have no other alternatives, then I think your approach is good. But instead of waiting for the buffer to be full, you could have a set of threads writing into the buffer and a set of threads reading from the buffer simultaneously. I would suggest using more number of HTTP scraper threads and less number of DB-write threads since the HTTP request-response cycle and HTML parsing would be order of times slower than a database write.