Number of threads for precise timed large dataset calculations?

Number of threads for precise timed large dataset calculations? - java

I'm not very competent on the details of Java's concurrency execution with multiple threads as I don't have an extensive understanding of how microprocessors execute concurrent operations on the same thread, make use of hyper-threading, control the cache, the OS's ability to make use of threads ect...
I have done a good amount of research to inform myself but, I still don't quite understand how to optimize my code.
Specifically I need to be able to simultaneously retrieve input from a network connection, write the data to a file, and perform complex mathematical operations (most taxing being a polynomial regression) which requires often tens of numbers in excess of 1.0e32 being processed. And I need this to be done sometimes up to thousands of times within a matter seconds.
So what would be a good way of approaching my concurrency for these elements, assuming that the application may be run from a server or a common desktop? If the question is too vague anyone who could point me in the right direction to understanding multiprocessing in Java would be greatly appreciated also.

The multi-threading model depends on the application. Is it like the network connection keeps sending data at a very fast rate? Are your complex calculations made on each single datum that you receive over the connection? Are the calculations independent of each other or can be accumulated after individual processing? If the answer to all these questions is a yes, then a good model would be to have a Socket reader thread write data from the network into a queue, and have several threads read from this queue and compute the operation on the datum that it read. This concept is known as a thread pool.

Related

Recommended size of a thread pool where several related tasks are submitted once

I am trying to make a program that will execute a variable number of possibly (but not certainly) computationally heavy tasks in parallel. These tasks (of Runnable type) will all be submitted at the same time and the thread pool should shut down once all these tasks are complete (in other words, the pool will only need to accept the initial tasks and nothing more).
In most of the answers that I found on this site, the question was about a server-based task (I am running my program on a decent desktop) or a pool that accepts tasks over irregular time intervals. In the questions that were not specific about the use, the answer was usually "it depends."
I have basically zero experience with threads, so I really do not know what is the optimal "thread count to task intensity" ratio.
For context, the program that I am working on deals with collections of matrices (represented by 3D arrays) where each matrix can contain up to 1000x1000 elements. One of the tasks may be to perform a convolution operation, and each task is an operation on one of the matrices in the collection.
Is there a recommendation for this specific type of problem?

The same that you hear when that question gets asked for a server: don't make assumptions, make experiments.
Try to identify (worst case: guess) the typical hardware setup that your users are running your software on. Then make sure you can do nicely automated performance testing. And then see what happens.
But thing is: that won't help much. You see, when you run your own server, you are (hopefully) in control about the workload that these machines are busy with. For a desktop setup, where remote users run your code on their boxes ... you have zero insights what else is running there. You might find that 16 threads are fine for 50% of the users. But the rest is maybe doing a lot of other things on their machines, and 16 is already way too much for them.
And that is the real crux. No matter what number you find "good to go" for a specific hardware configuration, you have no control about other workloads.
From that point of view, I would be pretty conservative. For a CPU intensive workload "too many" threads isn't helpful anyway, so go with the number of CPUs, or better number of cores as starting point.
Beyond that, what might be really helpful here: add some sort of "data gathering" to your application. Meaning: have it call home regularly, to tell you things like: "this is the hardware I am running on, I am using X threads, and the other workload on the system is Y". That might help you to get to some heuristics to adapt to the most important user setups. But be diligent about what data to collect. Define the questions you want to be answered upfront, and then pull the data you need to answer these questions.

If you workload is computationally intensive (CPU bound) you might want to look into ForkJoinPool which implements worker stealing.
A ForkJoinPool differs from other kinds of ExecutorService mainly by virtue of employing work-stealing: all threads in the pool attempt to find and execute tasks submitted to the pool and/or created by other active tasks (eventually blocking waiting for work if none exist). This enables efficient processing when most tasks spawn other subtasks (as do most ForkJoinTasks), as well as when many small tasks are submitted to the pool from external clients.

Stream processing architecture

I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.
It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)
Now, I'm facing several problems:
It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)
Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.
Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?
It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".
Regarding technologies,
I'm developing the system with Java but I'm not bounded to a specific technology.
I'd be glad if you could help me with the general design of the system.
Thanks a lot!

As Peter said, it really depends on the use case. Some general remarks though:
If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.
If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.
If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.

Some additional thoughts:
If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.
You might want to decouple your architecture with some sort of durable queueing between the workers.
It's common to use heartbeats with timeouts and restarts.
Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.

Synchronous multithreading in Java (Apache HTTPClient)

I am wondering how I would go about doing this. Say I load a list of 1,000 words and for each word a thread is created and say it does a google search on each word. The problem here is obvious. I can't have 1k threads, can I. Keep in mind I am extremely new to threads and synchronization. So basically I am wondering how I would go about using less threads. I assume I have to set thread amount to a fixed number and synchronize the threads. Was wondering how to do this with Apache HttpClient using GetThread and then run it. In run I'm getting the data from webpage and turning it into a String and then checking if it contains a certain word.

Surely you can have as many threads as you want. But in general it is not recommended to use more threads than there are processing cores on your computer.
And don't forget that creating 1000 internet sessions at once affects your networking. A size of one single google page is nearly 0.3 megabytes. Are you really going to download 300 megabytes of data at once?
By the way,
There is a funny thing about concurrency.
Some people say: "synchronization is like concurrency". It is not true.
Synchronization is the opposite of concurrency.
Concurrency is when lots of things happen in parallel.
Synchronization is when I am blocking you.
(Joshua Bloch)

Maybe you can look at this problem this way.
You have 1000 words and for each word you are going to carry out a search.
In other words there are 1000 tasks to be executed and they are not related
to each other, so there is no need for synchronization in the case of this
problem as per the following definition from Wiki.
"In computer science, synchronization refers to one of two distinct but related concepts: synchronization of processes, and synchronization of data. Process synchronization refers to the idea that multiple processes are to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action. Data Synchronization refers to the idea of keeping multiple copies of a dataset in coherence with one another, or to maintain data integrity"
So in this problem you do not have to synchronize the 1000 processes which
execute the word searches since they can run independently and dont need
to join forces. So it is not a Process synchronization.
It is not a Data synchronization either since the data of each search is
independent of the other 999 searches.
Hence when Joshua says Synchronization is when I am blocking you, there is no need of blocking in this case.
Yes all tasks can concurrently get executed in different threads.
Of course your system may not have the resources to run 1000 threads
concurrently ( read same time ).
So you need concepts like pools where a pool has a certain no of
threads...say if it has 10 threads...then those 10 will start
10 independent searches on 10 words from your list.
If any of them is done with its task then it will take up the next
word search task available and the process goes on....

Use threads as "sessions"

I am developing a text-based game, MUD. I have the base functions of the program ready, and now I would like to allow to connect more than one client at a time. I plan to use threads to accomplish that.
In my game I need to store information such as current position or health points for each player. I could hold it in the database, but as it will change very quick, sometimes every second, the use of database would be inefficient (am I right?).
My question is: can threads behave as "sessions", ie hold some data unique to each user?
If yes, could you direct me to some resources that I could use to help me understand how it works?
If no, what do you suggest? Is database a good option or would you recommend something else?
Cheers,
Eleeist

Yes, they can, but this is a mind-bogglingly stupid way to do things. For one thing, it permanently locks you into a "one thread per client" model. For another thing, it makes it difficult (maybe even impossible) to implement interactions between users, which I'm sure your MUD has.
Instead, have a collection of some kind that stores your users, with data on each user. Save persistent data to the database, but you don't need to update ephemeral data on every change.
One way to handle this is to have a "changed" boolean in each user. When you make a critical change to a user, write them to the database immediately. But if it's a routine, non-critical change, just set the "changed" flag. Then have a thread come along every once in a while and write out changed users to the database (and clear the "changed" flag).
Use appropriate synchronization, of course!

A Thread per connection / user session won't scale. You can only have N number of threads active where N is equal to the number of physical cores / processors your machine has. You are also limited by the amount of memory in your machine for how many threads you can create a time, some operating systems just put arbitrary limits as well.
There is nothing magical about Threads in handling multiple clients. They will just make your code more complicated and less deterministic and thus harder to reason about what is actually happening when you start hunting logic errors.
A Thread per connection / user session would be an anti-pattern!
Threads should be stateless workers that pull things off concurrent queues and process the data.
Look at concurrent maps for caching ( or use some appropriate caching solution ) and process them and then do something else. See java.util.concurrent for all the primitive classes you need to implement something correctly.

Instead of worrying about threads and thread-safety, I'd use an in-memory SQL database like HSQLDB to store session information. Among other benefits, if your MUD turns out to be the next Angry Birds, you could more easily scale the thing up.

Definitely you can use threads as sessions. But it's a bit off the mark.
The main point of threads is the ability of concurrent, asynchronous execution. Most probably, you don't want events received from your MUD clients to happen in an parallel, uncontrolled order.
To ensure consistency of the world I'd use an in-memory database to store the game world. I'd serialize updates to it, or at least some updates to it. Imagine two players in parallel hitting a monster with HP 100. Each deals 100 damage. If you don't serialize the updates, you could end up giving credit for 100 damage to both players. Imagine two players simultaneously taking loot from the monster. Without proper serialization they could end up each with their own copy of the loot.
Threads, on the other hand, are good for asynchronous communication with clients. Use threads for that, unless something else (like a web server) does that for you already.

ThreadLocal is your friend! :)
http://docs.oracle.com/javase/6/docs/api/java/lang/ThreadLocal.html
ThreadLocal provides storage on the Thread itself. So the exact same call from 2 different threads will return/store different data.
The biggest danger is having a leak between Threads. You would have to be absolutely sure that if a different user used a Thread that someone else used, you would reset/clear the data.

Tutorial about Using multi-threading in jdbc

Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example.
There are 20+ tasks defined in that application, each one may depends on other ones or not.
The application execute tasks one by one, the whole application runs in a single thread.
It takes 3~7 hours to finish all the tasks. I think it's too long, so I think maybe I can improve performance by multi-threading.
I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task.
for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). I wonder is that worth to use multi-threading?
As I know a little about multi-threading, I hope some one provide some tutorial links on this topic.

Multi-threading will improve your performance but there are a couple of things you need to know:
Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables.
Cut tasks into several work units where each unit does one job.
To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc.
What you should do:
One (!) thread to read the file and create "jobs" out of it. Each job should contains a small, but not too small "unit of work". Push those into a queue
The next thread(s) wait(s) for jobs in the queue and do the calculations. This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. The result of this conversion step goes into the next queue
One or more threads to upload the data via JDBC.
The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys)
Using different worker threads gives you lots of advantages:
It's easy to test each thread separately. Since they don't share data, you need no synchronization. The queues will do that for you
You can quickly change the number of threads for each step to tweak performance

Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. As always depending on your requirements.
Also you may want to checkout springbatch which is designed for batch processing.

As far as I know,the JDBC Bridge uses synchronized methods to serialize all calls to ODBC so using mutliple threads won't give you any performance boost unless it boosts your application itself.

I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. Even worse, it might all execute without any problems, but the results might be off from true values. Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south.
Here are a couple of links that might provide useful:
Oracle's java trail: best place to start
A good tutorial for java concurrency
an interesting article on concurrency
If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really..
Good luck

I had a similar task. But in my case, all the tables were unrelated to each other.
STEP1:
Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database.
STEP2:
Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks.
P.S. You could identify different inter-related jobs in your application and categorize them in groups; and running each group in different threads.
Links to run you up:
JAVA Threading
follow the last example in the above link(Example: Partitioning a large task with multiple threads)
SQL Loader can dramatically improve performance

The fastest way I've found to insert large numbers of records into Oracle is with array operations. See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. It's described in one of the examples here:
http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc

If Multi threading would complicate your work, you could go with Async messaging. I'm not fully aware of what your needs are, so, the following is from what I am seeing currently.
Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. This could be plain Java with static void main()
Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers.
Each row of data is asynchronously split between 2 servers and 50 beans on each server.
You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading.
Also, spring is providing spring-batch which can help you. http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.