Parallel job execution with split-and-aggregate in Java

Parallel job execution with split-and-aggregate in Java - java

We are working on rewrite of an existing application, and need support for high number of read/write to database. For this, we are proceeding with sharding on MySQL. Since we are allowing bulk APIs for read/write, this would mean parallel execution of queries on different shards.
Can you suggest frameworks which would support the same in Java, mainly focussing on split-and-aggregate jobs. Basically I will define two interfaces ReadTask and WriteTask, and implementation of these tasks will be jobs and they would be submitted as a list for parallel execution.
I might not have termed this question in the right way, but I hope you got the context from the description. Let me know if there is any info needed for answer.

BLUF: This sounds like a common processing pattern in Akka.
This sounds like a Scatter-Gather patterned API.
If you have 1 job, you should first answer if that job will touch only one shard or more? If it will touch many shards you may choose to reject it (allowing only single-shard actions) or you may choose to break it up (scatter) it across other workers.
Akka gives you APIs, especially the Streaming API, that talk about this style of work. Akka is best expressed in Scala, but it has a Java API that gives you all the functionality of the Scala one. That you are talking about "mapping" and "reducing" (or "folding") data, these are functional operations and Scala gives you the functional idioms.
If you scatter it across other workers, you'll need to communicate the manifest of jobs to the gather side of the system.
Hope that's helpful.

You can use the ThreadPoolExecutor & Executors(factory) in Java to create Thread pools to which you can submit your read & write tasks. It allows for Runnable & Callable based on your situation.

Related

Java 8 CompletableFuture vs Netty Future

How does the CompletableFuture introduced in JDK 8 compare with the io.netty.util.concurrent.Future provided by Netty ?
Netty documentation mentions that
JDK 8 adds CompletableFuture which somewhat overlaps
io.netty.util.concurrent.Future
http://netty.io/wiki/using-as-a-generic-library.html
The questions I'm trying to get answers to are:
What would their similarities and differences be?
How would the performance characteristics of the two differ? Which one would be able to scale better?
With respect to the similarities/ differences, I have been able to come up with the following:
Similarities:
The fundamental similarity being that both are non-blocking as compared to the Java Future. Both the classes have methods available to add a listener to the future, introspect failure and success of the task and get results from the task.
Differences:
CompletableFuture seems to have a much richer interface for things like composing multiple async activities etc. Netty's io.netty.util.concurrent.Future on the other hand allows for multiple listeners to be added to the same Future, and moreover allows for listeners to be removed.

If we look at that whole paragraph (especially the first sentence)
Java sometimes advances by adopting ideas that subsume constructs
provided by Netty. For example, JDK 8 adds CompletableFuture which
somewhat overlaps io.netty.util.concurrent.Future. In such a case,
Netty's constructs provide a good migration path to you; We will
diligently update the API with future migration in mind.
What it's basically saying is that the netty Future and CompletableFuture are the same concept, but implemented at different times by different people.
Netty made their future because there wasn't one available in java, and they didn't want to pull one in as a dependency from something like Guice. But now, java has created one, and it's available for use.
In the end of the paragraph they're basically saying that the netty API may replace Future with CompletableFuture in the future.
As far as similarities/differences, they're both just one of many implementations of the future/promise pattern. Use the netty one when you're using the netty api and netty specific stuff, otherwise use CompletableFuture.

sending and receiving events in Java threads

I'm used to C++/Qt's concept of signals (emit/listen) and now I'm doing a project in Java which requires some sort of data sending/receiving mechanism.
My needs are:
Emit an event (with some data) and let all threads listen/catch it.
Obviously, given the previous requirement, being able to listen/catch signals with attached data.
Is this possible in Java, and how? (I'll appreciate a small compilable example/link)

Java by default doesn't have a simple event handling mechanism such as .Net's events or Qt's Signals and Slots. It does have the notion of Listeners in various java GUI frameworks but I don't think that's what you're looking for.
You should consider a pub-sub library like Google Guava's EventBus framework.
If you don't want to use a third party lib then I suggest you start looking into using one of the sub-classes of BlockingQueue. See the FileCrawler example from page 62 of Java Concurrency in Practice to see how to use a BlockingQueue to send events/data to worker threads.
If you're looking for a more complicated solution for message/event notifications across the process boundary or the local machine boundary then you may want to look into:
RabbitMq
Redis
JMS

not sure if this will match your exact query but have you tried CountDownLatch or CyclicBarrier?
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CountDownLatch.html
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CyclicBarrier.html

Java Multi-Core Processing

I would like to learn about Java's multi-core processing. From what I understand, is that Threading is one type of multi-core, which I feel I have a decent grasp of. I know there are other ways to do multi-core processing, but I don't know what they are. Does anyone know of any good simple tutorials/examples, or have their own that I could look at to learn more about multi-core processing in Java?
All the tutorials that I have found get too in depth with charts, graphs, background information, etc. and that really isn't my learning style with programming. I would preferably like something quick and simple.

The primary way you use multiple cores is to use multiple threads. The simplest way to use these if via the High Level Concurrency Objects which you should be familar with. This uses threads but you don't have to deal with them directly.
Another way is to use multiple processes, but this is an indirect way of using multiple threads.
You might find this library interesting. Java Thread Affinity It allows you to assign thread to sockets, cores or cpus.

This is Oracle's tutorial about Java 7 fork/join framework
http://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html

Beyond the mentioned High Level Concurrency Objects (fork/join should be added there) which are part of the Java implementation, there are many libraries and frameworks. Google for "actor framework", "dataflow framework", mapreduce, "scientific dataflow". The dataflow model is the mainstream, all other are it's variations (e.g. actor - dataflow node with single input port, mapreduce - persistent distributed actors created by demand, etc). The minimal dataflow framework (no persistence or distribution over a machine cluster) is mine df4j library.

java API or framework for queue processing

i need an open-source java API or framework for processing items in a queue. i can develop something myself, but do not want to re-invent the wheel (and i don't have much experience in multi-threading). is there such a thing?
the closest solution that i can think of is a business process management (BPM) solution.
right now, i am using multiple Quartz jobs to process the items in my queue. it is not really working out because of scalability and concurrency issues.

Sounds like you'd want to use an Executor

A queue of what sort? How many items? Is Quartz not working out because it's too big or too small?
I'd give some serious thought to using message queues in something like OpenMQ.

You can use JMS with ActiveMQ and can create optimized queue system as well as ESB. And want to manage workflow based system then tpdi is right. Use JBoss jbpm.
You can process JMS messages with ThreadPool also. In this case, you can use Executors.

Would the actor model fit your process? It's based around the idea of asynchronously passing messages between other actors. So you can set up a simple state machine to model your process and have all the transitions handled concurrently.

You need to determine if the problem in is the framework you are using or your code. I suggest you measure how fast your application is running and how fast your framework will go if its not doing anything at all. (just passing trivial tasks around) You should be able to perform between 100K to 1 million tasks per second using your in process framework. Even using JMS you should be able to achieve 10K messages per second. If you need to do closer to 10 million tasks per second, I suggest you try grouping your tasks together so each task does more work.
I would be very surprised if your framework was the bottleneck in which case I would suggest using an Executor.
If the framework isn't the cause of your scalability and concurrency issues (which is more likely) you need to restructure your code so it can run for longer periods of time without inter dependencies. i.e. you have to fix your code, a framework won't do that for you.

I know it is 5 years late, but this might help someone else that has been driven into this question.
Nowadays, there is http://queues.io and it contains a whole lot of queuing (and messaging) frameworks...

Scalability of Java EE Application. How would you approach it?

I've been working on the solution for financial industry. The main functionality of the application is the ability to load massive input files, digest them, update state in persistent store and generate extracts from persistent store on request. Pretty straightforward.
The input files are industry standard formatted XML large (more that hundreds of megabytes) messages containing many repeated entries. The persistent storage is relational database. The engine has been implemented as POJO-based (Spring Framework as back-bone) Java application deployable on J2EE application server.
The question is about the scalability and performance of the solution. If the application processes entries from XML in sequence the scalability of the solution is rather poor. there is no way to engage more than one instance of the application into the processing of the single file. This is why I've introduced parallel processing for entries form input XML file. Basically the idea is to dispatch processing of individual entries for workers from the pool. I decided to use JMS for dispatching. The component that loads the file reads the stream and simply extracts single entries and feeds the dispatching queue. There is a number of concurrent consumers on the other end of the queue. Each picks one message of the queue and processes the entry and it's immediately available to process other entry. This is pretty similar to servlets within the web container. What I found particularly powerful about this approach is that the workers can reside within separate instances of the application deployed on remote servers as long as the queue is shared. Unfortunately all workers connect to the same database that maintains persistence storage and this might be a bottleneck if database server is not powerful enough to handle load from concurrent workers.
What is your opinion on this architecture? Did you have similar application to design? What was your design choice then?

You can also have a look at Hadoop, a very handy platform for Map/Reduce jobs. The huge advantage is, that all infrastructure is provided by Hadoop, so you only apply new hardware nodes to scale. Implementing the Map and Reduce jobs should be only done once, after this, you can feed you cluster with massive load.

I think the architecture is generally sound. If the database is having trouble dealing with a high number of concurrent updates from the workers, you could introduce a 2nd queue on the other "side" of the app: as each worker completes their task, they add the results of that task to the queue. Then a single worker process periodically grabs the result objects from the 2nd queue and updates the database in a large batch operation? That would reduce database concurrency and might increase the efficiency of updates.

Also, take a look at Terracota clustering solution.

For parallel processing, as Mork0075 said, hadoop is a great solution. Actually many companies are use it for very large log analysis. And an interesting project Hive have been build based on hadoop for data warehousing.
Anyway, I think your current design is quite scalable. As for your concern about all of workers hitting on the database, you can just put another messaging queue between workers and database. Workers put processing results in the queue, and you build another program to subscribe to the queue and update the database. The drawback is that two queues might make system too complicated. Of course you can just add another topic to the existing MQ system. That will make system more simpler. Another approach is use a shared file system, such as NFS, each worker machine mount the same directory on the shared file server, and each worker write its processing results into a separate file on the shared file server. Then you build a program to check new files to update database. In this approach you introduce another complexity: shared file server. You can judge which one is more simpler in your case.

I recently spend some of my spare time investigating Spring Batch 2.0. This is new version of Java batching engine based on Spring framework. Guys who implemented Spring Batch concentrated on concurrency and parallelization of execution for this release. I must say it looks promising!

In answer to your questions:
What is your opinion on this architecture? Did you have similar application to design? What was your design choice then?
I think it's a good architecture, and you're right the DB is your bottleneck. However the design is flexible enough you can control the amount of input to the database.
I have and multi-threading across nodes works. I'm not entirely sure that Haddoop, or other distributed processing system will give you much more then what you already have, since your simply doing I/O to a database.
I've implemented something simliar using JMS queues for centralized logging, and it worked quite well with less impact to the code then writing the logs to disk. I think it'll work well for your application.

If you already are using Spring/Java EE, it is only natural to apply Spring Batch as a solution for your "concurrence architecture".
Two benefits right of the bat:
Spring Batch (starting from 2.0) implements partitioning, that means that the framework will take care of partitioning data for you in separate partition steps ( StepExecution ), and delegating the actual execution of these steps to multiple threads or other distributed systems ( PartitionHandlers, e.g. TaskExecutorPartitionHandler or to be more distributed MessageChannelPartitionHandler, etc.. )
Spring has a nice OXM package for dealing with XML + Spring Batch has a StaxEventItemReader that extracts fragments from the input XML document which would correspond to records for processing
Give Spring Batch a try. Let me know if you have any questions, I'll be glad to help out.
EDIT:
Also look at Scala/AKKA Actors and/or Scala parallel collections. If your task is applicable to be sharded/partitioned/distributed => that what Actor model is for.
If you'd like to consider a non JVM solution, take a look at Erlang OTP => simple and elegant.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.