Divide process workflow between remote workers

Divide process workflow between remote workers - java

I need to develop a Java platform to download and process information from Twitter. The basic idea is to have a centralized controller to generate tasks (id and keywords basically) and send this tasks to remote workers (one per computer). I need to receive an status report periodically to know about the status of both, the task and the worker. I'll have at least 60 workers (ten times more in a near future).
My initial idea was to use RMI but I need to communicate in both directions and I don't feel comfortable with RMI. The other approach was to use SSLSockets to send serialized objects but I would have to control a lot of errors and add a lot of code to monitor tasks and workers. Some people told me about use a framework like Spring Batch, Gigaspaces or Quartz.
What do you think would be the best option for this project? By the time being I've read a lot of good things about Gigaspaces but I don't find a good tutorial about how to implement it and Quartz seems promising. What do you think? Is it worth using any of them?

It's not easy to tell you to go for a technology based on your question. GigaSpaces is certainly up to the job but so is Spring Batch. Quartz is just the scheduling part of your question and not so much the remoting and the distribution of workload.
GigaSpaces is a fully fledged application platform to handle scenario's where parallelism, high throughput and scalability is a factor. Spring Batch can definitely also do the job, but unlike GigaSpaces, it is not an application platform. So you would still need to deploy your application somewhere.
However, GigaSpaces is a commericial product (free version available) but there are other frameworks that can help you such as Storm Project (http://storm-project.net/) and Hazelcast (www.hazelcast.com) also come to mind.
So without clarifying your use case it's hard to give a single answer. It all depends on what exactly you want and how you want to use it, now and in the future.

Related

Job scheduling - how to choose the best one for springMVC based application

I am planning to go for Job scheduling for my spring MVC application and while I was searching for the same I came across this. but really don't have idea whether there are many like Quartz or which is the best scheduling API for Spring based application.

I think it really depends upon your requirements. For example:
Do jobs need to survive a restart of your infrastructure?
How critical is the availability of the scheduling framework?
How complex is the type of job you're trying to execute?
Quartz is a dedicated Job scheduling framework and as you would expect comes with many 'enterprisey' features that allow you to build a very highly available, highly performant Job scheduling implementation. It is fairly easy to get started with as well.
Other alternatives could be something like Amazon SQS with again provides a very highly available job queue that operates as a service. However the clue is in the name in terms of 'simple'. You loose a lot of the features that something like Quartz would offer. Amazon do however provide a Java wrapper onto the SQS API so managing it as part of your build should be simple enough.
Alternatively the JDK comes with its own built in options. Take a look at the various implementations of the java.util.concurrent.ExecutorService interface. Again depending upon your requirements there may be something in there that fits the bill without having to depend upon external libraries or APIs.
There is also this list of open-source job scheduling frameworks that should help you to compare other offerings with Quartz.

Decouple web services from other backend heavy computing service in Java

Background of the web application:
I am using java/spring-mvc/tomcat to provide my web service as well as exposing my restful API to mobile clients. I am happy with everything on the web surface right now. The problem is that my application has a really heavy computing process at its core, which invokes a separate Java program to process the images and return computed data back to the web service.
It sometime eats up lots of my EC2 instance memory, or causes an exception that shuts down my Tomcat7 server.
Question:
Right now everything is running under same tomcat7 container, and I am seeking a solution to decouple those two so that I can install them in different server, perhaps find a high memory server for computing program alone.
What are the options out there that allow me to decouple them and improve scalability and stability?
Update:
I can invoke computing engine programmatically or from command line.
Update2:
I have done some researches based on the answer. When I read on another post about What exactly is Apache Camel?, I feel I should probably learn a little more about EIP patterns. Hopefully, it is not overkill.
Solution based on suggestion
After reading through the EIP concept, camel in action, activemq, I finally come up with a solution. It might not be elegant, but it's working. Suggestion and comments would be appreciated!
I wrote a queue router based on apache-camel , connecting to activemq broker and running as standalone program in one server. The computing engine running in standalone container and the router is responsible to process jms requestor from my spring container in web server. Later on I just need to config load balance for computing engine from camel if further intensive computing is needed.

The one which are pointing right now is adding more hardware. You need to think through if this solves your problem. Eg: If you are using a 32 bit JVM there are limitations on how much heap size you can specify. If you are lucky to have a 64 bit JVM them then you will have a bigger room for memory. But there is always the possibility of using too much CPU where your application becomes unresponsive.
I prefer breaking the compute intensive tasks into jobs and work them out in a seperate JVM. Persist your jobs in a datastore/JMS so that they do not get lost. Be careful if you are doing DB updates from those jobs to avoid any locking.

If I understand correctly, it seems you need a load balancer.
Have a load balancer to route to one of multiple instances of your webservice/compute engine. You can achieve this using an esb, routing engine, clustered, master-slave, distributed-cache etc most of them interrelated.
And you can also spin up additional nodes realtime on EC2 based on load.
Else, if the task can be broken, then delegate it to multiple nodes/services. You will need some orchestration mechanism.
There are open source solutions that can address 1 and 2 above.

Does the backend work synchronously? I mean, when the mobile clients requests something do they have to wait for the backend to do a lot of processing?
If yes, you can grow horizontally, putting more worker nodes (backend webapps) and a front Nginx or any balancer. It's the fastest way.
Do you have reutilizable data? if yes, you can use something like memcached.
Hope it helps, if you give us more information I'm pretty sure that we will provide better advice.

General architecture for a long-running data-processing system in Java?

I've been asked to port a legacy data processing application over to Java.
The current version of the system is composed of a nubmer of (badly written) Excel sheets. The sheets implement a big loop: A number of data-sources are polled. These source are a mixture of CSV and XML-based web-servics.
The process is conceptually simple:
It's stateless, that means the calculations which run are purely dependant on the inputs. The results from the calculations are published (currently by writing a number of CSV files in some standard locations on the network).
Having published the results the polling cycle begins again.
The process will not need an admin GUI, however it would be neat if I could implemnt some kind of web-based control panel. It would be nothing pretty and purely for internal use. The control panel would do little more than dispay stats about the source feeds and possibly force refresh the input feeds in the event of a problem. This component is purely optional in the first delivery round.
A critical feature of this system will be fault-tolerance. Some of the input feeds are notoriously buggy. I'd like my system to be able to recover in the event that some of the inputs are broken. In this case it would not be possible to update the output - I'd like it to keep polling until the system is resolved, possibly generating some XMPP messages to indicate the status of the system. Overall the system should work without intervention for long periods of time.
Users currently have a custom-client which polls the CSV files which (hopefully) will not need to be re-written. If I can do this job properly then they will not notice that the engine that runs this system has been re-implemented.
I'm not a java devloper (I mainly do Python), but JVM is the requirement in this case. The manager has given me generous time to learn.
What I want to know is how to begin architecting this kind of project. I'd like to make use of frameworks & good patterns possible. Are there any big building-blocks that might help me get a good quality system running faster?
UPDATE0: Nobody mentioned Spring yet - Does this framework have a role to play in this kind of application?

You can use lots of big complex frameworks to "help" you do this. Learning these can be CV++.
In your case I would suggest you try making the system as simple as possible. It will perform better and be easier to maintain (its also more likely to work)
So I would take each of the requirements and ask yourself; How simple can I make this? This is not about being lazy (you have to think harder) but good practice IMHO.

1) Write the code that processes the files, keep it simple one class per task, you might find the Apache CSV and Apache Commons useful.
2) Then look at Java Thread Pools to create a sperate process runner for those classes as seperate tasks, if they error it can restart them.
3) The best approach to start up depends on platform, but I'll assume your mention of Excel indicates it's windows PC. The simplest solution would therefore be to run the process runner from Windows->Startup menu item. A slightly better solution would be to use a windows service wrapper Alternatively you could run this under something like Apache ACD

There is a tool in Java ecosystem, which solves all (almost) integration problems.
It is called Apache Camel (http://camel.apache.org/). It relies on a concept of Consumers and Producers and Enterprise Integration Patterns in between. It provides fault-tolerance and concurrent processing configuration capabilities. There is a support for periodical polling. It has components for XML, CSV and XMPP. It is easy to define time-triggered background jobs and integrate with any messaging system you like for job queuing.
If you would be writing such system from scratch it would takes weeks and weeks and still you would probably miss some of the error conditions.

Have a look at Pentaho ETL tool or Talend OpenStudio.
This tools provide access to files, databases and so on. You can write your own plugin or adapter if you need it. Talend creates Java code which you can compile and run.

java API or framework for queue processing

i need an open-source java API or framework for processing items in a queue. i can develop something myself, but do not want to re-invent the wheel (and i don't have much experience in multi-threading). is there such a thing?
the closest solution that i can think of is a business process management (BPM) solution.
right now, i am using multiple Quartz jobs to process the items in my queue. it is not really working out because of scalability and concurrency issues.

Sounds like you'd want to use an Executor

A queue of what sort? How many items? Is Quartz not working out because it's too big or too small?
I'd give some serious thought to using message queues in something like OpenMQ.

You can use JMS with ActiveMQ and can create optimized queue system as well as ESB. And want to manage workflow based system then tpdi is right. Use JBoss jbpm.
You can process JMS messages with ThreadPool also. In this case, you can use Executors.

Would the actor model fit your process? It's based around the idea of asynchronously passing messages between other actors. So you can set up a simple state machine to model your process and have all the transitions handled concurrently.

You need to determine if the problem in is the framework you are using or your code. I suggest you measure how fast your application is running and how fast your framework will go if its not doing anything at all. (just passing trivial tasks around) You should be able to perform between 100K to 1 million tasks per second using your in process framework. Even using JMS you should be able to achieve 10K messages per second. If you need to do closer to 10 million tasks per second, I suggest you try grouping your tasks together so each task does more work.
I would be very surprised if your framework was the bottleneck in which case I would suggest using an Executor.
If the framework isn't the cause of your scalability and concurrency issues (which is more likely) you need to restructure your code so it can run for longer periods of time without inter dependencies. i.e. you have to fix your code, a framework won't do that for you.

I know it is 5 years late, but this might help someone else that has been driven into this question.
Nowadays, there is http://queues.io and it contains a whole lot of queuing (and messaging) frameworks...

JMS alternative? something for decoupling sending emails from http reqs

we have a web application that does various things and sometimes emails users depending on a given action. I want to decouple the http request threads from actually sending the email in case there is some trouble with the SMTP server or a backlog. In the past I've used JMS for this and had no problem with it. However at the moment for the web app we're doing JMS just feels a bit of an over kill right now (in terms of setup etc) and I was wondering what other alternative there are out there.
Ideally I just like something that I can run in-process (JVM/Tomcat), but when the servlet context is unloaded any pending items in the queue would be swapped to disk/db. I could of course just code something together involving an in memory Q, but I'm looking to gain the benfit of opensource projects, so wondering whats out there if anything.
If JMS really is the answer anyone know of somethign that could fit our simple requirements.
thanks

I'm using JMS for something similar. Our reasons for using JMS:
We already had a JMS server for something else (so it was just adding a new queue)
We wanted our application be decoupled from the processing process, so errors on either side would stay on their side
The app could drop the message in a queue, commit, and go on. No need to worry about how to persist the messages, how to start over after a crash, etc. JMS does all that for you.

I would think spring integration would work in this case as well.
http://www.springsource.org/spring-integration

Wow, this issue comes up a lot. CommonJ WorkManagager is what you are looking for. A Tomcat implementation can be found here. It allows you to safely create threads in a Java EE environment but is much lighter weight than using JMS (which will obviously work as well).

Beyond JMS, for short messages you could also use Amazon Simple Queue Service (SQS).
While you might think it an overkill too, consider the fact there's minimal maintenance required, scales nicely, has ultra-high availability, and doesn't cost all that much.
No cost for creating new queues etc; or having account. As far as I recall, it's purely based on number of operations you do (sending messages, polling/retrieving).
Main limitation really is the message size (there are others, like not guaranteeing ordering due to distributed nature etc); but that might work as is. Or for larger messages, using related AWS service, s3, for storing actual body, and just passing headers through SQS.

You could use a scheduler. Have a look at Quartz.
The idea is that you schedule a job to start at regular intervals. All requests need to be persisted somewhere. The scheduled job will read them and process them. You need to define the interval between two subsequent jobs to fit your needs.
This is the recommended way of doing things. Full-fledged application servers offer Java EE Timers for this, but these aren't available in Tomcat. Quartz is fine though and you could avoid starting your own threads, which will cause mess in some situations (e.g. in application updates).

I agree that JMS is overkill for this.
You can just send the e-mail in a separate thread (i.e. separate from the request handling thread). The only thing to be careful about is that if your app gets any kind of traffic at all, you may want to use a thread pool to avoid resource depletion issues. The java.util.concurrent package has some nice stuff for thread pools.

Since you say the app "sometimes" emails users it doesn't sound like you're talking about a high volume of mail. A quick and dirty solution would be to just Runtime.getRuntime().exec():
sendmail recipient#domain.com
and dump the message into the resulting Process's getOutputStream(). After that it's sendmail's problem.
Figure a minute to see if you have sendmail available on the server, about fifteen minutes to throw together a test if you do, and nothing to install assuming you found sendmail. A few more minutes to construct the email headers properly (easy - here are some examples) and you're done.
Hope this helps...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.