General architecture for a long-running data-processing system in Java?

General architecture for a long-running data-processing system in Java? - java

I've been asked to port a legacy data processing application over to Java.
The current version of the system is composed of a nubmer of (badly written) Excel sheets. The sheets implement a big loop: A number of data-sources are polled. These source are a mixture of CSV and XML-based web-servics.
The process is conceptually simple:
It's stateless, that means the calculations which run are purely dependant on the inputs. The results from the calculations are published (currently by writing a number of CSV files in some standard locations on the network).
Having published the results the polling cycle begins again.
The process will not need an admin GUI, however it would be neat if I could implemnt some kind of web-based control panel. It would be nothing pretty and purely for internal use. The control panel would do little more than dispay stats about the source feeds and possibly force refresh the input feeds in the event of a problem. This component is purely optional in the first delivery round.
A critical feature of this system will be fault-tolerance. Some of the input feeds are notoriously buggy. I'd like my system to be able to recover in the event that some of the inputs are broken. In this case it would not be possible to update the output - I'd like it to keep polling until the system is resolved, possibly generating some XMPP messages to indicate the status of the system. Overall the system should work without intervention for long periods of time.
Users currently have a custom-client which polls the CSV files which (hopefully) will not need to be re-written. If I can do this job properly then they will not notice that the engine that runs this system has been re-implemented.
I'm not a java devloper (I mainly do Python), but JVM is the requirement in this case. The manager has given me generous time to learn.
What I want to know is how to begin architecting this kind of project. I'd like to make use of frameworks & good patterns possible. Are there any big building-blocks that might help me get a good quality system running faster?
UPDATE0: Nobody mentioned Spring yet - Does this framework have a role to play in this kind of application?

You can use lots of big complex frameworks to "help" you do this. Learning these can be CV++.
In your case I would suggest you try making the system as simple as possible. It will perform better and be easier to maintain (its also more likely to work)
So I would take each of the requirements and ask yourself; How simple can I make this? This is not about being lazy (you have to think harder) but good practice IMHO.

1) Write the code that processes the files, keep it simple one class per task, you might find the Apache CSV and Apache Commons useful.
2) Then look at Java Thread Pools to create a sperate process runner for those classes as seperate tasks, if they error it can restart them.
3) The best approach to start up depends on platform, but I'll assume your mention of Excel indicates it's windows PC. The simplest solution would therefore be to run the process runner from Windows->Startup menu item. A slightly better solution would be to use a windows service wrapper Alternatively you could run this under something like Apache ACD

There is a tool in Java ecosystem, which solves all (almost) integration problems.
It is called Apache Camel (http://camel.apache.org/). It relies on a concept of Consumers and Producers and Enterprise Integration Patterns in between. It provides fault-tolerance and concurrent processing configuration capabilities. There is a support for periodical polling. It has components for XML, CSV and XMPP. It is easy to define time-triggered background jobs and integrate with any messaging system you like for job queuing.
If you would be writing such system from scratch it would takes weeks and weeks and still you would probably miss some of the error conditions.

Have a look at Pentaho ETL tool or Talend OpenStudio.
This tools provide access to files, databases and so on. You can write your own plugin or adapter if you need it. Talend creates Java code which you can compile and run.

Related

How to effectively manage a bunch of jar files and their plumbing?

This is a rather high-level question so apologies if it's off-topic. I'm new to the enterprise Java world.
Suppose I have written some individual Java packages that do things like parse data feeds and store the parsed information to a queue. Another package might read from that queue and ingest those entries into a rules engine package. Tripped alerts get fed into another queue, which is polled by an alerting service (assume it's written in Python) that reads from the queue and issues emails.
As it stands I have to manually run each jar file and stick it in the background. While I could probably daemonize some or all of these services for resiliency or write some kind of service manager to do the same, this strikes me as being very amateur. Especially since I'd have to start a dozen services for this single workflow at boot.
I feel like I'm missing something, but I don't know what I don't know. Short of writing one giant, monolithic application, what should I be looking into to help me manage all these discrete components and be able to (conceptually) deliver a holistic application? I'd like to end up with some sort of hypervisor where I can click one button, it starts/stops all the above services, provides me some visibility into their status and makes sure the services are running when they should.
Is this where frameworks come into play? I see a number of them but don't know if that's just overkill, especially if I'm not actively developing a solution for that framework.

It seems you architected a system with a lot of components, and then after some time you decided to aggregate some of them because they happen to share the same programming language: Java. So, first a warning: this is not the best way to wire components together.
Also, it seems you don't know Java very well because you mix terms like package, jar and executable that are totally unrelated and distinct concepts.
However, let's assume that the current state of the art is the best possible and is immutable. Your current requirement is building a graphical interface (I guess HTTP/HTML based) to manage all the distinct components of the system written in Java. I suggest you use a single JVM, writing your components as EJB (essentially a start(), stop() and a method to query the component state that returns a custom object), and finally wire everything up with the Spring framework, that has a nice annotation-driven configuration for #Bean's.
SpringBoot also has an actuator package that simplify exposing objects. You may also find it useful to register your beans as Managed beans, and using the Hawtio framework to administer them (via a Jolokia agent).

I am not sure if you're actually using J2EE (i.e. Java Enterprise Edition). It is possible to write enterprise software also in J2SE. J2SE is not having too much available off the shelf for this, but in contrast has a lot of micro-frameworks such as Ninja, or full stack frameworks such as Play framework which work quite well, much easier to program, and performs much better than J2EE.
If you're not using J2EE, then you can go as simple as:
make one new Java project
add all the jars as dependency to that project (see the comment on Maven above by NimChimpsky)
start the classes in the jars by simply calling their constructor
This is quite a naive approach, but can serve you at this point. Of course, if you're aiming for a scalable platform, there is a lot more you need to learn first. For scalability, I suggest the Play! framework as a good start. Alternatively you can use Vert.x which has its own message queue implementation as well as support for high performance distributed caches.
The standard J2EE approach is doable (and considered "de-facto" in many oldschool enterprises) but has fundamental -flaws- or "differences" which makes a very steep learning curve and a very much non-scalable application.

It seems like you're writing your application in a microservice architecture.
You need an orchestrator.
If you are running everything in a single machine, a simple orchestrator that you probably is already running is systemd. You write systemd service description, and systemd will maintain your services according to your services description. You can specify the order the services should be brought up based on dependencies between services, restart policy if your service goes down unexpectedly, logging for stdout/stderr, etc. Note that this is the same systemd that runs the startup sequence of most modern Linux distros.
If you're running multiple machines, you can still keep using single machine orchestrator like systemd, but usually the requirement for the orchestrator will also become more complex. With multiple machines, you now have to take into account things like moving services between machines, phased roll out, etc. For these setups, there are software that adapts systemd for multi machine orchestration, like CoreOS's fleetd; and there are also standalone multi machine orchestrator like Kubernetes. Both uses docker as application container mechanism.
None of what I've described here is Java specific, which means you can use the same orchestration for Java as you used for Python or other languages or architecture.

You have to choose, As Raffaele suggested you can choose to write all your requirements into one app/service. Seems like a possible mission, using java Ejb's or using spring integration - ampqTemplate ( can write to a queue with ampqTemplate and receive the message with a dedicated listener (example).
Or choosing implementation with microservices architecture. write a service that will push to the queue another one that will contain the listener etc. a task that can be done easily with spring boot.
"One button to control them all" - in the case of a monolithic app - it's easy.
In case that you choose microservices architecture. It depends what are you needs. if its just the "start" "stop" operation I guess that that start and stop of your tomcat/other server will do. For other metrics, there is a variety of solutions. again, it depends on your needs.

Divide process workflow between remote workers

I need to develop a Java platform to download and process information from Twitter. The basic idea is to have a centralized controller to generate tasks (id and keywords basically) and send this tasks to remote workers (one per computer). I need to receive an status report periodically to know about the status of both, the task and the worker. I'll have at least 60 workers (ten times more in a near future).
My initial idea was to use RMI but I need to communicate in both directions and I don't feel comfortable with RMI. The other approach was to use SSLSockets to send serialized objects but I would have to control a lot of errors and add a lot of code to monitor tasks and workers. Some people told me about use a framework like Spring Batch, Gigaspaces or Quartz.
What do you think would be the best option for this project? By the time being I've read a lot of good things about Gigaspaces but I don't find a good tutorial about how to implement it and Quartz seems promising. What do you think? Is it worth using any of them?

It's not easy to tell you to go for a technology based on your question. GigaSpaces is certainly up to the job but so is Spring Batch. Quartz is just the scheduling part of your question and not so much the remoting and the distribution of workload.
GigaSpaces is a fully fledged application platform to handle scenario's where parallelism, high throughput and scalability is a factor. Spring Batch can definitely also do the job, but unlike GigaSpaces, it is not an application platform. So you would still need to deploy your application somewhere.
However, GigaSpaces is a commericial product (free version available) but there are other frameworks that can help you such as Storm Project (http://storm-project.net/) and Hazelcast (www.hazelcast.com) also come to mind.
So without clarifying your use case it's hard to give a single answer. It all depends on what exactly you want and how you want to use it, now and in the future.

Logging Framework considerations

I am designing a server for logging. The business logic for this application is written in multiple languages (C++ & Java for now, but other languages might be added to the mix at a later stage.
I am considering making this a separate server with a well defined interface to ensure that I need not port this for other languages at a later date. For scalability, the main application has the ability to run multiple instances on multiple machines supported by load balancers.
One of the important considerations for the design (other than the usuals like the logging level) is performance and support for multiple logging targets (flat file, console, DB(?) etc) .
How do I ensure that the logger is not impacting the performance of the application? Would communicating using a socket make sense? Is there a better way to do this?

Is there a need to have all your logs shared? I would use whatever logging mechanism is best for each stage of the logic (log4j or java's logging in java, and I guess I don't know C++'s logging libraries enough to suggest one.)
For the most part, logs should only be used for debugging and outside-the-app parsing. I would not recommend integrating logging as part of your business logic. If you really need data in the logs, you'll be much better off making a direct communication rather than spitting out the log to have it slurped in by another application.
If you absolutely need it, you can have an external (very low priority) application that feeds off the logs and sends them back to a centralized logging server.

There is data you need to see in near real time and data which needs to be recorded for offline processing. They have different requirements.
real time data needs to be in a machine readable format and is usually directed to the places where it is used. The central logger can be on this path provided it doesn't delay the real time information unacceptably. For this I would use a sockets (or JMS) rather than a buffered file.
offline processing logs can be machine readable format (for over night reports) or be human readable (for debugging) For this I would use a file or a database or both. File can be simpler to manage, esp if that are large. A database makes building reports easier.
In either case I would pass the information which needs to be send via socket or written to a file, to another thread so that any random delays in the system do not impact the code which is producing the log. In fact, I would consider delaying send any logs until whatever the critical process is complete. i.e. You process everything which needs to be done first, then log everything of interest later.

Check this:
http://logging.apache.org/log4j/1.2/manual.html
Take a look at the performance section. It will address your concerns as far as the logging overhead in your application is concerned.
As far as supporting multiple logging targets this is easily achievable with log4j but you need to delve into some details (refer to the URL I posted you).
In general, from my experience log4j is excellent. I have generated thousands of static & dynamic logs of "considerable size" ( for my application - this term may be interpreted differently for your application ) without any problem despite the heavy processing I perform (for history I am evaluating/simulating a distributed P2P algorithm in a local PC and all is going well despite creating hundred of logger instances for the simulation ).

Is Hadoop right for running my simulations?

have written a stochastic simulation in Java, which loads data from a few CSV files on disk (totaling about 100MB) and writes results to another output file (not much data, just a boolean and a few numbers). There is also a parameters file, and for different parameters the distribution of simulation outputs would be expected to change. To determine the correct/best input parameters I need to run multiple simulations, across multiple input parameter configurations, and look at the distributions of the outputs in each group. Each simulation takes 0.1-10 min depending on parameters and randomness.
I've been reading about Hadoop and wondering if it can help me running lots of simulations; I may have access to about 8 networked desktop machines in the near future. If I understand correctly, the map function could run my simulation and spit out the result, and the reducer might be the identity.
The thing I'm worried about is HDFS, which seems to meant for huge files, not a smattering of small CSV files, (none of which would big enough to even make up the minimum recommended block size of 64MB). Furthermore, each simulation would only need an identical copy of each of the CSV files.
Is Hadoop the wrong tool for me?

I see a number of answers here that basically are saying, "no, you shouldn't use Hadoop for simulations because it wasn't built for simulations." I believe this is a rather short sighted view and would be akin to someone saying in 1985, "you can't use a PC for word processing, PCs are for spreadsheets!"
Hadoop is a fantastic framework for construction of a simulation engine. I've been using it for this purpose for months and have had great success with small data / large computation problems. Here's the top 5 reasons I migrated to Hadoop for simulation (using R as my language for simulations, btw):
Access: I can lease Hadoop clusters through either Amazon Elastic Map Reduce and I don't have to invest any time and energy into the administration of a cluster. This meant I could actually start doing simulations on a distributed framework without having to get administrative approval in my org!
Administration: Hadoop handles job control issues, like node failure, invisibly. I don't have to code for these conditions. If a node fails, Hadoop makes sure the sims scheduled for that node gets run on another node.
Upgradeable: Being a rather generic map reduce engine with a great distributed file system if you later have problems that involve large data if you're used to using Hadoop you don't have to migrate to a new solution. So Hadoop gives you a simulation platform that will also scale to a large data platform for (nearly) free!
Support: Being open source and used by so many companies, the number of resources, both on line and off, for Hadoop are numerous. Many of those resources are written with the assumption of "big data" but they are still useful for learning to think in a map reduce way.
Portability: I have built analysis on top of proprietary engines using proprietary tools which took considerable learning to get working. When I later changed jobs and found myself at a firm without that same proprietary stack I had to learn a new set of tools and a new simulation stack. Never again. I traded in SAS for R and our old grid framework for Hadoop. Both are open source and I know that I can land at any job in the future and immediately have tools at my fingertips to start kicking ass.

Hadoop can be made to perform your simulation if you already have a Hadoop cluster, but it's not the best tool for the kind of application you are describing. Hadoop is built to make working on big data possible, and you don't have big data -- you have big computation.
I like Gearman (http://gearman.org/) for this sort of thing.

While you might be able to get by using MapReduce with Hadoop, it seems like what you're doing might be better suited for a grid/job scheduler such as Condor or Sun Grid Engine. Hadoop is more suited for doing something where you take a single (very large) input, split it into chunks for your worker machines to process, and then reduce it to produce an output.

Since you are already using Java, I suggest taking a look at GridGain which, I think, is particularly well suited to your problem.

Simply said, though Hadoop may solve your problem here, its not the right tool for your purpose.

Python, PyTables, Java - tying all together

Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?

This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
PyTables looks like a nice match.
There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Create two diagrams of your current architecture, physical and logical.
On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
Be certain to label the resources available to each server and the type and resources available to each connection.
Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
On the logical diagram, create boxes for every application that is running in your current architecture.
Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
Draw on disk resources (like the HDF5 and CSV files) as cylinders.
Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
What data formats or methods can this application use to communicate.
What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
How often does it need it.
Approximately how much resources does the application need.
What does the application do now that it doesn't do that well.
What can this application do now that would help, but it isn't doing.
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.

Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".

You could try Jython, a Python interpreter for the JVM which can import Java classes.
Jython project homepage
Unfortunately, that's all I know on the subject.

Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.