Java tools for Concurrent XML Editing - java

I am building a user interface where users must be able to concurrently make edits on a web page. The standard technique employed in 90+% of our application is optimistic locking; that is, we expect that concurrent edits will open happen very rarely, and if they do, the second committer gets an error and loses their edit.
In limited parts of the application, concurrent edits are expected, and we need to handle them in a better way. Locking to force sequential editing is not an option, as it would require a timeout on the lock for the situation where a user walks away from their screen without explicitly unlocking, which would be very annoying.
If possible, I would like to hold the form data as XML, but to allow concurrent edits to this document to be made.
Some tools I have briefly looked into that may be able to do this, or provide some inspiration in how best to handle it are:
Apache Wave (Google Wave) http://incubator.apache.org/wave/
MobWrite http://code.google.com/p/google-mobwrite/
Does anyone have some more suggestions of toolkits or techniques that could handle this? Thanks for your help.

Two basic algorithms can help you here:
1. Operational transformation, which is behind Apache Wave.
The main idea is to capture elementary editing events and merge concurrent event streams from different users. Although simple at the first glance, this approach may require complex implementation:
Robust cross-browser capturing and decomposition of all editing events may be tricky to implement
There's no easy way to restore synchronized state of several communication agents in case of network errors/timeouts.
2. Differential sync, which is behind Google Mobwrite.
This one is based on computing a diff between edited copy and some base state, so called "shadow". Concurrent diffs are then getting merged. One can easily compute a diff between any two arbitrary states of the document, so there's no problem to recover from the sync package loss. The following tech talks record explains this technique in details.
Facing the requirement similar to yours, I've based my solution on the second approach due to it's simplicity and robustness. Original Mobwrite implementation uses a Python backend, so I've reimplemented it in Java. You can find a working prototype here, it's a simple web-based collaborative editor I made as a proof-of-concept before adopting the same approach in our proprietary software.

Related

Solution to provide shared entities between multiple Java processes

I am trying to reconstruct a flow of information from multiple parts handled by different Java processes. Please note that i don't generate the flows, i just read some information about them.
I've tried using MySQL (MyISAM/InnoDB tables) with INSERT ON DUPLICATE KEY UPDATE using an id for each flow. I've also tried storing all the pieces of information and running a query at the end to get the full information. Neither of these approaches yielded the performance needed.
I'm looking for a solution that will allow me to have a set of shared objects between multiple Java processes. The objects should be persistent between runs and fast to lookup/update concurrently (>100k lookups/updates per second).
I've thought of a few solutions including:
NoSQL: something like MongoDB, HBase etc.
a caching solution like EhCache, Memcached etc.
The problem is i don't have any experience with any of these solutions. So, what would you recommend that fits the following criteria:
very fast on a single system. Most of the applications i mentioned were built for distributed systems, but it's not the case here.
easy to learn/use (i want to be able to prototype it in a day)
mature technology
free to use even for commercial purposes
preferably open-source
You could try a seperate java process that co-ordinates between the others. This process would hold the information to pass over to the main processes. You could wire them up with RMI.
If you want to do only exchange of objects withing java applications, you could also looki into tuple spaces. There are specific implementations of spaces for java, JavaSpaces, which should be able to do what you need. Not sure if they can keep up with the performance though. Also I’m not sure how widely this technology is still being used, since it only supports Java and isn’t as flexible as NoSQL stores would be these days.
Wikipedia has a more detailed description and list of different implementations, many of which are open source.
The other option is to go with Redis, you have notifications there and it can for sure scale to the requirements you are looking for.
The old (legacy?) solution is JavaSpaces. However, from an software architects point of view I would say distributed caches are the replacements for that nowadays. Especially take a look at hazelcast and infinispan.
From the performance viewpoint I am not happy with the performance of the "big" distributed caching solutions, when only a single in-memory cache is needed, see my writeup on the cache2k benchmarks page (hazelcast needs to be added here).
Anyways, please clarify your problem statement first, because your question falls into the XyProblem category. You are not describing the actual problem, and your question just boils down to "fast reliable distributed objects" solution. What kind of data comes in? What is the rate? Who is it accessed? What consistency guarantees need to be met, considering the fact that writing and reading is in parallel?
By the term "flow of information" it sounds more like a complex event processing problem to me.

Real life experience with the Axon Framework [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
As part of researching CQRS for use with a project, I ran across the Axon Framework, and I was wondering if anyone has any real life experience with it. Just to be clear, I'm asking about the framework, not CQRS as an architectural pattern.
My project already uses Spring and Spring Integration which fits nicely with Axon's own requirements, but before i dedicate a lot of time to it, I would like to know if anyone has some first hand experience. In particular I'm interested i possible pitfalls that are not immediately apparent from the documentation.
The framework relies heavily on eventsourcing, which means that all state changes are >written to the data store as events. "
This is completely untrue, it does not rely heavily on event-sourcing. One of the implementations for storing the aggregate in this framework use Event-Sourcing but you can easily use also the classes provided to use a standard relational model.
It is just better with event-sourcing.
So you have a historical reference of all your data. This is nice but makes changing your >domain after you've gone in production a very daunting proposition especially if you sold >the customer on the system's "strong auditability" "
I don't think it is a lot easier with a standard relational model that only stores the current state.
The framework encourages denormalizing your data, to the point that some have suggested >having a table per view in the application. This makes your application extremely >difficult to maintain, especially when the original developers are gone"
This is unrelated to the framework but to the architectural pattern in use (CQRS).
And sorry to mention that but having one denormalizer/view is a good idea as it stays a simple object.
So maintenance is easy because SQL request/insertion as also easy.
So this argument is not very strong.
How about a view which uses a 1000 tables model with inner joins everywhere and complex SQL queries?
Again, CQRS helps because, basically, the view data is just a SELECT * from the table which correspond to the view.
if somehow you made a mistake in one of the eventhandlers, your only option is to >"replay" the eventlog, which depending on the size of your data can take a very long >time. The tooling for this however is non-existent.
I agree on the point that currently there is a lack of tooling to replay events and that this can take a long time. However, it is theoretically possible to only replay a portion of the event and not all the content of the event store.
Replaying can have side effects, so >developers become scared of doing it
Replaying event have side effects -> that's untrue. For me side effects means modifying the state of the system. In an event-sourced CQRS application, the state is stored in the event-store. Replaying the events does not modify the event store.
You can have side effect on the query side of the model yes. But you don't care if you have made a mistake because you are still able to correct it and replay the event once again.
it's extremely easy to have developers mess up using this framework. if they don't store >changes to domain objects in events, next time you replay your events you are in for a >surprise.
Well if you misused and misunderstand the architecture, the concept, etc. then ok I agree with you. But perhaps the problem is not the framework here.
Should you store delta's ? absolute values ? if you don't keep tabs on your developers >you are bound to end up with both and you will be f***ed
I can say that for every system I would say that it's unrelated directly to the framework itself. It's like saying, "Java is crap because you can messed up everything if someone codes a bad implementation of hashCode and equals methods."
And for the last part of your comment, I already seen samples like helloWorld with the Spring framework.
Of course it is completely useless in a simple example.
Be careful in your comment to make a difference between the concept (CQRS + EventSourcing) and the framework. Make a difference please.
Since you have stated that you want to use CQRS for your project (and I assume that the JVM is your target platform) I think Axon Framework is an excellent choice.
I have built a fairly complex trading platform on it (no, the trading sample is not complex) and I have not seen any obvious flaws of the framework.
Since I use EventSourcing, the test fixtures made it very easy to write BDD style "given, when, then" tests. This lets you treat an aggregate as a black box and concentrate on checking that the correct set of events come out when you put in a certain command.
About pitfalls: before jumping in, make sure
That you have the concepts of CQRS figured out.
Make a list (paper, whiteboard, whatever) of all your aggregates, command handlers, event handlers, sagas, commands and events. This is the hard part of building your system, figuring out what it should do and how. After this, the reference manual should show you how to wire it all together with Axon.
Some non Axon specific points:
Being able to rebuild the view store from events is a concept of EventSourcing, and not something that is exclusive to Axon, but I found it pretty easy to create a service that will send me all events from an aggregate type, aggregate id or a certain event type.
Being able to build a new reporting component one year after the project is launched and instantly get reports on data from the time of the project launch and onwards is awesome.
I've been using AxonFramework for more than one year on a complex project developed for a big bank.
The requirements were demanding, customer's expectations were high, and release times narrow.
I've choosed AxonFramework because, at the project kick off moment, it was the most complete and the best documented implementation of CQRS available in Java, well designed, easy to integrate, to test and to extend.
After more than one year I think that these considerations are still valid and current.
Another consideration has guided my choice: I wanted that the commitment on such a difficult project to become a training opportunity for me and other members of the team.
We started to develop with AxonFramework version 1.0 and moved to version 1.4 as newer versions were released.
Our team experience with CQRS and the implementation provided by the AxonFramework was absolutely positive.
It provided us with a consistent and uniform manner to develop each feature that guided us and make you feel at ease.
Without it some features of the application would have been much more complicated to develop.
I am referring mainly to the various long-running processes that need to be handled and to the related compensation logic, but also to the many business logics pieces that have been necessary, here and there, that fitted nicely and uncoupled in the event driven architecture promoted by CQRS.
Our choice was to be conservative in the write model, so we preferred a JPA based persistence instead of the event sourced one.
The query model is made up of views. We have tried to make sure that each view contains all the required data from a single page using intermediate views when necessary.
Anyhow we developed the write model as we were applying event sourcing, so we take care of modifying the state of aggregates exclusively through events. When the customer asked for a cloning function of a very complex aggregate it was just a matter of replaying the source events (with uuid translated) to a brand new instance - the down side in this case have been the events upcasting (but this functionality was greatly improved in the imminent 2.0 version).
As in each project during the development we found a lot of bugs, in our code mainly, but also in components supposed to be mature and stable, like the application server, the IoC container, the cache, the workflow engine and some of the other libraries that are easily to be found in any large J2EE application.
As any other human product AxonFramework was not immune to bugs too, but surprisingly for a young and niche project like this, they have been few, not critical, and quickly resolved by new releases.
The kind and immediate support provided by the author on the mailing list is another invaluable feature and helped me a lot when I was in trouble.
The application was released in production a year ago and is currently maintained and under active development of new features.
The customer is satisfied and asks for more.
When to use AxonFramework is more a matter of when to use CQRS. For a response it's worth to go back to the official documentation: http://www.axonframework.org/docs/1.4/introduction.html#d4e51
In our case definitively it was worth it.
The OP specifically asks about the pitfalls relating to the Axon Framework rather than CQRS. This makes the question difficult to answer, as Axon started as a fairly faithful implementation of the famous book by Eric Evans
The main advantage is that it does exactly what it says on the tin: it handles the hard parts of a CQRS based design for you: aggregates, sagas, event sourcing, command handlers, event handlers, BASE consistency etc. When you follow the best practices, you end up with a highly responsive and horizontally scalable application. If you use it with event sourcing, your data is completely auditable, and at least in theory, you can determine the state your application had at any given point in time. Tooling to do this is not provided; you will have to roll your own.
The main developer of the framework is very approachable and extremely knowledgeable on the subject of high performance and scalable computing in java. He tends to answer every question on the mailing list within a few hours. This is both an advantage and the major pitfall: at this time (early 2014), the Axon Framework depends heavily on one person. The rest of the pitfalls I would like to mention are probably more the result of event sourcing than of CQRS or Axon (as of 2018 the framework is supported by the company Axoniq)
Design your data model very carefully upfront. Though it is easy to add to, making fundamental changes to your datamodel can be very difficult. If you make a fundamental mistake in the datamodel, your application may not perform well, or even fail to work at all. For example, if you choose a tree shaped data model, with one long lived aggregate root at the top, this aggregate may grow very large as it accrues more and more events over time, and it may take a long time to load and store. I don't know what will happen if this goes on until an instance of the aggregate no longer fits in RAM, but I imagine could be bad. Don't do it that way.
Another pitfall (event sourcing related) is that, after a number of revisions, it can become increasingly difficult to reason about the state of an aggregate, as you sometimes have to keep in mind not only what the code does today, but also what it did in the past. This definitely makes replaying (a portion of) the event store to rebuild a view table a non trivial task.
Fixing data errors can be more difficult than with a 'traditional' design. Rather than a simple SQL statement, you will often need to make a command to change the state of your application. If the error in your data was caused by a faulty event handler, you can usually just fix the bug, clear the snapshots and let he events for the aggregate be replayed. If your bug caused spurious events to be applied, it can me much more trouble to fix. The faulty events will stay in the event store, and you may have to apply some new ones to restore your data to the correct state, or change the code to ignore or fix their behaviour.
While the framework itself is written decent enough, using it in a real world project has been nothing short of a nightmare and the choice of this framework imo was a major contributing factor to this project's failing.
The framework relies heavily on eventsourcing, which means that all state changes are written to the data store as events. So you have a historical reference of all your data. This is nice but makes changing your domain after you've gone in production a very daunting proposition especially if you sold the customer on the system's "strong auditability"
You cannot have ops guys make ad-hoc changes to the database
The framework encourages denormalizing your data, to the point that some have suggested having a table per view in the application. This makes your application extremely difficult to maintain, especially when the original developers are gone
if somehow you made a mistake in one of the eventhandlers, your only option is to "replay" the eventlog, which depending on the size of your data can take a very long time. The tooling for this however is non-existent. Replaying can have side effects, so developers become scared of doing it
it's extremely easy to have developers mess up using this framework. if they don't store changes to domain objects in events, next time you replay your events you are in for a surprise. Should you store delta's ? absolute values ? if you don't keep tabs on your developers you are bound to end up with both and you will be f***ed
There is practically no adoption of this framework, so googling for answers will not do you any good
Even though the framework does not yet support distribution it's written with it in mind and the api's are a pain to work with because of it. Firing off an event is async by default and if you want to check if an exception was raised executing the command, say a duplicate username exception, you need to pass in a listener to your commandhandler which is a future, then you wait for the future's result to come in, handle any checked exceptions, interuptedexception etc and then you can grab the exception that was thrown from the future. Ofcourse which exceptions a command can raise is not apparent from the api. Defeating the purpose of checked exceptions
Check out some of the example apps. I somehow need a unit of work listener to create an addressbook application? My goodness...
I am currently with a team working on an online casino platform launching our brand Casumo this summer. The domain and platform is build using Axon Framework and so far it it has served us solidly.
A lot of time has been saved not having to build all the infrastructure needed for command handling, event routing, event sourcing, snapshoting etc and the APIs are really nice to work with. The one bug we found in the framework so far was fixed in .. release 12 hours later and Allard is always quick to take suggestions on new features and discussing ways to leverage the framework to fulfill your needs.

Is it wise to develop a prototype GUI before designing other part of the system?

Is it wise to develop a prototype GUI before designing other part of the system?
I am using Java for this small project. It will be a program with GUI and database connection. Say the database has table A and B, the user can choose which table to interact with. The program then display the contents of, say, table A in the GUI, and allows the user to change the content and submit the changes, or delete, or insert.
I think GUI should be developed first before any back-end development starts. There are couple of reason to do this:
You gain clarity on how model objects should interact.
Usability poses lots of restrictions on the way you want to pull data. You will probably want to develop and architect after you're 100% sure what constraints are there.
On business point, managers like to have a dumb function UI before any development start. Many times, the feedback leads in major changes in back-end assumptions. Which is a lot less pain than the case when you get a change request after the back-end development is over.
My personal experience goes that simultaneous development of GUI and back-end is a bit messy. Plus GUI provides solid expectation of behavior from back-end. Moreover, this approach makes sure all the developers, your client and your manager on the same page.
I agree with Joel Spolsky that it is a great idea to write a functional spec before writing code. Part of that spec should include a collection of screen mockups. #O.D. is right, Balsamiq is a great tool. It has saved me a lot of time in the past.
Once you have a functional spec in place that the business users are happy with, you will then have a better idea of how to design your system to meet the requirements. e.g. is high performance a requirement, domain model vs simple crud etc.
Then you should start by taking a single use case and building a vertical slice of your application. Build a GUI, service layer, persistence layer, database schema in one iteration. This will hopefully point out any problems with your design and give you the chance to modify it before you start building out the horizontal functionality.
I'd say yes and no.
No because you should design you application to be modularized enough so that your logic and data do not depend on UI design.
Yes because it is always smart to design everything before you actually start implementing it.
So what I mean is that you should make a concept, but not let your UI concept 'tie your hands' when you implement your logic. So if your managers clients don't like your conceptual UI, you can always change it without actually changing your application logic.
Well showing you GUI brfore starting to program is a very a good Idea, specially that you enable the enduser (Customer) to check if the UI is up to his expectations, which can save you lots of time.
In order to do that you dont necessarily need to develope a "real" prototype, you can use programms which enable you to fast design the UI of your App, including a minimal workflow simulation instead of full funcionality.
i had a very good experience with: Balsamiq can really recommend it
Writing spec before your code is always a good idea, because it makes you think. But most specs I have seen are not that good. And if the spec is too technical, users will at the end sign-off your spec without really understanding what are they going to get.
I have seen best results when either presenting the User Manual to the client, or by discussing mockups of the system one scenario at a time.
Note that half-baked mockups won't do the trick. You need your mockups to be fully populated with relevant data (Ever tried to discuss some screens with accounting while the numbers on the screen don't match? There's no way at all you could explain to them these are only dummy numbers...)
And the caveat of using mockups is that users will more often than not believe the app is "almost finished", whatever you do or say. It must be some subconscious thing, I'm not sure. But to avoid that, most of specialized tools have either only "black&white" look and feel or multiple skins you can switch to and from.
There is a pretty complete list of mockup tools here. Many of them are free:
http://c2.com/cgi/wiki?GuiPrototypingTools
My own tool is pretty popular: http://MockupScreens.com, I created it a long time ago exactly because of my own frustration with above mentioned problems.

Database or XML which one is better for Commentary based implementation?

I have a cricket based website and implemented a commentary
using the database and php, but i really feel its slow.
So my questions is how is the espncricinfo.com has implemented their commentary section
http://www.espncricinfo.com/the-ashes-2010-11/engine/current/match/446963.html
What is the technology behind this Database or xml files or any other way?
Is database is a right choice to implement this kind of requirement, as there millions of users accessing the system? and few system will keep updating the ongoing cricket match.
Database and XML aren't opposites - this looks an ideal application for an XML database!
But whichever technology you use, you will need to pay careful attention to performance at every stage of implementation. Poor performance is rarely the consequence of choosing technology that isn't up to the job, but it's often the consequence of using technology that you don't understand well enough.
It's impossible to say without knowing the exact specifications of your system, including system resources, number of users etc.
Given that you will have very few update sources and many just viewing the data it would certainly be possible to implement this in XML. Depending on how you've built the auto-refresh function of the commentary you might already be using XML to send updates back.
I suspect that whichever system you choose the main performance enhancement you could make would be to ensure that data that is frequently accessed is cached in memory and refreshed only when your data storage is updated.

Python, PyTables, Java - tying all together

Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?
This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
PyTables looks like a nice match.
There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Create two diagrams of your current architecture, physical and logical.
On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
Be certain to label the resources available to each server and the type and resources available to each connection.
Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
On the logical diagram, create boxes for every application that is running in your current architecture.
Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
Draw on disk resources (like the HDF5 and CSV files) as cylinders.
Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
What data formats or methods can this application use to communicate.
What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
How often does it need it.
Approximately how much resources does the application need.
What does the application do now that it doesn't do that well.
What can this application do now that would help, but it isn't doing.
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.
Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".
You could try Jython, a Python interpreter for the JVM which can import Java classes.
Jython project homepage
Unfortunately, that's all I know on the subject.
Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?

Categories