Remote Procedure Calls - java

I am doing a Software Engineering course in which different teams are building different prototype subsystems of a big system (different subsystem of F35 Lightning aircraft!).
The problem is that teams can use different programming languages (like C++ and Java) depending upon what they are most comfortable in. However, these subsystems need to be communicating with each other (like radar needs to provide object corodinates to navigation and control). Hence we need to come up with a solution in which different modules can interact in real time.
Someone suggested XML-RPC and hence I was reading about it. After reading it I think it is used in server client architecture. Is this a good way of doing interprocess kind of communication? What are my options?
Any help would be appreciated.
regards,
Newbie

There are a couple of options beside XML-RPC. For a short bullet-point comparison, take a look at:
http://michaeldehaan.net/2008/07/17/xmlrpc-vs-rest-vs-soap-vs-all-your-rpc-options/
If your exchange is more data-oriented, Protocol Buffers might be an alternative.
Protocol Buffers are a way of encoding structured data in an efficient yet extensible format.
Personally, I would go for lightweight exchange format or method first since the components are considered prototypes. Something like REST or some custom message passing might be simple enough, yet sufficient.

If you are already familiar with XML, it can be a reasonable answer. An advantage of XML is that you don't have to worry about how different machines represent numbers. A disadvantage is the time it takes to keep converting numbers to text and back to numbers.

Related

socket -V- rest performance

I have done some searching but haven't come up with anything on this topic. I was wondering if anyone has ever compared (to some degree) the performance difference between an RPC over a socket and a REST web service. If both do the same thing, which would have a tendency to be the better performer? I've already started building some socket code and would like to know if REST would give better performance before I progress much further. Any input would be really appreciated. Thanks indeed
RMI
Feels like a local API, much like
XMLRPC
Can provide some fairly nice remote
exception data
Java specific means this causes lock
in and limits your options
Has horrible versioning problems
between different versions of clients
Skeleton files must be compiled in
like CORBA, which is not very flexible
REST:
easy to route around firewalls
useful for uploading files as it can
be rather lightweight
very simple if you just want to shove
simple things at something and get
back an integer (like for uploaders)
easy to proxy security behind Apache
and let it take the heat
does not define any standard format
for the way the data is being
exchanged (could be JSON, YAML 1.0,
YAML 2.0, arbitrary XML format, etc)
does not define any convention about
having remote faults sent back to the
caller, integer codes are frequently
used, but method of sending back data
is not defined. Ideally this would be
standardized.
may require a lot of work on the
client side caller of the library to
make use of data (custom serialization
and so forth)
In short from here
web services do allow a loosely
coupled architecture. With RMI, you
have to make sure that the objects
stay in sync in all applications
RMI works best for smaller
applications, that are not
internet-related and thus not scalable
Its hard to imagine that REST is faster than a simple socket connection given it also goes over a Socket.
However REST may be performant enough, standard and easier to use. I would test whether REST is fast enough and meets your requirements first (or one of the many other existing solutions) before attempting your own Socket solution.

.net webservices using complex types with other platforms e.g. Java

I'm working on a .net system that will both expose and consume web-services with another system to pass data back and forth - the other system is java based.
Our proposed XSD contains complex types and some concern has been expressed about using complex types and how we'd be better sticking to simple types. I'd have thought .net would have been able to support complex types so was hoping someone can elaborate on what problems I'm likely to face. I've tried googling but not found anything specific.
The Exposing .NET WebService to Other Platform (Java) stackoverflow question has an answer that
"This should work out of the box, but
I would advise against returning
complex data structures or expecting
such as input arguments. If you need
complexity of that kind, I would
suggest returning/accepting XML
instead."
but doesn't really explain why, so any thoughts / explanation greatly appreciated
EDIT - note I'm not planning to transfer platform specific objects over these services, instead I want to model business entities in a shared XSD as complex types, built out of simple types, (so that they can be easily controlled and reused in other XSDs) and these are the element that concern has been raised about.
I plan to do some proof of concept on this to see if I can prove this working / surface any problems, but thought I'd get some views of SO users who have done this before, first.
There are many platform-specific types that can be used easily as long as both endpoints are homogeneous, but which don't map cleanly to xsd or to other platforms. For example DataTable in .NET is a royal PITA from anywhere else; and anything implementing IXmlSerializable in .NET is most-likely completely lying in the schema.
In the interop scenario, it would usually be worth starting from xsd, as that gives a common standard that all reasonable clients should expect.

how to design messages in a java client-server model

i have set up a basic client and a basic server using java sockets. it can successfully send strings between them.
now i want to design some basic messages.
could you give me any recommendations on how to lay them out?
should i use java's serialzation to send classes?
or should i just encode the information i need in a custom string and decode on the other side?
what about recognizing the type of messages? is there some convention for this? like the first 4 characters of each message are a identifier for the message?
thanks!
I would recommend you not to reinvent the wheel. If java serialization suits you, just use it.
Also take into account that there are some nice serialization frameworks around:
thrift, from facebook, and protocol buffers from Google.
Thrift also is a RPC mechanism, so you could also use it instead of opening / reading raw sockets, but this, of course, depends on your problem domain.
Edit: And answering your question about the message formatting. Definitely if you want to implement your own protocol and if you have more than one type of messages you should implement a header yes. But I warn you that implementing a protocol is hard and very error prone. Just create an object containing the different inner objects + methods you need, if you want add it a version field and make it implement the java.io.Serializable interface.
Maybe JMS would help you, it's hard to say without knowing the details. But JMS is standard, well thought out and versatile, and there are an impressive number of implementations available, open source and commercial. We use Sun's OpenMQ implementation and we're quite happy with it. It's fast enough for our needs, very mature and reliable.
Mind you, JMS is not a lightweight affair by any standard so it may very well be overkill for your needs.
If you're going to deploy this in a production environment, I'd advice you to look at either RMI or XML web services. (Google's Protocol Buffers are interesting too, but do not include a standard protocol for message transport, although 3rd party implementations exist.)
If you're doing this for the pleasure of learning, there are tons of ways to go about this. In general, a message in a generic messaging system will have some kind of "envelope format" which contains not only the message body, but also metadata about the message. A bare minimum for the header is something that identifies the intended receiver - either an integer identifier, a string representing a method name or a file, or something like it.
A simple example is HTTP, a plain-text format where the envelope and the is made up of all the lines until the first blank line. The first line identifies the protocol version and the intended receiver (≈the file requested), the following lines are metadata about the request, and the message body follows the first blank line.
In general, XML is a common format for distributed services (mostly because of its good schema capabilities and cross-platform support), although some schemes use other formats for simplicity and/or performance. RMI uses standard Java object serialization, for example.
What you choose to use is ultimately based on your needs. If you want to make it easy to interact with your system from a large amount of platforms, use XML web services (or REST). For communication between distributed Java subsystems, use RMI. If your system is extremely transaction intensive, maybe a custom binary format is best for faster processing and smaller messages - but before doing this "optimization", remember that it requires a lot more work to get it working properly and that most apps won't benefit a lot from it.

Python, PyTables, Java - tying all together

Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?
This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
PyTables looks like a nice match.
There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Create two diagrams of your current architecture, physical and logical.
On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
Be certain to label the resources available to each server and the type and resources available to each connection.
Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
On the logical diagram, create boxes for every application that is running in your current architecture.
Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
Draw on disk resources (like the HDF5 and CSV files) as cylinders.
Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
What data formats or methods can this application use to communicate.
What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
How often does it need it.
Approximately how much resources does the application need.
What does the application do now that it doesn't do that well.
What can this application do now that would help, but it isn't doing.
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.
Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".
You could try Jython, a Python interpreter for the JVM which can import Java classes.
Jython project homepage
Unfortunately, that's all I know on the subject.
Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?

Performance comparison of Thrift, Protocol Buffers, JSON, EJB, other?

We're looking into transport/protocol solutions and were about to do various performance tests, so I thought I'd check with the community if they've already done this:
Has anyone done server performance tests for simple echo services as well as serialization/deserialization for various messages sizes comparing EJB3, Thrift, and Protocol Buffers on Linux?
Primarily languages will be Java, C/C++, Python, and PHP.
Update: I'm still very interested in this, if anyone has done any further benchmarks please let me know. Also, very interesting benchmark showing compressed JSON performing similar / better than Thrift / Protocol Buffers, so I'm throwing JSON into this question as well.
Latest comparison available here at the thrift-protobuf-compare project wiki. It includes many other serialization libraries.
I'm in the process of writing some code in an open source project named thrift-protobuf-compare comparing between protobuf and thrift. For now it covers few serialization aspects, but I intend to cover more. The results (for Thrift and Protobuf) are discussed in my blog, I'll add more when I'll get to it.
You may look at the code to compare API, description language and generated code. I'll be happy to have contributions to achieve a more rounded comparison.
You may be interested in this question: "Biggest differences of Thrift vs Protocol Buffers?"
I did test performance of PB with number of other data formats (xml, json, default object serialization, hessian, one proprietary one) and libraries (jaxb, fast infoset, hand-written) for data binding task (both reading and writing), but thrift's format(s) was not included. Performance for formats with multiple converters (like xml) had very high variance, from very slow to pretty-darn-fast. Correlation between claims of authors and perceived performance was rather weak. Especially so for packages that made wildest claims.
For what it is worth, I found PB performance to be bit over hyped (usually not by its authors, but others who only know who wrote it). With default settings it did not beat fastest textual xml alternative. With optimized mode (why is this not default?), it was bit faster, comparable with the fastest JSON package. Hessian was rather fast, textual json also. Properietary binary format (no name here, it was company internal) was the slowest. Java object serialization was fast for larger messages, less so for small objects (i.e. high fixed per-operation noverhead).
With PB message size was compact, but given all trade-offs you have to do (data is not self-descriptive: if you lose the schema, you lose data; there are indexes of course, and value types, but from what you have reverse-engineer back to field names if you want), I personally would only choose it for specific use cases -- size-sensitive, closely coupled system where interface/format never (or very very rarely) changes.
My opinion in this is that (a) implementation often matters more than specification (of data format), (b) end-to-end, differences between best-of-breed (for different formats) are usually not big enough to dictate the choice.
That is, you may be better off choosing format+API/lib/framework you like using most (or has best tool support), find best implementation, and see if that works fast enough.
If (and only if!) not, consider next best alternative.
ps. Not sure what EJB3 here would be. Maybe just plain of Java serialization?
If the raw net performance is the target, then nothing beats IIOP (see RMI/IIOP).
Smallest possible footprint -- only binary data, no markup at all. Serialization/deserialization is very fast too.
Since it's IIOP (that is CORBA), almost all languages have bindings.
But I presume the performance is not the only requirement, right?
One of the things near the top of my "to-do" list for PBs is to port Google's internal Protocol Buffer performance benchmark - it's mostly a case of taking confidential message formats and turning them into entirely bland ones, and then doing the same for the data.
When that's been done, I'd imagine you could build the same messages in Thrift and then compare the performance.
In other words, I don't have the data for you yet - but hopefully in the next couple of weeks...
To back up Vladimir's point about IIOP, here's an interesting performance test, that should give some additional info over the google benchmarks, since it compares Thrift and CORBA. (Performance_TIDorb_vs_Thrift_morfeo.pdf // link no longer valid)
To quote from the study:
Thrift is very efficient with small
data (basic types as operation
arguments)
Thrifts transports are not so efficient as CORBA with medium and
large data (struct and >complex
types > 1 kilobytes).
Another odd limitation, not having to do with performance, is that Thrift is limited to returning only several values as a struct - although this, like performance, can surely be improved perhaps.
It is interesting that the Thrift IDL closely matches the CORBA IDL, nice. I haven't used Thrift, it looks interesting especially for smaller messages, and one of the design goals was for a less cumbersome install, so these are other advantages of Thrift. That said, CORBA has a bad rap, there are many excellent implementations out there like omniORB for example, which has bindings for Python, that are easy to install and use.
Edited: The Thrift and CORBA link is no longer valid, but I did find another useful paper from CERN. They evaluated replacements for their CORBA system, and, while they evaluated Thrift, they eventually went with ZeroMQ. While Thrift performed the fastest in their performance tests, at 9000 msg/sec vs. 8000 (ZeroMQ) and 7000+ RDA (CORBA-based), they chose not to test Thrift further because of other issues notably:
It is still an immature product with a buggy implementation
I have done a study for spring-boot, mappers (manual, Dozer and MapStruct), Thrift, REST, SOAP and Protocol Buffers integration for my job.
The server side: https://github.com/vlachenal/webservices-bench
The client side: https://github.com/vlachenal/webservices-bench-client
It is not finished and has been run on my personal computers (I have to ask for servers to complete the tests) ... but results can be consulted on:
Laptop: https://github.com/vlachenal/webservices-bench/blob/master/results.md
Desktop: https://github.com/vlachenal/webservices-bench/blob/master/results-desktop.md
As conclusion :
Thrift offers the best performance and is easy to use
RESTful webservice with JSON content type is pretty close to Thrift performance, is "browser ready to use" and is quite elegant (from my point of view)
SOAP has very poor performance but offers the best data control
Protocol Buffers has good performance ... until 3 simultaneous calls ... and I don't know why. It is very difficult to use: I give up (for now) to make for it work with MapStruct and I don't try with Dozer.
Projects can be completed through pull requests (either for fixes or other results).

Categories