Performance comparison of Thrift, Protocol Buffers, JSON, EJB, other?

Performance comparison of Thrift, Protocol Buffers, JSON, EJB, other? - java

We're looking into transport/protocol solutions and were about to do various performance tests, so I thought I'd check with the community if they've already done this:
Has anyone done server performance tests for simple echo services as well as serialization/deserialization for various messages sizes comparing EJB3, Thrift, and Protocol Buffers on Linux?
Primarily languages will be Java, C/C++, Python, and PHP.
Update: I'm still very interested in this, if anyone has done any further benchmarks please let me know. Also, very interesting benchmark showing compressed JSON performing similar / better than Thrift / Protocol Buffers, so I'm throwing JSON into this question as well.

Latest comparison available here at the thrift-protobuf-compare project wiki. It includes many other serialization libraries.

I'm in the process of writing some code in an open source project named thrift-protobuf-compare comparing between protobuf and thrift. For now it covers few serialization aspects, but I intend to cover more. The results (for Thrift and Protobuf) are discussed in my blog, I'll add more when I'll get to it.
You may look at the code to compare API, description language and generated code. I'll be happy to have contributions to achieve a more rounded comparison.

You may be interested in this question: "Biggest differences of Thrift vs Protocol Buffers?"

I did test performance of PB with number of other data formats (xml, json, default object serialization, hessian, one proprietary one) and libraries (jaxb, fast infoset, hand-written) for data binding task (both reading and writing), but thrift's format(s) was not included. Performance for formats with multiple converters (like xml) had very high variance, from very slow to pretty-darn-fast. Correlation between claims of authors and perceived performance was rather weak. Especially so for packages that made wildest claims.
For what it is worth, I found PB performance to be bit over hyped (usually not by its authors, but others who only know who wrote it). With default settings it did not beat fastest textual xml alternative. With optimized mode (why is this not default?), it was bit faster, comparable with the fastest JSON package. Hessian was rather fast, textual json also. Properietary binary format (no name here, it was company internal) was the slowest. Java object serialization was fast for larger messages, less so for small objects (i.e. high fixed per-operation noverhead).
With PB message size was compact, but given all trade-offs you have to do (data is not self-descriptive: if you lose the schema, you lose data; there are indexes of course, and value types, but from what you have reverse-engineer back to field names if you want), I personally would only choose it for specific use cases -- size-sensitive, closely coupled system where interface/format never (or very very rarely) changes.
My opinion in this is that (a) implementation often matters more than specification (of data format), (b) end-to-end, differences between best-of-breed (for different formats) are usually not big enough to dictate the choice.
That is, you may be better off choosing format+API/lib/framework you like using most (or has best tool support), find best implementation, and see if that works fast enough.
If (and only if!) not, consider next best alternative.
ps. Not sure what EJB3 here would be. Maybe just plain of Java serialization?

If the raw net performance is the target, then nothing beats IIOP (see RMI/IIOP).
Smallest possible footprint -- only binary data, no markup at all. Serialization/deserialization is very fast too.
Since it's IIOP (that is CORBA), almost all languages have bindings.
But I presume the performance is not the only requirement, right?

One of the things near the top of my "to-do" list for PBs is to port Google's internal Protocol Buffer performance benchmark - it's mostly a case of taking confidential message formats and turning them into entirely bland ones, and then doing the same for the data.
When that's been done, I'd imagine you could build the same messages in Thrift and then compare the performance.
In other words, I don't have the data for you yet - but hopefully in the next couple of weeks...

To back up Vladimir's point about IIOP, here's an interesting performance test, that should give some additional info over the google benchmarks, since it compares Thrift and CORBA. (Performance_TIDorb_vs_Thrift_morfeo.pdf // link no longer valid)
To quote from the study:
Thrift is very efficient with small
data (basic types as operation
arguments)
Thrifts transports are not so efficient as CORBA with medium and
large data (struct and >complex
types > 1 kilobytes).
Another odd limitation, not having to do with performance, is that Thrift is limited to returning only several values as a struct - although this, like performance, can surely be improved perhaps.
It is interesting that the Thrift IDL closely matches the CORBA IDL, nice. I haven't used Thrift, it looks interesting especially for smaller messages, and one of the design goals was for a less cumbersome install, so these are other advantages of Thrift. That said, CORBA has a bad rap, there are many excellent implementations out there like omniORB for example, which has bindings for Python, that are easy to install and use.
Edited: The Thrift and CORBA link is no longer valid, but I did find another useful paper from CERN. They evaluated replacements for their CORBA system, and, while they evaluated Thrift, they eventually went with ZeroMQ. While Thrift performed the fastest in their performance tests, at 9000 msg/sec vs. 8000 (ZeroMQ) and 7000+ RDA (CORBA-based), they chose not to test Thrift further because of other issues notably:
It is still an immature product with a buggy implementation

I have done a study for spring-boot, mappers (manual, Dozer and MapStruct), Thrift, REST, SOAP and Protocol Buffers integration for my job.
The server side: https://github.com/vlachenal/webservices-bench
The client side: https://github.com/vlachenal/webservices-bench-client
It is not finished and has been run on my personal computers (I have to ask for servers to complete the tests) ... but results can be consulted on:
Laptop: https://github.com/vlachenal/webservices-bench/blob/master/results.md
Desktop: https://github.com/vlachenal/webservices-bench/blob/master/results-desktop.md
As conclusion :
Thrift offers the best performance and is easy to use
RESTful webservice with JSON content type is pretty close to Thrift performance, is "browser ready to use" and is quite elegant (from my point of view)
SOAP has very poor performance but offers the best data control
Protocol Buffers has good performance ... until 3 simultaneous calls ... and I don't know why. It is very difficult to use: I give up (for now) to make for it work with MapStruct and I don't try with Dozer.
Projects can be completed through pull requests (either for fixes or other results).

Related

Is there a way to find out the time taken for serialisation of data by Kafka

I want to time the time taken by Kafka to serialize different data formats. And have a doubt whether I can do it on my end(since I think this is done on the Kafka side.) If yes how can we do it?Is the serialisation done after the message.send()?
Else I was also checking for Kafka monitoring metrics available and did not find anything related to this in their documentation either. Had seen the request-latency-avg as a possible metric but its values seem too high to be just the serialisation part.
Could anybody suggest anything for the same.

Kafka has built-in Serializer and Deserializers for a number of formats, such as Strings, Long, ByteArrays, ByteBuffers and the community has for JSON, ProtoBuf, Avro.
If your focus is performance for serialization and deserialization you can check the result of some benchmark: https://labs.criteo.com/2017/05/serialization/
where the author concluded:
Protobuf and Thrift have similar performances, in terms of file sizes
and serialization/deserialization time. The slightly better
performances of Thrift did not outweigh the easier and less risky
integration of Protobuf as it was already in use in our systems, thus
the final choice. Protobuf also has a better documentation, whereas
Thrift lacks it. Luckily there was the missing guide that helped us
implement Thrift quickly for benchmarking.
https://diwakergupta.github.io/thrift-missing-guide/#_types
Avro should not be used if your objects are small. But it looks interesting
for its speed if you have very big objects and don’t have complex data
structures as they are difficult to express. Avro tools also look more
targeted at the Java world than cross-language development. The C#
implementation’s bugs and limitations are quite frustrating.

Kafka doesn't have any API to identify performance number on serializer/deserializer and there is no matter to find in case you are using basic serializer/deserializer.
Really you are interested you can build custom serializer/deserializer and try to get the number there.
You can refer already answered below link for custom serializer/deserializer
Custom serializer/deserializer

What difficulties should I expect if I write a NoSQL db using golang but want to run Hadoop mapreduce on it?

I would like to build a distributed NoSQL database or key-value store using golang, to learn golang and practice distribute system knowledge I've learnt from school. The target use case I can think of is running MapReduce on top of it, and implement a HDFS-compatible "filesystem" to expose the data to Hadoop, similar to running Hadoop on Ceph and Amazon S3.
My question is, what difficulties should I expect to integrate such an NoSQl database with Hadoop? Or integrate with other languages (e.g., providing Ruby/Python/Node.js/C++ APIs?) if I use golang to build the system.

Ok, I'm not much of a Hadoop user so I'll give you some more general lessons learned about the issues you'll face:
Protocol. If you're going with REST Go will be fine, but expect to find some gotchas in the default HTTP library's defaults (not expiring idle keepalive connections, not necessarily knowing when a reader has closed a stream). But if you want something more compact, know that: a. the Thrift implementation for Go, last I checked, was lacking and relatively slow. b. Go has great support for RPC but it might not play well with other languages. So you might want to check out protobuf, or work on top the redis protocol or something like that.
GC. Go's GC is very simplistic (STW, not generational, etc). If you plan on heavy memory caching in the orders of multiple Gs, expect GC pauses all over the place. There are techniques to reduce GC pressure but the straight forward Go idioms aren't usually optimized for that.
mmap'ing in Go is not straightforward, so it will be a bit of a struggle if you want to leverage that.
Besides slices, lists and maps, you won't have a lot of built in data structures to work with, like a Set type. There are tons of good implementations of them out there, but you'll have to do some digging up.
Take the time to learn concurrency patterns and interface patterns in Go. It's a bit different than other languages, and as a rule of thumb, if you find yourself struggling with a pattern from other languages, you're probably doing it wrong. A good talk about Go concurrency is this one IMHO http://www.youtube.com/watch?v=QDDwwePbDtw
A few projects you might want to have a look at:
Groupcache - a distributed key/value cache written in Go by Brad Fitzpatrick for Google's own use. It's a great implementation of a simple yet super robust distributed system in Go. https://github.com/golang/groupcache and check out Brad's presentation about it: http://talks.golang.org/2013/oscon-dl.slide
InfluxDB which includes a Go based version of the great Raft algorithm: https://github.com/influxdb/influxdb
My own humble (pretty dead) project, a redis compliant database that's based on a plugin architecture. My Go has improved since, but it has some nice parts, and it includes a pretty fast server for the redis protocol. https://bitbucket.org/dvirsky/boilerdb

moving java class bytecode from jvm to jvm

So i have a server jvm and a client jvm. The client communicates with the server by sending serialized java objects over tcp. Now, normally the server would have the classes of the objects it was receiving in its classpath, in order to deserialize the objects properly.
But what i'm looking for is some way to avoid that; ie, have the client "somehow" send the class bytecode over the wire, on-demand. This would of course require recursing down the class tree (in case any members of the original class where themselves objects of other classes that the server didn't know about).
So i was wondering about any technologies out there that do this sort of thing.
Thx.

RMI includes the notion of a "class server." Sounds like you're pretty much reinventing that, so consider looking into using all or part of RMI. Here's a tutorial.

RMI has the ability to dynamically download entire class file definitions over the wire on demand.
Even if you don't use (or want to use) RMI, the technologies underlying the classloading may be of interest, and they're standard Java.

You are asking about Code Mobility. Also the area of grid computing is somewhat relevant.
Take a look at Mobility-RPC, it's a library which does exactly what you ask at the same level of granularity (class-level).
Security is something to bear in mind. But I'd also remember that SQL sent to databases, bash commands executed over SSH, Business rules engines, Adobe Flash, Java Applets, RMI as described above, ActiveX, JavaScript, Hadoop/grid computing frameworks - all of these are examples of remote code execution in widespread use. Like everything, turning the security dial to the max is going to limit your options. But all of the above are used to good effect when properly firewalled or sandboxed.
In this instance, it sounds like you want something to eliminate a minor hassle, and you're not (say) designing a full-blown distributed application. So based on what you've said, despite myself being somewhat of a code mobility proponent, I'd say code mobility is probably overkill in this case. (But useful in others.)
Regarding grid computing, take a look at GridGain, and Hadoop. GridGain is a pure (CPU-centric) grid computing framework, whereas Hadoop is more a data mining/data warehouse platform with its own replicated distributed file system (HDFS).
Both GridGain and Hadoop transmit user-defined Java code implementing tasks/jobs to remote worker nodes. Last time I checked, they did this by transferring user-supplied jar files to the relevant nodes. I think the GridGain ClassLoader is more sophisticated than Hadoop's however (but less sophisticated than Mobility-RPC's). Hadoop basically starts a new JVM for every job, not especially efficient (but not exactly the bottleneck given the IO load!).
Mobility-RPC is somewhat different because it doesn't expect the remote machine to be a worker node at all, it could be any application running the library. So it's more like RPC or ad-hoc task/object transfer.

This sounds like a very bad idea. Basically, it would mean you allow your client to send code to the server which is directly executed inside the current process. Something like this is generally considered a serious vulnerability, namely Arbitrary code execution which is one of the worst vulnerabilities you could ever have.
Building a system design based on that is well, not so smart.

Create a classloader that loads classes from a stream. See the JarFileClassLoader example for details.
This, of course, will become hugely problematic, particularly if any of the classes use reflection and don't directly name an implementation in the bytecode, in addition to potential security issues; you'll need to look into secure classloaders.

If plain RMI does not work for your requirements take a look to mobile agents frameworks in Java (e.g., Aglets).

Remote Procedure Calls

I am doing a Software Engineering course in which different teams are building different prototype subsystems of a big system (different subsystem of F35 Lightning aircraft!).
The problem is that teams can use different programming languages (like C++ and Java) depending upon what they are most comfortable in. However, these subsystems need to be communicating with each other (like radar needs to provide object corodinates to navigation and control). Hence we need to come up with a solution in which different modules can interact in real time.
Someone suggested XML-RPC and hence I was reading about it. After reading it I think it is used in server client architecture. Is this a good way of doing interprocess kind of communication? What are my options?
Any help would be appreciated.
regards,
Newbie

There are a couple of options beside XML-RPC. For a short bullet-point comparison, take a look at:
http://michaeldehaan.net/2008/07/17/xmlrpc-vs-rest-vs-soap-vs-all-your-rpc-options/
If your exchange is more data-oriented, Protocol Buffers might be an alternative.
Protocol Buffers are a way of encoding structured data in an efficient yet extensible format.
Personally, I would go for lightweight exchange format or method first since the components are considered prototypes. Something like REST or some custom message passing might be simple enough, yet sufficient.

If you are already familiar with XML, it can be a reasonable answer. An advantage of XML is that you don't have to worry about how different machines represent numbers. A disadvantage is the time it takes to keep converting numbers to text and back to numbers.

Transmitting commands from a client to a server in Clojure/Java

I'm working on Clojure app where a client needs to send some commands to a server. These will happen in quite large volumes, so I'd like it to be reasonably efficient, both in terms of processing and over-the-wire serialised size.
What would be the best way of doing this in Clojure?
Currently I'm thinking of:
Creating a simple standard representation e.g. {:command-id 1, :params [1 2 3 "abc"]}
Serialising using some efficient Java library such as Kryo, and configuring it to understand the Clojure data types
Hacking together an appropriate Client/Server socket implementation using the Java NIO libraries for the transmission over TCP/IP
However this seems a little convoluted and I'm sure other people have come up with smarter approaches. Any ideas / advice much appreciated!

If parameters aren't too big and source is trusted, why not send s-expressions back and fort,
(eval (read-string "(println \"Hello World\")"))
Clojure being a LISP dialect code is data.
EDIT:
For safety, after reading the string you check the command against a valid set of commands,
(contains? #{'println}
(first (read-string "(println \"Hello World\")")))
or you can use a library designed for this such as
http://github.com/Licenser/clj-sandbox

How about Google's protocol buffers? There's a library for dealing with them from Clojure: clojure-protobuf. I remember someone on Freenode #clojure is doing a Haskell vs. OCaml vs. Clojure comparison on a serious task (processing loads of Twitter data); s/he's been lavishing praise on the lib.
Update: Here's the relevant utterance from the #clojure conversation I had in mind.

My answer is not Clojure-specific, but I tend to prefer strings over http - it's reasonably standard and reasonably efficient.
There are libraries for JSON in pretty much every language, I'd go with that (along with your simple standard command format) unless the data volume is massive.
My experience is that the less you need to fiddle with specialized formats, sockets and protocols the more likely it is that you can spend the weekend on the beach :).
I'd reserve anything more complicated than JSON over http until after benchmarking shows a need for something else.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.