python vs java for kafka implementation

python vs java for kafka implementation - java

#kafka users:
I have been trying to understand python client for kafka, including PyPy client as well. Following was a good benchmarking i read and realized some similar results:
http://mrafayaleem.com/2016/03/31/apache-kafka-producer-benchmarks/
I am extremely confused whether Java has a massive advantage over python, as the libraries are written using Java and kafka. So my question is, does the native implementation of Kafka in Java helps with performance tremendously when Java is used, or PyPy/Python works equally better?
Being a python programmer, I am not at all comfortable with java, and hence confused.

Apache Kafka defines a language neutral wire protocol (see https://kafka.apache.org/protocol) and so clients can be written in any programming language and do not need to be based on the Java client that ships with the core Kafka distribution. For example, the c/c++ librdkafka library is a very high performance non-Java client. There are a number of python Kafka clients including one that is based on librdkafka. Benchmark results and other comparison information for the various Kafka Python clients is available here http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/

Related

Which Language to use for Kafka Consumer

I am trying to set up a Kafka system. Since most of the existing code in my project is already in PHP, I will most probably be writing the producers in PHP itself. But I am comparatively very less constrained when it comes to choosing a language to write the consumer. Now, that there are so many clients which can be used I am in a fix.
In other to order to choose the right tech here, what are the various factors that should be kept in mind?
Would especially like to apply this knowledge to choose between java client vs node client(multithreaded model vs async model)
Any help will be highly appreciated.

The Java client is the most advance client and officially supported by the Kafka Project -- most other clients are third party projects and many do not implement all available features.
Thus, I would recommend to use Java clients.

Kafka is basically written in pure Java and Kafka’s native API is java, so this is the only language where you’re not using a third-party library.You always have an edge over writing in other languages which have an additional overhead.
Node.js isn’t optimized for high throughput applications such as Kafka. So if you need the high processing rates that come standard on Kafka, or perhaps C++.
Also, I believe Kafka consumer clients written in Java has good community support. So it makes sense to implement it using java as long as you don't have any other dependency stopping you from implementing it from.
Also, check this out for the benchmarking results using various Kafka Clients. The results are contrasting.
Client Type Throughput(No of messages)
Java 40,000 - 50,0000
Go 28,000 - 30,0000
Node 6,000 - 8,0000
Kafka-pixy 700 - 800
Logstash 250

As far as Kafka goes, I'd use any of the languages with an official Confluent supported client: JVM, C/C++, .NET, Python, Go
I'm sure you can get others to work like Node or PHP, and maybe those can use the C library, but I would prefer something with official language support and a broader user to ask questions to.

Py4J has bigger overhead than Jython and JPype

After searching for an option to run Java code from Django application(python), I found out that Py4J is the best option for me. I tried Jython, JPype and Python subprocess and each of them have certain limitations:
Jython. My app runs in python.
JPype is buggy. You can start JVM just once after that it fails to start again.
Python subprocess. Cannot pass Java object between Python and Java, because of regular console call.
On Py4J web site is written:
In terms of performance, Py4J has a bigger overhead than both of the previous solutions (Jython and JPype) because it relies on sockets, but if performance is critical to your application, accessing Java objects from Python programs might not be the best idea.
In my application performance is critical, because I'm working with Machine learning framework Mahout. My question is: Will Mahout also run slower because of Py4J gateway server or this overhead just mean that invoking Java methods from Python functions is slower (in latter case performance of Mahout will not be a problem and I can use Py4J).

JPype issue that #HIP_HOP mentioned with JVM getting detached from new threads can be overcome with the following hack (add it before the first call to Java objects in the new thread which does not have JVM yet):
# ensure that current thread is attached to JVM
# (essential to prevent JVM / entire container crashes
# due to "JPJavaEnv::FindClass" errors)
if not jpype.isThreadAttachedToJVM():
jpype.attachThreadToJVM()

PySpark uses Py4J quite successfully. If all the heavylifting is done on Spark (or Mahout in your case) itself, and you just want to return result back to "driver"/Python code, then Py4J might work for you very well as well.
Py4j has slightly bigger overhead for huge results (that's not necessarily the case for Spark workloads, as you only return summaries /aggregates for the dataframes). There is an improvement discussion for py4j to switch to binary serialization to remove that overhead for higher badnwidth requirements too: https://github.com/bartdag/py4j/issues/159

My solutions
java thread/process <-> Pipes <-> py subprocess
Use pipes by java's ProcessBuilder to call py with args "-u" to transfer data via pipes.
Here is a good practice.
https://github.com/JULIELab/java-stdio-ipc
Here is my stupid research result about "java <-> py"
[Jython] Java implement of python.
[Jpype] JPype is designed to allow the user to exercise Java as fluidly as possible from within Python. We can break this down into a few specific design goals.
Unlike Jython, JPype does not achieve this by re-implementing Python, but instead by interfacing both virtual machines at the native level. This shared memory based approach achieves good computing performance while providing the access to the entirety of CPython and Java libraries.
[Runtime] The Runtime class in java (old method).
[Process] Java ProcessBuilder class gives more structure to the arguments.
[Pipes] Named pipes could be the answer for you. Use subprocess. Popen to start the Java process and establish pipes to communicate with it.
Try mkfifo() implementation in python.
https://jj09.net/interprocess-communication-python-java/
-> java<-> Pipes <-> py https://github.com/JULIELab/java-stdio-ipc
[Protobuf] This is the opensource solution Google uses to do IPC between Java and Python. For serializing and deserializing data efficiently in a language-neutral, platform-neutral, extensible way, take a look at Protocol Buffers.
[Socket] CS-arch throgh socket
Server(Python) - Client(Java) communication using sockets
https://jj09.net/interprocess-communication-python-java/
Send File From Python Server to Java Client
[procbridge]
A super-lightweight IPC (Inter-Process Communication) protocol over TCP socket. https://github.com/gongzhang/procbridge
https://github.com/gongzhang/procbridge-python
https://github.com/gongzhang/procbridge-java
[hessian binary web service protocol] using python client and java server.
[Jython]
Jython is a reimplementation of Python in Java. As a result it has much lower costs to share data structures between Java and Python and potentially much higher level of integration. Noted downsides of Jython are that it has lagged well behind the state of the art in Python; it has a limited selection of modules that can be used; and the Python object thrashing is not particularly well fit in Java virtual machine leading to some known performance issues.
[Py4J]
Py4J uses a remote tunnel to operate the JVM. This has the advantage that the remote JVM does not share the same memory space and multiple JVMs can be controlled. It provides a fairly general API, but the overall integration to Python is as one would expect when operating a remote channel operating more like an RPC front-end. It seems well documented and capable. Although I haven’t done benchmarking, a remote access JVM will have a transfer penalty when moving data.
[Jep]
Jep stands for Java embedded Python. It is a mirror image of JPype. Rather that focusing on accessing Java from within Python, this project is geared towards allowing Java to access Python as a sub-interpreter. The syntax for accessing Java resources from within the embedded Python is quite similar to support for imports. Notable downsides are that although Python supports multiple interpreters many Python modules do not, thus some of the advantages of the use of Python may be hard to realize. In addition, the documentation is a bit underwhelming thus it is difficult to see how capable it is from the limited examples.
[PyJnius]
PyJnius is another Python to Java only bridge. Syntax is somewhat similar to JPype in that classes can be loaded in and then have mostly Java native syntax. Like JPype, it provides an ability to customize Java classes so that they appear more like native classes. PyJnius seems to be focused on Android. It is written using Cython .pxi files for speed. It does not include a method to represent primitive arrays, thus Python list must be converted whenever an array needs to be passed as an argument or a return. This seems pretty prohibitive for scientific code. PyJnius appears is still in active development.
[Javabridge]
Javabridge is direct low level JNI control from Python. The integration level is quite low on this, but it does serve the purpose of providing the JNI API to Python rather than attempting to wrap Java in a Python skin. The downside being of course you would really have to know a lot of JNI to make effective use of it.
[jpy]
This is the most similar package to JPype in terms of project goals. They have achieved more capabilities in terms of a Java from Python than JPype which does not support any reverse capabilities. It is currently unclear if this project is still active as the most recent release is dated 2014. The integration level with Python is fairly low currently though what they do provide is a similar API to JPype.
[JCC]
JCC is a C++ code generator that produces a C++ object interface wrapping a Java library via Java’s Native Interface (JNI). JCC also generates C++ wrappers that conform to Python’s C type system making the instances of Java classes directly available to a Python interpreter. This may be handy if your goal is not to make use of all of Java but rather have a specific library exposed to Python.
[VOC] https://beeware.org/project/projects/bridges/voc/_
A transpiler that converts Python bytecode into Java bytecode part of the BeeWare project. This may be useful if getting a smallish piece of Python code hooked into Java. It currently list itself as early development. This is more in the reverse direction as its goals are making Python code available in Java rather providing interaction between the two.
[p2j]
This lists itself as “A (restricted) python to java source translator”. Appears to try to convert Python code into Java. Has not been actively maintained since 2013. Like VOC this is primilarly for code translation rather that bridging.
[GraalVM]
Source: https://github.com/oracle/graal

I don't know Mahout. But think about that: At least with JPype and Py4J you will have performance impact when converting types from Java to Python and vice versa. Try to minimize calls between the languages. Maybe it's an alternative for you to code a thin wrapper in Java that condenses many Javacalls to one python2java call.

Because the performance is also a question about your usage screnario (how often you call the script and how large is the data that is moved) and because the different solutions have their own specific benefits/drawbacks, I have created an API to switch between different implementations without you having to change your python script: https://github.com/subes/invesdwin-context-python
Thus testing what works best or just being flexible about what to deploy to is really easy.

How to integrate Java with nodejs for handling CPU-heavy tasks?

I am trying to pick a right web technology both for I/O heavy and CPU heavy tasks. NodeJs is perfect for handling large load and it also can be scaled out. However, I am stuck with the cpu heavy part. Is it possible to integrate another technology (e.g. Java) into node, so that I will have it running my algorithms in other threads and then use the results again in node. Is there any existing solution? Any other suggestions will be very good.

You can intergrate NodeJS with Java using node-java.

As mentioned in a previous answer, you can use node-java which is an npm module that talks to Java. You can also use J2V8 which wraps Node.js as a Java library and provides a Node.js API in Java.

The answer is lambda architecture.
NodeJs is nice by itself - handling fast queries in a lightweight manner, not doing any extra computations on data.
The CPU heavy tasks can be easily delegated to specialized components based on JVM (well, the most famous ones are on JVM). This is nicely implemented by using message brokers and microservices.
An event-based architecture, where nodejs can be hooked up to databases like Cassandra or Mongodb and cluster computing frameworks like Apache Spark (not necessarily, though, it depends on the problem) to handle the cpu-heavy parts of the system. And lightweight containers add an icing to the cake by providing nice isolated runtime environments for each of the components to live in.
That's my conclusion so far regarding this question.
I think the suggestions above sort of eliminate the need to wrap node under java or other JVM based solution for cpu-heavy tasks.

NodeJS is based on the v8 javascript engine which is written in c++.
It is therefore possible to write fully native addons in c++ for NodeJS. Check out some of these resources:
https://github.com/nodejs/node-addon-api
https://github.com/nodejs/node-addon-examples

Are there known problems using Scala with Apache Camel?

I know that there is a supported Scala DSL for Camel. Apart from that
Is it realistic to replace Java (the language) completely by Scala for a Camel based project?
Which kind of known problems are known to exist?
Which workarounds exist for those problems (other than using Java)?
I am mainly looking for less boilerplaty code.

Akka offers stable Scala-idiomatic Camel integration.
The akka-camel module allows actors,
untyped actors and typed actors to
receive and send messages over a great
variety of protocols and APIs. This
section gives a brief overview of the
general ideas behind the akka-camel
module, the remaining sections go into
the details. In addition to the native
Scala and Java actor API, actors can
now exchange messages with other
systems over large number of protcols
and APIs such as HTTP, SOAP, TCP, FTP,
SMTP or JMS, to mention a few. At the
moment, approximately 80 protocols and
APIs are supported.
Apart from that, I'm sure this replacement is possible due to a good interop, and there could hardly be any Scala-specific issues that are not peculiar to Java. E.g., Akka Actors used for publishing to/consuming from Camel endpoints are based on java.util.concurrency, and the only problem I can think of is a fixable bug in the library.

In the meantime a relatively simple Scala DSL has been developed for Camel, that should have the functionality of the Java DSL.
To decide if it is realistic for you, consider:
- The quality of the IDE support for the languages
- The Scala language complexity
- The Scala/Java language popularity
- DSL extension possibilities. In Scala, it should be possible (with some Scala magic) to to extend the DSL (add additional DSL elements)
If you decide to try it out, it would be great if you share your experience with the Apache Camel community your impressions on:
code readability, code maintainability, code efficiency, developer satisfaction, code size, the number of "man-days".

Since then (2010-2011), there is now (Sept 2016) a recent initiative named for Akka Streams Integration, codename Alpakka.
We believe that Akka Streams can be the tool for building a modern alternative to Apache Camel. That will not happen by itself overnight and this is a call for arms for the community to join us on this mission. The biggest asset of Camel is its rich set of endpoint components. We would like to see that similar endpoints are developed for Akka Streams.
See "akka/akka-stream-contrib".

Are there tools to integrate Java and C++?

thanks for reading this question.
I am doing this homework which need a GUI as frond end to integrate with back end code which written in C++.
I wanna to write this front end GUI in java as its cross-platform feature and strong graphic components.
Is there any good way I can integrate java and C++ well?
Thank you

Swig works very well. It's a means to bind C/C++ to a huge variety of languages. I have experience of using this to talk to C++ with very little grief. Here's the manual page on using Swig and Java together. The tutorial gets you going very quickly, with many examples including Java.
I would however investigate splitting your application into a client/server architecture, to separate the C++ backend from the Java front end. You'll avoid the C++/Java development and integration pain = although you'll have to implement some communication protocol between the front and back end depending on requirements (e.g. basic sockets / webservice / HTTP+REST or possibly CORBA - which comes natively to Java and is designed for cross-language communication).

Assuming you are the back end component is on the same machine you could use an interface layer as described by others
JNI
JNA
Swig
QTJambi
These all require you c++ backend to be available in a dll and usually provide Java proxies for C functions and sometimes c++ classes. There is a learning curve to all of these and some work to enable the Proxy.
Another approach would be to use a c++ process and communicate with this using either
command line
stdin/stdout
If you want to support communication across a network
sockets
CORBA
WebServices
Thrift
These also have a learning curve and some set up costs
Of these the command-line or stdin/stdout is probably the fastest to get working with the minimal amount of effort and knowledge. However it does not scale well to large interfaces as you must encode the input and output of each message as text
For the command-line approach you execute the c++ process using command line switches for the options, the results are either read from the processes standard out or its exit code.
For stdin/stdout you start the process each request is sent to stdin of the process and the results are read from stdout.

Hava a look at JNI (Java Native Interface). Sun has an online book on JNI.

How about Thrift?
Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.

If you are not writing the C++ backend library yourself, but just want to use a third-party library, the better alternative would be to use JNA.
The main benefit of using JNA over JNI in this case is that the bridging code is all written in Java (rather than in the native language, C++ in your case). This means you wouldn't need to complicate your build process by building C++ JNI interfaces, all your interface work would be written in the language of the main project.
If however, you are writing the C++ backend yourself, then any of the other options already provided would be equally applicable.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.