Mixing python with a faster language for optimization in GAE - java

I'm a newbie in the Python and GAE world and I have a question.
With Python the normal approach is to only optimize the code when needed, fixing the more urgent bottlenecks.
And one of the ways to achieve that is by rewriting the most critical parts of the program in C.
By using GAE are we losing this possibility forever?
Since Google's Go language is now (or it will be as soon as it is compiled more efficiently) the fastest language in GAE, will there be a way to mix Python and Go in the same app?
What other ways could be used to achieve a similar result?

See Can I write parts of the Google App Engine code in Java, other parts in Python? for how to use multiple languages.
Basically, each version of a given app can only use one runtime language.
But, you can have two different versions of your app, written in different languages, and they can pass information back and forth through the datastore.
Also, you can have two different apps, in two different languages, and you can have then pass information back and forth through requests.

I think you're falling for premature optimisation here. For nearly all webapps, the majority of time spent is in RPCs, waiting for the rest of the system to do something such as process datastore queries. Of the remainder, a significant fraction is often spent in C code anyway. There are relatively few webapps that need to do a lot of processor-intensive work in order to serve a typical query.
If your app is one of those, you may want to reconsider writing your entire app in Python, given the unavailability of C extensions on App Engine, and choose Java or Go. If your app is one of the 99% that don't need to do much processor intensive work for typical requests, don't worry about it.

Related

Different languages in a downloadable application

I would like your opinion on whether this idea sounds good to you, and if not, what you would do instead.
The goal of my project is to make a downloadable application that lets the user input a text file of experimental data, then performs calculations on the data to find statistical values such as the mean, standard deviation, and slope and intercept of the linear regression line. These are presented on the screen, as well as a scatter plot or histogram of the data.
For now, my plan has been to code the interface that the user interacts with in Java using the Swing library, and the part that performs the calculations in C. My reasons for doing this are that Java is good for GUIs that can be used on any machine, and C is faster at performing big calculations. One critical step in my project is to parallelize the code using the MPICH library so that my program can do things like make many sets of randomized data and analyze them. The Java and C code would communicate with each other by inputting and outputting text files, and I have been told that I need to do some shell scripting to bridge the two together. By doing this, I would hope that the Java code would give the C code the text file of the original data, the C code would do the calculations and report the statistical values in the form of a text file, and then the Java code would read this text file to present the results of data analysis to the user.
The important characteristics of this downloadable application are:
Has a very clear, easy-to-use interface
Can be downloaded and used easily, ideally on all kinds of computers (Windows, Mac, Linux)
Takes advantage of parallelization to do big calculations faster
I am not very knowledgeable about these languages or environments, and I am having a few doubts about my plan.
I know that Java programs are easily downloadable in a jar file, but if I use Java and C, will my program still be easily downloadable and able to be used on all machines with the shell scripting?
Would it be best to do all my coding in one language and still preserve the important characteristics listed above? If so, what would I be losing by doing so compared to using two languages?
I appreciate your help!
Please read again quotes from your own post:
First
I am not very knowledgeable about these languages
But:
Java is good for GUIs
And
Java is good for GUIs
So, you do not really have knowledge in Java and C but you can state that Java is good for GUI and C is faster. But both statements are probably wrong.
For the last 15 years people tend to avoid implementing SWING UI and desktop applications in java. They typically try to move the calculations to server and control the process using web based UI. (This probably is not applicable for your use-case if you have indeed large input data sets, e.g. tens or hundreds of GB).
During the same period JVM was significantly improved in terms of performance, so assumption that C code runs faster than the same code written in java could be incorrect.
So, probably you should implement all in java.
If you cannot move the calculations to server you can implement java application and run it using JNLP.
However before you start, may I recommend you to ask other design/architecture question that will contain more details about the amount of your input/output data and the nature of your calculations?
Let's address your characteristics first.
This is debatable, but I would make the claim that while certain languages make clean UI's easier, it is possible to make a good UI in any popular language with proper library support.
Portability - If you are planning to distribute in binary form and not in source form, then Java wins hands down. If you are planning to distribute in source form then C is possible but you will have to provide a means of building the application on each platform, which will usually be different for each platform.
Performance through parallel computation. For your kind of application (CPU Bound), C will likely be faster, but the difference may not matter enough for you to care.
Now let's talk about your doubts.
This is addressed in (2) above.
A single-language program is generally easier to maintain and distribute. You will lose a lot of headache by not dealing with two languages. You might also lose some performance if you choose only Java, but you will lose no portability if you choose only C vs C and Java, since the C part will already be non-portable.

Strictly server-side processing (no web browser interaction): is Java or PHP better for this scenario?

Here's the situation:
I currently have a web application that uses PHP to serve HTML/CSS/JS and that talks to a MySQL DB. Completely vanilla and common. The PHP is a mixture of presentation logic (HTML generation, etc) and business logic (the app uses Ajax extensively to make requests for data or to tell the server to make changes to something).
As part of a redesign of this system I am removing all of the presentation logic from the PHP. Instead, I will be using Ext JS 4 (a javascript-based windowing toolkit / app) connected to a web socket gateway (a COMET/AJAX replacement that allows bi-directional communication) on the server. Let's wave a magic wand for a minute and forget about how the Ext JS 4 gets delivered to the browser and how it talks to the web socket gateway.
What we are left with is a web socket gateway (written in Java and running persistently listening on a specific port for web socket connections) and some business logic / DB interaction currently written in PHP.
At this point, I see one of two options:
Keep the business logic / DB interaction in the PHP and execute it by calling either PHP from the command line or by having the PHP / Apache listen on a different port only for communications from the web socket gateway.
Write a new Java or C++ application that will be persistent and listen on a specific port for communications from the web socket gateway. The business logic / DB integration is re-written in Java or C++ code and is part of this application.
Would re-writing in Java or C++ give better performance than calling PHP over and over? (The PHP code is pretty cleanly written: object-oriented using packages like CodeIgniter and Doctrine).
Would the performance benefits outweigh the hassle of re-writing all the business logic? Obviously dependent on many factors such as quantity of code but what is your gut feeling?
In case it might influence your thinking / feedback, you should know that the web socket gateway (Kaazing) supports JMS, Stomp, AMQP, XMPP, or something custom you build yourself.
Let me know if there is any other info I can provide to help you with your answers.
Thanks!
I know a lot of the solutions I mention here are "ugly" but you sound like a person who's looking to get results and refactor, so I hope it's okay.
Do it the easy way (PHP if I understood correctly) first. Then run a realistic stress test. Since you're making PHP calls, just create a realistic sequence (log in, change this, do that, log out) and run as many as you think is realistic. 100? 10000? It depends on how stressed you expect this thing to be and still preform.
That step is easier than it sounds. Don't think "ultimate test framework", think 20 line python script that runs as many threads as you want executing a few lines that will keep your application busy. If it takes you more than 40 minutes, stop and simplify. The hour you spend will be worth it.
If CPU hits 100 or you run out of some resource then perhaps it's time for a rewrite, or you can probably guess what's taking the longest and write it in C. If you do use C/C++ and you're not 100% comfortable with it, avoid a major rewrite, since it's a dangerous language with lots of opportunities for introducing bugs. Maybe even call compiled code from the PHP you have if that suits your application.
I've written server-side HTML-generating C code once. It's not exactly the right tool for the job. PHP may be hackish but it gets the job done fast. I would avoid optimization unless/until it is actually needed.
Good luck, don't forget to tell us how it goes!
Edit: If you do go for a mixed-language solution, don't forget to clean it up after! Standardize what you do fast and what you do in PHP, do it in a common format, maybe write up a short readme. Again, those fifteen minutes will save you, or the next person, a few days and many hairs.
Writing in a compiled language (Java or C++, in your examples) would almost certainly give better performance than an interpreted language like PHP. The performance benefits almost certainly would not outweigh the hassle of rewriting all of the code.
If your business logic has high processing costs, Java or C++ will give you a much better performance.
If you are simply fetching some results from a DB, do not expect any great performance gains.
I would do some prototyping/testing to identify the performance bottleneck.
My opinion is that PHP is too slow for processing HUGE datasets if you have many 100,000s of objects to analyse C++ rocks and Java benefits from the HotSpot JIT performance optimizer.
The HotSpot effect is very specific to doing number crunching in Java. You really can see the JRE is pushing the accelerator, ironing out detected bottlenecks. In some rare cases HotSpot JIT optimised Java can be even faster than C.
In some also very rare cases HotSpot performance voodooism can make your code slower!
Have you ever thought of turning a PHP application into a faster Java or C++ app?
Maybe the HipHop php2cpp compiler is all you want:
https://github.com/facebook/hiphop-php/wiki/
Quercus is a php4java runtime which can help you migrate more cheaply to Java.
http://quercus.caucho.com/
Quite interesting was Joshua Bloch's talk about "Performance Anxiety" last year.
http://www.wiki.jvmlangsummit.com/images/1/1d/PerformanceAnxiety2010.pdf
http://parleys.com/#st=5&id=2103 (32min video)

what is faster flex and Java or flex and php?

We are designing a major webapplication for the www.
It a social community site. And I would like to know witch direction I need to take.
What works faster, flex and php or java and flex?
I've read that flex and php with amfphp is very fast (with AMFEXT).
But I have seen that 90% of the major companies here in Europe are hiring java / flex developers to develop major webapplications.
Our application needs to handle a lot of users at the same time.
Our application will be hosted in a datacenter later it will be hosted by a major cdn provider.
Our application has a video (streaming and progressive streaming) a shoppingmall and a community area.
Due to the nature of our business model we think that our application will attract a lot of users a day.
So we must have a webapplication that works very fast. With a strong technology on the backend. Java or PHP (amf support)
for the Database:
We will start with mySql and make the switch to oracle and then to sas.
What is the right direction for our application?
flex and java or flex and php?
I have no idea which provides "faster" execution - however, I do know that "faster" isn't the only reason to choose a language. Here's a general comparison of Java and PHP and here's another that compares Java, PHP and Ruby on Rails - neither one focuses on the language executing "faster".
Especially with Flex - you will most likely spend more time executing in Flex rather than in the backing server side language. Also, since the application is Flex - it should be possible to provide similar test implementations in PHP and Java and compare the results for your specific application.
The biggest part of the choice would be whatever language and platform your developers are familiar with.
This is a pretty subjective question. I believe that PHP tends to be a little bit faster but it really sort of depends on your applications requirements. From personal experience, I have been able to get more done with less code with PHP. Java has a much more strictly enforced object oriented approach which is actually quite nice whereas PHP is still lacking a bit in this area. For the most part, you will be able to accomplish the same things with both languages. I also feel that PHP has much better community driven support then Java which could be a factor. It really all depends on what you guys are most comfortable with. Both languages play well with Flash/Flex.
Java is faster than PHP in terms of pure execution time. Here is an interesting algorithm performance comparison that ranks a number of languages, showing Java to be approximately 300 times faster than PHP:
http://blog.dhananjaynene.com/2008/07/performance-comparison-c-java-python-ruby-jython-jruby-groovy/
With that said, this is NOT a good approximation of the speed differences for real-world applications. A major bottleneck will typically be your database. However if your application requires a lot of processing that doesn't occur in the database, you may see performance improvement with Java.
One advantage in terms of remoting is that Adobe offers Blaze DS which is a standard implementation of AMF for Flex. They also include some messaging capabilities ("data push") which I don't believe are implemented in AMFPHP.
Language choice is largely (though not entirely) irrelevant in terms of speed. Very large deployments have been built on both, and the speed factor comes from good architecture and code. So whether you go with php or java, hopefully there are good architects/designers/developers versed in the ways of writing for performance involved.
Java is always going to be faster than PHP, unless you have done something very wrong!
BUT...
The speed of the server side script won't really be noticed by the user, because so many other things add to the time it takes to get a response from the server (network delay, propagation delay, etc). To the user PHP and Java will seem equally fast.
To the server, however, there is a difference. According to your post you plan to have many concurrent users. If each user takes 20% longer to complete a request with PHP, then PHP can handle 20% fewer concurrent users. So if you worry that the server will fill up and run at maximum capacity, then I would pick Java. If you don't expect that to happen for quite a while, then I would pick PHP, based solely on performance.
Of course there are other things to take into account, like what you can do with each language, libraries available, developers available to/how well you know each one.
I would also strongly advice against changing anything backend once the system is up and running. If you start out with MySQL, don't change to Oracle half way. Either stick to MySQL, unless it becomes impossible, or start using Oracle from the beginning.
I would say, try both by doing a prototype ( e.g 3-4 pages ) for each language, and run a few performance test , overall should not take more than one week to do these.
Each language has its own pros / cons.

Python, PyTables, Java - tying all together

Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?
This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
PyTables looks like a nice match.
There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Create two diagrams of your current architecture, physical and logical.
On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
Be certain to label the resources available to each server and the type and resources available to each connection.
Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
On the logical diagram, create boxes for every application that is running in your current architecture.
Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
Draw on disk resources (like the HDF5 and CSV files) as cylinders.
Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
What data formats or methods can this application use to communicate.
What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
How often does it need it.
Approximately how much resources does the application need.
What does the application do now that it doesn't do that well.
What can this application do now that would help, but it isn't doing.
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.
Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".
You could try Jython, a Python interpreter for the JVM which can import Java classes.
Jython project homepage
Unfortunately, that's all I know on the subject.
Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?

Choosing Java vs Python on Google App Engine

Currently Google App Engine supports both Python & Java. Java support is less mature. However, Java seems to have a longer list of libraries and especially support for Java bytecode regardless of the languages used to write that code. Which language will give better performance and more power? Please advise. Thank you!
Edit:
http://groups.google.com/group/google-appengine-java/web/will-it-play-in-app-engine?pli=1
Edit:
By "power" I mean better expandability and inclusion of available libraries outside the framework. Python allows only pure Python libraries, though.
I'm biased (being a Python expert but pretty rusty in Java) but I think the Python runtime of GAE is currently more advanced and better developed than the Java runtime -- the former has had one extra year to develop and mature, after all.
How things will proceed going forward is of course hard to predict -- demand is probably stronger on the Java side (especially since it's not just about Java, but other languages perched on top of the JVM too, so it's THE way to run e.g. PHP or Ruby code on App Engine); the Python App Engine team however does have the advantage of having on board Guido van Rossum, the inventor of Python and an amazingly strong engineer.
In terms of flexibility, the Java engine, as already mentioned, does offer the possibility of running JVM bytecode made by different languages, not just Java -- if you're in a multi-language shop that's a pretty large positive. Vice versa, if you loathe Javascript but must execute some code in the user's browser, Java's GWT (generating the Javascript for you from your Java-level coding) is far richer and more advanced than Python-side alternatives (in practice, if you choose Python, you'll be writing some JS yourself for this purpose, while if you choose Java GWT is a usable alternative if you loathe writing JS).
In terms of libraries it's pretty much a wash -- the JVM is restricted enough (no threads, no custom class loaders, no JNI, no relational DB) to hamper the simple reuse of existing Java libraries as much, or more, than existing Python libraries are similarly hampered by the similar restrictions on the Python runtime.
In terms of performance, I think it's a wash, though you should benchmark on tasks of your own -- don't rely on the performance of highly optimized JIT-based JVM implementations discounting their large startup times and memory footprints, because the app engine environment is very different (startup costs will be paid often, as instances of your app are started, stopped, moved to different hosts, etc, all trasparently to you -- such events are typically much cheaper with Python runtime environments than with JVMs).
The XPath/XSLT situation (to be euphemistic...) is not exactly perfect on either side, sigh, though I think it may be a tad less bad in the JVM (where, apparently, substantial subsets of Saxon can be made to run, with some care). I think it's worth opening issues on the Appengine Issues page with XPath and XSLT in their titles -- right now there are only issues asking for specific libraries, and that's myopic: I don't really care HOW a good XPath/XSLT is implemented, for Python and/or for Java, as long as I get to use it. (Specific libraries may ease migration of existing code, but that's less important than being able to perform such tasks as "rapidly apply XSLT transformation" in SOME way!-). I know I'd star such an issue if well phrased (especially in a language-independent way).
Last but not least: remember that you can have different version of your app (using the same datastore) some of which are implemented with the Python runtime, some with the Java runtime, and you can access versions that differ from the "default/active" one with explicit URLs. So you could have both Python and Java code (in different versions of your app) use and modify the same data store, granting you even more flexibility (though only one will have the "nice" URL such as foobar.appspot.com -- which is probably important only for access by interactive users on browsers, I imagine;-).
Watch this app for changes in Python and Java performance:
http://gaejava.appspot.com/
(edit: apologies, link is broken now. But following para still applied when I saw it running last)
Currently, Python and using the low-level API in Java are faster than JDO on Java, for this simple test. At least if the underlying engine changes, that app should reflect performance changes.
Based on experience with running these VMs on other platforms, I'd say that you'll probably get more raw performance out of Java than Python. Don't underestimate Python's selling points, however: The Python language is much more productive in terms of lines of code - the general agreement is that Python requires a third of the code of an equivalent Java program, while remaining as or more readable. This benefit is multiplied by the ability to run code immediately without an explicit compile step.
With regards to available libraries, you'll find that much of the extensive Python runtime library works out of the box (as does Java's). The popular Django Web framework (http://www.djangoproject.com/) is also supported on AppEngine.
With regards to 'power', it's difficult to know what you mean, but Python is used in many different domains, especially the Web: YouTube is written in Python, as is Sourceforge (as of last week).
June 2013: This video is a very good answer by a google engineer:
http://www.youtube.com/watch?v=tLriM2krw2E
TLDR; is:
Pick the language that you and your team is most productive with
If you want to build something for production: Java or Python (not Go)
If you have a big team and a complex code base: Java (because of static code analysis and refactoring)
Small teams that iterate quickly: Python (although Java is also okay)
An important question to consider in deciding between Python and Java is how you will use the datastore in each language (and most other angles to the original question have already been covered quite well in this topic).
For Java, the standard method is to use JDO or JPA. These are great for portability but are not very well suited to the datastore.
A low-level API is available but this is too low level for day-to-day use - it is more suitable for building 3rd party libraries.
For Python there is an API designed specifically to provide applications with easy but powerful access to the datastore. It is great except that it is not portable so it locks you into GAE.
Fortunately, there are solutions being developed for the weaknesses listed for both languages.
For Java, the low-level API is being used to develop persistence libraries that are much better suited to the datastore then JDO/JPA (IMO). Examples include the Siena project, and Objectify.
I've recently started using Objectify and am finding it to be very easy to use and well suited to the datastore, and its growing popularity has translated into good support. For example, Objectify is officially supported by Google's new Cloud Endpoints service. On the other hand, Objectify only works with the datastore, while Siena is 'inspired' by the datastore but is designed to work with a variety of both SQL databases and NoSQL datastores.
For Python, there are efforts being made to allow the use of the Python GAE datastore API off of the GAE. One example is the SQLite backend that Google released for use with the SDK, but I doubt they intend this to grow into something production ready. The TyphoonAE project probably has more potential, but I don't think it is production ready yet either (correct me if I am wrong).
If anyone has experience with any of these alternatives or knows of others, please add them in a comment. Personally, I really like the GAE datastore - I find it to be a considerable improvement over the AWS SimpleDB - so I wish for the success of these efforts to alleviate some of the issues in using it.
I'm strongly recommending Java for GAE and here's why:
Performance: Java is potentially faster then Python.
Python development is under pressure of a lack of third-party libraries. For example, there is no XSLT for Python/GAE at all. Almost all Python libraries are C bindings (and those are unsupported by GAE).
Memcache API: Java SDK have more interesting abilities than Python SDK.
Datastore API: JDO is very slow, but native Java datastore API is very fast and easy.
I'm using Java/GAE in development right now.
As you've identified, using a JVM doesn't restrict you to using the Java language. A list of JVM languages and links can be found here. However, the Google App Engine does restrict the set of classes you can use from the normal Java SE set, and you will want to investigate if any of these implementations can be used on the app engine.
EDIT: I see you've found such a list
I can't comment on the performance of Python. However, the JVM is a very powerful platform performance-wise, given its ability to dynamically compile and optimise code during the run time.
Ultimately performance will depend on what your application does, and how you code it. In the absence of further info, I think it's not possible to give any more pointers in this area.
I've been amazed at how clean, straightforward, and problem free the Python/Django SDK is. However I started running into situations where I needed to start doing more JavaScript and thought I might want to take advantage of the GWT and other Java utilities. I've gotten just half way through the GAE Java tutorial, and have had one problem after another: Eclipse configuration issues, JRE versionitis, the mind-numbing complexity of Java, and a confusing and possibly broken tutorial. Checking out this site and others linked from here clinched it for me. I'm going back to Python, and I'll look into Pyjamas to help with my JavaScript challenges.
I'm a little late to the conversation, but here are my two cents. I really had a hard time choosing between Python and Java, since I am well versed in both languages. As we all know, there are advantages and disadvantages for both, and you have to take in account your requirements and the frameworks that work best for your project.
As I usually do in this type of dilemmas, I look for numbers to support my decision. I decided to go with Python for many reasons, but in my case, there was one plot that was the tipping point. If you search "Google App Engine" in GitHub as of September 2014, you will find the following figure:
There could be many biases in these numbers, but overall, there are three times more GAE Python repositories than GAE Java repositories. Not only that, but if you list the projects by the "number of stars" you will see that a majority of the Python projects appear at the top (you have to take in account that Python has been around longer). To me, this makes a strong case for Python because I take in account community adoption & support, documentation, and availability of open-source projects.
It's a good question, and I think many of the responses have given good view points of pros and cons on both sides of the fence. I've tried both Python and JVM-based AppEngine (in my case I was using Gaelyk which is a Groovy application framework built for AppEngine). When it comes to performance on the platform, one thing I hadn't considered until it was staring me in the face is the implication of "Loading Requests" that occur on the Java side of the fence. When using Groovy these loading requests are a killer.
I put a post together on the topic (http://distractable.net/coding/google-appengine-java-vs-python-performance-comparison/) and I'm hoping to find a way of working around the problem, but if not I think I'll be going back to a Python + Django combination until cold starting java requests has less of an impact.
Based on how much I hear Java people complain about AppEngine compared to Python users, I would say Python is much less stressful to use.
There's also project Unladen Swallow, which is apparently Google-funded if not Google-owned. They're trying to implement a LLVM-based backend for Python 2.6.1 bytecode, so they can use a JIT and various nice native code/GC/multi-core optimisations. (Nice quote: "We aspire to do no original work, instead using as much of the last 30 years of research as possible.") They're looking for a 5x speed-up to CPython.
Of course this doesn't answer your immediate question, but points towards a "closing of the gap" (if any) in the future (hopefully).
The beauty of python nowdays is how well it communicates with other languages. For instance you can have both python and java on the same table with Jython. Of course jython even though it fully supports java libraries it does not support fully python libraries. But its an ideal solution if you want to mess with Java Libraries. It even allows you to mix it with Java code with no extra coding.
But even python itself has made some steps forwared. See ctypes for example, near C speed , direct accees to C libraries all of this without leaving the comfort of python coding. Cython goes one step further , allowing to mix c code with python code with ease, or even if you dont want to mess with c or c++ , you can still code in python but use statically type variables making your python programms as fast as C apps. Cython is both used and supported by google by the way.
Yesterday I even found tools for python to inline C or even Assembly (see CorePy) , you cant get any more powerful than that.
Python is surely a very mature language, not only standing on itself , but able to coooperate with any other language with easy. I think that is what makes python an ideal solution even in a very advanced and demanding scenarios.
With python you can have acess to C/C++ ,Java , .NET and many other libraries with almost zero additional coding giving you also a language that minimises, simplifies and beautifies coding. Its a very tempting language.
Gone with Python even though GWT seems a perfect match for the kind of an app I'm developing. JPA is pretty messed up on GAE (e.g. no #Embeddable and other obscure non-documented limitations). Having spent a week, I can tell that Java just doesn't feel right on GAE at the moment.
One think to take into account are the frameworks you intend yo use. Not all frameworks on Java side are well suited for applications running on App Engine, which is somewhat different than traditional Java app servers.
One thing to consider is the application startup time. With traditional Java web apps you don't really need to think about this. The application starts and then it just runs. Doesn't really matter if the startup takes 5 seconds or couple of minutes. With App Engine you might end up in a situation where the application is only started when a request comes in. This means the user is waiting while your application boots up. New GAE features like reserved instances help here, but check first.
Another thing are the different limitations GAE psoes on Java. Not all frameworks are happy with the limitations on what classes you can use or the fact that threads are not allowed or that you can't access local filesystem. These issues are probably easy to find out by just googling about GAE compatibility.
I've also seen some people complaining about issues with session size on modern UI frameworks (Wicket, namely). In general these frameworks tend to do certain trade-offs in order to make development fun, fast and easy. Sometimes this may lead to conflicts with the App Engine limitations.
I initially started developing working on GAE with Java, but then switched to Python because of these reasons. My personal feeling is that Python is a better choice for App Engine development. I think Java is more "at home" for example on Amazon's Elastic Beanstalk.
BUT with App Engine things are changing very rapidly. GAE is changing itself and as it becomes more popular, the frameworks are also changing to work around its limitations.

Categories