Different languages in a downloadable application

Different languages in a downloadable application - java

I would like your opinion on whether this idea sounds good to you, and if not, what you would do instead.
The goal of my project is to make a downloadable application that lets the user input a text file of experimental data, then performs calculations on the data to find statistical values such as the mean, standard deviation, and slope and intercept of the linear regression line. These are presented on the screen, as well as a scatter plot or histogram of the data.
For now, my plan has been to code the interface that the user interacts with in Java using the Swing library, and the part that performs the calculations in C. My reasons for doing this are that Java is good for GUIs that can be used on any machine, and C is faster at performing big calculations. One critical step in my project is to parallelize the code using the MPICH library so that my program can do things like make many sets of randomized data and analyze them. The Java and C code would communicate with each other by inputting and outputting text files, and I have been told that I need to do some shell scripting to bridge the two together. By doing this, I would hope that the Java code would give the C code the text file of the original data, the C code would do the calculations and report the statistical values in the form of a text file, and then the Java code would read this text file to present the results of data analysis to the user.
The important characteristics of this downloadable application are:
Has a very clear, easy-to-use interface
Can be downloaded and used easily, ideally on all kinds of computers (Windows, Mac, Linux)
Takes advantage of parallelization to do big calculations faster
I am not very knowledgeable about these languages or environments, and I am having a few doubts about my plan.
I know that Java programs are easily downloadable in a jar file, but if I use Java and C, will my program still be easily downloadable and able to be used on all machines with the shell scripting?
Would it be best to do all my coding in one language and still preserve the important characteristics listed above? If so, what would I be losing by doing so compared to using two languages?
I appreciate your help!

Please read again quotes from your own post:
First
I am not very knowledgeable about these languages
But:
Java is good for GUIs
And
Java is good for GUIs
So, you do not really have knowledge in Java and C but you can state that Java is good for GUI and C is faster. But both statements are probably wrong.
For the last 15 years people tend to avoid implementing SWING UI and desktop applications in java. They typically try to move the calculations to server and control the process using web based UI. (This probably is not applicable for your use-case if you have indeed large input data sets, e.g. tens or hundreds of GB).
During the same period JVM was significantly improved in terms of performance, so assumption that C code runs faster than the same code written in java could be incorrect.
So, probably you should implement all in java.
If you cannot move the calculations to server you can implement java application and run it using JNLP.
However before you start, may I recommend you to ask other design/architecture question that will contain more details about the amount of your input/output data and the nature of your calculations?

Let's address your characteristics first.
This is debatable, but I would make the claim that while certain languages make clean UI's easier, it is possible to make a good UI in any popular language with proper library support.
Portability - If you are planning to distribute in binary form and not in source form, then Java wins hands down. If you are planning to distribute in source form then C is possible but you will have to provide a means of building the application on each platform, which will usually be different for each platform.
Performance through parallel computation. For your kind of application (CPU Bound), C will likely be faster, but the difference may not matter enough for you to care.
Now let's talk about your doubts.
This is addressed in (2) above.
A single-language program is generally easier to maintain and distribute. You will lose a lot of headache by not dealing with two languages. You might also lose some performance if you choose only Java, but you will lose no portability if you choose only C vs C and Java, since the C part will already be non-portable.

Related

Mixing python with a faster language for optimization in GAE

I'm a newbie in the Python and GAE world and I have a question.
With Python the normal approach is to only optimize the code when needed, fixing the more urgent bottlenecks.
And one of the ways to achieve that is by rewriting the most critical parts of the program in C.
By using GAE are we losing this possibility forever?
Since Google's Go language is now (or it will be as soon as it is compiled more efficiently) the fastest language in GAE, will there be a way to mix Python and Go in the same app?
What other ways could be used to achieve a similar result?

See Can I write parts of the Google App Engine code in Java, other parts in Python? for how to use multiple languages.
Basically, each version of a given app can only use one runtime language.
But, you can have two different versions of your app, written in different languages, and they can pass information back and forth through the datastore.
Also, you can have two different apps, in two different languages, and you can have then pass information back and forth through requests.

I think you're falling for premature optimisation here. For nearly all webapps, the majority of time spent is in RPCs, waiting for the rest of the system to do something such as process datastore queries. Of the remainder, a significant fraction is often spent in C code anyway. There are relatively few webapps that need to do a lot of processor-intensive work in order to serve a typical query.
If your app is one of those, you may want to reconsider writing your entire app in Python, given the unavailability of C extensions on App Engine, and choose Java or Go. If your app is one of the 99% that don't need to do much processor intensive work for typical requests, don't worry about it.

GUI in Java, Backend in SML?

I'm a big fan of functional programming languages (namely Standard ML and its dialects), mainly because of their expressiveness which allows for very consise, clean code. I can solve many problems dramatically faster with ML than with say Java.
However, Java is really great when it comes to programming GUIs (->SWT). I would definitely not wanna do that in a functional language.
This brings us to my actual question: Is there a good way to write a program in ML and then wrap it with a GUI written in Java?
What I have come up with so far is the following:
Compile the ML programm (e.g. with MLton or Poly ML) and execute the binary as
an external program from Java
(http://www.rgagnon.com/javadetails/java-0014.html).
Problem: The only way the Frontend/Backend can communicate is via Strings. This might require tons of (difficult) encoding/decoding.
Use JNI/JNA. From what I read, this will allow you to transfer Integers, Arrays etc. I think the external programms have to be written in C/C++ for this to work. With MLton's Foreign Function Interface I can write an Interface to my functional program in C and statically link the whole thing.
Problem: Apparantly, this only works with dynamic libraries, that is dlls in Windows. However, MLton will only let me compile the ML/C Programm to an executable. When trying to create a dll, I get a whole bunch of errors.
Does anyone have experience with this? Is there a better way to do this?
Thanks in advance! -Steffen
EDIT: I know about Scala which tries to bring concepts from functional programming to Java. I have tried it but I dont think it can compete with an actual functional programming language (in terms of expressivness)

That's not quite the exact answer but there is a functional language which is very ml-orientated for the JVM: Yeti
So if you like coding in ML than that's probably currently the closes you can get on the JVM and it integrates of course very well with all the Java APIs.

Is there a good way to write a program in ML and then wrap it with a GUI written in Java?
I don't know if this is a good way for small applications, but it is definitely a way, one that works for big IDE style stuff: Isabelle/ML vs. Isabelle/Scala/JVM. This is an application of interactive theorem proving, but plain SML programming is a trivial instance of that, in a sense.
So you can write basic Isabelle/ML code that emits some messages in the manner of the old-fashioned REPL, but the output can be interpreted by GUI components on the JVM side. Isabelle/jEdit does that routinely for pretty-printing of colored text, with a tiny little bit of rich text (sub/superscripts and bold).
Concerning explicit recoding of functional values over pipes/sockets as strings: that turns out quite simple in Isabelle/ML/Scala, due to some imitation of the way SML would represent typed values in untyped memory, but using untyped XML trees here instead of bits. The XML transfer syntax is specific to keep things simple: YXML instead of official quasi-human-readable XML. All of that fits into approx. 8000 bytes of SML source -- I am tempted to post the sources here, but better search the web for "Isabelle YXML" or "YXML PIDE".
Since Scala/JVM alone has been mentioned as standalone alternative: it definitely works, Scala is also very powerful and flexibile in imitating many programming styles (higher-order functional-object oriented), but for sophisticated symbolic applications like theorem proving, it just won't reach the purity and stability of SML. (Note that the underlying SML platform here is Poly/ML.)

Alternative to Java

I need an alternative to Java, because I am working on a genetics-calculation project.
It takes a lot of memory and the most of the cpu time. And therefore it won´t work when I deploy it on a server, because many people use the program at the same time.
Does anybody know another language that is not running in a virtual machine and is similar to Java (object-oriented, using exceptions and type-safety)?
Best regards,
Jonathan

To answer the direct question: there are dozens of languages that fit your explicit requirements. AmmoQ listed a few; Wikipedia has many more.
And I think that you'll be disappointed with every one of them.
Despite what Java haters want you to think, Java's performance is not much different than any other compiled language. Just changing languages won't improve performance much.
You'll probably do better by getting a profiler, and looking at the algorithms that you used.
Good luck!

If your apps is consuming most of the CPU and memory on a single-user workstation, I'm skeptical that translating it into some non-VM language is going to help much. With Java, you're depending on the VM for things like memory management; you're going to have to re-implement their equivalents in your non-VM language. Also, Java's memory management is pretty good. Your application probably isn't real-time sensitive, so having it pause once in a while isn't a problem. Besides, you're going to be running this on a multi-user system anyway, right?
Memory usage will have more to do with your underlying data structures and algorithms rather than something magical about the language. Unless you've got a really great memory allocator library for your chosen language, you may find you uses just as much memory (if not more) due to bugs in your program.
Since your app is compute-intensive, some other language is unlikely to make it less so, unless you insert some strategic sleep() calls throughout the code to deliberately make it yield the CPU more often. This will slow it down, but will be nicer to the other users.
Try running your app with Java's -server option. That will engage a VM designed for long-running programs and includes a JIT that will compile your Java into native code. It may make your program run a bit faster, but it will still be CPU and memory bound.

If you don't like C++, you might consider D, ObjectiveC or the new Go language from google.

You may try C++, it satisfies all your requirements.

Use Python along with numpy, scipy and matplotlib packages. numpy is a Python package which has all the number crunching code implemented in C. Hence runtime performance (bcoz of Python Virtual Machine) won't be an issue.
If you want compiled, statically typed language only, have a look at Haskell.

Can your algorithms be parallelised?
No matter what language you use you may come up against limitations at some point if you use a single process. Using something like Hadoop will mean you can retain Java and ease of use but you can run in parallel across many machines.

On the same theme as #Barry Brown's answer:
If your application is compute / memory intensive in Java, it will probably be compute / memory intensive in C++ or any other "more efficient" language. You might get some extra leeway ... but you'll soon run into the same performance wall.
IMO, you need to do the following things:
You need profile your application, and look for any major performance bottlenecks. You might find some real surprises.
In the light of the previous step, review the design and algorithms, paying attention to space and time complexity issues. Do some research to see if someone has discovered better algorithms for doing the computations that are problematic from a performance perspective.
If the previous steps don't get you ahead of the curve, see if you can upgrade your platform; get a bigger machine with more processors, more memory, etc.
If you are still stuck, your only other option is a scale-out design. Assuming that individual user requests are processed in a single-threaded, re-architect your system so that you can run "workers" across multiple servers, with a load balancer on the front. If you have a persistent back-end, look into how you can replicate that. And so on.
Figure out if the key algorithms can be parallelized / distributed so that the resource intensive parts of a user request execute in parallel on multiple processors / multiple servers; e.g. using a "map-reduce" framework.
OK, so there is no easy answer. But simply changing programming languages is NOT a good answer.

Regardless of language your program will need to share with others when running in multiple instances on a single machine. That is simply the way computers work.
The best way to allow your current program to scale to use the available hardware resources is to chop your amount of work into small, independent pieces, and make them implement the Callable interface. These can then be executed by a suitable Executor which can then be chosen according to the available hardware. See the Executors class for many preconfigured versions. THis is what I would recommend you to do here.
If you want to switch language then Mac OS X 10.6 allows for programming in the way described above with C and ObjectiveC and if you do it properly OS X can distribute the code over all available computing resources (both CPU and GPU and what have we).
If none of the above is interesting to you, then consider one of the Grid frameworks. Terracotta may be a good place to start.

F# or ruby, or python, they are very good for calculations, and many other things
NASA uses python

Well.. I think you are looking for C#.
C# is Object Oriented and has excellent support for Generics. You can use it do write both WinForm and server-side applications.
You can read more about C# generics here: http://msdn.microsoft.com/en-us/library/ms379564(VS.80).aspx
Edit:
My mistake, geneTIcs, not geneRIcs. It does not change the fact C# will do the job, and using generics will reduce load significantly.

You might find the computer language shootout here interesting.
For example, here's Java vs C++.
You might find Ocaml (from which F# is derived) worth a look; it meets your requirements for OO, exceptions, static types and it has a native compiler, however according to the shootout you may be trading less memory for lower speed.

Simple Programming Theory Question

Hi guys does anyone know why the programming language C++ is used more widely in biometric security applications compared to the programming language Java? The answers that I have collected so far are 1) Virtual Compilers 2) OpenCV Library provided by C++. Can anyone help with this question??

Maybe it's the hardware support: I wrote an app that uses a fingerprint sensor. The library support for the device is C++, so I wrote the app in C++. Now they have a .NET version, so my next app will be C#.

I don't know specifically about biometric applications, but in general when security is important Java can be a stumbling block. Depending on how the security requirements are written, they can cover things that one must do manually in C++, but which are done automatically by Java. This poses a problem because one would need to demonstrate that Java properly (and in a timely manner!) satisfies the requirement. It is a lot easier to show that these requirements are met in C++ code, because the code the meets the requirement is part of the program in question.
If the security person/requirements/customer make it clear that relying on Java for some security features is acceptable, then this is no big deal. We could go round-and-round about whether or not it is reasonable to rely on/trust Java to satisfy security requirements, it really just depends on the specific security needs.

I am willing to put money on the reason being simply that the access api's for the hardware are written in c++. Most of the modern/higher-level languages are not going to easily communicate with hardware originaly exposed through a C/C++ api.
On a somewhat related note, Vala has all the languages features expected of a modern\high-level language(and then some), but compiles to C binary and source, and can easily make use of any library written in C (not sure about c++). Check it out, I havnt used it much, but its pretty cool.

Implementing a library in C++ provide a lot over java. Once written, C++ library can run on almost any platform (including embedded ones), and can be made available as a native import to a variety of other languages through tools like SWIG. Java can only run on something with enough speed and memory to run a JVM, and the only other Java programs can include the code as a native import. For biometric applications especially I think running on embedded systems would be a large concern, since you could build this into a small sensor.
The more glib answer would be no one wants to wait for your garbage collection cycle to launch the friggen missiles.

You could replace Java with any other language there. Probably it has more to do with the APIs and hardware.
Also, Java is more suited for Web Applications. Its not the best choice for desktop applications.

For some biometric applications, execution speed is crucial.
For instance, let's say you're doing facial recognition for a checkpoint, and Java takes twice the time to run the algorithm that a compiled language like C++ does. That means if you go with Java, either:
The checkpoint lines will be twice as long,
You'll have to pay to staff twice as many checkpoints, or
Your system will do half as good a job at recognizing faces
None of those are usually acceptable options, which makes using Java a non-starter.

Python, PyTables, Java - tying all together

Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?

This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
PyTables looks like a nice match.
There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Create two diagrams of your current architecture, physical and logical.
On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
Be certain to label the resources available to each server and the type and resources available to each connection.
Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
On the logical diagram, create boxes for every application that is running in your current architecture.
Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
Draw on disk resources (like the HDF5 and CSV files) as cylinders.
Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
What data formats or methods can this application use to communicate.
What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
How often does it need it.
Approximately how much resources does the application need.
What does the application do now that it doesn't do that well.
What can this application do now that would help, but it isn't doing.
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.

Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".

You could try Jython, a Python interpreter for the JVM which can import Java classes.
Jython project homepage
Unfortunately, that's all I know on the subject.

Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.