I am looking for something that will make it easy to run (correctly coded) embarrassingly parallel JVM code on a cluster (so that I can use Clojure + Incanter).
I have used Parallel Python in the past to do this. We have a new PBS cluster and our admin will soon set up IPython nodes that use PBS as the backend. Both of these systems make it almost a no-brainer to run certain types of code in a cluster.
I made the mistake of using Hadoop in the past (Hadoop is just not suited to the kind of data that I use) - the latency made even small runs execute for 1-2 minutes.
Is JPPF or Gridgain better for what I need? Does anyone here have any experience with either? Is there anything else you can recommend?
Check out cascalog - http://github.com/nathanmarz/cascalog
Clojure is reported to work on Terracotta, subject to some patching.
Look at Skandium
Edit :
Above link is no more live, so adding github link
https://github.com/mleyton/Skandium
I suggest you look at Skandium, alternative licenses to GPL can be negotiated with the developers upon request.
Related
I was wondering if there is any way to simulate some workers in my local machine, I do know all spark can be set up locally and all can be fully tested. However, I was wondering if it is possible to actually simulate workers, with the idea to see how the repartition of work, loads and the DAG dynamics work.
I also can think about ways that can help me out, for example debugging and tracing data transformations. What I want to do is to develop an optimal way do my program in a testing fashion, I do not wanna rely in the big theory behind shuffling and expensive operations. Or are we dommed to try to test this up in a trial and error in real clusterS? thanks!
Yes, this is possible, please have a look at link https://sparkbyexamples.com/spark/apache-spark-installation-on-windows/
Also, if you can setup spark history server on your local machine, you can trace transformations, DAG etc.
I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a job. I'm looking for solution which could be use this way - jobs which could can be easly scaled into multiple machines but which does not require much input data. What is more I want to use machine learning jobs e.g. using created before neural network in real-time.
What libraries/technologies I can use for this purposes?
You are right, Hadoop is designed for batch-type processing.
Reading the question, I though about the Storm framework very recently open sourced by Twitter, which can be considered as "Hadoop for real-time processing".
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.
(from: InfoQ post)
However, I have not worked with it yet, so I really cannot say much about it in practice.
Twitter Engineering Blog Post: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Github: https://github.com/nathanmarz/storm
Given the fact that you want a real-time response in de "seconds" area I recommend something like this:
Setup a batched processing model for pre-computing as much as possible. Essentially try to do everything that does not depend on the "last second" data. Here you can use a regular Hadoop/Mahout setup and run these batches daily or (if needed) every hour or even 15 minutes.
Use a real-time system to do the last few things that cannot be precomputed.
For this you should look at either using the mentioned s4 or the recently announced twitter storm.
Sometimes it pays to go really simple and store the precomputed values all in memory and simply do the last aggregation/filter/sorting/... steps in memory. If you can do that you can really scale because each node can run completely independently of all others.
Perhaps having a NoSQL backend for your realtime component helps.
There are lot's of those available: mongodb, redis, riak, cassandra, hbase, couchdb, ...
It all depends on your real application.
Also try S4, initially released by Yahoo! and its now Apache Incubator project. It has been around for a while, and I found it to be good for some basic stuff when I did a proof of concept. Haven't used it extensively though.
What you're trying to do would be a better fit for HPCC as it has both, the back end data processing engine (equivalent to Hadoop) and the front-end real-time data delivery engine, eliminating the need to increase complexity through third party components. And a nice thing of HPCC is that both components are programmed using the same exact language and programming paradigms.
Check them out at: http://hpccsystems.com
Two Java programs have to communicate with each other, To do that I found two possibilities
Using Sockets
Using JavaSpaces
After looking into the description, I found out, that JavaSpaces is apparently the better solution. Sadly, I can't get it to run. Ever totourial roots me to another installation process, to other files and so on.... :(
How to install JavaSpaces, where to download them etc?
If someone offers me a better solution for it I'll be thankful to (JavaSpaces seems to be from 2005)
this are the websites I found so far:
http://www.jroller.com/matsh/entry/intreagued_by_javaspaces_try_blitz
(Installation description, not working...)
http://www.jini.org/wiki/Main_Page
(Download links are broken)
http://www.jarvana.com/jarvana/inspect/com/sun/jini/jini-starterkit/2.1/jini-starterkit-2.1.zip?folder=jini2_1/
(Download of jini starter kit)
For one quick start using GigaSpaces, a commercial JavaSpaces product (with a community edition available), see http://www.gigaspaces.com/wiki/display/XAP8/Data+Grid+Quick+Start
Also see http://replay.waybackmachine.org/20070202031207/http://www.theserverside.com/tt/articles/article.tss%3Fl%3DUsingJavaSpaces and http://www.theserverside.com/news/thread.tss?thread_id=42164 and http://www.enigmastation.com/?page_id=425
JavaSpaces is great, IMO (I'm biased, as I work for GigaSpaces... but then again, I work for GigaSpaces because I think the underlying technology is great.) - it's got a very simple API but the transaction model is actually pretty strong, and it's very fast. It's simpler and stronger than JMS, and has a simpler deployment/connection model.
If you're GigaSpaces-averse for some reason ("yikes, someone makes money from this") you can look into Blitz as well.
In Feb 2009 another user on SO mentioned, that GigaSpaces is a mature version of JavaSpaces.
Looking at that older question, I start believing that JavaSpaces is dead..!?
Have you considered also using something like RMI where it becomes transparent the fact that you are invoking a method on a remote system?
Or JMS where you just send and read messages -- and the infrastructure routes them to the right place/process?
Or how about another approach where you have a network cache (e.g. memcached) where both processes can put and get items to/from the cache -- thus allowing for inter-process communication to a certain extent?
I've got an app running on a grid of uniform java processes (potentially on different physical machines). I'd like to collect cpu usage statistics from a single run of this app. I've went over profiling tools looking for an option of automatic collection of data but failed to find any in netbeans, tptp, jvisualvm, yourkit etc.
Maybe I'm looking in a wrong way?
What I was thinking is:
run the processes on the grid with some special setup that allows them to dump profiling info
run my app as usual - it will push tasks to the grid, the processes will execute the tasks and publish profiling info
uses some tool to collect and analyze the profiling results
but I can't find anything even remotely similar to this.
Any thoughts, experience, suggestions?
Thank you!
If you have allowed remote JMX access and if you are using SUN JDK 1.6 then try using jvisualvm. It has the option of remote JMX connection. Though I haven't it used for profiling CPU in a distributed environment.
Note: For CPU profiling your application should be running on SUN JDK 1.6 or above.
Have a look at these links:
JVisualVM
JVisualVM - Working with Remote Applications
Get heap dump from a remote application in Java using JVisualVM
Unable to profile JBoss 5 using jvisualvm
http://www.taranfx.com/java-visualvm
I have used CA Introscope for this type of monitoring. It uses Instrumentation to collect metrics over time. As an example, it can be configured to provide you a view of all nodes and their performance over time. From that node view, you can drill down to the method level to help you figure out where your bottle necks are.
Yes, it will provide CPU utilization.
It's a commercial $$$ tool, but its a great tool for collecting, monitoring and interrogating performance data.
if you look at something like zabbix (though there are tons others of monitoring tools), this allows for gathering data via JMX from a Java app. And if you enable JMX in your app and allow it to be queried externally (via TCP/IP) you will have access to a lot of the hotspot internals (free memory etc) also thread stacks etc. Then you could have these values graphed as well. It does need configuration but what you're looking for don't think can be done with a one line of a script.
Just to add that profiling information on each node usually contain timestamps.
To match these timestamps all machines should have exactly the same time (10 millis delta maximum)
cluster nodes should synchronize with single source network time server (NTP)
You can use some JMX library, e.g. jmxterm and wrap it in some code to connect to multiple hosts an poll them for changes. If you are abit familiar with Python, look at mys simple script here for some inspiration: http://rostislav-matl.blogspot.com/2011/02/monitoring-tomcat-with-jmxterm.html .
http://www.hyperic.com/products/open-source-systems-monitoring
I never tried other tools mentioned in other answers. I was more than satisfied with hyperic.
It exposes webservices API as well which you can use to write your own analysis tools.
If you know the critical paths you want to analyse I would suggest time stamping your process in key places and combining the logs yourself. This is likely to be a useful addition to your profiling, can be used in production and may be even more useful as a result. (It is for my project)
I have used YourKit to monitor a number of processes at once. It can show you what is happening in each in real time and collect the results when all is finished.
I don't know if it provides a combined view of what is happening.
I was looking for something similar and found Hyperic
Claims are the tool can monitor most common applications ans systems, gather all information and present them in a conveniant fashion.
To be honest this is on my todo list, so I can't say if it will do the job or not. Anyway, it seem impressive.
Can anyone point me out to a project out there that I can download and run it and it would load / stress test itself and then provide me with reports? I want the project to be as big as it can and to involve as many components in java as it can, also i need it free... or to some very good already made results over the web that I can already take a look and get decision. Thanks!
main issue to benchmark is which would run it faster / better, solaris or linux
Linux and Solaris are not much different seen from your perspective, and I do not believe that the benchmark you ask for exists. A much better approach is to take the application you want to run - which hopefully should be platform independent already - and deploy to the architectures you want to test and then attach with jvisualvm and apply your standard test suite.
This will give you quite a good look at the performance without skewing with heavy profiling.
My guess is that for identical configurations you will see that Linux is slightly better than Solaris as the amount of unused memory available for disk caching will strongly influence the performance of the system. Also note that expert system tuning can also make a big difference, but that I believe that you are most interested in the "out of the box" performance.
It really depends on the aspects of usage you want to benchmark.
I did this for database applications, in this area TPC could be helpful.
I would recommend google: benchmark java numeric|transaction|rendering|olap
Depending on the characteristics of your use case.
Edit: regarding your comment of a java app running on an applicationserver, check from the backend db server whats the maximal throughput? TPC, then write a multithreaded benchmark client to checkout whats the business logics performance. The last step would be to involve webservers using Apache JMeter. This procedure allows you to tune all relevant parameters from OS over DB-Poolsizes etc.