Is "Adopting MapReduce model" = Universal answer to scalability?

Is "Adopting MapReduce model" = Universal answer to scalability? - java

I have been trying to understand the MapReduce concept and apply it to my current situation. What is my situation? Well, I have an ETL tool here, in which data transformation happens outside of source and destination data sources (databases). Hence,the source data source is purely used for extract and destination for load.
So, this act of transformation today, say takes about X hours for a million records. I would like to address a scenario where I would have a billion records, but I would want the work done in the same X hours. So, here is the need, for my product to scale out (adding more commodity machines) based on the scale of data. As you can see, I am only worried about the ability of distributing my product's transformation functionality to different machines, there by, leveraging CPU power from all these machines.
I started looking for options and I came across Apache Hadoop and then eventually the concept of MapReduce. I was pretty successful in settin up Hadoop quickly without running into issues in cluster mode and was happy to run a wordcount demo too. Soon, I realized that for implementing my own MapReduce model, I would have to redefine my product's transformation functionality into MAP and REDUCE functions.
Here's when trouble began. I read a copy of Hadoop: Definitive Guide, and I understood that many of the common use cases of Hadoop are in scenarios where one is faced with:
Unstructed data and one would like to perform aggregation/ sort/ or something of that kind.
Unstrucuted text and there is a need to perform mining
etc!
Here is my scenario where I extract from a database and load to a database (which has structured data), and my sole purpose is about bringing in more CPUs into play, in a reliable manner, and there by distribute my transformation. And redefining my transformation to fit a Map and Reduce model makes it a huge challenge in itself. So here are my questions:
Have you used Hadoop in ETL
scenarios? If yes, could be specific
about how you handled MapReducing of
your transformation? Have you used
Hadoop purely for leveraging extra
CPU power?
Is MapReduce concept the
universal answer to distributed
computing? Are there other equally
good options?
My understanding is
that MapReduce applies to large
dataset for
sorting/analytics/grouping/counting/aggregation/etc,
is my understading correct?

If you want to scale-out a processing problem over a lot of systems you must do two things:
Make sure you can process the information in independent parts.
There should be NO shared resource that is needed among these parts.
If there are dependencies then these will be the limit in your horizontal scalability.
So if you are starting from a relational model then the main obstruction is the fact that you have relationships. Having these relationships is a great asset in relational databases but is a pain in the ... when trying to scale-out.
The simplest way to go from relational to independent parts is to make a jump and de-normalize your data into records that have everything in them and are focussed around the part you want to do the processing around. Then you can disribute them over a huge cluster and after the processing has been completed you use the results.
If you cannot do such a jump you're in trouble.
So coming back to your questions:
# Have you used Hadoop in ETL scenarios?
Yes, the input being Apache logfiles and the loading and transformation consisted of parsing, normalizing and filtering these loglines. The result wan't put in a normal RDBMS!
# Is MapReduce concept the universal answer to distributed computing? Are there other equally good options?
MapReduce is a very simple processing model that will work great for any processing problem you are able to split into a lot of smaller 100% independent parts. The MapReduce model is so simple that as far as I know any problem that can be split into independent parts can be written as series of mapreduce steps.
HOWEVER: It is important to note that at this moment only BATCH oriented processing can be done with Hadoop. If you want "realtime" processing you are currently out of luck.
I don't know of a better model at this moment that an actual implementation exists for.
# My understanding is that MapReduce applies to large dataset for sorting/analytics/grouping/counting/aggregation/etc, is my understading correct?
Yep, that is the most common application.

MapReduce is "one" solution for "some" class of problems. It does not solve all the distributed systems problems - think about large TPS systems as the ones in banks or telecoms or telco signaling - there MR might be ineffective. But for the non real-time data processing MR performs awesome and you might consider it for massive ETL.

I cannot answer #1, as I haven't used MapReduce in ETL scenarios. However, I can say that MapReduce is not an "universal answer" for distributed computing; it's a useful tool for handling certain types of situations, where data is structured in a certain way. Think of it like a hashtable; very useful for certain situations, but not an "ultimate algorithm" by any definition of terms.
My personal understanding is that MapReduce is particularly useful for large quantities of "understructured" data; that is, it's useful for imposing some structure (basically, effectively providing a "first order" operation on large unstructured datasets). However, for datasets that are very large and relatively "tightly bound" (i.e. strong association between disparate data elements), it's (in my understanding) not a great solution.

Related

Why to use Hadoop? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am little confused about the usage of Hadoop. I dont understand when & where to use Hadoop.
Hadoop is an open-source framework that allows to store and process
big data in a distributed environment across clusters of computers
using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local
computation and storage.
According to the definition, this job also gets done by other databases like Oracle, MSSQL, etc i.e. storing & processing data across clusters. Then what else is the advantage of using Hadoop ?

Hadoop is basically a distributed file system (HDFS) - it lets you store large amount of file data on a cloud of machines, handling data redundancy etc.
On top of that distributed file system, Hadoop provides an API for processing all that stored data - Map-Reduce.
The basic idea is that since the data is stored in many nodes, you're better off processing it in a distributed manner where each node can process the data stored on it rather than spend a lot of time moving it over the network.
Unlike RDMS that you can query in realtime, the map-reduce process takes time and doesn't produce immediate results.
On top of this basic scheme you can build a Column Database, like HBase.
A column-database is basically a hashtable that allows realtime queries on rows.
As per my knowledge, there are lot of differences. Please read below differences.
Hadoop is not a database. Hbase or Impala may be considered databases but Hadoop is just a file system (hdfs) with built in redundancy, parallelism.
Traditional databases/RDBMS have ACID properties - Atomicity, Consistency, Isolation and Durability. You get none of these out of the box with Hadoop. So if you have to for example write code to take money from one bank account and put into another one, you have to (painfully) code all the scenarios like what happens if money is taken out but a failure occurs before its moved into another account.
Hadoop offers massive scale in processing power and storage at a very low comparable cost to an RDBMS.
Hadoop offers tremendous parallel processing capabilities. You can run jobs in parallel to crunch large volumes of data.
Some people argue that traditional databases do not work well with un-structured data, but its not as simple as that. There are many applications built using traditional RDBMS that use a lot of unstructured data or video files or PDFs that I have come across that work well.
Typically RDBMS will manage a large chunk of the data in its cache for faster processing while at the same time maintaining read consistency across sessions. I would argue Hadoop does a better job at using the memory cache to process the data without offering any other items like read consistency.
Hive SQL is almost always a magnitude of times slower than SQL you can run in traditional databases. So if you are thinking SQL in Hive is faster than in a database, you are in for a sad disappointment. It will not scale at all for complex analytics.
Hadoop is very good for parallel processing problems - like finding a set of keywords in a large set of documents (this operation can be parallelized). However typically RDBMS implementations will be faster for comparable data sets.

RDBMS is not capable of processing big data in a cost effective way. As the size of the data increases, the RDBMS systems which uses vertical scalability techniques will not work well. In this place, the big data processing frameworks such as hadoop work well in a cost effective way.
Most of the big data processing frameworks are opensource and designed to run on commodity hardware. So the cost will be very less compared to that of RDBMS required for the same set up.
In simple words, bigdata starts from where the RDBMS stops due to data size and complexity.
Another point is RDBMS mainly deals with structured data. But most of the big data frameworks can deal with Structured, Unstructured and Semi-structured data. Most of the big data frameworks are designed to process any kind of large data.

Distribute data and computation. The computation local to data prevents the network overload. Tasks are independent so, it is easy to handle partial failure. Here the entire nodes can fail and restart.
It avoids crawling horrors of failure and tolerant synchronous distributed systems.Linear scaling in the ideal case. It used to design for cheap, commodity hardware. Simple programming model. The end-user programmer only writes map-reduce tasks.

Solution to provide shared entities between multiple Java processes

I am trying to reconstruct a flow of information from multiple parts handled by different Java processes. Please note that i don't generate the flows, i just read some information about them.
I've tried using MySQL (MyISAM/InnoDB tables) with INSERT ON DUPLICATE KEY UPDATE using an id for each flow. I've also tried storing all the pieces of information and running a query at the end to get the full information. Neither of these approaches yielded the performance needed.
I'm looking for a solution that will allow me to have a set of shared objects between multiple Java processes. The objects should be persistent between runs and fast to lookup/update concurrently (>100k lookups/updates per second).
I've thought of a few solutions including:
NoSQL: something like MongoDB, HBase etc.
a caching solution like EhCache, Memcached etc.
The problem is i don't have any experience with any of these solutions. So, what would you recommend that fits the following criteria:
very fast on a single system. Most of the applications i mentioned were built for distributed systems, but it's not the case here.
easy to learn/use (i want to be able to prototype it in a day)
mature technology
free to use even for commercial purposes
preferably open-source

You could try a seperate java process that co-ordinates between the others. This process would hold the information to pass over to the main processes. You could wire them up with RMI.

If you want to do only exchange of objects withing java applications, you could also looki into tuple spaces. There are specific implementations of spaces for java, JavaSpaces, which should be able to do what you need. Not sure if they can keep up with the performance though. Also I’m not sure how widely this technology is still being used, since it only supports Java and isn’t as flexible as NoSQL stores would be these days.
Wikipedia has a more detailed description and list of different implementations, many of which are open source.
The other option is to go with Redis, you have notifications there and it can for sure scale to the requirements you are looking for.

The old (legacy?) solution is JavaSpaces. However, from an software architects point of view I would say distributed caches are the replacements for that nowadays. Especially take a look at hazelcast and infinispan.
From the performance viewpoint I am not happy with the performance of the "big" distributed caching solutions, when only a single in-memory cache is needed, see my writeup on the cache2k benchmarks page (hazelcast needs to be added here).
Anyways, please clarify your problem statement first, because your question falls into the XyProblem category. You are not describing the actual problem, and your question just boils down to "fast reliable distributed objects" solution. What kind of data comes in? What is the rate? Who is it accessed? What consistency guarantees need to be met, considering the fact that writing and reading is in parallel?
By the term "flow of information" it sounds more like a complex event processing problem to me.

Smaller scale Java distributed programming

I'm learning a bit more about hadoop and its applications, and I understand it is geared toward massive datasets and large files. Let's say I had an application in which I was processing a relatively small number of files (say 100k), which isn't a huge number for something like hadoop/hdfs. However, it does take a macro amount of time to run on a single machine, so I'd like to distribute the process.
The problem can be broken down into a map reduce style problem (e.g. each of the files can be processed independently and then I can aggregate the results). I'm open to using infrastructure such as Amazon EC2, but I'm not so sure about what technologies to be exploring for actually aggregating the results of the process. Seems like hadoop might be a bit overkill here.
Can anyone provide guidance on this type of problem?

First off, you may want to reconsider your assumption that you can't combine files. Even images can be combined- you just need to figure out how to do that in a way that allows you to break them out again in your mappers. Combining them with some sort of sentinel value or magic number between them might make it possible to turn them into one giant file.
Other options include HBase, where you could store the images in cells. HBase also has a built-in TableMapper and TableReducer, and can store the results of your processing alongside the raw data in a semi-structured way.
EDIT: As for the "is Hadoop overkill" question, you need to consider the following:
Hadoop adds at least one machine of overhead (the HDFS NameNode). You typically dont want to store data or run jobs on that machine, since it is a SPOF.
Hadoop is best suited for processing data in batch, with relatively high latency. As #Raihan mentions, there are several other FOSS distributed compute architectures that may server your needs better if you need realtime or low-latency results.
100k files isn't so very few. Even if they are 100k each, that's 10GB of data.
Other than the above, Hadoop is a relatively low-overhead way of approaching distributed computing problems. It has a huge, helpful community behind it, so you can get help quickly if you need it. And it is focused on running on cheap hardware and a free OS, so there really isnt any significant overhead.
In short, I'd try it before you discard it for something else.

Is Hadoop right for running my simulations?

have written a stochastic simulation in Java, which loads data from a few CSV files on disk (totaling about 100MB) and writes results to another output file (not much data, just a boolean and a few numbers). There is also a parameters file, and for different parameters the distribution of simulation outputs would be expected to change. To determine the correct/best input parameters I need to run multiple simulations, across multiple input parameter configurations, and look at the distributions of the outputs in each group. Each simulation takes 0.1-10 min depending on parameters and randomness.
I've been reading about Hadoop and wondering if it can help me running lots of simulations; I may have access to about 8 networked desktop machines in the near future. If I understand correctly, the map function could run my simulation and spit out the result, and the reducer might be the identity.
The thing I'm worried about is HDFS, which seems to meant for huge files, not a smattering of small CSV files, (none of which would big enough to even make up the minimum recommended block size of 64MB). Furthermore, each simulation would only need an identical copy of each of the CSV files.
Is Hadoop the wrong tool for me?

I see a number of answers here that basically are saying, "no, you shouldn't use Hadoop for simulations because it wasn't built for simulations." I believe this is a rather short sighted view and would be akin to someone saying in 1985, "you can't use a PC for word processing, PCs are for spreadsheets!"
Hadoop is a fantastic framework for construction of a simulation engine. I've been using it for this purpose for months and have had great success with small data / large computation problems. Here's the top 5 reasons I migrated to Hadoop for simulation (using R as my language for simulations, btw):
Access: I can lease Hadoop clusters through either Amazon Elastic Map Reduce and I don't have to invest any time and energy into the administration of a cluster. This meant I could actually start doing simulations on a distributed framework without having to get administrative approval in my org!
Administration: Hadoop handles job control issues, like node failure, invisibly. I don't have to code for these conditions. If a node fails, Hadoop makes sure the sims scheduled for that node gets run on another node.
Upgradeable: Being a rather generic map reduce engine with a great distributed file system if you later have problems that involve large data if you're used to using Hadoop you don't have to migrate to a new solution. So Hadoop gives you a simulation platform that will also scale to a large data platform for (nearly) free!
Support: Being open source and used by so many companies, the number of resources, both on line and off, for Hadoop are numerous. Many of those resources are written with the assumption of "big data" but they are still useful for learning to think in a map reduce way.
Portability: I have built analysis on top of proprietary engines using proprietary tools which took considerable learning to get working. When I later changed jobs and found myself at a firm without that same proprietary stack I had to learn a new set of tools and a new simulation stack. Never again. I traded in SAS for R and our old grid framework for Hadoop. Both are open source and I know that I can land at any job in the future and immediately have tools at my fingertips to start kicking ass.

Hadoop can be made to perform your simulation if you already have a Hadoop cluster, but it's not the best tool for the kind of application you are describing. Hadoop is built to make working on big data possible, and you don't have big data -- you have big computation.
I like Gearman (http://gearman.org/) for this sort of thing.

While you might be able to get by using MapReduce with Hadoop, it seems like what you're doing might be better suited for a grid/job scheduler such as Condor or Sun Grid Engine. Hadoop is more suited for doing something where you take a single (very large) input, split it into chunks for your worker machines to process, and then reduce it to produce an output.

Since you are already using Java, I suggest taking a look at GridGain which, I think, is particularly well suited to your problem.

Simply said, though Hadoop may solve your problem here, its not the right tool for your purpose.

Python, PyTables, Java - tying all together

Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?

This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
PyTables looks like a nice match.
There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Create two diagrams of your current architecture, physical and logical.
On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
Be certain to label the resources available to each server and the type and resources available to each connection.
Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
On the logical diagram, create boxes for every application that is running in your current architecture.
Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
Draw on disk resources (like the HDF5 and CSV files) as cylinders.
Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
What data formats or methods can this application use to communicate.
What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
How often does it need it.
Approximately how much resources does the application need.
What does the application do now that it doesn't do that well.
What can this application do now that would help, but it isn't doing.
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.

Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".

You could try Jython, a Python interpreter for the JVM which can import Java classes.
Jython project homepage
Unfortunately, that's all I know on the subject.

Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.