Related
Is it wise to develop a prototype GUI before designing other part of the system?
I am using Java for this small project. It will be a program with GUI and database connection. Say the database has table A and B, the user can choose which table to interact with. The program then display the contents of, say, table A in the GUI, and allows the user to change the content and submit the changes, or delete, or insert.
I think GUI should be developed first before any back-end development starts. There are couple of reason to do this:
You gain clarity on how model objects should interact.
Usability poses lots of restrictions on the way you want to pull data. You will probably want to develop and architect after you're 100% sure what constraints are there.
On business point, managers like to have a dumb function UI before any development start. Many times, the feedback leads in major changes in back-end assumptions. Which is a lot less pain than the case when you get a change request after the back-end development is over.
My personal experience goes that simultaneous development of GUI and back-end is a bit messy. Plus GUI provides solid expectation of behavior from back-end. Moreover, this approach makes sure all the developers, your client and your manager on the same page.
I agree with Joel Spolsky that it is a great idea to write a functional spec before writing code. Part of that spec should include a collection of screen mockups. #O.D. is right, Balsamiq is a great tool. It has saved me a lot of time in the past.
Once you have a functional spec in place that the business users are happy with, you will then have a better idea of how to design your system to meet the requirements. e.g. is high performance a requirement, domain model vs simple crud etc.
Then you should start by taking a single use case and building a vertical slice of your application. Build a GUI, service layer, persistence layer, database schema in one iteration. This will hopefully point out any problems with your design and give you the chance to modify it before you start building out the horizontal functionality.
I'd say yes and no.
No because you should design you application to be modularized enough so that your logic and data do not depend on UI design.
Yes because it is always smart to design everything before you actually start implementing it.
So what I mean is that you should make a concept, but not let your UI concept 'tie your hands' when you implement your logic. So if your managers clients don't like your conceptual UI, you can always change it without actually changing your application logic.
Well showing you GUI brfore starting to program is a very a good Idea, specially that you enable the enduser (Customer) to check if the UI is up to his expectations, which can save you lots of time.
In order to do that you dont necessarily need to develope a "real" prototype, you can use programms which enable you to fast design the UI of your App, including a minimal workflow simulation instead of full funcionality.
i had a very good experience with: Balsamiq can really recommend it
Writing spec before your code is always a good idea, because it makes you think. But most specs I have seen are not that good. And if the spec is too technical, users will at the end sign-off your spec without really understanding what are they going to get.
I have seen best results when either presenting the User Manual to the client, or by discussing mockups of the system one scenario at a time.
Note that half-baked mockups won't do the trick. You need your mockups to be fully populated with relevant data (Ever tried to discuss some screens with accounting while the numbers on the screen don't match? There's no way at all you could explain to them these are only dummy numbers...)
And the caveat of using mockups is that users will more often than not believe the app is "almost finished", whatever you do or say. It must be some subconscious thing, I'm not sure. But to avoid that, most of specialized tools have either only "black&white" look and feel or multiple skins you can switch to and from.
There is a pretty complete list of mockup tools here. Many of them are free:
http://c2.com/cgi/wiki?GuiPrototypingTools
My own tool is pretty popular: http://MockupScreens.com, I created it a long time ago exactly because of my own frustration with above mentioned problems.
I am a junior software engineer who've been given a task to take over a old system. This system has several problems, based on my preliminary assessment.
spaghetti code
repetitive code
classes with 10k lines and above
misuse and over-logging using log4j
bad database table design
Missing source control -> I have setup Subversion for this
Missing documents -> I have no idea of the business rule, except to read the codes
How should I go about it to enhance the quality of the system and resolve such issues? I can think of using static code analysis software to resolve any bad coding practice.
However, it can't detect any bad design issues or problems. How should I go about resolving these issues step by step?
Get and read Working Effectively With Legacy Code. It deals exactly with this situation.
As others have also advised, for refactoring you need a solid set of unit tests. However, legacy code is typically very difficult to unit test as is, since it has not been written to be unit testable. So you need to refactor first to allow unit testing, which would allow you to start refactoring... a bad catch.
This is where the book will help you. It gives lots of practical advice on how to make badly designed code unit testable with the minimal, and safest possible, code changes. Automatic refactorings can also help you here, but there are tricks described in the book which can only be done by hand. Then once the first set of unit tests are in place, you can start gradually refactoring towards better, more maintainable code.
Update: For hints on how to take over legacy code, you may find this earlier answer of mine useful.
As #Alex noted, unit tests are also very useful to understand and document the actual behaviour of the code. This is especially useful when documentation about the system is nonexistent or outdated.
Focus on stability first. You can't enhance or refactor until you have some kind of stable environment in-place around the application.
Some thoughts:
Revision control. You've made a start by setting-up subversion. Now make sure that your database schemas, stored procedures, scripts, third-party components, etc. are under revision control too. Have a version labelling system, make sure you label versions and can accurately access old versions in the future.
Build and release. Have a way to build stable releases on a machine other than your dev machine. You may want to use ant/nant, make, msbuild, or even a batch file or shell script. You may need deployment scripts / installers too if they don't exist.
Get it under test. Do not change the app until you have a way to know whether your change has broken it. For this you need tests. You should hopefully be able to write xunit unit tests for some of the simpler, stand-alone classes, but try to build some system/integration tests that exercise the application as a whole. Without high code coverage (which you won't have to begin with) integration tests are your best bet. Get into the habit of running the tests as often as possible. Take every opportunity to extend them.
Make small, focussed changes. Try to identify systems/subsystems within the application, and improve the boundaries between them. This reduces the knock-on effects of changes you may make. Beware the temptation to "pretty-up" the code by reformatting it or imposing the latest fashionable design pattern. Turning-around a system like this takes time.
Documentation. Its necessary, but don't worry too much about it. System documentation is rarely used in my experience. Good tests are usually better than good documentation. Concentrate on documenting the interfaces between the application and the system context that it runs in (inputs, outputs, file structures, db schemas, etc).
Manage expectations. If its in bad shape then it will probably resist your efforts to make changes and timescales may be harder than usual to estimate. Make sure management and stakeholders understand that.
At all costs, beware the temptation to just rewrite the whole thing. Its almost never the right thing to do in this situation. If it works, concentrate on keeping it working.
As a junior developer, don't be afraid to ask for help. As others have said, Working Effectively With Legacy Code is a good book to read, as is Martin Fowler's Refactoring.
Good luck!
First, don't fix what isn't broken. As long as the system you are to take over works, leave functionality alone.
The system is obviuosly broken when it comes to maintainability, however, so that is what you tackle. As mentioned above, write some tests first, get the source backed up in a cvs, and THEN start by cleaning up small pieces first, then the larger ones and so on. Do NOT attack the bigger architectural issues until you have gained a good understanding of how the system works. Tools won't help you as long as you don't dive into the code yourself, but when you do, they do help a lot.
Remember, nothing is "perfect". Don't over-engineer. Obey the KISS and YAGNI principles.
EDIT: Added direct link to YAGNI article
Your issue #7 is by far the most important. As long as you have no idea how the system is supposed to behave, all technical considerations are secondary. Everyone is suggesting unit tests - but how can you write a useful test if you can't distinguish between wanted and unwanted behaviour?
So before you start touching the code, you have to understand the system from the user's point of view: talk to users, observe them using the system, write documentation on the use case level.
Yes, I am seriously suggesting that you spend days, more likely weeks, without changing a single line of code. Because right now, any change you make is likely to break things without you realizing it.
Once you understand the app, you'll at least know which functionality is important to test (manually or automated).
Write some unit tests first, and make sure they pass. Then with each refactoring change you make, just keep making sure the tests keep passing. Then you can be confident that your application behaviour to the outside world hasn't changed.
This also has the added benefit that the tests will always be there, so for any future changes the tests should still pass, guarding against any regressions in the new changes.
First and foremost, make sure you have source control system installed and all source code is versioned and can be built.
Next, you can try writing unit test for core parts of your system. From there, when you have a more or less solid body of regression tests, you can actually proceed with refactoring.
When I encounter messy codebase, I usually start with renaming poorly-named types and methods to better reflect their initial intent. Next you can try splitting huge methods into smaller ones.
Keep in mind that this legacy system, with all it's spaghetti code, currently works. Don't go changing things just because they don't look as pretty as they should. Focus on stability, new features & familiarity before ripping old code out left right and centre.
Firstly, let me say that Working Effectively with Legacy Code is probably a really good book to read, judging by three answers within a minute of each other.
bad database table design
This one, you are probably stuck with. If you try to change an existing database design you are probably committing yourself to redesigning the whole system and writing migration tools for the existing data. Leave well alone.
My standard answer to this question is: Refactor the Low-hanging Fruit. In this case, I'd be inclined to take one of the 10K-line classes and seek out opportunities to Sprout Class, but that's just my own proclivity; you might be more comfortable changing other things first (setting up source control was an excellent first step!) Test what you can; refactor what can't be tested, take a step at a time, and make it better.
Keep in mind as you progress how much better you are making things; if you concentrate only on how bad things still are, you're likely to become discouraged.
As others have noted, don't change something that works just to make it prettier. The risk that you will introduce errors is great.
My philosophy is: As I have to make changes to satisfy new requirements or to fix reported bugs, I try to make the piece of code that I have to change a little cleaner. I'm going to have to test the changed code anyway, so now is a good time to do a little clean-up at small additional cost.
Fundamental design changes are the toughest and must be saved for occasions where you have to make a big enough change that you would be testing all the changed code anyway.
Changing bad database design is hardest of all because the poorly designed tables are likely used by many programs. Any change to the database requires changing every program that reads or writes it. The best way to accomplish this is usually to try to reduce the number of places that access any given part of the database. To take a simple example: Suppose there are 20 places that read through customer records and calculate the customer account balance. Replace this with one function that reads the database and returns the total, and twenty calls to that function. Now you can change the schema for the customer records and there is only one piece of code to change instead of 20. The principle is simple enough, but in practice it is unlikely that every function that accesses a given record is doing the same thing. Even if the original programmer was clumsy enough to write the same code 20 times (not unlikely -- I've seen plenty of that), the real situation is probably not that he wrote 1 function 20 times, period, but that he wrote function A 20 times, function B 12 times, function C 4 times, etc.
Working Effectively With Legacy Code might be helpful.
Design issues are very difficult to catch. The first place to start is understanding the design of the application. I find it useful to diagram using either UML or a process flow diagram, anything works that communicates the design and working for the application.
From there I go into more detail, and ask myself the questions "Would I have done it this way", what other options are there. It is easy to see code-debt, i.e. the debt that we get from making bad choices, as always bad, but sometimes there are other factors involved like budget, time, availability of resources etc. Their you have to ask the question if it is worth refactoring a working but bad designed application.
If there are many upcoming new features, changes, bug fixes, etc I would say it is good to refactor, but if the application rarely changes and is stable, then maybe leaving it as is is a better approach.
Another sidepoint to note, is that if the code is used by another application as a service or module, then refactoring might first mean create a stub around the code that servers as the interfaces, once that is defined clearly and has unit test to prove it work. You can choose any technology to fill in the details.
A good book on this subject is Working Effectively with Legacy Code By Michael Feathers (2004). It goes through the process of making small changes, while working towards a bigger clean up.
Write unit test & Find and remove duplicate code.
Write unit test & Break long methods into a series of short methods.
Write unit test & Find and remove duplicate method.
Write unit test & Break apart classes so that the follow the single responsibility principle.
Try to create some unit tests first that can trigger some actions in your code.
Commit everyting in SVN and TAG it (in case that something goes bad you'll have an escape pod).
Use inCode Eclipse plugin http://www.intooitus.com/inCode.html and look for what refactorings it proposes. Check if the refactorings proposed seem ok for your proble. Try to understand them.
Retest with the units created before.
Now you can use FindBugs and/or PMD to check for other subtle issues.
If everything is oka you might want to check-in again.
I'd also try reading the source in order to detect some cases where patterns can be applied.
The project which I am going to start is my first project and it is very big. Though it's a great opportunity for me, I don't want me to trapped myself in a messed-code in the end so I have made a whole design of the software (software architecture) divided it into three tiers:
Presentation Tier ---- (will be implemented through Java Swing)
BusinessLogic Tier -- (will be implemented through EJB technology)
DatabaseLayer ------- (will be implemented with the help of Hibernate)
Q1. Which Tier should I select to start with?
I don't have any experience of Standard Product Development environment but I am sure that there is some specific order which is better than other.
Q2. I think that these things come under Good Design Principles and Best Practices. I have searched the internet for these kinds of resources and have found some good resources too but I would be grateful if you recommend me some resources that you know have short, to-the-point and quality content?
"my first project and it is very big"
Please don't do that.
Please do a small project first.
Write the "model" (business logic) first. This will be very difficult, since it's your first project. Keep it small and focused on just business logic that you can test and prove that it works.
Throw that away.
Now do another project.
Write the "model" (business logic) first. Based on lessons learned during the first project this will be much better. This will be hard, since it's your second project. Keep it small and focused on just business logic that you can test and prove that it works.
Write the persistence and the object-relational mapping second. Adding database persistence will be very hard, since it's only your first database project, and only your second project.
Throw that away.
Now, you have some idea what you're doing. Start a third project.
Write the "model" (business logic) first. Based on lessons learned during the first two projects this will be much better. By now, this will still be a lot of work because you finally understand what you're doing. The work, however, will no longer involve technical issues, it will now involve real issues around use cases and what the application actually does.
Write the persistence and the object-relational mapping second. Based on lessons learned during the first projects this will be much better. This will still be hard. Nothing makes this easy. It's only your second time, so you'll still make mistakes; there will be fewer mistakes.
Write the presentation last. Always.
This is actually the fastest way to do a large project when you don't already know the technology.
Do the implementation in vertical stripes: implement one capability of the project from the three layer so that the design could be validated from end-to-end.
The database layer could be skipped first, data could either be hard coded, not persisted, or simply store in test files first.
There are two approaches to this and both have advantages and disadvantages:
Top-down
First develop the UI and implement the data access and buisness logic where required for the UI.
Bottum-up
Develop the Database Layer first and then add the UI functionality later.
The first has the advantage that you have a (at least partially) working UI fast which is something the customers like, so it's the de-facto industry standard. The latter one provides a stable and complete Database and business logic layer but they usually have some overhead because it is not clear how the data is later required/accessed.
I think you should start with the business logic layer as this layer should be the one that dictates which functionality the other layers should have. If you start with the data access layer there's a risk that you will end up with a lot of YAGNI code, and furthermore any application should be centered around business logic and not persistence or presentation logic.
You should look into such best practices as DDD and TDD, and you could probably benefit from knowledge about MVVM or similar pattern too.
I would add to the existing answers:
Do not use lots of technologies you're unfamiliar with or only have a basic undersyanding; EJB, Hibernate, etc. They give you no appreciation of what's going on under the covers and add to the learning curve, pushing out your deadline. Instead go for something a lot simpler; e.g. RMI client-server application with JDBC or Spring/JDBC for persistence. You can always rework elements later but better to deliver incrementally than not at all.
Q1. Take a pen and a lot of paper and start designing. That way you make "plans" before writing code.
Q2. Using paper first help designing for quality; regular refinement, refactoring and review help maintaining quality.
Design Principles:
Start here
http://www.butunclebob.com/ArticleS.UncleBob.PrinciplesOfOod
Question in nutshell
What is the best way to get Python and Java to play nice with each other?
More detailed explanation
I have a somewhat complicated situation. I'll try my best to explain both in pictures and words. Here's the current system architecture:
We have an agent-based modeling simulation written in Java. It has options of either writing locally to CSV files, or remotely via a connection to a Java server to an HDF5 file. Each simulation run spits out over a gigabyte of data, and we run the simulation dozens of times. We need to be able to aggregate over multiple runs of the same scenario (with different random seeds) in order to see some trends (e.g. min, max, median, mean). As you can imagine, trying to move around all these CSV files is a nightmare; there are multiple files produced per run, and like I said some of them are enormous. That's the reason we've been trying to move towards an HDF5 solution, where all the data for a study is stored in one place, rather than scattered across dozens of plain text files. Furthermore, since it is a binary file format, it should be able to get significant space savings as compared to uncompressed CSVS.
As the diagram shows, the current post-processing we do of the raw output data from simulation also takes place in Java, and reads in the CSV files produced by local output. This post-processing module uses JFreeChart to create some charts and graphs related to the simulation.
The Problem
As I alluded to earlier, the CSVs are really untenable and are not scaling well as we generate more and more data from simulation. Furthermore, the post-processing code is doing more than it should have to do, essentially performing the work of a very, very poor man's relational database (making joins across 'tables' (csv files) based on foreign keys (the unique agent IDs). It is also difficult in this system to visualize the data in other ways (e.g. Prefuse, Processing, JMonkeyEngine getting some subset of the raw data to play with in MatLab or SPSS).
Solution?
My group decided we really need a way of filtering and querying the data we have, as well as performing cross table joins. Given this is a write-once, read-many situation, we really don't need the overhead of a real relational database; instead we just need some way to put a nicer front end on the HDF5 files. I found a few papers about this, such as one describing how to use [XQuery as the query language on HDF5 files][3], but the paper describes having to write a compiler to convert from XQuery/XPath into the native HDF5 calls, way beyond our needs.
Enter [PyTables][4]. It seems to do exactly what we need (provides two different ways of querying data, either through Python list comprehension or through [in-kernel (C level) searches][5].
The proposed architecture I envision is this:
What I'm not really sure how to do is to link together the python code that will be written for querying, with the Java code that serves up the HDF5 files, and the Java code that does the post processing of the data. Obviously I will want to rewrite much of the post-processing code that is implicitly doing queries and instead let the excellent PyTables do this much more elegantly.
Java/Python options
A simple google search turns up a few options for [communicating between Java and Python][7], but I am so new to the topic that I'm looking for some actual expertise and criticism of the proposed architecture. It seems like the Python process should be running on same machine as the Datahose so that the large .h5 files do not have to be transferred over the network, but rather the much smaller, filtered views of it would be transmitted to the clients. [Pyro][8] seems to be an interesting choice - does anyone have experience with that?
This is an epic question, and there are lots of considerations. Since you didn't mention any specific performance or architectural constraints, I'll try and offer the best well-rounded suggestions.
The initial plan of using PyTables as an intermediary layer between your other elements and the datafiles seems solid. However, one design constraint that wasn't mentioned is one of the most critical of all data processing: Which of these data processing tasks can be done in batch processing style and which data processing tasks are more of a live stream.
This differentiation between "we know exactly our input and output and can just do the processing" (batch) and "we know our input and what needs to be available for something else to ask" (live) makes all the difference to an architectural question. Looking at your diagram, there are several relationships that imply the different processing styles.
Additionally, on your diagram you have components of different types all using the same symbols. It makes it a little bit difficult to analyze the expected performance and efficiency.
Another contraint that's significant is your IT infrastructure. Do you have high speed network available storage? If you do, intermediary files become a brilliant, simple, and fast way of sharing data between the elements of your infrastructure for all batch processing needs. You mentioned running your PyTables-using-application on the same server that's running the Java simulation. However, that means that server will experience load for both writing and reading the data. (That is to say, the simulation environment could be affected by the needs of unrelated software when they query the data.)
To answer your questions directly:
PyTables looks like a nice match.
There are many ways for Python and Java to communicate, but consider a language agnostic communication method so these components can be changed later if necessarily. This is just as simple as finding libraries that support both Java and Python and trying them. The API you choose to implement with whatever library should be the same anyway. (XML-RPC would be fine for prototyping, as it's in the standard library, Google's Protocol Buffers or Facebook's Thrift make good production choices. But don't underestimate how great and simple just "writing things to intermediary files" can be if data is predictable and batchable.
To help with the design process more and flesh out your needs:
It's easy to look at a small piece of the puzzle, make some reasonable assumptions, and jump into solution evaluation. But it's even better to look at the problem holistically with a clear understanding of your constraints. May I suggest this process:
Create two diagrams of your current architecture, physical and logical.
On the physical diagram, create boxes for each physical server and diagram the physical connections between each.
Be certain to label the resources available to each server and the type and resources available to each connection.
Include physical hardware that isn't involved in your current setup if it might be useful. (If you have a SAN available, but aren't using it, include it in case the solution might want to.)
On the logical diagram, create boxes for every application that is running in your current architecture.
Include relevant libraries as boxes inside the application boxes. (This is important, because your future solution diagram currently has PyTables as a box, but it's just a library and can't do anything on it's own.)
Draw on disk resources (like the HDF5 and CSV files) as cylinders.
Connect the applications with arrows to other applications and resources as necessary. Always draw the arrow from the "actor" to the "target". So if an app writes and HDF5 file, they arrow goes from the app to the file. If an app reads a CSV file, the arrow goes from the app to the file.
Every arrow must be labeled with the communication mechanism. Unlabeled arrows show a relationship, but they don't show what relationship and so they won't help you make decisions or communicate constraints.
Once you've got these diagrams done, make a few copies of them, and then right on top of them start to do data-flow doodles. With a copy of the diagram for each "end point" application that needs your original data, start at the simulation and end at the end point with a pretty much solid flowing arrow. Any time your data arrow flows across a communication/protocol arrow, make notes of how the data changes (if any).
At this point, if you and your team all agree on what's on paper, then you've explained your current architecture in a manner that should be easily communicable to anyone. (Not just helpers here on stackoverflow, but also to bosses and project managers and other purse holders.)
To start planning your solution, look at your dataflow diagrams and work your way backwards from endpoint to startpoint and create a nested list that contains every app and intermediary format on the way back to the start. Then, list requirements for every application. Be sure to feature:
What data formats or methods can this application use to communicate.
What data does it actually want. (Is this always the same or does it change on a whim depending on other requirements?)
How often does it need it.
Approximately how much resources does the application need.
What does the application do now that it doesn't do that well.
What can this application do now that would help, but it isn't doing.
If you do a good job with this list, you can see how this will help define what protocols and solutions you choose. You look at the situations where the data crosses a communication line, and you compare the requirements list for both sides of the communication.
You've already described one particular situation where you have quite a bit of java post-processing code that is doing "joins" on tables of data in CSV files, thats a "do now but doesn't do that well". So you look at the other side of that communication to see if the other side can do that thing well. At this point, the other side is the CSV file and before that, the simulation, so no, there's nothing that can do that better in the current architecture.
So you've proposed a new Python application that uses the PyTables library to make that process better. Sounds good so far! But in your next diagram, you added a bunch of other things that talk to "PyTables". Now we've extended past the understanding of the group here at StackOverflow, because we don't know the requirements of those other applications. But if you make the requirements list like mentioned above, you'll know exactly what to consider. Maybe your Python application using PyTables to provide querying on the HDF5 files can support all of these applications. Maybe it will only support one or two of them. Maybe it will provide live querying to the post-processor, but periodically write intermediary files for the other applications. We can't tell, but with planning, you can.
Some final guidelines:
Keep things simple! The enemy here is complexity. The more complex your solution, the more difficult the solution to implement and the more likely it is to fail. Use the least number operations, use the least complex operations. Sometimes just one application to handle the queries for all the other parts of your architecture is the simplest. Sometimes an application to handle "live" queries and a separate application to handle "batch requests" is better.
Keep things simple! It's a big deal! Don't write anything that can already be done for you. (This is why intermediary files can be so great, the OS handles all the difficult parts.) Also, you mention that a relational database is too much overhead, but consider that a relational database also comes with a very expressive and well-known query language, the network communication protocol that goes with it, and you don't have to develop anything to use it! Whatever solution you come up with has to be better than using the off-the-shelf solution that's going to work, for certain, very well, or it's not the best solution.
Refer to your physical layer documentation frequently so you understand the resource use of your considerations. A slow network link or putting too much on one server can both rule out otherwise good solutions.
Save those docs. Whatever you decide, the documentation you generated in the process is valuable. Wiki-them or file them away so you can whip them out again when the topic come s up.
And the answer to the direct question, "How to get Python and Java to play nice together?" is simply "use a language agnostic communication method." The truth of the matter is that Python and Java are both not important to your describe problem-set. What's important is the data that's flowing through it. Anything that can easily and effectively share data is going to be just fine.
Do not make this more complex than it needs to be.
Your Java process can -- simply -- spawn a separate subprocess to run your PyTables queries. Let the Operating System do what OS's do best.
Your Java application can simply fork a process which has the necessary parameters as command-line options. Then your Java can move on to the next thing while Python runs in the background.
This has HUGE advantages in terms of concurrent performance. Your Python "backend" runs concurrently with your Java simulation "front end".
You could try Jython, a Python interpreter for the JVM which can import Java classes.
Jython project homepage
Unfortunately, that's all I know on the subject.
Not sure if this is good etiquette. I couldn't fit all my comments into a normal comment, and the post has no activity for 8 months.
Just wanted to see how this was going for you? We have a very very very similar situation where I work - only the simulation is written in C and the storage format is binary files. Every time a boss wants a different summary we have to make/modify handwritten code to do summaries. Our binary files are about 10 GB in size and there is one of these for every year of the simulation, so as you can imagine, things get hairy when we want to run it with different seeds and such.
I've just discovered pyTables and had a similar idea to yours. I was hoping to change our storage format to hdf5 and then run our summary reports/queries using pytables. Part of this involves joining tables from each year. Have you had much luck doing these types of "joins" using pytables?
I started web programming with raw PHP, gradually moving on to its various frameworks, then to Django and Rails. In every framework I've used, pretty much everything I need to do with a database (even involving relatively complex things like many-to-many relationships) could be taken care of by the automatically generated database API without much work. Those few operations that were more complex could be done with straight SQL or by tying together multiple API calls.
Now I'm starting to learn Java, and it's confusing me that the language celebrated for being so robust for back-end infrastructure requires so much more code (doesn't that mean harder to maintain?) to do simple things. Example from a tutorial: say you want to search by last name. You write the method in the DAO using Hibernate query language, then you write a method in the Service to call it (couldn't that be automated?), then you call the Service method from the controller. Whereas in any other framework I've worked with, you could call something to the effect of
Person.find_by_last_name(request.POST['last_name'])
Straight out of the controller - you don't have to write anything custom to do something like that.
Is there some kind of code generation I haven't found yet? Something in Eclipse? Just doesn't seem right to me that the language regraded as one of the best choices for complex back-ends is so much harder to work with. Is there something I'm missing?
Grails for the win. Groovy is very similar to Java but with a lot of nice dynamic language additions/simplifications. Grails has GORM, which is exactly what you're looking for.
In the example you mention, it looks like they are using more of a tiered architecture than you are used to.
Controller -> Service -> DAO
This provides for separation within the app. This is also completely dependent on the architecture of your application, not really Java as a language. Technically there is nothing in Java that would stop you from calling a Hibernate query in your controller. It just wouldn't constitute good design.
Another thing to consider is that the 'Service' could be something like an EJB, which may have the role of Transaction management, so that multiple calls to the DAO/Hibernate can be wrappered in a single transaction that will automatically commit or rollback on success / exception.
Again though, this is all in the architecture / framework that you are using, not Java as a language.
I would suggest that you look at the Spring framework for Java.
I haven't personally used it, but I've heard good things about it. From what I understand it provides the same sort of functionality that you would get Django and Rails.
I'd suggest you use Seam. It is a very good controller, that does not force you to have fully multitiered ap. It is fully integrated with JPA (Java Persistence Api) that does ORM.
Features
Has very nice scoping - you can have objects scoped to Session, pageload, conversation (conversation is an unit of work from user perispective).
Does not require much XML.
Does not require much boilerplate code!
Is easy to learn (you may even generate framework project from entity classes or db schema; it will still require much of work, but it will at least cut down boilerplate code)
Very nice security (you may either use role based security, or use rules framework)
When writing webpage you use beans (normal java objects).
You may write:
#{PersonHome.instance.name}
which will evaluate to the value of name of a person. If in request parameter there was person id it will be passed to PersonHome component (if it was annotated properly), and person bean will be loaded transparently from the db.
And you may even write:
<h:commandLink action="#{PersonHome.delete(person}">
Where controllerBean is java bean, and delete takes person object. It will be transparently translated to link that will have person id parameter, that will transparently be translated to bean before action method will be fired.
The only caveat for now is that it somewhat limits your choice of view framework: it works only with RichFaces, GWT, and something else that I cant remember now ;).
PS.
Yes I'm a huge fan of seam :).
Get iBatis for Java. It isn't as robust as Django's ORM (nothing is), but it's a big, big step above JDBC.
Well yeah, but it's not separation that's the concern for me. With these other database APIs, concerns are separated - it's not that custom DB logic is written in the controller (which I know is bad form) but that it's automatically generated. The call looks exactly the same from the controller, the difference is that with Java/Hibernate I had to write it myself and with Django/Rails/Symfony/Cake it was already there for me.
Grails looks very interesting. The main reason I'm learning Java, though, is because I want to at least be able to work with something I can use professionally. I'm not sure Grails fits that bill, though because it is Java perhaps the Enterprise will warm up to it moreso than Rails.
Django is the most beautiful piece of code I have ever worked with, but it's not trusted by the kind of businesses who can afford custom web apps, which is the market I think I want to be in.
iBATIS looks very promising - looking through the JPetStore code it appears it does a bit more automatically. But did all that SQL have to be hand-coded? Because then I'd be back where I started.
Spring has a great and reasonably easy to work with interface layer (MVC), and it ties components together pretty nicely. Though it can be used to integrate an ORM into an app, as far as I know it's not one.
Part of the the reason that you don't get automatically generated database APIs in Java is that as a compiled language without macros (like Lisp) it cannot do runtime code generation. Dynamically typed script languages like Ruby have that capability.
Another part of the reason is cultural: the J2EE world has tended to prefer configuration over convention for the last decade. The major reason for this preference is that enterprisey apps often have to work with lots of crufty assumptions that bring with it all sorts of "weird" edge cases. It is what it is. These assumptions stand in stark contrast to what the newer frameworks like Rails assume.
That said, I don't think there's anything to prevent a Java ORM API from generating database APIs at compile time. My guess is that Naked Objects is more up your alley.
Just doesn't seem right to me that the language regraded as one of the best choices for complex back-ends is so much harder to work with. Is there something I'm missing?
The reason for this is the amount of time Java has been deployed in the Enterprise.
It's not that Java is more mature than, say, Ruby, it's just that Java has been used for longer and has been proven by risk averse IT departments.
One recommended way in is therefore via JRuby (+Rails), as that runs on the Java VM, and the IT department don't need to install or reconfigure anything...