Migrating from hand-written persistence layer to ORM

Migrating from hand-written persistence layer to ORM - java

We are currently evaluating options for migrating from hand-written persistence layer to ORM.
We have a bunch of legacy persistent objects (~200), that implement simple interface like this:
interface JDBC {
public long getId();
public void setId(long id);
public void retrieve();
public void setDataSource(DataSource ds);
}
When retrieve() is called, object populates itself by issuing handwritten SQL queries to the connection provided using the ID it received in the setter (this usually is the only parameter to the query). It manages its statements, result sets, etc itself. Some of the objects have special flavors of retrive() method, like retrieveByName(), in this case a different SQL is issued.
Queries could be quite complex, we often join several tables to populate the sets representing relations to other objects, sometimes join queries are issued on-demand in the specific getter (lazy loading). So basically, we have implemented most of the ORM's functionality manually.
The reason for that was performance. We have very strong requirements for speed, and back in 2005 (when this code was written) performance tests has shown that none of mainstream ORMs were that fast as hand-written SQL.
The problems we are facing now that make us think of ORM are:
Most of the paths in this code are well-tested and are stable. However, some rarely-used code is prone to result set and connection leaks that are very hard to detect
We are currently squeezing some additional performance by adding caching to our persistence layer and it's a huge pain to maintain the cached objects manually in this setup
Support of this code when DB schema changes is a big problem.
I am looking for an advice on what could be the best alternative for us. As far as I know, ORMs has advanced in last 5 years, so it might be that now there's one that offers an acceptable performance. As I see this issue, we need to address those points:
Find some way to reuse at least some of the written SQL to express mappings
Have the possibility to issue native SQL queries without the necessity to manually decompose their results (i.e. avoid manual rs.getInt(42) as they are very sensitive to schema changes)
Add a non-intrusive caching layer
Keep the performance figures.
Is there any ORM framework you could recommend with regards to that?
UPDATE To give a feeling of what kind of performance figures we are talking about:
The backend database is TimesTen, in-memory database that runs on the same machine as the JVM
We found out that changing rs.getInt("column1") to rs.getInt(42) brings the performance increase we consider significant.

If you want a standard persistence layer that lets you issue native SQL queries, consider using iBATIS. It's a fairly thin mapping between your objects and SQL. http://ibatis.apache.org/
For caching and lazy joins, Hibernate might be a better choice. I haven't used iBATIS for these purposes.
Hibernate provides a lot of flexibility in allowing you to specify certain defaults for lazy loading as you traverse your object graph, yet also pre-fetch data with SQL or HQL queries to your heart's content when you need better-known load times. However, the conversion effort will be complicated for you as it has a fairly high bar to entry in terms of learning and configuration. Annotations made this easier for me.
Two benefits you didn't mention about switching to a standard framework:
(1) running down bugs becomes easier when you have a wealth of sites and forums out there to support you.
(2) new hires are cheaper, easier and faster.
Good luck in addressing your performance and usability issues. The tradeoffs you point out are very common. Sorry if I evangelized.

For the bulk of your queries, I'd go with hibernate. It's widely used,well documented, and generally performant. You can drop down to hand-written SQL if hibernate isn't producing efficient enough queries. Hibernate gives you a lot of control in specifying the table names and columns that the domain objects map to, and in most cases you can retro fit it to an exisitng schema.
Find some way to reuse at least some of the written SQL to express mappings
The mappings are expressed in JPA using annotations. You can use the existing SQL as a guide when creating JPQL queries.
Add a non-intrusive caching layer
Caching in hibernate is automatic and transparent, unless you specifically choose to get involved. You can mark entities as read only, or evict from the cache, control when changes are flushed to the database (inside a transaction of course - automatic use of batching improves performance when network latency is a concern.)
Have the possibility to issue native
SQL queries without the necessity to
manually decompose their results (i.e.
avoid manual rs.getInt(42) as they
are very sensitive to schema changes)
Hibernate allows you to write SQL, and have this mapped to your entities. You don't deal with the ResultSet directly - hibernate takes care of the deconstruction into your entity. See Chpt 16, Native SQL in the hibernate manual.
Support of this code when DB schema changes is a big problem.
Managing schema changes can still be a pain, since you now effectively have two schemata - the database schema and the JPA mapping (an object schema). if you choose to let hibernate generate the db schema and move your data to that, you are no longer directly responsible for what goes into the database, and so you are then faced with manging automatic changes to a machine generated schema. There are tools that can assist, such as dbmigrate, and liquibase, but it's no walk in the park. Conversely, if you are managing the db schema by hand, then you will have to carefully recraft your entities, JPA annotations and queries to accomodate the schema changes. Adding columns and new entities is relatively trivial, but more complex changes such as changing a single property to a collection of properties, or restructing an object hierarchy will involve considerably more extensive changes. There is no easy way out of this - either the db or hibernate is the "master" that decides the schema, and when one changes, the other must follow. The code changes aren't so bad - in my experience, it's migrating the data that's difficult. But this is a basic issue with databases, and will be present in any solution you choose.
So, to sum up, I'd go with hibernate, and use the JPA interface.

I've recently drilled through a bunch of Java ORMs and didn't come up with anything much better than Hibernate. Hibernate's performance may get you there and satisfy your performance goals.
Lots of people think that moving to Hibernate will make everything so awesome, but it's really just moving a set of problems from JDBC queries into Hibernate tuning. Read a bunch of books or (better) hire a "Hibernate guy" to come in and help.
During your refactor, I'd recommend using JPA so you can un-plug and re-plug a new persistence provider when the Next Big Thing comes along (or you move to Oracle)

Do you really need to migrate? What's forcing you to move? Is there some REAL need here or someone just inventing work (an 'Astronaut architect')?
I agree with the above answers though - if you HAVE to move - Hibernate or iBatis are good choices. iBatis especially if you want to stay 'closer' to the SQL.

If you need more performance: drop the database (for on-line work) and handle the persistence direct. Adding caching is not going to help you with a TimesTen DB, it just adds an extra copy (slowing you down).
You might want to take a look at GemFire.

There is a lot of good advice already in here that I won't repeat. The only thing I didn't see suggested that might work for you is caching reference data in memory.
I have done quite a bit of this in the past and it does save a lot of time. If you have a large number of fairly static reference tables, load them all into memory at startup time and refresh them every couple minutes. That way you're not hitting the DB over and over again for data that never changes.

Related

JPA performance optimization or alternatives

We are currently in a project with a high demand on performance when it comes to reads from the database.
We are currently using JPA (EclipseLink implementation), currently just because it provides convenient database access and column mapping.
For our queries we are using highly specific SQL queries. We are also using one database (SAP HANA, in-memory), so a language abstraction is not required. The database access is pretty fast, our current bottleneck really is the application server, especially the persistence layer.
The result sets often also do not contain entities because entities are made up of the context. For us, there is no point in using an #Id field like the following, because we don't have fields that are unique (only combinations, but defining an IdClass is too much overhead).
#Entity
public class Item {
#Id
public myField;
// other fields...
}
This seems to be enforced by JPA if I want to run a typed native query. Is that assumption true? Currently we haven't found a way around the ID mapping.
Are these findings valid?
If not, how can we make our use of JPA more performant (there is significant latency compared to plain JDBC), also without defining an #Id (because it is useless in our case) for result types?
If yes, is there another Java library that just provides a minimum layer on top of JDBC without too much latency that provides a more convenient use than plain JDBC (with column mapping and all that good stuff).
Thanks!
Usecase: We would like to stream historic GPS sensor data from the database. Besides just transforming this to JSON, we also do some transformations/validations. That's why we actually need to build objects. So what we basically looking for is a convenient way of mapping the fields of select statements to attributes. I hope that makes sense.

There are many articles and blogs about improving EclipseLink/JPA performance that you might look into, such as EclipseLink Performance, JPA Performance Tuning and Optimizing the EclipseLink Application
In the end though it all depends very much on your specific use case and any future use cases you may want. JPA is designed to make reading and writing overtop of JDBC easier and more maintainable and adds many performance benefits such as caching. If all you are using it for is to read raw data though, the extra layer might be extra overhead that isn't adding any value. There isn't much point to having JPA build you entities from the resultsets, maintain the cache and watch for changes only for your application to ignore it all and grab the raw data.
I do not understand why you would have an Item table with a single myField. How is it used by the application and how does it relate to other tables and potential entities?
Such a construct is not the normal use case for relational databases and ORMs, but there are still ways around it in JPA. The data could be used in element collections by other entities, or even just not mapped, and native SQL queries used which are passed straight through the JDBC layer. EclipseLink itself has many mapping types and options above and beyond JPA that might be used depending on your use cases.

Disadvantages of Object Relational Mapping

I am a fan of ORM - Object Relational Mapping and I have been using it with Rails for the past year and a half. Prior that, I use to write raw queries using JDBC and make Database do the heavy lifting via Stored Procedures. With ORM, I was initially happy to do stuff like coach.manager and manager.coaches which were very simple and easy to read.
But as time went by there were in-numerous associations creeping up and I ended up doing a.b.c.d which were firing queries in all directions, behind the scenes. With rails and ruby, the garbage collector went nuts and took insane time to load a very complex page which involves relatively lesser data. I had to replace this ORM style code by a simple Stored procedure and the result I saw was enormous. A page that took 50 seconds to load now takes only 2 seconds.
With this huge difference, should I continue using ORM? It is very clear it has severe overheads compared to a raw query.
In general, what are the general pitfalls of using an ORM framework like Hibernate, ActiveRecord?

An ORM is only a tool. If you don't use it correctly, you'll have bad results.
Nothing stops you from using dedicated HQL/criteria queries, with fetch joins or projections, to return the information that your page must display in as few queries as possible. This will take more or less the same time as dedicated SQL queries.
But of course, if you just get everything by ID and navigate through your objects without realizing how many queries it generates, it will lead to long loading times. The key is to know exactly what the ORM does behind the scene, and decide if it's appropriate or if another strategy must be adopted.

I think you've already identified the major tradeoff associated with ORM software. Every time you add a new layer of abstraction that tries to provide a generalized implementation of something that you used to do by hand there is going to be some loss of performance/efficiency.
As you noted, traversing multiple relationships such as a.b.c.d can be inefficient, because most ORM software will be doing an independent database query for each . along the way. But I'm not sure that means you should eliminate ORM altogether. Most ORM solutions (or at least, certainly Hibernate) allow you to specify custom queries where you can bring back exactly what you want in a single database operation. This should be about as fast as your dedicated SQL.
Really the issue is about understanding how the ORM layer is working behind the scenes, and realizing that while something like a.b.c.d is simple to write, what it causes the ORM layer to do as it is evaluated is not. As a general rule I always go with the simplest possible approach to begin, and then write optimized queries in areas where it makes sense/where it is obvious that the simple approach will not scale.

I'd say, one should use the appropriate tool for different tasks.
E.g., for CRUD operations, ORM frameworks like Hibernate can speed up development and it will perform well enough. Sometimes you need to do some necessary tweaks to achieve acceptable performance. I'm not sure, your task (what took 50 sec with Hibernate) could not be done properly with Hibernate, because you did not provide us with the details.
On the other hand, for example bulk operations involving hundreds of thousands of records is not the type of task you'd expect Hibernate will do without significant performance penalty.

As it was mentioned already, ORM is only a tool and you can use it eiter good or bad.
One of the most typical performance problems in ORMs is 1+N queries problem. It is caused by loading additional objects for each of objects from the list. This is caused by eager fetch of 1-to-n-relation entities for each element on list, the dealing is using HQL queries, specifying fields in projection or marking fetching 1-to-n relations to lazy.
Any time, you must exactly know what the ORM is doing in order to achieve good performance. Not understanding what operations are done in background is a way to disaster (slow, buggy and hard to analyze code because of unnecessary and wrongly written work-arounds).

I'm with Petar from your comments regarding the lazy fetching. Say you have an html table filled fields from object a.b.c.d. You could find your framework round-tripping the database thousands of times(possibly many more) . The disadvantage of ORM in this case is you have to read the documentation thoroughly. Most frameworks support disabling lazy fetching and many even support adding your own processing logic to bind the data set.
The net out is that almost any ORM is almost undoubtedly better than anything you are going to write yourself. You will find yourself saddled with maintaining huge libraries of boilerplate or worse writing the same code over and over again.

We are currently investigating to switch from our own data store layer with clean separation of transfer objects and data access objects to JPA. We used a generator to create the TOs, the DAOs and the SQL DDL as well from some documentation in docbook format. By this all of our stuff from documentation, the database structure and the generated Java classes where always in sync with a good documentation of the database itself.
What we discovered so far by using JPA:
Foreign key references cannot be used for imports, some special
queries and so on because they must not be placed in a managed
entity. JPA only allows the target class there.
Access to some user session scope is difficult upto impossible. We
still have no clue how to get the users id into the column
'userWhoLastMadeAnUpdate' in some PrePersist method.
Something expected to be quite easy with an ORM, namely "class
mapping" does not work at all. We are using HalDateTime
(http://sourceforge.net/projects/haldatetime/) internally.
Especially in the client. Mapping it with JPA directly is not
possible although HalDateTime supports it. Due to JPA restrictions
we have to use two fields in the entity.
JPA uses either one XML file to describe the mapping. So you have to
look at least into two files to even understand the relationship
between the Java class and the database. And the XML file becomes
huge for large applications.
Alternatively ORMs provide annotations in the Java class itself. So
its easier to learn and understand the relationship. But it forces
you to see all that database stuff in the client layer (which
completely breaks a proper layering).
You will have to restrict yourself to stay as close to a clean
database structure as anyhow possible. Otherwise you will for sure
end up with a mess of queries and statements by the ORM.
Use an ORM which provides a query language which is close to SQL
itself (JPA seems quite acceptable here). An ORM induced language
makes supporting a large application really expensive.

Strategies for performance optimizations on an inherited EJB3 application

I was asked to have a look at a legacy EJB3 application with significant performance problems. The original author is not available anymore so all I've got is the source code and some user comments regarding the unacceptable performance. My personal EJB3 skill are pretty basic, I can read and understand the annotated code but that's all until know.
The server has a database, several EJB3 beans (JPA) and a few stateless beans just to allow CRUD on 4..5 domain objects for remote clients. The client itself is a java application. Just a few are connected to the server in parallel. From the user comments I learned that
the client/server app performed well in a LAN
the app was practically unusable on a WAN (1MBit or more) because read and update operations took much too long (up to several minutes)
I've seen one potential problem - on all EJB, all relations have been defined with the fetching strategy FetchType.EAGER. Would that explain the performance issues for read operations, is it advisable to start tuning with the fetching strategies?
But that would not explain performance issues on update operations, or would it? Update is handled by an EntityManager, the client just passes the domain object to the manager bean and persisting is done with nothing but manager.persist(obj). Maybe the domain objects that are sent to the server are just too big (maybe a side effect of the EAGER strategy).
So my actual theory is that too many bytes are sent over a rather slow network and I should look at reducing the size of result sets.
From your experience, what are the typical and most common coding errors that lead to performance issues on CRUD operations, where should I start investigating/optimizing?

On all EJB, all relations have been defined with the fetching strategy FetchType.EAGER. Would that explain the performance issues for read operations?
Depending on the relations betweens classes, you might be fetching much more (the whole database?) than actually wanted when retrieving entities?
is it advisable to start tuning with the fetching strategies?
I can't say that making all relations EAGER is a very standard approach. To my experience, you usually keep them lazy and use "Fetch Joins" (a type of join allowing to fetch an association) when you want to eager load an association for a given use case.
But that would not explain performance issues on update operations, or would it?
It could. I mean, if the app is retrieving a big fat object graph when reading and then sending the same fat object graph back to update just the root entity, there might be a performance penalty. But it's kinda weird that the code is using em.persist(Object) to update entities.
From your experience, what are the typical and most common coding errors that lead to performance issues on CRUD operations, where should I start investigating/optimizing?
The obvious ones include:
Retrieving more data than required
N+1 requests problems (bad fetching strategy)
Poorly written JPQL queries
Non appropriate inheritance strategies
Unnecessary database hits (i.e. lack of caching)
I would start with writing some integration tests or functional tests before touching anything to guarantee you won't change the functional behavior. Then, I would activate SQL logging and start to look at the generated SQL for the major use cases and work on the above points.

From DBA position.
From your experience, what are the typical and most common coding errors that lead to performance issues on CRUD operations, where should I start investigating/optimizing?
Turn off caching
Enable sql logging Ejb3/Hibernate generates by default a lots of extremely stupid queries.
Now You see what I mean.
Change FetchType.EAGER to FetchType.LAZY
Say "no" for big business logic between em.find em.persist
Use ehcache http://ehcache.org/
Turn on entity cache
If You can, make primary keys immutable ( #Column(updatable = false, ...)
Turn on query cache
Never ever use Hibernate if You want big performance:
http://www.google.com/search?q=hibernate+sucks

I my case a similar performance problem wasn't depending on the fetch strategy. Or lets say it was not really possible to change the business logic in the existing fetch strategies. In my case the solution was simply adding indices.
When your JPA Object model have a lot of relationsships (OneToOne, OneToMany, ...) you will typical use JPQL statements with a lot of joins. This can result in complex SQL translations. When you take a look at the datamodel (generated by the JPA) you will recognize that there are no indices for any of your table rows.
For example if you have a Customer and a Address object with an oneToOne relationship everything will work well on the first look. Customer and Address have an foreign key. But if you do selections like this
Select c from Customer as c where c.address.zip='8888'
you should take care about your table column 'zip' in the table ADDRESS. JPA will not create such an index for you during deployment. So in my case I was able to speed up the database performance by simply adding indices.
An SQL Statement in your database looks like this:
ALTER TABLE `mydatabase`.`ADDRESS` ADD INDEX `zip_index`(`IZIP`);

In the question, and in the other answers, I'm hearing a lot of "might"s and "maybe"s.
First find out what's going on. If you haven't done that, we're all just poking in the dark.
I'm no expert on this kind of system, but this method works on any language or OS.
When you find out what's making it take too long, why don't you summarize it here?
I'm especially interested to know if it was something that might have been guessed.

hibernate object vs database physical model

Is there any real issue - such as performance - when the hibernate object model and the database physical model no longer match? Any concerns? Should they be keep in sync?
Our current system was original designed for a low number of users so not much effort was done to keep the physical and objects in sync. The developers went about their task and the architects did not monitor. Now that we are in the process of rewriting/importing the legacy system into the new system, a concern has been raised in that the legacy system handles a lot of user volume and might bring the new system to its knees.
Update 20090331
From Pete's comments below - the concern was about table/data relationships in the data layer vs the object layer. If there is no dependencies between the two, then there is no performance hits if these relationships do not match? Is that correct?
The concern from my view is that the development team spends a lot of time "tuning" the hibernate queries/objects but nothing at the database layer to improve the performance of the application. I would have assumed that they would tune at both layers.
Could these issue be from just a poor initial design of the database to begin with and trying to cover/make up the difference by the use of Hibernate?
(I am new to this project so playing catchup)

Update: in response to comment: It is CRUCIAL that the database be optimized in addition to the Hibernate use. When you think about it, after all the work hibernate does, in the end it is just querying the database. If the database doesn't perform well (wrong or missing indexes, poorly set up table spaces, etc) it doesn't matter how much you tune Hibernate. On the flip side if your database is set up well but Hibernate isn't (perhaps the caching is not set up properly, etc., and you are going back to the database a lot more then you need to) then performance will suffer as well. It is always important to tune the system end to end, but start at the foundation (database) and work up.
End Update
I'm curious what you mean about 'don't match' - do you mean columns have been added to tables that aren't represented in the hibernate data objects? Tables have been added? I don't think anything like that would affect performance (more likely data integrity if you are not inserting/updating all columns)
In general, the goal of the object model should NOT be match the database schema verbatim. You want to abstract the underlying data complexity / joins / normalization, that is the whole point of using something like Hibernate.
So for example lets say you have (keeping things very simple) 'orders' and 'order items',
your application code should be able to do something like
order.getItems()
without having to know that underneath it is a one to many relationship. The details in your hibernate code control how the load is done (lazy, caching, etc).
If that doesn't answer your question then please provide more detail

You could of course code your abstraction layer in asm - "might" (awful word for a developer) be faster.
This is premature optimization - maybe breaking a clean project-layout.
As in the hibernate-manual - optimization can look different ways - plain coding some parts "might" be part of it.

It's certainly possible that the changes you describe could cause performance problems.
I would have thought that this should have been part of the design spec.
So when you're coding it, you bear the performance critiera in mind.
The only way to really know though is to load the data onto a test environment, and run some tests.
This should definately be done before going live, as it might produce some quite interesting results.

When can/should you go whole hog with the ORM approach?

It seems to me that introducing an ORM tool is supposed to make your architecture cleaner, but for efficiency I've found myself bypassing it and iterating over a JDBC Result Set on occasion. This leads to an uncoordinated tangle of artifacts instead of a cleaner architecture.
Is this because I'm applying the tool in an invalid Context, or is it deeper than that?
When can/should you go whole hog with the ORM approach?
Any insight would be greatly appreciated.
A little of background:
In my environment I have about 50 client computers and 1 reasonably powerful SQL Server.
I have a desktop application in which all 50 clients are accessing the data at all times.
The project's Data Model has gone through a number of reorganizations for various reasons including clarity, efficiency, etc.
My Data Model's history
JDBC calls directly
DAO + POJO without relations between Pojos (basically wrapping the JDBC).
Added Relations between POJOs implementing Lazy Loading, but just hiding the inter-DAO calls
Jumped onto the Hibernate bandwagon after seeing how "simple" it made data access (it made inter POJO relations trivial) and because it could decrease the number of round trips to the database when working with many related entities.
Since it was a desktop application keeping Sessions open long term was a nightmare so it ended up causing a whole lot of issues
Stepped back to a partial DAO/Hibernate approach that allows me to make direct JDBC calls behind the DAO curtain while at the same time using Hibernate.

Hibernate makes more sense when your application works on object graphs, which are persisted in the RDBMS. Instead, if your application logic works on a 2-D matrix of data, fetching those via direct JDBC works better. Although Hibernate is written on top of JDBC, it has capabilities which might be non-trivial to implement in JDBC. For eg:
Say, the user views a row in the UI and changes some of the values and you want to fire an update query for only those columns that did indeed change.
To avoid getting into deadlocks you need to maintain a global order for SQLs in a transaction. Getting this right JDBC might not be easy
Easily setting up optimistic locking. When you use JDBC, you need to remember to have this in every update query.
Batch updates, lazy materialization of collections etc might also be non-trivial to implement in JDBC.
(I say "might be non-trivial", because it of course can be done - and you might be a super hacker:)
Hibernate lets you fire your own SQL queries also, in case you need to.
Hope this helps you to decide.
PS: Keeping the Session open on a remote desktop client and running into trouble is really not Hibernate's problem - you would run into the same issue if you keep the Connection to the DB open for long.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.