Why does query caching with Hibernate make the query ten times slower? - java

I'm currently experimenting with EJB3 as a prestudy for a major project at work. One of the things I'm looking into is query caching.
I've made a very simple domain model with JPA annotations, a #Local business interface and a #Stateless implementation in an EJB-JAR, deployed in an EAR together with a very simple webapp to do some basic testing. The EAR is deployed in JBoss 5.0.1 default config with no modifications. This was very straighforward, and worked as expected.
However, my latest test involved query caching, and I got some strange results:
I have a domain class that only maps an ID and a String value, and have created about 10000 rows in that particular table
In the business bean, there's a very simple query, SELECT m FROM MyClass m
With no cache, this executes in about 400ms on average
With query cache enabled (through hints on the query), the first execution of course takes a little longer, about 1200ms. The next executions take 3500ms on average!
This puzzled me, so I enabled Hibernate's show_sql to look at the log. Uncached, and on the first execution with cache enabled, there is one SELECT logged, as expected. When I should get cache hits, Hibernate logs one SELECT for each row in the database table.
That would certainly explain the slow execution time, but can anyone tell me why this happens?

The way that the query cache works is that it only caches the ID's of the objects returned by the query. So, your initial SELECT statement might return all the objects, and Hibernate will give them back to you and remember the ID's.
The next time you issue the query, however, Hibernate goes through the list of ID's and realizes it needs to materialize the actual data. So it goes back to the database to get the rest. And it does one SELECT per row, which is exactly what you are seeing.
Now, before you think, "this feature is obviously broken", the reason it works this way is that the Query Cache is designed to work in concert with the Second Level Cache. If the objects are stored in the L2 cache after the first query, then Hibernate will look there instead to satisfy the per-ID requests.
I highly recommend you pick up the book Java Persistence with Hibernate to learn more about this. Chapter 13 in particular covers optimizing queries, and how to use the cache effectively.

Related

How to inspect data within a database transaction?

I'm running an integration test that executes some Hibernate code within a single transaction (managed by Spring). The test is failing with a duplicate key violation and I'd like to hit a breakpoint just before this and inspect the table contents. I can't just go into MySQL Workbench and run a SELECT query as it would be outside the transaction. Is there another way?
After reading your comments, my impression that predominantly you are interested in how to hit a breakpoint and at the same time be able to examine database contents. Under normal circumstances I would just offer you to log the SQLs. Having the breakpoint in mind my suggestion is:
Reduce isolation level to READ_UNCOMMITED for the integration test.
Reducing the isolation level will allow you to see the uncommitted values in the database during the debugging. As long as you don't have parallel activity within the integration test. It should be fine.
Isolation level can be set up on per connection basis. There is no need for anything to be done on the server.
One side note. If you are using Hibernate even the parallel activities may work fine when you reduce the ISOLATION LEVEL because largely Hibernate behaves as it is in REPEATABLE_READ because of the transactional Level 1 cache.
The following can be run from Eclipse's "Display" view:
java.util.Arrays.deepToString(
em.createNativeQuery("SELECT mystuff FROM mytable").getResultList().toArray())
.replace("], ", "]\n");
This displays all the data, albeit not in a very user-friendly way - e.g. will need to work out which columns the comma-separated fields correspond to.

JDBC Query Caching and Precaching

Scenario:
I have a need to cache the results of database queries in my web service. There about 30 tables queried during the cycle of a service call. I am confident data in a certain date range will be accessed frequently by the service, and I would like to pre-cache that data. This would mean caching around 800,000 rows at application startup, the data is read-only. The data does not need to be dynamically refreshed, this is reference data. The cache can't be loaded on each service call, there's simply too much data for that. Data outside of this 'frequently used' window is not time critical and can be lazy loaded. Most queries would return 1 row, and none of the tables have a parent/child relationship to each other, though there will be a few joins. There is no need for dynamic sql support.
Options:
I intended to use myBatis, but there isn't a good method to warm up the cache. myBatis can't understand that the service query select * from table where key = ? is already covered by the startup pre-cache query select * from table.
As far as I understand it (documentation overload), Hibernate has the same problem. Additionally, these tables were designed with composite keys and no primary key, which is an extra hassle for Hibernate.
Question:
Preferred: Is there a myBatis solution for this problem ? I'd very much like to use it. (Familiarity, simplicity, performance, funny name, etc)
Alternatively: Is there an ORM or DB-friendly cache that offers what I'm looking for ?
You can use distributed caching solution like NCache or Tayzgrid which provide indexing and queries features along with cache startup loader.
You can configure indexes on attributes of your entities in cache. A cache startup loader can be configured to load all data from database in cache at cache startup. While loading data, cache will create indexes for all entities in memory.
Object Query Language (OQL) feature, which provides queries similar to SQL can then be used to query in-memory data.
The variety of options for third-party products (free and paid) is too broad and too dependent on your particular requirements and operational capabilities to try to "answer" here.
However, I will suggest an alternative to an explicit cache of your read-only data.
You clearly believe that the memory footprint of your dataset will fit into RAM on a reasonably-sized server. My suggestion is that you use your database engine directly (no additional external cache), but configured the database with internal cache large enough to hold your whole dataset. If all of your data is residing in the database server's RAM, it will be accessed very quickly.
I have used this technique successfully with mySQL, but I expect the same applies to all major database engines. If you cannot figure out how to configure your chosen database appropriately, I suggest that you follow ask a separate, detailed question.
You can warm the cache by executing representative queries when you start your system. These queries will be relatively slow because they have to actually do the disk I/O to pull the relevant blocks of data into the cache. Subsequent queries that access the same blocks of data will be much faster.
This approach should give you a huge performance boost with no additional complexity in your code or your operational environment.
Sormula may do want you want. You would need to annotate each POJO to be cached like:
#Cached(type=ReadOnlyCache.class)
public class SomePojo {
...
}
Pre-populate the cache by invoking selectAll method for each:
Database db = new Database(one of the JNDI constructors);
Table<SomePojo> t = db.getTable(SomePojo.class);
t.selectAll();
The key is that the cache is stored in the Table object, t. So you would need to keep a reference to t and use it for subsequent queries. Or in the case of many tables, keep reference to database object, db, and use db.getTable(...) to get tables to query.
See javadoc and tests in org.sormula.tests.cache.readonly package.

Hibernate Second Level Cache In Case of Soft Delete

Read operations are very high as compare to insert/update/delete for master data module. We are using JDBC for read,write and update operations till now. We are doing soft delete (Marking IS_DELETED column to 'Y') on delete operation. All write/update methods are synchronized to handle the concurreny. We are using oracle and have no plan to support multiple databases.
Now, We are planning to cache the data and we also have plans to go for clustering.
The easiest option we have is to change the insert/update/delete methods and use something like ehcache to manage the cache as per our requirement and handle concurrency in the clustered environment by using version column in the table and remove synchronized keyword.
Other option that people around me are suggesting (Infact asking me to do) is to move to the hibernate (I don't know much about hibernate) which will take care of caching and concurrency automatically.
Here are my doubts:
1) Is it worth to change the complete DAO code given we have around 200 tables to mangage the master data ?.
2) Would hibernate second level cache help in this case given we need to filter the cached data again to discard deleted rows or there is a mechanism in hibernate (or any other way) by which we can perform update operation in database but delete operation in the cached data ?
3) We have exposed the data transfer objects to other modules having all the fields of the table with primary key stored in the separate PK Objects (Having Primary key fields in a separate object) and we don't have reference DO in it (Composite DO are not there). Given, We can't afford to change the exposed methods and DO structure - so do we have to pack the hibernate cached entities data in our DO again ? Or we can reuse the old DO structure as hibernate entity (As Per my understaindg PK column should be there directly in the hibenate entity rather than being in some composite object). I mentioned composite DO because we also have dependent dropdown requirement which could have been used with hibernate lazy loading for the child objects if we would have composite DO at the first place. Argument against is to provide new methods which would use cached data and depricate the old methods. Other modules would slowly migrate as per their need on caching but we will have maintaince issues as we have to maintain both methods in case of the db changes. Also, 1 and 2 doubts are still there.
I am sure that hibernate is not the way to go for us at this stage and I have to convince people around me but I want to know your views on long term advantages of moving to hibernate other than automatic management of second level cache, concurrency handling (Can we done by small code change at common place) and db indepedency (We are not interested in) on the cost of changing the complete code.
If you plan to migrate to hibernate you should take in account
1) You'll need to map all your structure to POJO's (if you have not already)
2) Rewrite all DAO's to use hibernate (bare in mind, that hibernate QL/criteria API has certain limitations
3) Be ready to fight lazy initialization problems and so on...
Personaly i don't thinks it's worth migrating to hibernate with working model unless it's extremly painfull to maintain current model
Concerning your 2 and 3 questions
2) Second level cache holds only loaded instances, accessed by primary key. i.e. if you say hibernateSession.load(User, 10) - it will lookup User object in second level cache using id=10. If i understand clearly that's not the case. Most of the time you want to load your data using more complex query - in that case you will need StandarQueryCache, which will map your query string to a list of loaded IDs which in turn will be retrieved from second-level cache. But if you have a lot of queries with a low similarity - both StandartQueryCache and second level cache will be total useless (take a look http://darren.oldag.net/2008/11/hibernate-query-cache-dirty-little_04.html)
3)You can use components and such, but im not sure about your DTO structure.
Hope that helps

How to make unit tests for DAO classes less brittle in the absence of a static test database?

Here's the scanario:
I am working on a DAO object which uses hibernate criteria API to form a number of complicated queries to perform certain tasks on the database (keyword search across multiple fields for example).
We need to unit test this to ensure that the generated query is correct for various scenarios. One way of testing it -which could be preferable- would be to test the hibernate criteria is created correctly by checking it at the end and mocking the database interaction. However this is not desirable as firstly it's kinda cheating (it's merely duplicating what the code would be doing) and also it doesn't check if the criteria itself causes hibernate to barf or when it goes to database it causes issues.
The option to use is then run the query against a test database. However, for historical reasons there is no static test database (one that code be checked in as part of the code for example) and the remit of my project does not allow me to embark on creating one, we have to content with testing against a shared development database that's periodically refreshed with production data.
When theses refreshes happen, the data behind the tests could change too, and this would make our unit tests brittle. We can get over it by not using exact numbers in tests but it's not really adequate testing that way.
The question is then: what do people do in cases like this to make tests less brittle? One option that I have in mind is to run a native SQL that does the same query (behaviourally - it doesn't have to be exact same as the query generated by hibernate) to get the expected number and then run the DAO version to see if it matches. This way, the behaviour of the query can be always implemented in the initial native SQL and you will always have the correct numbers.
Any feedback on this or other ideas on how to manage this situation would be greatly appreciated.
A.
UPDATE:
With regards to hsqldb/h2/derby suggestions, I am familiar with them but the company is not ready to go down that route just yet and doing it piecemeal on just one test case won't be suitable.
With regards to my earlier suggestion I would like to elaborate a bit more - consider this scenario:
I want to ensure that my relatively complicated keyword search returns 2100 matches for "John Smith".
In order to find the expected number, I would have analyzed my database and found out the number using a SQL Query. What is the downside of having that query as part of the test, so that you will always know the you are testing the behaviour of the criteria?
So basically the question is: if for some reason you could not have a static data set for testing, how would you perform you integration tests in a non-brittle way?
One approach could be to use in-memory database like Apache Derby or HSQLDB, and prepopulate it with data before test start using DBUnit.
UPDATE: Here is a nice article about the aproach.
I agree with Andrey and Bedwyr that the best approach in the long term is to create an hsqldb database specifically for testing. If you don't have the option of doing that, then your solution seems like an appropriate one. You can't test everything, but you don't want to test nothing either. I've used this approach a few times for testing web services against integration databases etc. But remember that this database has to be maintained as well, if you add new columns etc.
You have to decide what you're trying to test. You don't want to test hibernate, you don't want to test that the database is giving what you've asked for (in terms of SQL). In your tests, you can assume that hibernate works, as does the database.
You say:
We need to unit test this to ensure that the generated query is
correct for various scenarios. One way of testing it -which could be
preferable- would be to test the hibernate criteria is created
correctly by checking it at the end and mocking the database
interaction. However this is not desirable as firstly it's kinda
cheating (it's merely duplicating what the code would be doing) and
also it doesn't check if the criteria itself causes hibernate to barf
or when it goes to database it causes issues.
Why should hibernate barf on the criteria you give it? Because you're giving it the wrong criteria. This is not a problem with hibernate, but with the code that is creating the criteria. You can test that without a database.
It has problems when it gets to the database? Hibernate, in general, creates the sql that is appropriate to the criteria and database dialect you give it, so again, any problem is with the criteria.
The database does not match what hibernate is expecting? Now you are testing that the criteria and the database are aligned. For this you need a database. But you're not testing the criteria any more, you're testing that everything is aligned, a different sort of test.
So actually, it seems to me you're doing an integration test, that the whole chain from the criteria to the structure of the database works. This is a perfectly valid test.
So, what I do is in my tests to create another connection to the database (jdbc) to get information. I execute SQL to get number of rows etc, or check that an insert has happened.
I think your approach is a perfectly valid one.
However, for historical reasons there is no static test database (one that code be checked in as part of the code for example) and the remit of my project does not allow me to embark on creating on
All you need to do is fire up H2 or similar - put some entities in it and execute your integration tests. Once you've done this for a few tests you should be able to extract a data setup utility that creates a schema with some test data that you can use for all the integration tests if you feel the need.

When can/should you go whole hog with the ORM approach?

It seems to me that introducing an ORM tool is supposed to make your architecture cleaner, but for efficiency I've found myself bypassing it and iterating over a JDBC Result Set on occasion. This leads to an uncoordinated tangle of artifacts instead of a cleaner architecture.
Is this because I'm applying the tool in an invalid Context, or is it deeper than that?
When can/should you go whole hog with the ORM approach?
Any insight would be greatly appreciated.
A little of background:
In my environment I have about 50 client computers and 1 reasonably powerful SQL Server.
I have a desktop application in which all 50 clients are accessing the data at all times.
The project's Data Model has gone through a number of reorganizations for various reasons including clarity, efficiency, etc.
My Data Model's history
JDBC calls directly
DAO + POJO without relations between Pojos (basically wrapping the JDBC).
Added Relations between POJOs implementing Lazy Loading, but just hiding the inter-DAO calls
Jumped onto the Hibernate bandwagon after seeing how "simple" it made data access (it made inter POJO relations trivial) and because it could decrease the number of round trips to the database when working with many related entities.
Since it was a desktop application keeping Sessions open long term was a nightmare so it ended up causing a whole lot of issues
Stepped back to a partial DAO/Hibernate approach that allows me to make direct JDBC calls behind the DAO curtain while at the same time using Hibernate.
Hibernate makes more sense when your application works on object graphs, which are persisted in the RDBMS. Instead, if your application logic works on a 2-D matrix of data, fetching those via direct JDBC works better. Although Hibernate is written on top of JDBC, it has capabilities which might be non-trivial to implement in JDBC. For eg:
Say, the user views a row in the UI and changes some of the values and you want to fire an update query for only those columns that did indeed change.
To avoid getting into deadlocks you need to maintain a global order for SQLs in a transaction. Getting this right JDBC might not be easy
Easily setting up optimistic locking. When you use JDBC, you need to remember to have this in every update query.
Batch updates, lazy materialization of collections etc might also be non-trivial to implement in JDBC.
(I say "might be non-trivial", because it of course can be done - and you might be a super hacker:)
Hibernate lets you fire your own SQL queries also, in case you need to.
Hope this helps you to decide.
PS: Keeping the Session open on a remote desktop client and running into trouble is really not Hibernate's problem - you would run into the same issue if you keep the Connection to the DB open for long.

Categories