Optimization of using a database - java

Is it faster to get the whole object from the database and get needed attributes from the entity in Java app, or to get only needed attributes from the database?

It depends, the rule is that you should minimize the number of roundtrip you do with the database. So you probably better load the entire object from the DB, if the entity is actually what you want. in other case, you should query just for a couple of properties on a lot of records, for example if you are drawing, say, a barchart. So we can't say a general rule but just minimizing roundtrips without having too heavy queries.

Related

Auditing using Data tables vs Separate Audit tables

I am in the process of designing a new java application which has very strict requirements for auditing. A brief context is here:
I have a complex entity with multiple one to many nested relationships. If any of the field changes, I need to consider it as a new version of the object and all this need to be audited as well. Now I have two options:
1.) Do not do any update operation, just insert a new entity whenever anything changes. This would require me to create all the relational objects (even if they have not been changed) as I do not want to hold references to any previous version objects. My data tables becomes my auditing table as well.
OR
2.) Always do an update operation and maintain the auditing information in separate tables. That would add some more complexity in terms of implementation.
I would like to know if there is a good vs bad practice for any of these two approaches.
Thanks,
-csn
What should define your choice is your insert/update/read patterns for both the "live" data and the audits.
Most commonly these pattern are very different for both kinds.
- Conserning "live" it depends a lot on your application but I can imagine you have significants inserts; significatant updates; lot of reads. Live data also require transactionality and have lot relationship between tables for which you need to keep consistency. They might require fast and complex search. Indexes on many columns
- Audits have lot of inserts; almost no update; few reads. Read, search don't requires complex search (e.g. you only consult audits and sort them by date) and indexes on many columns.
So with increased load and data size you will probably need to split the data and optimize tables for your use cases.

JDBC Query Caching and Precaching

Scenario:
I have a need to cache the results of database queries in my web service. There about 30 tables queried during the cycle of a service call. I am confident data in a certain date range will be accessed frequently by the service, and I would like to pre-cache that data. This would mean caching around 800,000 rows at application startup, the data is read-only. The data does not need to be dynamically refreshed, this is reference data. The cache can't be loaded on each service call, there's simply too much data for that. Data outside of this 'frequently used' window is not time critical and can be lazy loaded. Most queries would return 1 row, and none of the tables have a parent/child relationship to each other, though there will be a few joins. There is no need for dynamic sql support.
Options:
I intended to use myBatis, but there isn't a good method to warm up the cache. myBatis can't understand that the service query select * from table where key = ? is already covered by the startup pre-cache query select * from table.
As far as I understand it (documentation overload), Hibernate has the same problem. Additionally, these tables were designed with composite keys and no primary key, which is an extra hassle for Hibernate.
Question:
Preferred: Is there a myBatis solution for this problem ? I'd very much like to use it. (Familiarity, simplicity, performance, funny name, etc)
Alternatively: Is there an ORM or DB-friendly cache that offers what I'm looking for ?
You can use distributed caching solution like NCache or Tayzgrid which provide indexing and queries features along with cache startup loader.
You can configure indexes on attributes of your entities in cache. A cache startup loader can be configured to load all data from database in cache at cache startup. While loading data, cache will create indexes for all entities in memory.
Object Query Language (OQL) feature, which provides queries similar to SQL can then be used to query in-memory data.
The variety of options for third-party products (free and paid) is too broad and too dependent on your particular requirements and operational capabilities to try to "answer" here.
However, I will suggest an alternative to an explicit cache of your read-only data.
You clearly believe that the memory footprint of your dataset will fit into RAM on a reasonably-sized server. My suggestion is that you use your database engine directly (no additional external cache), but configured the database with internal cache large enough to hold your whole dataset. If all of your data is residing in the database server's RAM, it will be accessed very quickly.
I have used this technique successfully with mySQL, but I expect the same applies to all major database engines. If you cannot figure out how to configure your chosen database appropriately, I suggest that you follow ask a separate, detailed question.
You can warm the cache by executing representative queries when you start your system. These queries will be relatively slow because they have to actually do the disk I/O to pull the relevant blocks of data into the cache. Subsequent queries that access the same blocks of data will be much faster.
This approach should give you a huge performance boost with no additional complexity in your code or your operational environment.
Sormula may do want you want. You would need to annotate each POJO to be cached like:
#Cached(type=ReadOnlyCache.class)
public class SomePojo {
...
}
Pre-populate the cache by invoking selectAll method for each:
Database db = new Database(one of the JNDI constructors);
Table<SomePojo> t = db.getTable(SomePojo.class);
t.selectAll();
The key is that the cache is stored in the Table object, t. So you would need to keep a reference to t and use it for subsequent queries. Or in the case of many tables, keep reference to database object, db, and use db.getTable(...) to get tables to query.
See javadoc and tests in org.sormula.tests.cache.readonly package.

Custom hibernate entity persister

I am in the process of performance testing/optimizing a project that maps
a document <--> Java object tree <--> mysql database
The document, Java classes, database schema and logic for mapping is orchestrated with HyperJaxb3. The ORM piece of it is JPA provided by hibernate.
There are about 50 different entities and obviously lots of relationships between them. A major feature of the application is to load the documents and then reorganize the data into new documents; all the pieces of each incoming document eventually gets sent out in one outgoing document. While I would prefer to not be living in the relational world, the transactional semantics are a very good fit for this application - there is a lot of money and government regulation involved, so we need to make sure everything gets delivered exactly once.
Functionally, everything is going well and performance is decent (after a fair amount of tweaking). Each document is made up of a few thousand entities which end up creating a few thousand rows in the database. The documents vary in size, and insert performance is pretty much proportional to the number of rows that need to be inserted (no surprise there).
I see the potential for a significant optimization, and this is where my question lies.
Each document is mapped to a tree of entities. The "leaf" half of the tree contains lots of detailed information that is not used in the decisions for how to generate the outgoing documents. In other words, I don't need to be able to query/filter by the contents of many of the tables.
I would like to map the appropriate entity sub-trees to blobs, and thus save the overhead of inserting/updating/indexing the majority of the rows I am currently handling the usual way.
It seems that my best bet is to implement a custom EntityPersister and associate it with the appropriate entities. Is this the right way to go? The hibernate docs are not bad, but it is a fairly complex class that needs to be implemented and I am left with lots of questions after looking at the javadoc. Can you point me to a concrete, yet simple example that I can use as a starting point?
Any thoughts about another way to approach this optimization?
I've run in to the same problem with storing large amounts of binary data. The solution I found worked best is a denormalization of the Object model. For example, I create a master record, and then I create a second object that holds the binary data. On the master, use the #OneToOne mapping to the secondary object, but mark the association as lazy. Now the data will only be loaded if you need it.
The one thing that might slow you down is the outer join that hibernate performs with all objects of this type. To avoid it, you can mark the object as mandatory. But if the database doesn't give you a huge performance hit, I suggest you leave it alone. I found that Hibernate has a tendency to load the binary data immediately if I tried to get a regular join.
Finally, if you need to retrieve a lot of the binary data in a single SQL call, use the HQL fetch join command. For example: from Article a fetch join a.data where a.data is the one-to-one relationship to the binary holder. The HQL compiler will see this as an instruction to get all the data in a single sql call.
HTH

Strategies for One-to-Many type of association where "many" side entries are in millions

Giving an analogy: Twitter like scenario where in a person can be followed by huge number of people (one-to-many) ,
Few options which I could think of
Use some OR mapping tool with lazy loading. But when you access the "followers" side of relations, it will still load all the data even tough lazily. So not a suitable option.
Do not maintain one-to-many relation (or not use any OR mapping) . Fetch the "Followers" side in separate call and handle the paging etc programmatically.
Offload Fetching of large data to some search stack (Lucene/Solr) which can better handle large data. But this will introduce some latency between database update and index update.
Please share your thoughts/suggestions and any possible tools library. Stack consists of Java , MySQL.
Millions should not be a problem for an RDBMS as it is designed for those situations.
Sometimes it is also recommended to denormalize rather than normalize to optimize the performance of your application. This is specifically for applications that have very high read and very low write statistics.

Strategies for performance optimizations on an inherited EJB3 application

I was asked to have a look at a legacy EJB3 application with significant performance problems. The original author is not available anymore so all I've got is the source code and some user comments regarding the unacceptable performance. My personal EJB3 skill are pretty basic, I can read and understand the annotated code but that's all until know.
The server has a database, several EJB3 beans (JPA) and a few stateless beans just to allow CRUD on 4..5 domain objects for remote clients. The client itself is a java application. Just a few are connected to the server in parallel. From the user comments I learned that
the client/server app performed well in a LAN
the app was practically unusable on a WAN (1MBit or more) because read and update operations took much too long (up to several minutes)
I've seen one potential problem - on all EJB, all relations have been defined with the fetching strategy FetchType.EAGER. Would that explain the performance issues for read operations, is it advisable to start tuning with the fetching strategies?
But that would not explain performance issues on update operations, or would it? Update is handled by an EntityManager, the client just passes the domain object to the manager bean and persisting is done with nothing but manager.persist(obj). Maybe the domain objects that are sent to the server are just too big (maybe a side effect of the EAGER strategy).
So my actual theory is that too many bytes are sent over a rather slow network and I should look at reducing the size of result sets.
From your experience, what are the typical and most common coding errors that lead to performance issues on CRUD operations, where should I start investigating/optimizing?
On all EJB, all relations have been defined with the fetching strategy FetchType.EAGER. Would that explain the performance issues for read operations?
Depending on the relations betweens classes, you might be fetching much more (the whole database?) than actually wanted when retrieving entities?
is it advisable to start tuning with the fetching strategies?
I can't say that making all relations EAGER is a very standard approach. To my experience, you usually keep them lazy and use "Fetch Joins" (a type of join allowing to fetch an association) when you want to eager load an association for a given use case.
But that would not explain performance issues on update operations, or would it?
It could. I mean, if the app is retrieving a big fat object graph when reading and then sending the same fat object graph back to update just the root entity, there might be a performance penalty. But it's kinda weird that the code is using em.persist(Object) to update entities.
From your experience, what are the typical and most common coding errors that lead to performance issues on CRUD operations, where should I start investigating/optimizing?
The obvious ones include:
Retrieving more data than required
N+1 requests problems (bad fetching strategy)
Poorly written JPQL queries
Non appropriate inheritance strategies
Unnecessary database hits (i.e. lack of caching)
I would start with writing some integration tests or functional tests before touching anything to guarantee you won't change the functional behavior. Then, I would activate SQL logging and start to look at the generated SQL for the major use cases and work on the above points.
From DBA position.
From your experience, what are the typical and most common coding errors that lead to performance issues on CRUD operations, where should I start investigating/optimizing?
Turn off caching
Enable sql logging Ejb3/Hibernate generates by default a lots of extremely stupid queries.
Now You see what I mean.
Change FetchType.EAGER to FetchType.LAZY
Say "no" for big business logic between em.find em.persist
Use ehcache http://ehcache.org/
Turn on entity cache
If You can, make primary keys immutable ( #Column(updatable = false, ...)
Turn on query cache
Never ever use Hibernate if You want big performance:
http://www.google.com/search?q=hibernate+sucks
I my case a similar performance problem wasn't depending on the fetch strategy. Or lets say it was not really possible to change the business logic in the existing fetch strategies. In my case the solution was simply adding indices.
When your JPA Object model have a lot of relationsships (OneToOne, OneToMany, ...) you will typical use JPQL statements with a lot of joins. This can result in complex SQL translations. When you take a look at the datamodel (generated by the JPA) you will recognize that there are no indices for any of your table rows.
For example if you have a Customer and a Address object with an oneToOne relationship everything will work well on the first look. Customer and Address have an foreign key. But if you do selections like this
Select c from Customer as c where c.address.zip='8888'
you should take care about your table column 'zip' in the table ADDRESS. JPA will not create such an index for you during deployment. So in my case I was able to speed up the database performance by simply adding indices.
An SQL Statement in your database looks like this:
ALTER TABLE `mydatabase`.`ADDRESS` ADD INDEX `zip_index`(`IZIP`);
In the question, and in the other answers, I'm hearing a lot of "might"s and "maybe"s.
First find out what's going on. If you haven't done that, we're all just poking in the dark.
I'm no expert on this kind of system, but this method works on any language or OS.
When you find out what's making it take too long, why don't you summarize it here?
I'm especially interested to know if it was something that might have been guessed.

Categories