I'm designing an application that has to consume live data from several sources and periodically report on it. Consumed data will be added to an Ehcache cache and reports will query it. Once the live data is consumed it needs to be persisted for recovery purposes only. If the application restarts it will prime the cache with historical data from the DB before connecting to the live data sources (which queue new data).
I'm leaning toward implementing it as a cache-as-sor with JDBC caching:
1. Receive data from source
2. Persist to DB
3. Add to cache
4. Confirm receipt with source
with 2-4 wrapped in a JTA transaction.
I also looked into Hibernate with Ehcache as a 2nd level cache, but that doesn't seem appropriate.
I'm relatively new to Ehcache so would like some advice on the right design.
For persistence, rather than do a "cache-aside", you probably would want to configure your caches to use read-through and some cache writer (either write-through, or write-behind). You can read about these here: http://ehcache.org/documentation/user-guide/concepts#cache-as-sor
Now I'd avoid JTA, as I fear the overhead might be overkill (except if you really need XA Transaction Recovery) and rather opt for a fault tolerant approach. If you opt for a asynchronous persistence (write-behind), clustering your cache with Terracotta (the WriteBehind Queue would automatically be persistent, recoverable and even HA if multiple nodes are available) is one approach of ensuring every element gets written out to the underlying SoR... All depending on your needs I guess.
Ehcache would let you start with a single node, unclustered approach, simply using read- & write-through caches, that you could grow and fine tune to meet your SLA. As data grows, you'd then be able to move to clustered caches and asynchronous writers (should writes become the issues) or grow your cache sizes (if reads remain the issue). Obviously, you should measure (or at least know what the bottlenecks are you foresee) and choose accordingly. But putting a Cache in front of your RDBMS is a common and well understood pattern to scale read (and write) access to these "slower" stores...
If you want to have data in a cache, the Hibernate looks like overkill. All you need is JDBC, both to implement a cache loader for cache initialization and for saving the data to a database periodically. Or just setup your cache to persist on disk.
Then Ehcache + Hibernate is not the solution. What you are describing here is an asynchronous event processing system in which one of the listeners awaits a "event processed successfully" to persist.
NoSQL databases are a far better option in this case, unless you need to strictly rely to a relational database.
Related
I have been poundering on how to reliably implement a write-through caching mechanism to store realtime data.
Basically what we need is this:
Save data to Redis -> Save to database (underlying)
Read data from Redis <- Read from database in case unavailable in cache
The resources online to help in the implementation of this caching strategy seem scarce.
The problem is:
1) No built-in transaction possibility between Redis and the database (Mongo in my case).
2) No transactions mean that writes to the underlying database are unreliable.
The most straightforward way I see how this can be implemented is by using a broker like Kafka and putting messages on a persistent queue to be processed later.
Therefore Kafka would be the responsible entity for reliable processing.
Another way would be by having a custom implementation in a scheduler that checks the Redis database for dirty records. On first thought there seem to be some tradeoffs to this approach and I would like not having to go this road if possible.
I am looking on some options on how this can be implemented otherwise.
Or whether this is in fact the most viable approach.
So better approach than is as u mentioned above is to use kafka and consumer which will store data to mongo. But read about it delivery guarantee, as i remember exactly once is guaranteed in kafka streams only (between two topics), in your case your database should be idempotent because u get at least once guarantee. And don't forget to turn AOF on with Redis, not to loose data. And don't forget that in this case u get eventual consistency in db with all the consequences.
On review I will use MongoDB as a single datastore without Redis at all.
Premature optimization is evil I guess.
Anyhow, I can add additional architecture afterwards after benchmarking.
Plans to refactor towards a cache shouldn't be too hard.
Scaling is additional concern so I shouldn't be bothered with that during development right now.
Accepted #Ipave answer, going with a single datastore for the moment.
I am currently developing an application using Spring MVC4 and hibernate 4. I have implemented hibernate second level cache for performance improvement. If I use Redis which is an in-memory data structure store, used as a database, cache etc, the performance will increase but will it be a drastic change?
Drastic differences you may expect if you cache what is good to be cached and avoid caching data that should not be cached at all. Like beauty is in the eye of the beholder the same is with the performance. Here are several aspects you should have in mind when using hibernate AS second level cache provider:
No Custom serialization - Memory intensive
If you use second level caching, you would not be able to use fast serialization frameworks such as Kryo and will have to stick to java serializable which sucks.
On top of this for each entity type you will have a separate region and within each region, you will have an entry for each key of each entity.
In terms of memory efficiency, this is inefficient.
Lacks the ability to store and distribute rich objects
Most of the modern caches also present computing grid functionality having your objects fragmented into many small pieces decrease your ability to execute distributed tasks with guaranteed data co-location. That depends a little bit on the Grid provider, but for many would be a limitation.
Sub optimal performance
Depending on how much performance you need and what type of application you are having using hibernate second level cache might be a good or a bad choice. Good in terms that it is plug and play...." kind of..." bad because you will never squeeze the performance you would have gained. Also designing rich models mean more upfront work and more OOP.
Limited querying capabilities ON the Cache itself
That depends on the cache provider, but some of the providers really are not good doing JOINs with Where clause different than the ID. If you try to build and in memory index for a query on Hazelcast, for example, you will see what I mean.
Yes, if you use Redis, it will improve your performance.
No, it will not be a drastic change. :)
https://memorynotfound.com/spring-redis-application-configuration-example/
http://www.baeldung.com/spring-data-redis-tutorial
the above links will help you to find out the way of integration redis with your project.
It depends on the movement.
If You have 1000 or more requests per second and You are low on RAM, then Yes, use redis nodes on other machine to take some usage. It will greatly improve your RAM and request speed.
But If it's otherwise then do not use it.
Remember that You can use this approach later when You will see what is the RAM and database Connection Pool usage.
Your question was already discussed here. Check this link: Application cache v.s. hibernate second level cache, which to use?
This was the most accepted answer, which I agree with:
It really depends on your application querying model and the traffic
demands.
Using Redis/Hazelcast may yield the best performance since there won't
be any round-trip to DB anymore, but you end up having a normalized
data in DB and denormalized copy in your cache which will put pressure
on your cache update policies. So you gain the best performance at the
cost of implementing the cache update whenever the persisted data
changes.
Using 2nd level cache is easier to set up but it only stores
entities by id. There is also a query cache, storing ids returned by a
given query. So the 2nd level cache is a two-step process that you
need to fine tune to get the best performance. When you execute
projection queries the 2nd level object cache won't help you, since it
only operates on entity load. The main advantage of 2nd level cache is
that it's easier to keep it in sync whenever data changes, especially
if all your data is persisted by hibernate.
So, if you need ultimate
performance and you don't mind implementing your cache update logic
that ensures a minimum eventual consistency window, then go with an
external cache.
If you only need to cache entities (that usually don't change that
frequently) and you mostly access those through Hibernate entity
loading, then 2nd level cache can help you.
Hope it helps!
Scenario:
I have a need to cache the results of database queries in my web service. There about 30 tables queried during the cycle of a service call. I am confident data in a certain date range will be accessed frequently by the service, and I would like to pre-cache that data. This would mean caching around 800,000 rows at application startup, the data is read-only. The data does not need to be dynamically refreshed, this is reference data. The cache can't be loaded on each service call, there's simply too much data for that. Data outside of this 'frequently used' window is not time critical and can be lazy loaded. Most queries would return 1 row, and none of the tables have a parent/child relationship to each other, though there will be a few joins. There is no need for dynamic sql support.
Options:
I intended to use myBatis, but there isn't a good method to warm up the cache. myBatis can't understand that the service query select * from table where key = ? is already covered by the startup pre-cache query select * from table.
As far as I understand it (documentation overload), Hibernate has the same problem. Additionally, these tables were designed with composite keys and no primary key, which is an extra hassle for Hibernate.
Question:
Preferred: Is there a myBatis solution for this problem ? I'd very much like to use it. (Familiarity, simplicity, performance, funny name, etc)
Alternatively: Is there an ORM or DB-friendly cache that offers what I'm looking for ?
You can use distributed caching solution like NCache or Tayzgrid which provide indexing and queries features along with cache startup loader.
You can configure indexes on attributes of your entities in cache. A cache startup loader can be configured to load all data from database in cache at cache startup. While loading data, cache will create indexes for all entities in memory.
Object Query Language (OQL) feature, which provides queries similar to SQL can then be used to query in-memory data.
The variety of options for third-party products (free and paid) is too broad and too dependent on your particular requirements and operational capabilities to try to "answer" here.
However, I will suggest an alternative to an explicit cache of your read-only data.
You clearly believe that the memory footprint of your dataset will fit into RAM on a reasonably-sized server. My suggestion is that you use your database engine directly (no additional external cache), but configured the database with internal cache large enough to hold your whole dataset. If all of your data is residing in the database server's RAM, it will be accessed very quickly.
I have used this technique successfully with mySQL, but I expect the same applies to all major database engines. If you cannot figure out how to configure your chosen database appropriately, I suggest that you follow ask a separate, detailed question.
You can warm the cache by executing representative queries when you start your system. These queries will be relatively slow because they have to actually do the disk I/O to pull the relevant blocks of data into the cache. Subsequent queries that access the same blocks of data will be much faster.
This approach should give you a huge performance boost with no additional complexity in your code or your operational environment.
Sormula may do want you want. You would need to annotate each POJO to be cached like:
#Cached(type=ReadOnlyCache.class)
public class SomePojo {
...
}
Pre-populate the cache by invoking selectAll method for each:
Database db = new Database(one of the JNDI constructors);
Table<SomePojo> t = db.getTable(SomePojo.class);
t.selectAll();
The key is that the cache is stored in the Table object, t. So you would need to keep a reference to t and use it for subsequent queries. Or in the case of many tables, keep reference to database object, db, and use db.getTable(...) to get tables to query.
See javadoc and tests in org.sormula.tests.cache.readonly package.
This may be a dumb question, but i am not getting what to google even.
I have a server which fetches the some data from DB, caches this data and when ever any request involves this data, then data is fetched from cache instead of from DB.There by reducing the time taken to serve the request.
This cache can be modified, i.e may be some key can get added to it or deleted or updated.
Any change which occurs in cache will also happen on DB.
The Problem is now due to heavy rush in traffic we want to add a load balancer infront of my server. Lets say i add one more server. Then the two servers will have two different cache. if some thing gets added in the first server cache, how should i inform the second server cache to get it refreshed??
If you ultimately decide to move the cache outside your main webserver process, then you could also take a look at consistent hashing. This would be a alternative to a replicated cache.
The problem with replicated caches, is they scale inversely proportional to the number of nodes participating in the cache. i.e. their performance degrades as you add additional nodes. They work fine when there is a small number of nodes. If data is to be replicated between N nodes (or you need to send eviction messages to N nodes), then every write requires 1 write to the cache on the originating node, and N-1 writes to the other nodes.
In consistent hashing, you instead define a hashing function, which takes the key of the data you want to store or retrieve as input, and it returns the id of the server in the cluster which is responsible for caching the data for that key. So each caching server is responsible for a fraction of the overall keys, the client can determine which server will contain the sought data without any lookup, and data and eviction messages do not need to be replicated between caching servers.
The "consistent" part of consistent hashing, refers to how your hashing function handles new servers being added to or removed from the cluster: some re-distribution of keys between servers is required, but the function is designed to minimize the amount of such disruption.
In practice, you do not actually need a dedicated caching cluster, as your caches could run in-process in your web servers; each web server being able to determine the other webserver which should store cache data for a key.
Consistent hashing is used at large scale. It might be overkill for you at this stage. But just be aware of the scalability bottleneck inherent in O(N) messaging architectures. A replicated cache is possibly a good idea to start with.
EDIT: Take a look at Infinispan, a distributed cache which indeed uses consistent hashing out of box.
Any way you like ;) If you have no idea, I suggest you look at or use ehcache or Hazelcast. It may not be the best solutions for you but it is some of the most widely used. (And CV++ ;) I suggest you understand what it does first.
I have a web application that receives messages through an HTTP interface, e.g.:
http://server/application?source=123&destination=234&text=hello
This request contains the ID of the sender, the ID of the recipient and the text of the message.
This message should be processed like:
finding the matching User object for both the source and the destination from the database
creating a tree of objects: a Message that contains a field for the message text and two User objects for the source and the destination
persisting this tree to a database.
The tree will be loaded by other applications that I can't touch.
I use Oracle as the backing database and JPA with Toplink for the database handling tasks. If possible, I'd stay with these.
Without much optimization I can achieve ~30 requests/sec throughput in my environment. That's not much, I'd require ~300 requests/sec. So I measured where the performance bottleneck is and found that the calls to em.persist() takes most of the time. If I simply comment out that line, the throughput go well over 1000 requests/sec.
I tried to write a small test application that used simple JDBC calls to persist 1 million messages to the same database. I used batching, meaning I did 100 inserts then a commit, and repeated until all the records was in the database. I measured ~500 requests/sec throughput in this scenario, that would meet my needs.
It is clear that I need to optimize insert performance here. However as I mentioned earlier I would like to keep using JPA and Toplink for this, not pure JDBC.
Do you know a way to create batch inserts with JPA and Toplink? Can you recommend any other technique for improving JPA persist performance?
ADDITIONAL INFO:
"requests/sec" means here: total number of requests / total time from beginning of test to last record written to database.
I tried to make the calls to em.persist() asynchronous by creating an in-memory queue between the servlet stuff and the persister. It helped the performance greatly. However the queue did grow really fast and as the application will receive ~200 requests/second continuously, It is not an acceptable solution for me.
In this decoupled approach I collected requests for 100 msec and called em.persist() on all collected items before commiting the transaction. The EntityManagerFactory is cached between each transaction.
You should decouple from the JPA interface and use the bare TopLink API. You can probably chuck the objects you're persisting into a UnitOfWork and commit the UnitOfWork on your schedule (sync or async). Note that one of the costs of em.persist() is the implicit clone that happens of the whole object graph. TopLink will work rather better if you uow.registerObject() your two user objects yourself, saving itself the identity tests it has to otherwise do. So you'll end up with:
uow=sess.acquireUnitOfWork();
for (job in batch) {
thingyCl=uow.registerObject(new Thingy());
user1Cl=uow.registerObject(user1);
user2Cl=uow.registerObject(user2);
thingyCl.setUsers(user1Cl,user2Cl);
}
uow.commit();
This is very old school TopLink btw ;)
Note that the batch will help a lot, because batch writing and more especially batch writing with parameter binding will kick in which for this simple example will probably have a very large impact on your performance.
Other things to look for: your sequencing size. A lot of the time spent writing objects in TopLink is actually spent reading sequencing information from the database, especially with the small defaults (I would probably have several hundred or even more as my sequence size).
What is your measure of "requests/sec"? In other words, what happens for the 31st request? What resource is being blocked? If it is the front-end/servlet/web portion, can you run em.persist() in another thread and return immediately?
Also, are you creating transactions each time? Are you creating EntityManagerFactory objects with each request?