Cache in a distributed web application - complex queries use case

Cache in a distributed web application - complex queries use case - java

We are developing a distributed web application (3 tomcats with a load balancer).
Currently we are looking for a cache solution. This solution should be cluster safe ofcourse.
We are using spring, jpa (mysql)
We thought about the following solution :
Create a cache server that runs a simple cache and all DB operations from each tomcat will be delegated to it. (dao layer in web app will communicate with that server instead of accessing DB itself). This is appealing since the cache on the cache server configuration can be minimal.
What we are wondering about right now is:
If a complex query is passed to the cacheServer (i.e. select with multiple joins and where clauses) how exactly the standard cache form (map) can handle this? does it mean we have to single handedly implement a lookup for each complex query and adjust it to map search instead of DB?
P.S - there is a possibility that this architecture is flawed in its base and therefore a weird question like this was raised, if that's the case please suggest an alternative.
Best,

mySql already come with a query cache, see http://dev.mysql.com/doc/refman/5.1/en/query-cache-operation.html

If I understand correctly, you are trying to implement a method cache, using as a key the arguments of your DAO methods and as value, the resulted object/list.
This should work, but your concern about complex queries is valid, you will end up with a lot of entries in your cache. For a complex query you would hit the cache only if the same query is executed exactly with the same arguments as the one in the cache. You will have to figure out if it is useful to cache those complex queries, if there is a chance they will be hit, it really depends on the application business logic.
Another option would be to implement a cache with multiple levels: second level cache and query cache, using ehcache and big memory. You might find this useful:
http://ehcache.org/documentation/integrations/hibernate

Related

Hibernate + MySQL Best practices for reporting data

I am creating a webapp in Spring Boot (Spring + Hibernate + MySQL).
I have already created all the CRUD operations for the data of my app, and now I need to process the data and create reports.
As per the complexity of these reports, I will create some summary or pre proccesed tables. This way, I can trigger the reports creation once, and then get them efficiently.
My doubt is if I should build all the reports in Java or in Stored Procedures in MySQL.
Pros of doing it in Java:
More logging
More control of the structures (entities, maps, list, etc)
Catching exceptions
If I change my db engine (it would not happen, but never know)
Cons of doing it in Java:
Maybe memory?
Any thoughts on this?
Thanks!

Java. Though both are possible. It depends on what is most important and what skills are available for maintenance and the price of maintaining. Stored procedures are usually very fast, but availability and performance also depends on what exact database you use. You will need special skills, and then you have it all working on that specific database.
Hibernate does come with a special dialect written for every database to get the best performance out of the persistence layer. It’s not that fast as a stored procedure, but it comes pretty close. With Spring Data on top of that, all difficulty is gone. Maintenance will not cost that much and people who know Spring Data are more available than any special database vendor.
You can still create various “difficult” queries easily with HQL, so no block there. But Hibernate comes with more possibilities. You can have your caching done by eh-cache and with Hibernate envers you will have your audit done in no time. That’s the nice thing about this framework. It’s widely used and many free to use maven dependencies are there for the taking. And if in future you want to change your database, you can do it by changing like 3 parameters in your application.properties file when using Spring Data.
You can play with some annotations and see what performs better. For example you have the #Inheritance annotation where you can have some classes end up in the same table or split it to more tables. Also you have the #MappedSuperclass where you can have one JpaObject with the id which all your entities can extend. If you want some more tricks on JPA, maybe check this post with my answer on how to use a superclass and a general repository.

As per the complexity of these reports, I will create some summary or
pre proccesed tables. This way, I can trigger the reports creation
once, and then get them efficiently.
My first thought is, is this required? It seems like adding complexity to the application that perhaps isn't needed. Premature optimisation and all that. Try writing the reports in SQL and running an execution plan. If it's good enough, you have less code to maintain and no added batch jobs to administer. Consider load testing using E.G. jmeter or gatling to see how it holds up under stress.
Consider using querydsl or jooq for reporting. Both provide a database abstraction layer and fluent API for querying databases, which deliver the benefits listed in the "Pros of doing it in Java" section of the question and may be more suited to the problem. This blog post jOOQ vs. Hibernate: When to Choose Which is well worth a read.

keep data in memory instead of saving in database java

I want to run a java function and it generates a json that has about 1M size. I need the json for input of next call that is 2 minute later. I can save the json in a database, but the time is spent on saving in database is not acceptable for me. can I keep this data in memory and use for next call? I need also read this data from node.js. how can I do this job?

Why dont you use a persistent asynchronous queue in between you application and your database. This way you will just fire and forget the persist operation and serve the result as fast as possible.
If you want to also keep the object in memory your best bet would be something like Infinispan or Hazelcast. Infinispan offers it own persistent store for the cache and good database integration. Hazelcast on the other end works more as In memory key value store but some persistence can easily be implemented with it as well. Hazel cast is very easy to start with and the learning curve is not that steep.
The good thing about this infrastructure is that you can have safety that your data is in sync with database. For example you can configure how many backups of particular object to be kept and these backups are created asynchronously or synchronously depending on how you configure them. You can also send the data to the database. If persistence is strong requirement probably Infinispan is better in this regards.
When I was reading second time your post I realized that maybe you need something significantly simpler when it comes to caching. If you just need a local cache with no backup capabilities, just go for EHCache.

hazelcast is a Java library that provides API's to solve caching use case.
It extends java collections with capabilities suitable for caching use case - eviction, ttl for entries, read through and write through cache, etc.
For node.js use case please find my answer here https://stackoverflow.com/a/36704734/27563
Let me know if you have any questions.
Thanks

Spring + Hibernate web application - When to use cache?

I'm building java web application which in future can generate a lot of traffic.
All in all it uses some quite simple queries to the database but still some kind of cache may be necessary to keep low latency and to prevent high database access rate.
Shall I bother with cache from the start? is it necessity?
Is it hard to implement or use some open source solutions on existing queries and how such cache will know when database state changed?

It all depends on how much traffic you expect, do you have some estimate of the max volume or the number of users?
Most of the times you don't need to worry about the cache from the beginning and can add later a hibernate second level cache later on.
You can start the development without a cache configured, then add it later on by choosing a cache provider and plug it as second level cache provider. EHCache is a frequent choice.
You can then annotate your entities with #Cache with different strategies, for example read only, etc.

Running Neo4j purely in memory without any persistence

I don't want to persist any data but still want to use Neo4j for it's graph traversal and algorithm capabilities. In an embedded database, I've configured cache_type = strong and after all the writes I set the transaction to failure. But my write speeds (node, relationship creation speeds) are a slow and this is becoming a big bottleneck in my process.
So, the question is, can Neo4j be run without any persistence aspects to it at all and just as a pure API? I tried others like JGraphT but those don't have traversal mechanisms like the ones Neo4j provides.

As far as I know, Neo4J data storage and Lucene indexes are always written to files. On Linux, at least, you could set up a ramfs filing system to hold the files in-memory.
See also:
Loading all Neo4J db to RAM

How many changes do you group in each transaction? You should try to group up to thousands of changes in each transaction since committing a transaction forces the logical log to disk.
However, in your case you could instead begin your transactions with:
db.tx().unforced().begin();
Instead of:
db.beginTx();
Which makes that transaction not wait for the logical log to force to disk and makes small transactions much faster, but a power outage could have you lose the last couple of seconds of data potentially.
The tx() method sits on GraphDatabaseAPI, which for example EmbeddedGraphDatabase implements.

you can try a virtual drive. It would make neo4j persist to the drive, but it would all happen in memory
https://thelinuxexperiment.com/create-a-virtual-hard-drive-volume-within-a-file-in-linux/

Pattern / Framework for lazy population of a database from remote source

My application pulls a large amount of data from an external source - normally across the internet - and stores it locally on the application server.
Currently, as a user starts a new project we aggressively try to pull the data from the external source based on the order that we predict the user will want to access it. This process can take 2 - 3 hours.
It seems like a smarter approach here is to provide access to the data in a lazy loading style fashion. Eg - If a user wants to access entity A, try to grab it from our database first. If it's not yet there, fetch it from the remote source and populate the database at the same time.
This, combined with continuing to populate the database in the background, would give a much slicker experience for the user.
Are there frameworks which manage this level of abstraction? (My application is in Java).
There's several considerations here - Ie., Currently my database enforces relational integrity - something that might have to be turned off to facilitate this lazy loading approach. Concurrency seems like it would cause problems here.
Also, it seems like entities and collections could exist in a partially populated state - this requires additional schema data to distinguish the complete from the partially populated.
As I understand it, this is just an aggregated repository pattern - is this correct, or is this a more appropriate pattern I should study?

Have you tried JPA/Hibernate? This seems easily possible in Hibernate.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.