This may be a dumb question, but i am not getting what to google even.
I have a server which fetches the some data from DB, caches this data and when ever any request involves this data, then data is fetched from cache instead of from DB.There by reducing the time taken to serve the request.
This cache can be modified, i.e may be some key can get added to it or deleted or updated.
Any change which occurs in cache will also happen on DB.
The Problem is now due to heavy rush in traffic we want to add a load balancer infront of my server. Lets say i add one more server. Then the two servers will have two different cache. if some thing gets added in the first server cache, how should i inform the second server cache to get it refreshed??
If you ultimately decide to move the cache outside your main webserver process, then you could also take a look at consistent hashing. This would be a alternative to a replicated cache.
The problem with replicated caches, is they scale inversely proportional to the number of nodes participating in the cache. i.e. their performance degrades as you add additional nodes. They work fine when there is a small number of nodes. If data is to be replicated between N nodes (or you need to send eviction messages to N nodes), then every write requires 1 write to the cache on the originating node, and N-1 writes to the other nodes.
In consistent hashing, you instead define a hashing function, which takes the key of the data you want to store or retrieve as input, and it returns the id of the server in the cluster which is responsible for caching the data for that key. So each caching server is responsible for a fraction of the overall keys, the client can determine which server will contain the sought data without any lookup, and data and eviction messages do not need to be replicated between caching servers.
The "consistent" part of consistent hashing, refers to how your hashing function handles new servers being added to or removed from the cluster: some re-distribution of keys between servers is required, but the function is designed to minimize the amount of such disruption.
In practice, you do not actually need a dedicated caching cluster, as your caches could run in-process in your web servers; each web server being able to determine the other webserver which should store cache data for a key.
Consistent hashing is used at large scale. It might be overkill for you at this stage. But just be aware of the scalability bottleneck inherent in O(N) messaging architectures. A replicated cache is possibly a good idea to start with.
EDIT: Take a look at Infinispan, a distributed cache which indeed uses consistent hashing out of box.
Any way you like ;) If you have no idea, I suggest you look at or use ehcache or Hazelcast. It may not be the best solutions for you but it is some of the most widely used. (And CV++ ;) I suggest you understand what it does first.
Related
I have cloud statistics (Structured data :: CSV) information; which i have to expose to administrator and user.
But for scalability; data collection will be collected by multiple machines (perf monitor) which is connected with individual DBs.
Now Manager (Mgr) is responsible of multicasting the request to all perf monitor; to collect the overall stats data to satisfy single UI request.
So questions are:
1) How will i make the mutiple monitor datas to be sorted based on
the client request at Mgr. Each monitor may give the result as per the client
request; but still how to merge multiple machines datas through java?
Means How to perform in memory sql aggregate/scalar (e.g. Groupby, orderby, avg) function on all the results retrieved from multiple clusters at MGR. How do i implement DB sql aggregate/scalar functionality in java side, any known APIs?
I think what i need is Reduce part of mapreduce technique in hadoop.
2) A request from UI (assume select count(*) from DB where Memory >
1000MB) have to be forwarded to multiple machines. Now how to send parallel
requests to individual monitor and consume only when all the nodes
are responded? Means how to wait User thread till consuming all the
responses from perf monitors? How to trigger parallel REST request for single UI request on MGR.
3) Do I have to authenticate UI user at both Mgr and Perf monitor?
4) Are you thinking any drawback in this approach?
Notes:
1) I didn't go for NoSql because datas are structured and no joins are required.
2) I didn't go for node.js since i am new for that and may take more time on developing it. Also i am not developing any concurrent critical where single threaded are best suited. Here only push/retrieve of data is done. No modification happening.
3) I want individual DB for each monitor OR at-least two instances of DB's with multiple clusters for an instance to support faster accessing of real time BIG statistical data.
You want to scale your app, but you designed an inherent bottleneck. Namely: the Mgr.
What I would do is that I would split the Mgr into at least two parts. Front-end and backend. The front end could simply be an aggregator and/or controller which collects all the requests from all the different UI servers, timestamps those requests and put them in a queue (RabbitMQ, Kafka, Redis, whatever) making a message with the UI session ID or something similar which uniquely identifies the source of request. Then you just have to wait until you get a response on the queue (with a different topic of course).
Then on your backend (the other side of the queue) you can set up as many nodes as your load requires and make them performing the same task. Namely: pull off requests from the queue and call those performance monitoring APIs as necessary. You can scale these backend nodes as much as you wish since they don't have any state, all the state which needs to be stored is already part of the messages in the queue which will be automagically persisted for you by Redis/Kafka/RabbitMQ or whatever else you choose.
You can also use Apache Storm or something similar to do this for you in the backend, since it was designed for exactly this kind of applications.
Apache Storm has also built-in merging capability exposed through the Trident API.
Note on the authentication: you should authenticate the HTTP requests on the front-end side and then you will be all right. Just assign unique IDs (session IDs most probably) to the users connected to your mgr and use this internal ID when you forward your requests further to downstream servers.
Now how to send parallel requests to individual monitor and consume
only when all the nodes are responded? Means how to wait User thread
till consuming all the responses from perf monitors? How to trigger
parallel REST request for single UI request on MGR.
Well if you have so many questions regarding handling user connections and serving those clients with responses then I would suggest to pick up a book on the Java servlets API. You might want to read this one for example: Servlet & JSP: A Tutorial (A Tutorial series). It is a bit outdated but well written.
But with all due respect, if you have so many questions on these quite fundamental topics, then it might be better to leave the architecture design to someone more experienced.
Don't reinvent the wheel, use some good existing BAM and Database monitoring tools, they have lot of built in dashboards and statistics, easy to connect with Java and work-flows.
But for scalability; data collection will be collected by multiple
machines (perf monitor) which is connected with individual DBs.
Approximately what sort of scaling do you anticipate ... is it 100s of GB's Multiple Terra Bytes .... Reason is these days SQL Server and Oracle can handle really large volumes of data. Once data is collected in a central db its game over as far as searching and crunching are concerned.
Now Manager (Mgr) is responsible of multicasting the request to all
perf monitor; to collect the overall stats data to satisfy single UI
request.
This will be a major task to write this and it will be really complex IMHO. That said Iam not an expert in this aspect.
What I would do is to put a layer of Hazelcast or Infinispan or something like this in your Performance Monitor instead of the Hazelcast. The Performance monitor itself like a logic can be part of the DataGrid. Then the MySQL will work as a persistent storage of this data grid. In this sense you can have more then one Mysql and each mysql will just hold a portion of the data It will just work as extension ability to go beyond your maximum RAM. Overtime you scale your performance monitor you will also scale your persistent capabilities.
Young then Map Reduce or other distributed functions for aggregation can lead to massive amount of paralelism and ability to server significantly more requests. Also such architecture scales horizontal. At the end it should look something like this:
And just on another note to say that it is not necessary in general to have 1 MySQL for each hazelcast. That depends on what the goal is. I also kind of forgot the Manager from the diagram but things there are simple it can either work as a gateway to the Data Grid or alternatively it can be merged with the grid.
Not sure if my answer would be useful for you since this question has been posted sometimes back.
I would like to answer it based on your question, problems in the current approach and proposed solution...
1) How will i make the mutiple monitor datas to be sorted based on the
client request at Mgr. Each monitor may give the result as per the
client request; but still how to merge multiple machines datas through
java? Means How to perform in memory sql aggregate/scalar (e.g.
Groupby, orderby, avg) function on all the results retrieved from
multiple clusters at MGR. How do i implement DB sql aggregate/scalar
functionality in java side, any known APIs? I think what i need is
Reduce part of mapreduce technique in hadoop.
Java provided in-build Java DB as part of Java distribution which is also available as Apache Derby database. This database can be used as in-memory SQL database. JavaDB & Apache Derby stores the data into disk. So you won't loose the data after restart.
Check here http://www.oracle.com/technetwork/java/javadb/overview/index.html https://db.apache.org/derby/
For Map-Reduce simple Java collection based approached would work. I don't think you need any special Map-Reduce framework in this case. You should however consider Out Of Memory, Network bandwidth etc. when you read data from multiple sources
2) A request from UI (assume select count(*) from DB where Memory >
1000MB) have to be forwarded to multiple machines. Now how to send
parallel requests to individual monitor and consume only when all the
nodes are responded? Means how to wait User thread till consuming all
the responses from perf monitors? How to trigger parallel REST request
for single UI request on MGR.
Ideally NodeJS kind of application are really best suite in this case where application get callback whenever there is a response of the HTTP call. However you can implement Observer Pattern like explained here How do I perform a JAVA callback between classes?
3) Do I have to authenticate UI user at both Mgr and Perf monitor?
It should be based on your requirement
4) Are you thinking any drawback in this approach?
There are several drawbacks with this approach
Data should not be pulled on-demand from UI. At-least data should be available in the centralised database whenever there is a request to generate the data. Pulling data from various end-points is expensive.
Stats must be collected periodically to maintain history and reports must be generated based on the moving time window.
JVM might go OutOfMemory if large data needs to be process. Proper handling is required.
Large data might get transferred over the network every time there is a new request. It might be for the same data again.
Notes:
1) I didn't go for NoSql because datas are structured and no joins are
required.
No SQL doesn't mean there is not structure followed. Even NoSQL database is the best fit for such data where you don't update the records, transactions etc are not required.
2) I didn't go for node.js since i am new for that and may take more
time on developing it. Also i am not developing any concurrent
critical where single threaded are best suited. Here only
push/retrieve of data is done. No modification happening.
NodeJS won't be a good choice since it is single threaded. NodeJS should not be used when you have CPU intensive job to perform. Like yours.
3) I want individual DB for each monitor OR at-least two instances of
DB's with multiple clusters for an instance to support faster
accessing of real time BIG statistical data.
**I would rather suggest you to either store data into any database which can horizontally scale, process the data either as and when it arrives or batch processing so that your user experience is good. **
I've read about CAP theorem and NoSQL data eventual consistency problem. As I understand you can achieve full consistency or full availability but never both. So if you get more performance you may get stale data / partial transactions. And as I understand there is no solution so far for clustered data storage.
In the other hand Hazelcast claims it enforce full consistency for IMap.
Question: How do Hazelcast enforce full data consistency? Is that possible because it based on RAM and may not care about availability (means availability is provided anyway)?
I can just answer for Hazelcast. We have the data partitioned, that means we serialize the key, take the hashcode of the serialized byte-array and make a mod with the partitionCount.
partitionId = hashcode(serialize(key)) % partitionCount
Every partitionId is now registered to a single node (+ backup nodes). If you have mutating operations for a given key this operation is send to the owner of the partition and he applies one operations after the other. Therefore you always have a consistent view per partition and get operations are enqueued just as everything else, so for a single partition there is no chance to see staled data.
If you use near-caches, for sure you end up in a slightly timewindow where the owner already have applied a mutation but the near-caches are not yet invalidated (network latency).
I hope this answers your question :)
I'm designing an application that has to consume live data from several sources and periodically report on it. Consumed data will be added to an Ehcache cache and reports will query it. Once the live data is consumed it needs to be persisted for recovery purposes only. If the application restarts it will prime the cache with historical data from the DB before connecting to the live data sources (which queue new data).
I'm leaning toward implementing it as a cache-as-sor with JDBC caching:
1. Receive data from source
2. Persist to DB
3. Add to cache
4. Confirm receipt with source
with 2-4 wrapped in a JTA transaction.
I also looked into Hibernate with Ehcache as a 2nd level cache, but that doesn't seem appropriate.
I'm relatively new to Ehcache so would like some advice on the right design.
For persistence, rather than do a "cache-aside", you probably would want to configure your caches to use read-through and some cache writer (either write-through, or write-behind). You can read about these here: http://ehcache.org/documentation/user-guide/concepts#cache-as-sor
Now I'd avoid JTA, as I fear the overhead might be overkill (except if you really need XA Transaction Recovery) and rather opt for a fault tolerant approach. If you opt for a asynchronous persistence (write-behind), clustering your cache with Terracotta (the WriteBehind Queue would automatically be persistent, recoverable and even HA if multiple nodes are available) is one approach of ensuring every element gets written out to the underlying SoR... All depending on your needs I guess.
Ehcache would let you start with a single node, unclustered approach, simply using read- & write-through caches, that you could grow and fine tune to meet your SLA. As data grows, you'd then be able to move to clustered caches and asynchronous writers (should writes become the issues) or grow your cache sizes (if reads remain the issue). Obviously, you should measure (or at least know what the bottlenecks are you foresee) and choose accordingly. But putting a Cache in front of your RDBMS is a common and well understood pattern to scale read (and write) access to these "slower" stores...
If you want to have data in a cache, the Hibernate looks like overkill. All you need is JDBC, both to implement a cache loader for cache initialization and for saving the data to a database periodically. Or just setup your cache to persist on disk.
Then Ehcache + Hibernate is not the solution. What you are describing here is an asynchronous event processing system in which one of the listeners awaits a "event processed successfully" to persist.
NoSQL databases are a far better option in this case, unless you need to strictly rely to a relational database.
In Java Web Application, i would like to know if it is a proper (or "standard"?) way that all the essential data such as the config data, message data, code maintenance data, dropdown option data and etc (assuming all data will not updated frequently) are loaded as a "static" variables from database when the server startup.Or is it more preferred way to retrieve data by querying db per request?
Thanks for all your advice here.
It is perfectly valid to pull out all the data that are not going to be modified during application life-cycle into and keep it in memory as singleton or something.
This is a good idea because it saves DB hits and retrieval is faster. A lot of environment specific settings and other data can also be pulled once and kept in an immutable hashmap for any future request.
In a common web-app you generally do not have so many config data/option objects that can eat up lot of memory and cause OOM. But, if you have a table with hundreds of thousands of config data, better assume pulling objects as and when requested. And if you do want to keep it in memory, think of putting this in some key-value store like MemcacheD.
We used DB to store config values and ehcache to avoid a lot of DB hits. This way you don't need to worry about memory consumption (it will use whatever memory you have).
EhCache is one of many available DB cache solution and can be configured on top of JPA etc.
You can configure ehcache (or many other cache providers) to deem the tables read-only, in which case it will only go to the DB if it's explicitly told to invalidate the cache. This performs pretty well. The overhead becomes visible though when the read occurs very frequently (like 100/sec), but usually storing the config value in a local variable and avoiding reading inside loops, passing it on through the method stack during the invocation mitigates this well enough.
Storing values in a Singleton as java objects performs the best, but if you want to modify these without app. start up, it becomes a little bit involved.
Here is a simple way to achieve dynamic configuration with Java objects:
private volatile ImmutableMap<String,Object> param_value
Basically you'll have to start thinking about multi-threaded access, and memory issues (while it's quite unlikely that you'll run out of memory because of configuration values, unless you have binary data as config values etc.).
In essence, I'd recommend using the DB and some cache provider unless that part of code really needs high-performance.
Hi what do you think about this problem?
We do have too much information in HttpSession, because much information is computed and a few large graph of objects are needed to store between requests finally.
Is it appropriate to use any cache like memcache or so? Or is it the same as increasing memory for JVM?
There's fear of storing it in DB between requests. What would you use if we are getting
OutOfMemory error?
Thank you.
I think the real point is the lifespan of your data.
Think about these two characteristics of the HttpSession:
When in a cluster, the container is responsible for replicating the HttpSession. This is good (you don't have to manage this yourself), but can be dangerous in terms of performance if this leads to too much exchanges... If your application is not clustered, forget about this point.
The lifespan of the HttpSession can be a few minutes or a few hours, that is while the user keeps active. This is perfect for information that has that lifespan (connection information, preferences, authorizations...). But it is not appropriate for data that is useful from one screen to the next, let's call it transient+ data.
If you have clustering needs, the database takes care of it. But beware, you can't cache anything in memory then.
Storing in the database has even longer lifespan (persistent between session, and even between reboots!), so the problem would be even worth (except you trade a memory problem for a performance problem).
I think this is the wrong approach for data whose lifespan is not expected to be persistent ...
Transient data
If data is useful only for one request, then it is typically stored in the HttpRequest, fine.
But if it is used for a few requests (interactions within one screen, or within a screen sequence like an assistant ..), the HttpRequest is too short to store it, but the HttpSession is too long. The data needs to be cleaned regularly.
And many memory problems in the HttpSession are related to such data that is transient but was not cleaned (forgotten at all, or not cleaned when an Exception, or when the user doesn't respect the regular flow: hits Back, use a previous bookmark, clic on a different menu or whatever).
Caching library to have the correct lifespan
To avoid this cleaning effort altogether (and avoid the risks of OutOfMemory when things go wrong), you can store information in a data structure that has the right lifespan. As the container doesn't provide this (it is application-related anyway), you need to implement this yourself using a cache library (like the ones mentioned; we use EhCache).
The idea is that you have a technical code (not related to one functional page, but implemented globally, such as with a ServletFilter ...) that ensures cleaning is always done after the objects are not needed any more.
You can design this cache using one (or several as needed) of the following policies for cleaning the cache. Each policy related to a functional lifespan:
for data only related to one screen (but several requests : reloading of the screen, Ajax requests ...), the cache can store data only for one screen at a time (for each session), call it "currentScreenCache". That guarantees that, if the user goes to another screen (even in an unmanaged way), the new screen will override the "currentScreenCache" information, and the previous information can be garbage-collected.
Implementation idea: each request must carry its screenId, and the technical code responsible for clearing the cache detects when, for the current HttpSession id, the current screenId doesn't match the one in the cache. Then it cleans or resets that item in the cache.
for data only used in a series of connected screens (call it a functional module), the same applies at the level of the module.
Implementation: same as before, every request has to carry the module id...
for data that is expensive to recompute, the cache library can be configured to store the last X computed ones (the previous ones are considered less-likely to be useful in the near-future). In typical usage, the same ones are asked for regularly, so you have many cache hits. On intensive use, the X limit is reached and the memory doesn't inflate, preventing OutOfMemory errors (at the expense of re-computation the next time).
Implementation: cache libraries support natively this limiting factor, and several more...
for data that is only valid for a few minutes, the cache library can natively be configured to discard it after that delay...
... many more, see the caching library configuration for other ideas.
Note: Each cache can be application-wide, or specific to a user, a HttpSession id, a Company id or other functional value...
It's true that HttpSession doesn't scale well but that's mainly in relation to clustering. It's a convenience but at some point yes you are better off using something like memcache or Terracotta or EHCache to persist data between requests (or between users).