My App Engine (Java) application is planned to work on a data structure that needs frequent updates on many items. The amount of data is not planned to exceed 1000 records (per client) but the amount of clients is unlimited so I'm not willing to do 1000 reads and 1000 writes every second just to update some counters.
Naturally I'm thinking about utilizing the Memcache. Ideally the data should be in memory all the time so I can read and update it frequently. It should only be written to the data storage if the cache is full or the VM is being shut down (my biggest concern). Can I implement some sort of write-back strategy where the cache is only written to the storage when it needs to?
In particular my two questions are:
How do I know when an item is deleted from the cache?
How do I know when the VM is being shut down, so I can persist the content of the cache?
Short answer: No.
Longer answer: Memcache offers no guarantees.
Useful answer: Look at https://developers.google.com/appengine/articles/scaling/memcache#transient. If losing data is an option, you can rely on memcache (but sometimes some things might be lost).
Don't worry about the VM being shut down though: Memcache runs outside of the instance VM, and is shared between all the app instance VMs.
Related
In a Spring Boot application, I keep a TreeMap in memory. I'm doing around 10,000 operations per second, and it may increase. To improve performance, I kept data in memory. I want my app to be able to start from the same state when application is restarted.
There are some methods I could find for this.
Keeping data on Hazelcast.
In this case I don't risk losing the data unless the Hazelcast dies, but if the Hazelcast dies, I can't restore data. Additionally, I don't think it makes sense to sync that amount of operations on Hazlecast.
Synchronizing events to database.
Here, my risk of data loss is very low. However, I need to execute a query after each operation. This may affect performance. Also, I need to handle exceptions on database update.
Synchronizing data in batches
There is only one ready solution that I could find here, MapDB. I'm planning to try it but I haven't tried it yet. If there is a more reliable, optimized sink solution that also uses db instead of file, I would prefer to use it.
Any recommendation to solve this question?
Do you need a Map or a TreeMap ?
Is collating sequence relevant for storage, for access or neither.
For Hazelcast, the chance for data loss is configurable. You set up a cluster with the level of resilience you want. This is the same as with disk, if you have one disk and it fails, you lose data. If you have two and one goes offline, you don't lose data. You allocate hardware for the level of resilience you need. Three is the recommended minimum.
(10,000 per second isn't worrying either, 1,000,000,000 has been done. Sync to an external store can be immediate or in batches)
Disclaimer, I work for Hazelcast, but I think your question is more fundamental -- how do you keep your store available.
Simply, don't restart.
Clustered solutions are the answer here. If you have multiple nodes, the service as a whole stays running even if a few nodes go offline.
Do rolling bounces.
If you must restart everything at once, what matters is how quickly can your service bring all data back and what does it do when the restore is 50% done (is 50% data visible?). Immediate replication to elsewhere is only really necessary if you have a clustered solution that hasn't been configured for resilience. Saving intermittently is fine if you have solved resilience.
So, configure your storage so that it doesn't go offline, makes the solution options for backup/restore all the easier.
I am running a Java web app.
A user uploads a file (max 1 MB) and I would like to store that file until the user completes an entire process (which consists of multiple requests).
Is it ok to store the file as a byte array in the session until the user completes the entire process? Or is this expensive in terms of resources used?
The reason I am doing this is because I ultimately store the file on an external server (eg aws s3) but I only want to send it to that server if the whole process is completed.
Another option would be to just write the file to a temporary file on my server. However, this means I would need to remove the file in case the user exits the website. But it seems excessive for me to add code to the SessionDestroyed method in my SessionListener which removes the file if it’s just for this one particular case (ie: sessions are created throughout my entire application where I don’t need to check for temp files).
Thanks.
Maybe Yes, maybe No
Certainly it is reasonable to store such data in memory in a session if that fits your deployment constraints.
Remember that each user has their own session. So if all of your users have such a file in their session, then you must multiply to calculate the approximate impact on memory usage.
If you exceed the amount of memory available at runtime, there will be consequences. Your Servlet container may serialize less-used sessions to storage, which is a problem if you’ve not programmed all of your objects to support serialization. The JVM and OS may use a swap file to move contents out of real memory as part of the virtual memory system. That swapping may impact or even cripple performance.
You must consider your runtime deployment constraints, which you did not disclose. Are you running on a Raspberry Pi or inexpensive little cloud server with little memory available? Or will you run on an enterprise-class server with half a terabyte of RAM? Do you have 3 users, 300, or 30,000? You need to crunch the numbers and determine your needs, and maybe do some runtime profiling to see actual usage.
For example… I write web apps using the Vaadin Framework, a sophisticated package for creating desktop-style apps within a web browser. Being Servlet-based, Vaadin maintains a complete representation of each user’s entire work data on the server-side in the Servlet session. Multiplied by the number of users, and depending on the complexity of the app, this may require much memory. So I need to account for this and run my server on sufficient hardware with 64-bit Java tuned to run with a large amount of memory. Or take other approaches such load-balancing across multiple servers with sticky sessions.
Fortunately, RAM is quite cheap nowadays. And 64-bit hardware with large physical support for RAM modules, 64-bit operating systems, and 64-bit JVM implementations ( Azul, others ) are all readily available.
does anyone know was the proper configuration/development approach when writing an application that only uses cache as store?
To give some background, the application doesn't need to store any information (it actually stores a timestamp but I'll explain that later) because it only reads from what another app writes. We have a stored procedure that reads from that application's database and returns us the information at that point. From the moment the application starts, any update is notified through a topic so that database is no longer needed (until next restart).
Once everything is loaded, every record in the cache has to be read when certain messages are consumed to loop through them an process them individually. The application keeps a Map of Lock objects, each one for each record in the cache, to avoid race conditions. If the record meets certain criteria, a timestamp is written to the cache and to a database using write-behind of up to 5000 records.
The application is already developed but I think we have some problems with GCs. We keep getting spikes and I would like to know if there is any recommendation on what to do to reduce them.
These are the things we've done so far:
There is a collection of Strings that are repeated over and over for each record. I'm interning these ones (we are using java 8)
The cache we are using is EhCache. To avoid recreating objects, the element from the cache is used directly.
Every variable is a long or a String, except for an enum value and a LocalDateTime that is required to do some date checks.
There are two caches. This is because, once a criteria is met, a timestamp has to be replicated to another instance of the app. For this, we are using JMS replication from EhCache that uses topics for these updates.
The timestamp updates don't happen very often so the impact this could have should be minimum.
The amount of records is, at the moment, 350000, each one with a bunch of Strings and longs alongside the enum and LocalDateTime mentioned before.
A random problem we have is that sometimes it throws GC overhead limit exceeded. Normally the application keeps lowering the amount of memory it uses after some GCs but it seems sometimes it cannot handle the load.
The box has 3GB of memory for this and the application after a major GC uses around 500MB for the cache.
Apart from this, I don't know how the JVM is configured or what kind of GC uses. Any ideas or any blogs or documents someone could suggest me to start reading?
Thanks!
As you are running Java 8 you could change the Garbage Collector. The so called "Garbage First" GC has been there as an option since early versions of Java 7. Problems from its infancy have been resolved and it is often recommended for interactive applications that need fast response.
It can be enabled by using -XX:+UseG1GC and will become the default on Java 9.
Read more about it at http://www.oracle.com/technetwork/tutorials/tutorials-1876574.html
I have a webapp in development. I need to plan for what happens if the host goes down.
I will lose some very recent session status (which I can live with) and everything else should be persistently stored in the database.
If I am starting up again after an outage, can I expect a good host to reconstruct the database to within minutes or seconds of where I was up to, or should I build in a background process to continually mirror the database elsewhere?
What is normal/sensible?
Obviously a good host will have RAID and other redundancy, so the likelihood of total loss should be low, and if they have periodic backups, I should lose only very recent stuff but this is presumably likely to be designed with almost static web content in mind, and my site is transactional with new data being filed continuously (with a customer expectation that I don't ever lose it).
Any suggestions/advice?
Are there off-the-shelf frameworks for doing this? (I'm primarily working in Java.)
Should I just plan to save the data or should I plan to have an alternative usable host implementation ready to launch in case the host doesn't come back up in a suitable timeframe?
You need a replication strategy which of course depends on your database engine.
It's usually done by configuration.
http://en.wikipedia.org/wiki/Replication_%28computer_science%29
I've experience with Informix there you can setup data replication to have a standby system available or do full backup the data, and replay logical-logs (which contain basically all SQL-Statement) which needs more time to recover from a crash.
Having a redundant storage is also a good idea on case of a disc crashes. This topic is probably better discussed on serverfault.com
My java application functionality is to provide reference data (basically loads lots of data from xml files into hashmap) and hence we request for one such data from the hashmap based on a id and we have such multiple has map for different set of business data. The problem is that when i tried executing the java application for the same request multiple times, the response times are different like 31ms, 48ms, 72ms, 120ms, 63ms etc. hence there is a considerable gap between the min and max time taken for the execution to complete. Ideally, i would expect the response times to be like, 63ms, 65ms, 61ms, 70ms, 61ms, but in my case the variation of the response time for the same request is varying hugely. I had used a opensource profile to understand if there is any extra execution of the methods or memory leak, but as per my understanding there was no problem. Please let me know what could be the reasons and how can i address this problem.
There could be many causes:
Is your Java application restarted for each run? If not, it could be that the garbage collector kicks in at an unfortunate time. If so, the JVM startup time could be responsible for the variations.
Is anything else running on that machine?
Is the disk cache "warmed up" in some cases, but not in others? That is, have the files been recently accessed so that they are still in memory?
If this is a networked application, is there any network activity during the measurements?
If there is a remote machine involved (e.g. a database server or a file server), do the above apply to that machine as well?
Use a profiler to find out which piece of code is responsible for the variations in time.
If you don't run a real-time system, then you can't be sure it will execute within a certain time.
OSes constantly do other things, mostly housekeeping tasks, and providing the system other services. This easily will slow down the rest of your system for 50ms.
There also might be time that you need to wait for IO. Such as harddisks or network communication.
Besides that there is also the fact that your JVM doesn't do any real-time promises. This can mean the garbage collector runs through. The effect of this is very small on a normal application, but can be large if you create and forget lots of objects (as you might do when loading many or large files).
Finally it can be your algorithm (do you run the same data each time?) if you have different data, you can have different execution times.