To shard or not to shard? GAE/java/jdo

To shard or not to shard? GAE/java/jdo - java

I'm currently porting some work from MySQL to Google App Engine/Java. I'm using JDO, as well as the lower level java API where required.
I read through the optimization guide about sharding counters: http://code.google.com/appengine/articles/sharding_counters.html
I'm still building the foundation of my app. I know that premature optimization is the root of all evil; but this is clearly documented in order to avoid contention. So I'm having trouble deciding if I should be biased one way or the other.
So should I be sharding counters (and other possibly higher frequency write operation objects) by default, or should I go forward without sharding and implement on an as needed basis?

The salient meaning of "premature" here is "before the proper time." Designing to avoid limits, when those limits are well understood, is not premature.
Shard your counters.

Even with effective sharding, maintaining aggregates can add some substantial load to your application. If you need that aggregate, and you can't afford an approximation; then using a sharded aggregate is not a premature optimization; there is no next best alternative. If you don't actually need the counter, then the time it will take to implement it could be better spent elsewhere.

Related

Java Caching frameworks for maintaining huge data

Java Caching frameworks for storing huge data.
Context: We are developing a Restful service using Jersey 2.6 and will deploy it on WAS 8.5. This service need to serve more than 10 million requests per day.
We need to implement a cache to store more than 300k object (data will come from DB). And we need some way to update the cache on a daily basis.
Is this approach of caching 300k object and updating them on a daily basis is recommended?
Are there any Java framework which supports this kind of functionality?

Your question is too general to get a clear answer. You need to be describe what the problem you are trying to solve is.
Are you concerned about response times?
Are you trying to protect your DB from doing heavy lifting?
Are expecting to have to scale out and want to be sure that you can deal with future loads?
Additionally some more contextual information would be useful, especially:
How dynamic is your data compared to your requests?
What percentage of your data population will be requested on average per day? (How many of the 3 lakh objects will be enquired upon at least once per day? If you don't know, provide your best guess).
Your figures given as 3 lakh (300k) data points and 10M requests means that you are expecting to hit each object on average 33 times a day, which indicates that you are more concerned about back end DB load than your responses being right up to date.
In my experience there are a lot of fairly primitive solutions which will work much better than going for a heavyweight distributed systems such as Mongo, Cassandra or Coherence.
My first response would be: Keep it simple - 300k objects is not too much to store in an internal hash table which you flush once a day and populate on first request.
If you need to scale horizontally, I would suggest Memcache Spymemcached with a 1 day cache time, which populate when you don't find an existing entry.
I would NOT go for something like Cassandra or Mongo unless you have real compelling reasons to require a persistent store. Rationale: Purging can become really onerous, especially if your data is fast moving. For example: Cassandra does not really know how to delete, but instead "tombstones" deleted entries, which means that your data store will simply grow and grow until you create a strategy for purging.

Question is if caching must be distributed. Remember the caching is something you have seen. And posting this around for the chance it might be of use... well why.
Distributed Cache system: Redis, Cassandra in Memory. MongoDB in memory.
Local RocksDB (let you store byte[] -> byte[]) and SSDs makes a fine local cache layer. You might also add distributed layer on top of it. Usually better than something from the shelves. Should also be easy to implement.
10Million Requests per day isnt much. in 10hours tops you can server 1Mio / 60 / 60 => 3000 requests per second. Based on the afford you usually can go with an efficient frontend and efficient backend. We can do 40k pages per second and core and having 24 cores.. you know the math. Data in memory no chaching done...

For the caching provider I suggest Coherence, I am using Coherence at my company, and it is very robust and synchronized over multiple clusters.
For the other point about how to handle cache, it depends on the nature of your application, based on my experience with caching, I've decided to update the cache in the following scenarios:
1. Grid paging
2. Browsing
and decided to clear the cache and reload the data again:
Edit item
Add new item
Delete item
And I've decided so as maintaining the cache it an overkill headache that will be blown in your face when you handle some kind of statistics and nested hierarchies.
Hope this helped you.

Yes they are for example: Coherence, Hazelcast. All are distrubuted cashes.
http://java.dzone.com/articles/sneak-peek-jcache-api-jsr-107
In general you should cache what you are using, and cache should be always in sync not daily. You place in cache the recently used objects, and you get read/write through cache to your DB.

If you have money , best one is coherence (its reputation is proved by big financial companies )
Hazelcast is an other distributed cache memory you can use, it is one level lower than coherence based on preformance metrics.

Cou could try ehcache. It can be used as query cache or even hibernate second level cache.
You can configure how long entities should be stored in cache before they are invalidated.

If you already have WebSphere ND 8.5.5, you may take a look at WebSphere Extreme Scale, which is provided with that. It is distributed, partitioned caching solution that integrates with WebSphere. See WebSphere eXtreme Scale overview for more details.

See the new JCache standard (JSR 107 in the Java Community Process). This API is implemented by Coherence and other caching implementations (ehcache etc.), and also has a small reference implementation that you can use for basic use cases.
Yes, any of the Java caching frameworks should be able to help you. Coherence (note: I work with Coherence at Oracle) for example can definitely handle 3,00,000 items easily (I assume you are from India if you use lakh!), but I suggest only using Coherence if you are deploying this on more than one server.

Solution to provide shared entities between multiple Java processes

I am trying to reconstruct a flow of information from multiple parts handled by different Java processes. Please note that i don't generate the flows, i just read some information about them.
I've tried using MySQL (MyISAM/InnoDB tables) with INSERT ON DUPLICATE KEY UPDATE using an id for each flow. I've also tried storing all the pieces of information and running a query at the end to get the full information. Neither of these approaches yielded the performance needed.
I'm looking for a solution that will allow me to have a set of shared objects between multiple Java processes. The objects should be persistent between runs and fast to lookup/update concurrently (>100k lookups/updates per second).
I've thought of a few solutions including:
NoSQL: something like MongoDB, HBase etc.
a caching solution like EhCache, Memcached etc.
The problem is i don't have any experience with any of these solutions. So, what would you recommend that fits the following criteria:
very fast on a single system. Most of the applications i mentioned were built for distributed systems, but it's not the case here.
easy to learn/use (i want to be able to prototype it in a day)
mature technology
free to use even for commercial purposes
preferably open-source

You could try a seperate java process that co-ordinates between the others. This process would hold the information to pass over to the main processes. You could wire them up with RMI.

If you want to do only exchange of objects withing java applications, you could also looki into tuple spaces. There are specific implementations of spaces for java, JavaSpaces, which should be able to do what you need. Not sure if they can keep up with the performance though. Also I’m not sure how widely this technology is still being used, since it only supports Java and isn’t as flexible as NoSQL stores would be these days.
Wikipedia has a more detailed description and list of different implementations, many of which are open source.
The other option is to go with Redis, you have notifications there and it can for sure scale to the requirements you are looking for.

The old (legacy?) solution is JavaSpaces. However, from an software architects point of view I would say distributed caches are the replacements for that nowadays. Especially take a look at hazelcast and infinispan.
From the performance viewpoint I am not happy with the performance of the "big" distributed caching solutions, when only a single in-memory cache is needed, see my writeup on the cache2k benchmarks page (hazelcast needs to be added here).
Anyways, please clarify your problem statement first, because your question falls into the XyProblem category. You are not describing the actual problem, and your question just boils down to "fast reliable distributed objects" solution. What kind of data comes in? What is the rate? Who is it accessed? What consistency guarantees need to be met, considering the fact that writing and reading is in parallel?
By the term "flow of information" it sounds more like a complex event processing problem to me.

Overhead of using coherence cache

I consider caching key-value lists stored in database. Right now for rendering of JSF pages, a lot of redundant queries are executed to find the names to be displayed for some keys (O/R-Mapper: Eclipselink).
The values are quasi-static, but can change very seldom by using the application (no change in database except by the application in question).
A simple cache would suffice when only using one application server. However, load balancing with multiple servers should be possible, avoiding returning stale values if data is changed using one server and therefore not reflected by the other server.
One idea would be to use oracle coherence as distributed cache. I'm not sure whether this is overkill because of the fact that the data is only changed very seldomly and the cache itself does not need to be distributed, only the invalidation should be.
What is the overhead of coherence in terms of memory, execution times and network communication? Are there any alternatives that better suit my use case?
I talk about 50.000 key value pairs, mainly short strings.

If the invalidation is that rare, then you can use a local cache and something like a JMS Topic that everyone subscribes to in order to handle the invalidation.
There's also something like EHCache as an alternative, since it's OSS and free to use vs Coherence, if that's important. I like to use EHCaches pull through ability.

Coherence has relatively low overhead, and can easily manage 50,000 (or 50,000,000) objects. However, if your use case is super simple, and you don't mind doing the invalidation work yourself, and don't need the various QoS that Coherence provides, then it probably is overkill.
Also, this simple use case can easily be done using the Coherence Standard Edition, which is far less expensive (licensed per server instead of per processor, and it's a much lower price).
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

Java Fast Data Storage & Retrieval

I need to store records into a persistant storage and retrieve it on demand. The requirement is as follows:
Extremely fast retrieval and insertion
Each record will have a unique key. This key will be used to retrieve the record
The data stored should be persistent i.e. should be available upon JVM restart
A separate process would move stale records to RDBMS once a day
What do you guys think? I cannot use standard database because of latency issues. Memory databases like HSQLDB/ H2 have performace contraints. Moreover the records are simple string objects and do not qualify for SQL. I am thinking of some kind of flat file based solution. Any ideas? Any open source project? I am sure, there must be someone who has solved this problem before.

There are lot of diverse tools and methods, but I think none of them can shine in all of the requirements.
For low latency, you can only rely on in-memory data access - disks are physically too slow (and SSDs too). If data does not fit in the memory of a single machine, we have to distribute our data to more nodes summing up enough memory.
For persistency, we have to write our data to disk after all. Supposing optimal organization
this can be done as background activity, not affecting latency.
However for reliability (failover, HA or whatever), disk operations can not be totally independent of the access methods: we have to wait for the disks when modifying data to make shure our operation will not disappear. Concurrency also adds some complexity and latency.
Data model is not restricting here: most of the methods support access based on a unique key.
We have to decide,
if data fits in the memory of one machine, or we have to find distributed solutions,
if concurrency is an issue, or there are no parallel operations,
if reliability is strict, we can not loose modifications, or we can live with the fact that an unplanned crash would result in data loss.
Solutions might be
self implemented data structures using standard java library, files etc. may not be the best solution, because reliability and low latency require clever implementations and lots of testing,
Traditional RDBMS s have flexible data model, durable, atomic and isolated operations, caching etc. - they actually know too much, and are mostly hard to distribute. That's why they are too slow, if you can not turn off the unwanted features, which is usually the case.
NoSQL and key-value stores are good alternatives. These terms are quite vague, and cover lots of tools. Examples are
BerkeleyDB or Kyoto Cabinet as one-machine persistent key-value stores (using B-trees): can be used if the data set is small enough to fit in the memory of one machine.
Project Voldemort as a distributed key-value store: uses BerkeleyDB java edition inside, simple and distributed,
ScalienDB as a distributed key-value store: reliable, but not too slow for writes either.
MemcacheDB, Redis other caching databases with persistency,
popular NoSQL systems like Cassandra, CouchDB, HBase etc: used mainly for big data.
A list of NoSQL tools can be found eg. here.
Voldemort's performance tests report sub-millisecond response times, and these can be achieved quite easily, however we have to be careful with the hardware too (like the network properties mentioned above).

Have a look at LinkedIn's Voldemort.

If all the data fits in memory, MySQL can run in memory instead of from disk (MySQL Cluster, Hybrid Storage). It can then handle storing itself to disk for you.

What about something like CouchDB?

I would use a BlockingQueue for that. Simple, and built into Java.
I do something similar using realtime data from Chicago Merchantile Exchange.
The data is sent to one place for realtime use... and to another place (via TCP),
using a BlockingQueue (Producer/Consumer) to persist the data to a database (Oracle,H2).
The Consumer uses a time delayed commit to avoid fdisk sync issues in the database.
(H2 type databases are asyncronous commit by default and avoid that issue)
I log the persisting in the Consumer to keep track of the queue size to be sure
it is able to keep up with the Producer. Works pretty good for me.

MySQL with shards may be a good idea. However, it depends on what is the data volume, transactions per second and latency you need.
In memory databases are also a good idea. In fact MySQL provides memory-based tables as well.

Would a Tuple space / JavaSpace work? Also check out other enterprise data fabrics like Oracle Coherence and Gemstone.

MapDB provides highly performant HashMaps/TreeMaps that are persisted to disk. Its a single library that you can embed in your Java program.

Have you actually proved that using an out-of-process SQL database like MySQL or SQL Server is too slow, or is this an assumption?
You could use a SQL database approach in conjunction with an in-memory cache to ensure that retrievals do not hit the database at all. Despite the fact that the records are plaintext I would still advise using SQL over a flat file solution (e.g. using a text column in your table schema) as the RDBMS will perform optimisations that a file system cannot (e.g. caching recently accessed pages, etc).
However, without more information about your access patterns, expected throughput, etc. I can't provide much more in the way of suggestions.

If you are looking for a simple key-value store and don't need complex sql querying, Berkeley DB might be worth a look.
Another alternative is Tokyo Cabinet, a modern DBM implementation.

How bad would it be if you lose a couple of entries in case of a crash?
If it isn't that bad the following approach might work for you:
Create flat files for each entry, name of file equals id. Possible one file for a not so big number of consecutive entries.
Make sure your controller has a good cache and/or use one of the existing caches implemented in Java.
Talk to a file system expert how to make this really fast
It is simple and it might be fast.
Of course you lose transactions including the ACID principles.

Sub millisecond r/w means you cannot depend on disk, and you have to be careful about network latency. Just forget about standard SQL based solutions, main-memory or not. In a ms, you cannot get more than 100 KByte over a GBit network. Ask a telecom engineer, they are used to solving these kind of problems.

How much does it matter if you lose a record or two? Where are they coming from? Do you have a transactional relationship with the source?
If you have serious reliability requirements then I think you may need to be prepared to pay some DB Overhead.
Perhaps you could separate the persistence problem from the in-memory problem. Use a pup-sub approach. One subscriber look after in-memory, the other persisting the data ready for subsequent startup?
Distributed cahcing products such as WebSphere eXtreme Scale (no Java EE dependency) might be relevent if you can buy rather than build.

Chronicle Map is a ConcurrentMap implementation which stores keys and values off-heap, in a memory-mapped file. So you have persistence on JVM restart.
ChronicleMap.get() is consistently faster than 1 us, sometimes as fast as 100 ns / operation. It's the fastest solution in the class.

Will all the records and keys you need fit in memory at once? If so, you could just use a HashMap<String,String>, since it's Serializable.

Automatically Sharding a Java Map across multiple nodes

I have a problem where I need to assemble a Map whose eventual size is in the GBs (going on past 64GB) and I cannot assume that a user of the program will have this kind of monster machine hanging around. A nice solution would be to distribute this map across a number of machines to make a far more modest memory footprint per instance.
Does anyone know of a library/suite of tools which can perform this sharding? I do not care about replication or transactions; just spreading this memory requirement around.

terracotta might be useful have a look here
http://www.terracotta.org/
its a clustered jvm will depend on how often you update the map i guess on how well it performs.

I suggest that you start with hazelcast:
http://www.hazelcast.com/
It is open-source, and in my opinion it is very easy to work with, so it is the best framework for rapid prototyping.
As far as I as know, it performs faster than the commercial alternatives, so I wouldn't worry about performance either.
(I haven't formally benchmarked it myself)

Must it be open source? If not, Oracle Coherence can do it.

You may be able to solve your problem by using a database instead, something like http://hsqldb.org/ may provide the functionality you need with the ability to write the data to disk rather than keeping the whole thing in memory.
I would definitely take a step back and ask yourself if a map is the right data structure for GBs of data.

Gigaspaces Datagrid sounds like the kind of thing you are looking for. (Not free though)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.