I'm new to NoSQL, and I'm scratching my head trying to figure out the most appropriate NoSQL implementation for the application I'm trying to build.
My Java application needs to have an in-memory hashmap containing millions to billions of entries as it models a single-layer neural network. Right now we're using Trove in order to be able to use primitives as keys and values to reduce the size of the map and increase the access speed. The map is a map of maps where the outer map's keys are longs and the inner maps have long/float key/values.
We need to be able to read the saved state from disk to the map of maps when the application starts up. The changes to the map of maps need also to be saved to disk either continuously or according to some scheduled interval.
I was at first drawn towards OrientDB because of their document and object DBs, although I'm still not sure at this point what would be better. Then I came across Redis, which is a key value store and works with an in-memory dataset that can be dumped to disk, including master-slave replication. However, it doesn't look like the values of the map can be anything other than Strings.
Am I looking in the right places for a solution to my needs? Right now, I like the in-memory and master-slave aspect of Redis, but I like the object/document capabilities of OrientDB as my data structures are more complicated than simple Strings and being able to use Trove with the primitive key/value types is very advantageous. It would be better if reading was cheap and writing was expensive rather than the other way around.
Thoughts?
Why not just serialize the Trove data structures directly to disk? There appears to be some sort of support for that judging by the documentation (http://trove4j.sourceforge.net/javadocs/serialized-form.html), but it's hard to tell because it's all auto-generated cruft instead of lovingly-made tutorials. Still, for your use case it's not obvious why you need a proper database, so perhaps KISS applies.
OrientDB has the most flexible engine with index, graph, transactions and complex documents as JSON. Why not?
Check out Java-Chronicle. It's a low latency persistence library. I think you may find it offers excellent performance for this type of data.
If you'd like to use Redis for this, you'd likely be best suited by using either ZSETs or HASHes as underlying structures (Redis supports structures, not just string values). Unless you need to fetch your parts of your maps based on the values/sorted order of the values, HASHes would probably be best (in terms of memory and speed).
So you would probably want to use a long -> {long:float, ...} . That is, longs mapping to long/float maps. You can then either fetch individual entries in the map with HGET, multiple entries with HMGET, or the full map with HGETALL. You can see the command reference http://redis.io/commands
On the space saving side of things, depending on the expected size of your HASHes, you may be able to tune them to use less space with limited/no negative effects on performance.
On the persistence side of things, you can either run Redis with snapshots or using incremental saving with append-only files. You can see the persistence documentation here: http://redis.io/topics/persistence
If you'd like to ask more pointed questions, you should head over to the mailing list https://groups.google.com/forum/?fromgroups=#!topic/redis-db/33ZYReULius
Redis supports more complex data structures than simple strings such as lists, (sorted) sets or hashes which might come handy for your domain model. On the other your neural network can leverage from rich graph capabilities of OrientDB depending on it's strucuture.
Related
I have seen videos and read the documentation of Cloud firestore, from Google Firebase service, but I can't figure this out coming from realtime database.
I have this web app in mind in which I want to store my providers from different category of products. I want perform a search query through all my products to find what providers I have for such product, and eventually access that provider info.
I am planning to use this structure for this purpose:
Providers ( Collection )
Provider 1 ( Document )
Name
City
Categories
Provider 2
Name
City
Products ( Collection )
Product 1 ( Document )
Name
Description
Category
Provider ID
Product 2
Name
Description
Category
Provider ID
So my question is, is this approach the right way to access the provider info once I get the product I want?
I know this is possible in the realtime database, using the provider ID I could search for that provider in the providers section, but with Firestore I am not sure if its possible or if this is right approach.
What is the correct way to structure this kind of data in Firestore?
You need to know that there is no "perfect", "the best" or "the correct" solution for structuring a Cloud Firestore database. The best and correct solution is the solution that fits your needs and makes your job easier. Bear also in mind that there is also no single "correct data structure" in the world of NoSQL databases. All data is modeled to allow the use-cases that your app requires. This means that what works for one app, may be insufficient for another app. So there is not a correct solution for everyone. An effective structure for a NoSQL type database is entirely dependent on how you intend to query it.
The way you are structuring your data looks good to me. In general, there are two ways in which you can achieve the same thing. The first one would be to keep a reference of the provider in the product object (as you already do) or to copy the entire provider object within the product document. This last technique is called denormalization and is a quite common practice when it comes to Firebase. So we often duplicate data in NoSQL databases, to suit queries that may not be possible otherwise. For a better understanding, I recommend you see this video, Denormalization is normal with the Firebase Database. It's for Firebase Realtime Database but the same principles apply to Cloud Firestore.
Also, when you are duplicating data, there is one thing that needs to keep in mind. In the same way, you are adding data, you need to maintain it. In other words, if you want to update/delete a provider object, you need to do it in every place that it exists.
You might wonder now, which technique is best. In a very general sense, the best way in which you can store references or duplicate data in a NoSQL database is completely dependent on your project's requirements.
So you should ask yourself some questions about the data you want to duplicate or simply keep it as references:
Is the static or will it change over time?
If it does, do you need to update every duplicated instance of the data so they all stay in sync? This is what I have also mentioned earlier.
When it comes to Firestore, are you optimizing for performance or cost?
If your duplicated data needs to change and stay in sync in the same time, then you might have a hard time in the future keeping all those duplicates up to date. This will also might imply you spend a lot of money keeping all those documents fresh, as it will require a read and write for each document for each change. In this case, holding only references will be the winning variant.
In this kind of approach, you write very little duplicated data (pretty much just the Provider ID). So that means that your code for writing this data is going to be quite simple and quite fast. But when reading the data, you will need to load the data from both collections, which means an extra database call. This typically isn't a big performance issue for reasonable numbers of documents, but definitely does require more code and more API calls.
If you need your queries to be very fast, you may want to prefer to duplicate more data so that the client only has to read one document per item queried, rather than multiple documents. But you may also be able to depend on local client caches makes this cheaper, depending on the data the client has to read.
In this approach, you duplicate all data for a provider for each product document. This means that the code to write this data is more complex, and you're definitely storing more data, one more provider object for each product document. And you'll need to figure out if and how to keep up to date on each document. But on the other hand, reading a product document now gives you all information about the provider document in one read.
This is a common consideration in NoSQL databases: you'll often have to consider write performance and disk storage vs. reading performance and scalability.
For your choice of whether or not to duplicate some data, it is highly dependent on your data and its characteristics. You will have to think that through on a case-by-case basis.
So in the end, remember that both are valid approaches, and neither of them is pertinently better than the other. It all depends on what your use-cases are and how comfortable you are with this new technique of duplicating data. Data duplication is the key to faster reads, not just in Cloud Firestore or Firebase Realtime Database but in general. Any time you add the same data to a different location, you're duplicating data in favor of faster read performance. Unfortunately in return, you have a more complex update and higher storage/memory usage. But you need to note that extra calls in Firebase real-time database, are not expensive, in Firestore are. How much duplication data versus extra database calls is optimal for you, depends on your needs and your willingness to let go of the "Single Point of Definition mindset", which can be called very subjective.
After finishing a few Firebase projects, I find that my reading code gets drastically simpler if I duplicate data. But of course, the writing code gets more complex at the same time. It's a trade-off between these two and your needs that determines the optimal solution for your app. Furthermore, to be even more precise you can also measure what is happening in your app using the existing tools and decide accordingly. I know that is not a concrete recommendation but that's software development. Everything is about measuring things.
Remember also, that some database structures are easier to be protected with some security rules. So try to find a schema that can be easily secured using Cloud Firestore Security Rules.
Please also take a look at my answer from this post where I have explained more about collections, maps and arrays in Firestore.
I am wondering which approach is better. Should we use fine grained entities on the grid and later construct functionaly rich domain objects out of the fined grained entities.
Or alternatively we should construct the course grained domain objects and store them directly on the grid and the entities we just use for persistence.
Edit: I think that this question is not yet answered completely. So far we have comments from Hazelcast,Gemfire and Ignite. We are missing Infinispan, Coherence .... That is for completion sake :)
I agree with Valentin, it mainly depends on the system you want to use. Normally I would consider to store enhanced domain objects directly, anyhow if you would just have very few objects but their size is massive you end up with bad distribution and unequal memory usage on the nodes. If your domain object are "normally" sized and you have plenty, you shouldn't worry.
In Hazelcast it is better to store those objects directly but be aware of using a good serialization system as Java Serialization is slow. If you want to query on properties inside your domain objects you should also consider adding indexes.
I believe it can differ from one Data Grid to another. I'm more familiar with Apache Ignite, and in this case fine grained approach works much better, because it's more flexible and in many cases gives better data distribution and therefore better scalability. Ignite also provides rich SQL capabilities [1] that allow to join different entities and execute indexed search. This way you will not lose performance with fine grained model.
[1] https://apacheignite.readme.io/docs/sql-queries
One advantage of a coarse-grained object is data consistency. Everything in that object gets saved atomically. But if you split that object up into 4 small objects, you run the risk that 3 objects save and 1 fails (for whatever reason).
We use GemFire, and tend to favor coarse-grained objects...up to a point. For example our Customer object contains a list of Addresses. An alternative design would be to create one GemFire region for "Customer" and a separate GemFire region for "CustomerAddresses" and then hope you can keep those regions in sync.
The downside is that every time someone updates an Address, we re-write the entire Customer object. That's not very efficient, but our traffic patterns show that address changes are very rare (compared to all the other activity), so this works out fine.
One experience we've had though is the downside of using Java Serialization for long-term data storage. We avoid it now, because of all the problems caused by object compatibility as objects change over time. Not to mention it becomes headache for .NET clients to read the objects. :)
I'm developing a service that monitors computers. Computers can be added to or removed from monitoring by a web GUI. I keep reported data basically in various maps like Map<Computer, Temperature>. Now that the collected data grows and the data structures become more sophisticated (including computers referencing each other) I need a concept for what happens when removing computers from monitoring. Basically I need to delete all data reported by the removed computer. The most KISS-like approach would be removing the data manually from memory, like
public void onRemove(Computer computer) {
temperatures.remove(computer);
// ...
}
This method had to be changed whenever I add features :-( I know Java has a WeakHashMap, so I could store reported data like so:
Map<Computer, Temperature> temperatures = new WeakHashMap<>();
I could call System.gc() whenever a computer is removed from monitoring in order have all associated data eagerly removed from these maps.
While the first approach seems a bit like primitive MyISAM tables, the second one resembles DELETE cascades in InnoDB tables. But still it feels a bit uncomfortable and is probably the wrong approach. Could you point out advantages or disadvantages of WeakHashMaps or propose other solutions to this problem?
Not sure if it is possible for your case, but couldn't your Computer class have all the attributes, and then have a list of monitoredComputers (or have a wrapper class called MonitoredComputers, where you can wrap any logic needed like getTemperatures()). By that they can be removed from that list and don't have to look through all attribute lists. If the computer is referenced from another computer then you have to loop through that list and remove references from those who have it.
I'm not sure using a WeakHashMap is a good idea. As you say you may reference Computer objects from several places, so you'll need to make sure all references except one go through weak references, and to remove the hard reference when the Computer is deleted. As you have no control over when weak references are deleted, you may not get consistent results.
If you don't want to have to maintain manually the removal, you could have a flag on Computer objects, like isAlive(). Then you store Computers in special subclasses of Maps and Collections that at read time check if the Computer is alive and if not silently remove it. For example, on a Map<Computer, ?>, the get method would check if the computer is alive, and if not will remove it and return null.
Or the subclasses of Maps and Collections could just register themselves to a single computerRemoved() event, and automatically know how to remove the deleted computers, and you wouldn't have to manually code the removal. Just make sure you keep references to Computer only inside your special maps and collections.
Why not use an actual SQL database? You could use an embedded database engine such as H2, Apache Derby / Java DB, HSQLDB, or SQLite. Using an embedded database engine has the added benefits:
You could inspect the live contents of the monitoring data at any time using the corresponding DB engine's command line client.
You could build a new tool to access and manipulate the data by connecting to a shared database instance.
The schema itself is a form of documentation as to the structure of the monitoring data and the relationships between entities.
You could store different types of data for different types of computers by way of schema normalization.
You can back up the monitoring data.
If you need to restart the monitoring server, you won't lose all of the monitoring data.
Your Web UI could use a JPA implementation such as Hibernate to access the monitoring data and add new records. Or, for a more lightweight solution, you might consider using Spring Framework's JdbcTemplate and SimpleJdbcInsert classes. There is also OrmLite, ActiveJDBC, and jOOQ which each aim to offer simpler access to databases than JDBC.
The problem with WeakHashMap is that managing the references to Computer objects seems difficult and easily breakable.
Hash table based implementation of the Map interface, with weak keys. An entry in a WeakHashMap will automatically be removed when its key is no longer in ordinary use. More precisely, the presence of a mapping for a given key will not prevent the key from being discarded by the garbage collector, that is, made finalizable, finalized, and then reclaimed. When a key has been discarded its entry is effectively removed from the map, so this class behaves somewhat differently from other Map implementations.
It could be the case that a reference to a Computer object might still exist somewhere and the object will not be deleted for the WeakHashMaps. I would prefer a more deterministic approach.
But if you decide to go down this route, you can mitigate the problem I point out by wrapping all these Computer object keys in a class that has strict controls. This wrapper object will create and store the keys and will pay attention to never let references of those keys to leak out.
Novice coder here, so maybe this is too clunky:
Why not keep the monitored computers in a HashMap, and removed computers go to a WeakHashMap? That way all removed computers are seperate and easy to work with, with the gc cleaning up the oldest entries.
My algorithm will likely not be used on the web. The object I describe may be used by multiple threads, however.
The original object I had designed emulated pointers.
Reduced, a symbol would map to multiple pointers, and each unique pointer would map to a single symbol.
When I was finally finished and had a working algorithm, it turns out I actually needed six maps in total (these maps are called tens of thousands of times).
Initial testing with a very very small sample set of symbols showed the program to be working very efficiently. However, I'm afraid that once I increase the number of symbols by a few thousand-fold it will become sluggish.
Once the program completes and closes, the pointers do not need to persist.
I was wondering if I should re implement my algorithm using a database as a backend. Would this be better than using all of these maps?
The maps are stored in memory. The database will be stored on a hard drive (I have a SSD, so I'm afraid there will be a large difference in performance on my machine vs a machine using SATA/PATA). The maps should also be O(1). The maps might also become very ugly once multithreading is introduced, unless I use thread safe mapping, which would slow the program down. A database would efficiently handle these tasks.
I've formally written out the proper relations, and I'm sure I can implement it in a database if that was the best option. Which is the better option?
If you need not to persist that data structure, do not try to support it on a database. In your place, I would try some load tests with a proper amount of data on the data structure you already have and try to refine it from there if performance was not what I expected.
Anyway, the trend currently is to use relational databases in hard disk for persistence and cache frequently queried data in "big hashtables" in memory for performance, I doubt falling back to a database would improve your performance
If your data structures fit in memory, I would be shocked if using a database would be faster (not even considering the complexity of using a database implementation). By throwing away all the assumptions, features, safety and consistency that a database must maintain, you will gain performance. Even the best DB implementation, assuming enough memory to cache everything, pretty much has a ConcurrentHashMap as an upper bound on performance. As a practical matter, you won't get CHM performance even with great caching, because a DB API will require defensive copies or cache invalidations that you can avoid with your in-memory structure.
Apart from the likely performance boost simply from using an in-memory hashmap, you may also get additional performance by tuning your structure based on your specific use case. For example, perhaps the initial lookup is multi-threaded, but individual values are only accessed by a single thread. In that case, you can avoid locking those values.
Hard drives, even fast, are several orders of magnitude slower than your memory. So if your goal is performance you should stay in memory and use maps. For thread safety you can just use a ConcurrentHashMap which uses a lock-free algorithm and the synchronisation penalty in a multi threaded environment should be minimal.
You should also check if a single thread does not provide enough performance - multiple threads always introduce some overhead and they need to bring enough gains to offset it.
You may also want to check in-memory DBs such as HyperSQL or H2 Database.
I have say list of 1000 beans which I need to share among different projects. I use memcache for this purpose. Currently, loop is run over complete list and each bean is stored in memcache with some unique memcache id. I was wondering, instead of putting each and every bean in memcache independently. Put all the beans in hashmap with the same key which is used for storing beans in memcache, and then put this hashmap in memcache.
Will this give me any significant improvement over putting each and every bean individually in memcached. Or will this cause me any trouble because of large size of the object.
Any help is appreciated.
It won't get you any particular benefit -- it'll actually probably be slower on the load -- serialization is serialization, and adding a hashmap wrapper around it just increases the amount of data that needs to be deserialized and populated. for retrievals, assuming that most lookups are desecrate by the key you want to use for your hashmap you'll have a much much slower retrieval time because you'll be pulling down the whole graph just to get to one of it's discreet member info.
Of course if the data is entirely static and you're only using memcached to populate values in various JVM's you can do it that way and just hold onto the hashmap in a static... but then you're multiplying your memory consumption by the number of nodes in the cluster...
I did some optimization work in spymemcached that helps it do the right thing when doing the wire encoding.
This may, or may not help you with your application. In general, just measure when you have performance questions about your app.