mongodb - how to lock collection with mongodb java client - java

I need to update a whole collection concurrently in a background thread, but read operation might take place at the same time. It takes about 3 seconds to update the collection when I benchmark it. Is there any way to lock a collection while updating the collection? I try to create a new collection and insert all the documents into it and rename it to the original collection with "dropToTarget=true", but I am not sure how safe and stable it is in terms of sharding. I read that renameCollection is incompatible with the sharding.
It would be great if someone can suggest if there is a good idea.
Thanks.

Do you presented two possible strategies to update your collection, one being inline with a lock on it and the other one with a temporary collection?
As the mongodb documentation clearly states it will not work for sharded collections (http://docs.mongodb.org/manual/reference/command/renameCollection/). From my understanding this means that your collection you want to rename isn't sharded, as you need to delete the other collection before you do the actual renaming you'll mostlikely loose any previously kept sharding (-information). So you would need to reactivate the sharding. I highly discourage from using the two collection approach, especially if you're sharding your data.
You would need to get all the data from your sharded collection and store it centralized, once you're done with updating you need to rename the collection and shard it again. This will cause much I/O for your whole system, especially for the client doing the update.
Depending on your system architecture (with a single point of entry). You could easily hold some global flag telling you if you currently have the collection update running. Forbidding other write operations.
For multi-entry points into your MongoDB you might try $isolated, but this doesn't work with sharded collections. And I'm not sure if it allows read operations, the documentation isn't very clear.
Is it strictly disallowed to write any data, while the update is in progress? What type of updates do you perform. Can they influence each other? Or would it be possible to have concurrent writes?

Related

Atomic read and delete in mongo

I am fairly new to mongo, so what I'm trying to achieve here might not be possible. My research so far is inconclusive...
My scenario is the following: I have an application which may have multiple instances running. These instances are processing some data, and when that processing fails, they write the ID of the failed item in a mongo collection ("error").
From time to time I want to retry processing those items. So, at fixed intervals, the application reads all the IDs from the collection, after which it deletes all the records. Now, this is an obvious race condition. Two instances may read the very same data, which would double the work to be done. Some IDs may also be missed like this.
My question would be the following: is there any way I can read and delete those records, in a distributed-atomic way? I was thinking about locking the collection, but for this I found no support so far in the java driver's documentation. I also tried to look for a findAndDrop() like method, but no luck so far.
I am aware of techniques like leader election, which most probably would solve this problem, but I wanted to see if it can be done in an easier way.
You can use BlockingQueue with multiple producer-single consumer approach, as you have multiple producer to produce ids and delete them with single consumer.
After all, I found no way to implement this with mongo.
However, since this is a heroku app, I stored the IDs in a Redis collection. This library I found implements a distributed Redis lock for Jedis, so this workaround solved my problem.

Simulating DELETE cascades with WeakHashMaps

I'm developing a service that monitors computers. Computers can be added to or removed from monitoring by a web GUI. I keep reported data basically in various maps like Map<Computer, Temperature>. Now that the collected data grows and the data structures become more sophisticated (including computers referencing each other) I need a concept for what happens when removing computers from monitoring. Basically I need to delete all data reported by the removed computer. The most KISS-like approach would be removing the data manually from memory, like
public void onRemove(Computer computer) {
temperatures.remove(computer);
// ...
}
This method had to be changed whenever I add features :-( I know Java has a WeakHashMap, so I could store reported data like so:
Map<Computer, Temperature> temperatures = new WeakHashMap<>();
I could call System.gc() whenever a computer is removed from monitoring in order have all associated data eagerly removed from these maps.
While the first approach seems a bit like primitive MyISAM tables, the second one resembles DELETE cascades in InnoDB tables. But still it feels a bit uncomfortable and is probably the wrong approach. Could you point out advantages or disadvantages of WeakHashMaps or propose other solutions to this problem?
Not sure if it is possible for your case, but couldn't your Computer class have all the attributes, and then have a list of monitoredComputers (or have a wrapper class called MonitoredComputers, where you can wrap any logic needed like getTemperatures()). By that they can be removed from that list and don't have to look through all attribute lists. If the computer is referenced from another computer then you have to loop through that list and remove references from those who have it.
I'm not sure using a WeakHashMap is a good idea. As you say you may reference Computer objects from several places, so you'll need to make sure all references except one go through weak references, and to remove the hard reference when the Computer is deleted. As you have no control over when weak references are deleted, you may not get consistent results.
If you don't want to have to maintain manually the removal, you could have a flag on Computer objects, like isAlive(). Then you store Computers in special subclasses of Maps and Collections that at read time check if the Computer is alive and if not silently remove it. For example, on a Map<Computer, ?>, the get method would check if the computer is alive, and if not will remove it and return null.
Or the subclasses of Maps and Collections could just register themselves to a single computerRemoved() event, and automatically know how to remove the deleted computers, and you wouldn't have to manually code the removal. Just make sure you keep references to Computer only inside your special maps and collections.
Why not use an actual SQL database? You could use an embedded database engine such as H2, Apache Derby / Java DB, HSQLDB, or SQLite. Using an embedded database engine has the added benefits:
You could inspect the live contents of the monitoring data at any time using the corresponding DB engine's command line client.
You could build a new tool to access and manipulate the data by connecting to a shared database instance.
The schema itself is a form of documentation as to the structure of the monitoring data and the relationships between entities.
You could store different types of data for different types of computers by way of schema normalization.
You can back up the monitoring data.
If you need to restart the monitoring server, you won't lose all of the monitoring data.
Your Web UI could use a JPA implementation such as Hibernate to access the monitoring data and add new records. Or, for a more lightweight solution, you might consider using Spring Framework's JdbcTemplate and SimpleJdbcInsert classes. There is also OrmLite, ActiveJDBC, and jOOQ which each aim to offer simpler access to databases than JDBC.
The problem with WeakHashMap is that managing the references to Computer objects seems difficult and easily breakable.
Hash table based implementation of the Map interface, with weak keys. An entry in a WeakHashMap will automatically be removed when its key is no longer in ordinary use. More precisely, the presence of a mapping for a given key will not prevent the key from being discarded by the garbage collector, that is, made finalizable, finalized, and then reclaimed. When a key has been discarded its entry is effectively removed from the map, so this class behaves somewhat differently from other Map implementations.
It could be the case that a reference to a Computer object might still exist somewhere and the object will not be deleted for the WeakHashMaps. I would prefer a more deterministic approach.
But if you decide to go down this route, you can mitigate the problem I point out by wrapping all these Computer object keys in a class that has strict controls. This wrapper object will create and store the keys and will pay attention to never let references of those keys to leak out.
Novice coder here, so maybe this is too clunky:
Why not keep the monitored computers in a HashMap, and removed computers go to a WeakHashMap? That way all removed computers are seperate and easy to work with, with the gc cleaning up the oldest entries.

Java Map vs Backend database. Which is better for speed and for multithreading for relations?

My algorithm will likely not be used on the web. The object I describe may be used by multiple threads, however.
The original object I had designed emulated pointers.
Reduced, a symbol would map to multiple pointers, and each unique pointer would map to a single symbol.
When I was finally finished and had a working algorithm, it turns out I actually needed six maps in total (these maps are called tens of thousands of times).
Initial testing with a very very small sample set of symbols showed the program to be working very efficiently. However, I'm afraid that once I increase the number of symbols by a few thousand-fold it will become sluggish.
Once the program completes and closes, the pointers do not need to persist.
I was wondering if I should re implement my algorithm using a database as a backend. Would this be better than using all of these maps?
The maps are stored in memory. The database will be stored on a hard drive (I have a SSD, so I'm afraid there will be a large difference in performance on my machine vs a machine using SATA/PATA). The maps should also be O(1). The maps might also become very ugly once multithreading is introduced, unless I use thread safe mapping, which would slow the program down. A database would efficiently handle these tasks.
I've formally written out the proper relations, and I'm sure I can implement it in a database if that was the best option. Which is the better option?
If you need not to persist that data structure, do not try to support it on a database. In your place, I would try some load tests with a proper amount of data on the data structure you already have and try to refine it from there if performance was not what I expected.
Anyway, the trend currently is to use relational databases in hard disk for persistence and cache frequently queried data in "big hashtables" in memory for performance, I doubt falling back to a database would improve your performance
If your data structures fit in memory, I would be shocked if using a database would be faster (not even considering the complexity of using a database implementation). By throwing away all the assumptions, features, safety and consistency that a database must maintain, you will gain performance. Even the best DB implementation, assuming enough memory to cache everything, pretty much has a ConcurrentHashMap as an upper bound on performance. As a practical matter, you won't get CHM performance even with great caching, because a DB API will require defensive copies or cache invalidations that you can avoid with your in-memory structure.
Apart from the likely performance boost simply from using an in-memory hashmap, you may also get additional performance by tuning your structure based on your specific use case. For example, perhaps the initial lookup is multi-threaded, but individual values are only accessed by a single thread. In that case, you can avoid locking those values.
Hard drives, even fast, are several orders of magnitude slower than your memory. So if your goal is performance you should stay in memory and use maps. For thread safety you can just use a ConcurrentHashMap which uses a lock-free algorithm and the synchronisation penalty in a multi threaded environment should be minimal.
You should also check if a single thread does not provide enough performance - multiple threads always introduce some overhead and they need to bring enough gains to offset it.
You may also want to check in-memory DBs such as HyperSQL or H2 Database.

What is the benefit of holding table rows in Collection?

I've seen some Java codes in which the rows in database table are being held in a collections (usually ArrayList or HashMap).
What is the benefit of this approach?
How do you keep the collection and table synced?
Why not sending a query to database for each retrieval?
Is it a good practice at all?
The benefit is performance. Querying a database is resource and time intensive. If your tables are small enough that you can hold the items in memory, simple reference to local memory is orders of magnitude faster.
As far as keeping them in sync, that's a more difficult answer and would depend on the use case. In most cases, unless you've set up some good custom architecture, there will be no way to guarantee that the database and the in-memory collection are synchronized once you've retrieved it into memory.
If you wanted to take this approach and have the collection and the database be in sync (kind of like having your cake and eating it, too), you could do something like the following:
Set up database triggers on your table for any Create, Insert, Update, or Delete.
Have the triggers run a script which notifies your application somehow (monitoring thread, service, whatever).
Have the application, once notified, update the collection by reading the database.
Of course, whether this would even give you a performance improvement would depend on how often your database is getting modified by other programs.
You could also in your program simply maintain a lock on the database so that no one else could modify it for the duration of your processing (allowing you to keep the items in memory and guarantee that the database is unchanged), but this is extremely bad practice, because you will essentially break any other application using that table at the same time (for anything other than reading).
If you are constantly reading the same data that will never be changed, it makes sense to keep that data within a Java Collection, like a List or a Set. It is all about performance, making database calls and carrying out database transactions through Java does take time (My thesis in the University of London was all about this issue). However, if you have that data within a Java collection, you do not have to keep communicating with the database which has an 'impedance mismatch' as they are two separate entities; one using the Java paradigm and the other using the database paradigm.
As for keeping them in sync, that is whole different beast altogether.

hashmap cache in servlet

I am trying to implement a servlet for GPS monitoring and trying create simple cache, because i think that it will be faster then SQL request for every Http Request. simple scheme:
in the init() method, i reads one point for each vehicle into HashMap (vehicle id = key, location in json = value) . after that, some request try to read this points and some request try to update (one vehicle update one item). Of course I want to minimize synchronization so i read javadoc :
http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html
Note that this implementation is not synchronized. If multiple threads access a hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with a key that an instance already contains is not a structural modification.)
If I am right, there is no any synchronization in my task, because i do only "not a structural modification == changing the value associated with a key that an instance already contains)". is it a correct statement?
Use the ConcurrentHashMap it doesn't use synchronization by locks, but by atomic operations.
Wrong. Adding an item to the hash map is a structural modification (and to implement a cache you must add items at some point).
Use java.util.concurrent.ConcurrentHashMap.
if all the entries are read into hashmap in init() and then only read/modified - then yes, all the other threads theoretically do not need to sync, though some problems might arise due to threads caching values, so ConcurrentHashMap would be better.
perhaps rather than implementing cache yourself, use a simple implementation found in Guava library
Caching is not an easy problem - but it is a known one. Before starting, I would carefully measure wheter you really do have a performance problem, and whether caching actually solve it. You may think it should, and you may be right. You may also be horrendously wrong depending on the situation ("Preemptive optimization is the root of all evil"), so measure.
This said, do not implement a cache yourself, use a library doing it for you. I have personnaly good experience with ehcache.
If I understand correctly, you have two types of request:
Read from cache
Write to cache (to update the value)
In this case, you may potentially try to write to the same map twice at the same time, which is what the docs are referring to.
If all requests go through the same piece of code (e.g. an update method which can only be called from one thread) you will not need synchronisation.
If your system is multi-threaded and you have more than one thread or piece of code that writes to the map, you will need to externally synchronise your map or use a ConcurrentHashMap.
For clarity, the reason you need synchronisation is that if you have two threads both trying to update the a JSON value for the same key, who wins? This is either left up to chance or causes exceptions or, worse, buggy behaviour.
Any time you modify the same element from two threads, you need to synchronise on that code or, better still, use a thread-safe version of the data structure if that is applicable.

Categories