What is the benefit of holding table rows in Collection? - java

I've seen some Java codes in which the rows in database table are being held in a collections (usually ArrayList or HashMap).
What is the benefit of this approach?
How do you keep the collection and table synced?
Why not sending a query to database for each retrieval?
Is it a good practice at all?

The benefit is performance. Querying a database is resource and time intensive. If your tables are small enough that you can hold the items in memory, simple reference to local memory is orders of magnitude faster.
As far as keeping them in sync, that's a more difficult answer and would depend on the use case. In most cases, unless you've set up some good custom architecture, there will be no way to guarantee that the database and the in-memory collection are synchronized once you've retrieved it into memory.
If you wanted to take this approach and have the collection and the database be in sync (kind of like having your cake and eating it, too), you could do something like the following:
Set up database triggers on your table for any Create, Insert, Update, or Delete.
Have the triggers run a script which notifies your application somehow (monitoring thread, service, whatever).
Have the application, once notified, update the collection by reading the database.
Of course, whether this would even give you a performance improvement would depend on how often your database is getting modified by other programs.
You could also in your program simply maintain a lock on the database so that no one else could modify it for the duration of your processing (allowing you to keep the items in memory and guarantee that the database is unchanged), but this is extremely bad practice, because you will essentially break any other application using that table at the same time (for anything other than reading).

If you are constantly reading the same data that will never be changed, it makes sense to keep that data within a Java Collection, like a List or a Set. It is all about performance, making database calls and carrying out database transactions through Java does take time (My thesis in the University of London was all about this issue). However, if you have that data within a Java collection, you do not have to keep communicating with the database which has an 'impedance mismatch' as they are two separate entities; one using the Java paradigm and the other using the database paradigm.
As for keeping them in sync, that is whole different beast altogether.

Related

Simulating DELETE cascades with WeakHashMaps

I'm developing a service that monitors computers. Computers can be added to or removed from monitoring by a web GUI. I keep reported data basically in various maps like Map<Computer, Temperature>. Now that the collected data grows and the data structures become more sophisticated (including computers referencing each other) I need a concept for what happens when removing computers from monitoring. Basically I need to delete all data reported by the removed computer. The most KISS-like approach would be removing the data manually from memory, like
public void onRemove(Computer computer) {
temperatures.remove(computer);
// ...
}
This method had to be changed whenever I add features :-( I know Java has a WeakHashMap, so I could store reported data like so:
Map<Computer, Temperature> temperatures = new WeakHashMap<>();
I could call System.gc() whenever a computer is removed from monitoring in order have all associated data eagerly removed from these maps.
While the first approach seems a bit like primitive MyISAM tables, the second one resembles DELETE cascades in InnoDB tables. But still it feels a bit uncomfortable and is probably the wrong approach. Could you point out advantages or disadvantages of WeakHashMaps or propose other solutions to this problem?
Not sure if it is possible for your case, but couldn't your Computer class have all the attributes, and then have a list of monitoredComputers (or have a wrapper class called MonitoredComputers, where you can wrap any logic needed like getTemperatures()). By that they can be removed from that list and don't have to look through all attribute lists. If the computer is referenced from another computer then you have to loop through that list and remove references from those who have it.
I'm not sure using a WeakHashMap is a good idea. As you say you may reference Computer objects from several places, so you'll need to make sure all references except one go through weak references, and to remove the hard reference when the Computer is deleted. As you have no control over when weak references are deleted, you may not get consistent results.
If you don't want to have to maintain manually the removal, you could have a flag on Computer objects, like isAlive(). Then you store Computers in special subclasses of Maps and Collections that at read time check if the Computer is alive and if not silently remove it. For example, on a Map<Computer, ?>, the get method would check if the computer is alive, and if not will remove it and return null.
Or the subclasses of Maps and Collections could just register themselves to a single computerRemoved() event, and automatically know how to remove the deleted computers, and you wouldn't have to manually code the removal. Just make sure you keep references to Computer only inside your special maps and collections.
Why not use an actual SQL database? You could use an embedded database engine such as H2, Apache Derby / Java DB, HSQLDB, or SQLite. Using an embedded database engine has the added benefits:
You could inspect the live contents of the monitoring data at any time using the corresponding DB engine's command line client.
You could build a new tool to access and manipulate the data by connecting to a shared database instance.
The schema itself is a form of documentation as to the structure of the monitoring data and the relationships between entities.
You could store different types of data for different types of computers by way of schema normalization.
You can back up the monitoring data.
If you need to restart the monitoring server, you won't lose all of the monitoring data.
Your Web UI could use a JPA implementation such as Hibernate to access the monitoring data and add new records. Or, for a more lightweight solution, you might consider using Spring Framework's JdbcTemplate and SimpleJdbcInsert classes. There is also OrmLite, ActiveJDBC, and jOOQ which each aim to offer simpler access to databases than JDBC.
The problem with WeakHashMap is that managing the references to Computer objects seems difficult and easily breakable.
Hash table based implementation of the Map interface, with weak keys. An entry in a WeakHashMap will automatically be removed when its key is no longer in ordinary use. More precisely, the presence of a mapping for a given key will not prevent the key from being discarded by the garbage collector, that is, made finalizable, finalized, and then reclaimed. When a key has been discarded its entry is effectively removed from the map, so this class behaves somewhat differently from other Map implementations.
It could be the case that a reference to a Computer object might still exist somewhere and the object will not be deleted for the WeakHashMaps. I would prefer a more deterministic approach.
But if you decide to go down this route, you can mitigate the problem I point out by wrapping all these Computer object keys in a class that has strict controls. This wrapper object will create and store the keys and will pay attention to never let references of those keys to leak out.
Novice coder here, so maybe this is too clunky:
Why not keep the monitored computers in a HashMap, and removed computers go to a WeakHashMap? That way all removed computers are seperate and easy to work with, with the gc cleaning up the oldest entries.

Java Map vs Backend database. Which is better for speed and for multithreading for relations?

My algorithm will likely not be used on the web. The object I describe may be used by multiple threads, however.
The original object I had designed emulated pointers.
Reduced, a symbol would map to multiple pointers, and each unique pointer would map to a single symbol.
When I was finally finished and had a working algorithm, it turns out I actually needed six maps in total (these maps are called tens of thousands of times).
Initial testing with a very very small sample set of symbols showed the program to be working very efficiently. However, I'm afraid that once I increase the number of symbols by a few thousand-fold it will become sluggish.
Once the program completes and closes, the pointers do not need to persist.
I was wondering if I should re implement my algorithm using a database as a backend. Would this be better than using all of these maps?
The maps are stored in memory. The database will be stored on a hard drive (I have a SSD, so I'm afraid there will be a large difference in performance on my machine vs a machine using SATA/PATA). The maps should also be O(1). The maps might also become very ugly once multithreading is introduced, unless I use thread safe mapping, which would slow the program down. A database would efficiently handle these tasks.
I've formally written out the proper relations, and I'm sure I can implement it in a database if that was the best option. Which is the better option?
If you need not to persist that data structure, do not try to support it on a database. In your place, I would try some load tests with a proper amount of data on the data structure you already have and try to refine it from there if performance was not what I expected.
Anyway, the trend currently is to use relational databases in hard disk for persistence and cache frequently queried data in "big hashtables" in memory for performance, I doubt falling back to a database would improve your performance
If your data structures fit in memory, I would be shocked if using a database would be faster (not even considering the complexity of using a database implementation). By throwing away all the assumptions, features, safety and consistency that a database must maintain, you will gain performance. Even the best DB implementation, assuming enough memory to cache everything, pretty much has a ConcurrentHashMap as an upper bound on performance. As a practical matter, you won't get CHM performance even with great caching, because a DB API will require defensive copies or cache invalidations that you can avoid with your in-memory structure.
Apart from the likely performance boost simply from using an in-memory hashmap, you may also get additional performance by tuning your structure based on your specific use case. For example, perhaps the initial lookup is multi-threaded, but individual values are only accessed by a single thread. In that case, you can avoid locking those values.
Hard drives, even fast, are several orders of magnitude slower than your memory. So if your goal is performance you should stay in memory and use maps. For thread safety you can just use a ConcurrentHashMap which uses a lock-free algorithm and the synchronisation penalty in a multi threaded environment should be minimal.
You should also check if a single thread does not provide enough performance - multiple threads always introduce some overhead and they need to bring enough gains to offset it.
You may also want to check in-memory DBs such as HyperSQL or H2 Database.

mongodb - how to lock collection with mongodb java client

I need to update a whole collection concurrently in a background thread, but read operation might take place at the same time. It takes about 3 seconds to update the collection when I benchmark it. Is there any way to lock a collection while updating the collection? I try to create a new collection and insert all the documents into it and rename it to the original collection with "dropToTarget=true", but I am not sure how safe and stable it is in terms of sharding. I read that renameCollection is incompatible with the sharding.
It would be great if someone can suggest if there is a good idea.
Thanks.
Do you presented two possible strategies to update your collection, one being inline with a lock on it and the other one with a temporary collection?
As the mongodb documentation clearly states it will not work for sharded collections (http://docs.mongodb.org/manual/reference/command/renameCollection/). From my understanding this means that your collection you want to rename isn't sharded, as you need to delete the other collection before you do the actual renaming you'll mostlikely loose any previously kept sharding (-information). So you would need to reactivate the sharding. I highly discourage from using the two collection approach, especially if you're sharding your data.
You would need to get all the data from your sharded collection and store it centralized, once you're done with updating you need to rename the collection and shard it again. This will cause much I/O for your whole system, especially for the client doing the update.
Depending on your system architecture (with a single point of entry). You could easily hold some global flag telling you if you currently have the collection update running. Forbidding other write operations.
For multi-entry points into your MongoDB you might try $isolated, but this doesn't work with sharded collections. And I'm not sure if it allows read operations, the documentation isn't very clear.
Is it strictly disallowed to write any data, while the update is in progress? What type of updates do you perform. Can they influence each other? Or would it be possible to have concurrent writes?

Using Java locks for database concurrency

I have the following scenario.
I have two tables. One stores multi values that are counters for transactions. Through a java application the first table value is read, incremented and written to the second table, as well as the new value being written back to the first table. Obviously there is potential for this to go wrong as it's a multiple user system.
My solution, in Java, to the issue is to provide Locks that have to, well should, be aquired before any action can be taken on either table. These Locks, ReentrantLocks, are static and there is one for each column in Table 1 as the values are completely independent of each other.
Is this a recommended approached?
Cheers.
No. Use implicit Database Locks1 for Database Concurrency. Relational databases support Transactions which are a vital part of ACID: use them.
Java-centric locks will not work cross-VM and as such will not help in multi-User/Server environments.
1 Databases are smart enough to acquire/release locks to ensure Consistency and Isolation and may even use "lock free" implementations such as MVCC. There are rare occasions when explicit database locks must be requested, but this is an advanced use-case.
Whilst agreeing with some of the sentiments of #pst's answer, I would say this depends slightly.
If the sequence of events is, and probably always will be, essentially "SQL oriented", then you may as well do the locking at the database level (and indeed, probably implicitly via the use of transactions).
However if there is, or you are planning to build in, significant data manipulation logic within your app tier (either generally or in the case of this specific operation), then locking at the app level may be more appropriate. (In reality, you will probably still run your SQL in transactions so that you're actually locking at both levels.)
I don't think the issue of multiple VMs is necessarily a compelling issue on its own for relying on DB-level locking. If you have multiple server apps accessing the database, you will in any case want to establish a well-defined protocol for which data is accessed concurrently under what circumstances. And in a system of moderate complexity, you will in any case want to build in a system of running periodic sanity checks on the data. (Even if your server apps are perfectly behaved 100% of the time, will back end tech support never ever ever have to run some miscellaneous SQL on the database outside your app...?)

Retrieve all essential data on startup

In Java Web Application, i would like to know if it is a proper (or "standard"?) way that all the essential data such as the config data, message data, code maintenance data, dropdown option data and etc (assuming all data will not updated frequently) are loaded as a "static" variables from database when the server startup.Or is it more preferred way to retrieve data by querying db per request?
Thanks for all your advice here.
It is perfectly valid to pull out all the data that are not going to be modified during application life-cycle into and keep it in memory as singleton or something.
This is a good idea because it saves DB hits and retrieval is faster. A lot of environment specific settings and other data can also be pulled once and kept in an immutable hashmap for any future request.
In a common web-app you generally do not have so many config data/option objects that can eat up lot of memory and cause OOM. But, if you have a table with hundreds of thousands of config data, better assume pulling objects as and when requested. And if you do want to keep it in memory, think of putting this in some key-value store like MemcacheD.
We used DB to store config values and ehcache to avoid a lot of DB hits. This way you don't need to worry about memory consumption (it will use whatever memory you have).
EhCache is one of many available DB cache solution and can be configured on top of JPA etc.
You can configure ehcache (or many other cache providers) to deem the tables read-only, in which case it will only go to the DB if it's explicitly told to invalidate the cache. This performs pretty well. The overhead becomes visible though when the read occurs very frequently (like 100/sec), but usually storing the config value in a local variable and avoiding reading inside loops, passing it on through the method stack during the invocation mitigates this well enough.
Storing values in a Singleton as java objects performs the best, but if you want to modify these without app. start up, it becomes a little bit involved.
Here is a simple way to achieve dynamic configuration with Java objects:
private volatile ImmutableMap<String,Object> param_value
Basically you'll have to start thinking about multi-threaded access, and memory issues (while it's quite unlikely that you'll run out of memory because of configuration values, unless you have binary data as config values etc.).
In essence, I'd recommend using the DB and some cache provider unless that part of code really needs high-performance.

Categories