I'm developing a service that monitors computers. Computers can be added to or removed from monitoring by a web GUI. I keep reported data basically in various maps like Map<Computer, Temperature>. Now that the collected data grows and the data structures become more sophisticated (including computers referencing each other) I need a concept for what happens when removing computers from monitoring. Basically I need to delete all data reported by the removed computer. The most KISS-like approach would be removing the data manually from memory, like
public void onRemove(Computer computer) {
temperatures.remove(computer);
// ...
}
This method had to be changed whenever I add features :-( I know Java has a WeakHashMap, so I could store reported data like so:
Map<Computer, Temperature> temperatures = new WeakHashMap<>();
I could call System.gc() whenever a computer is removed from monitoring in order have all associated data eagerly removed from these maps.
While the first approach seems a bit like primitive MyISAM tables, the second one resembles DELETE cascades in InnoDB tables. But still it feels a bit uncomfortable and is probably the wrong approach. Could you point out advantages or disadvantages of WeakHashMaps or propose other solutions to this problem?
Not sure if it is possible for your case, but couldn't your Computer class have all the attributes, and then have a list of monitoredComputers (or have a wrapper class called MonitoredComputers, where you can wrap any logic needed like getTemperatures()). By that they can be removed from that list and don't have to look through all attribute lists. If the computer is referenced from another computer then you have to loop through that list and remove references from those who have it.
I'm not sure using a WeakHashMap is a good idea. As you say you may reference Computer objects from several places, so you'll need to make sure all references except one go through weak references, and to remove the hard reference when the Computer is deleted. As you have no control over when weak references are deleted, you may not get consistent results.
If you don't want to have to maintain manually the removal, you could have a flag on Computer objects, like isAlive(). Then you store Computers in special subclasses of Maps and Collections that at read time check if the Computer is alive and if not silently remove it. For example, on a Map<Computer, ?>, the get method would check if the computer is alive, and if not will remove it and return null.
Or the subclasses of Maps and Collections could just register themselves to a single computerRemoved() event, and automatically know how to remove the deleted computers, and you wouldn't have to manually code the removal. Just make sure you keep references to Computer only inside your special maps and collections.
Why not use an actual SQL database? You could use an embedded database engine such as H2, Apache Derby / Java DB, HSQLDB, or SQLite. Using an embedded database engine has the added benefits:
You could inspect the live contents of the monitoring data at any time using the corresponding DB engine's command line client.
You could build a new tool to access and manipulate the data by connecting to a shared database instance.
The schema itself is a form of documentation as to the structure of the monitoring data and the relationships between entities.
You could store different types of data for different types of computers by way of schema normalization.
You can back up the monitoring data.
If you need to restart the monitoring server, you won't lose all of the monitoring data.
Your Web UI could use a JPA implementation such as Hibernate to access the monitoring data and add new records. Or, for a more lightweight solution, you might consider using Spring Framework's JdbcTemplate and SimpleJdbcInsert classes. There is also OrmLite, ActiveJDBC, and jOOQ which each aim to offer simpler access to databases than JDBC.
The problem with WeakHashMap is that managing the references to Computer objects seems difficult and easily breakable.
Hash table based implementation of the Map interface, with weak keys. An entry in a WeakHashMap will automatically be removed when its key is no longer in ordinary use. More precisely, the presence of a mapping for a given key will not prevent the key from being discarded by the garbage collector, that is, made finalizable, finalized, and then reclaimed. When a key has been discarded its entry is effectively removed from the map, so this class behaves somewhat differently from other Map implementations.
It could be the case that a reference to a Computer object might still exist somewhere and the object will not be deleted for the WeakHashMaps. I would prefer a more deterministic approach.
But if you decide to go down this route, you can mitigate the problem I point out by wrapping all these Computer object keys in a class that has strict controls. This wrapper object will create and store the keys and will pay attention to never let references of those keys to leak out.
Novice coder here, so maybe this is too clunky:
Why not keep the monitored computers in a HashMap, and removed computers go to a WeakHashMap? That way all removed computers are seperate and easy to work with, with the gc cleaning up the oldest entries.
Related
I am wondering which approach is better. Should we use fine grained entities on the grid and later construct functionaly rich domain objects out of the fined grained entities.
Or alternatively we should construct the course grained domain objects and store them directly on the grid and the entities we just use for persistence.
Edit: I think that this question is not yet answered completely. So far we have comments from Hazelcast,Gemfire and Ignite. We are missing Infinispan, Coherence .... That is for completion sake :)
I agree with Valentin, it mainly depends on the system you want to use. Normally I would consider to store enhanced domain objects directly, anyhow if you would just have very few objects but their size is massive you end up with bad distribution and unequal memory usage on the nodes. If your domain object are "normally" sized and you have plenty, you shouldn't worry.
In Hazelcast it is better to store those objects directly but be aware of using a good serialization system as Java Serialization is slow. If you want to query on properties inside your domain objects you should also consider adding indexes.
I believe it can differ from one Data Grid to another. I'm more familiar with Apache Ignite, and in this case fine grained approach works much better, because it's more flexible and in many cases gives better data distribution and therefore better scalability. Ignite also provides rich SQL capabilities [1] that allow to join different entities and execute indexed search. This way you will not lose performance with fine grained model.
[1] https://apacheignite.readme.io/docs/sql-queries
One advantage of a coarse-grained object is data consistency. Everything in that object gets saved atomically. But if you split that object up into 4 small objects, you run the risk that 3 objects save and 1 fails (for whatever reason).
We use GemFire, and tend to favor coarse-grained objects...up to a point. For example our Customer object contains a list of Addresses. An alternative design would be to create one GemFire region for "Customer" and a separate GemFire region for "CustomerAddresses" and then hope you can keep those regions in sync.
The downside is that every time someone updates an Address, we re-write the entire Customer object. That's not very efficient, but our traffic patterns show that address changes are very rare (compared to all the other activity), so this works out fine.
One experience we've had though is the downside of using Java Serialization for long-term data storage. We avoid it now, because of all the problems caused by object compatibility as objects change over time. Not to mention it becomes headache for .NET clients to read the objects. :)
I've seen some Java codes in which the rows in database table are being held in a collections (usually ArrayList or HashMap).
What is the benefit of this approach?
How do you keep the collection and table synced?
Why not sending a query to database for each retrieval?
Is it a good practice at all?
The benefit is performance. Querying a database is resource and time intensive. If your tables are small enough that you can hold the items in memory, simple reference to local memory is orders of magnitude faster.
As far as keeping them in sync, that's a more difficult answer and would depend on the use case. In most cases, unless you've set up some good custom architecture, there will be no way to guarantee that the database and the in-memory collection are synchronized once you've retrieved it into memory.
If you wanted to take this approach and have the collection and the database be in sync (kind of like having your cake and eating it, too), you could do something like the following:
Set up database triggers on your table for any Create, Insert, Update, or Delete.
Have the triggers run a script which notifies your application somehow (monitoring thread, service, whatever).
Have the application, once notified, update the collection by reading the database.
Of course, whether this would even give you a performance improvement would depend on how often your database is getting modified by other programs.
You could also in your program simply maintain a lock on the database so that no one else could modify it for the duration of your processing (allowing you to keep the items in memory and guarantee that the database is unchanged), but this is extremely bad practice, because you will essentially break any other application using that table at the same time (for anything other than reading).
If you are constantly reading the same data that will never be changed, it makes sense to keep that data within a Java Collection, like a List or a Set. It is all about performance, making database calls and carrying out database transactions through Java does take time (My thesis in the University of London was all about this issue). However, if you have that data within a Java collection, you do not have to keep communicating with the database which has an 'impedance mismatch' as they are two separate entities; one using the Java paradigm and the other using the database paradigm.
As for keeping them in sync, that is whole different beast altogether.
I need to update a whole collection concurrently in a background thread, but read operation might take place at the same time. It takes about 3 seconds to update the collection when I benchmark it. Is there any way to lock a collection while updating the collection? I try to create a new collection and insert all the documents into it and rename it to the original collection with "dropToTarget=true", but I am not sure how safe and stable it is in terms of sharding. I read that renameCollection is incompatible with the sharding.
It would be great if someone can suggest if there is a good idea.
Thanks.
Do you presented two possible strategies to update your collection, one being inline with a lock on it and the other one with a temporary collection?
As the mongodb documentation clearly states it will not work for sharded collections (http://docs.mongodb.org/manual/reference/command/renameCollection/). From my understanding this means that your collection you want to rename isn't sharded, as you need to delete the other collection before you do the actual renaming you'll mostlikely loose any previously kept sharding (-information). So you would need to reactivate the sharding. I highly discourage from using the two collection approach, especially if you're sharding your data.
You would need to get all the data from your sharded collection and store it centralized, once you're done with updating you need to rename the collection and shard it again. This will cause much I/O for your whole system, especially for the client doing the update.
Depending on your system architecture (with a single point of entry). You could easily hold some global flag telling you if you currently have the collection update running. Forbidding other write operations.
For multi-entry points into your MongoDB you might try $isolated, but this doesn't work with sharded collections. And I'm not sure if it allows read operations, the documentation isn't very clear.
Is it strictly disallowed to write any data, while the update is in progress? What type of updates do you perform. Can they influence each other? Or would it be possible to have concurrent writes?
I'm reading a hierarchy of objects with ORMLite. It is shaped like a tree, parents have a #ForeignCollection of 0+ children, and every child refers to its parent with #DatabaseField(foreign=true). I'm reading and saving the whole hierarchy at once.
As I'm new to ORM in general, and to ORMLite as well, I didn't know that when objects with the same ID in the database are read, they won't be created as the actually same object with the same Identity, but as several duplicates having the same ID. Meaning, I'm now facing the problem that (let's say "->" stands for "refers to") A -> B -> C != C -> B -> A.
I was thinking to solve the problem by manually reading them through the provided DAOs and puting them together by their ID, assuring that objects with the same ID have the same identity.
Is there are ORMLite-native way of solving this? If yes, what is it, if not, what are common ways of solving this problem? Is this a general problem of ORM? Does it have a name (I'd like to learn more about it)?
Edit:
My hierarchy is so that one building contains several floors, where each floor knows its building, and each floor contains several zones, where every zone knows its floor.
Is this a general problem of ORM? Does it have a name (I'd like to learn more about it)?
It is a general pattern for ORMs and is called “Identity Map”: within a session, no matter where in your code you got a mapped object from the ORM, there will be only one object representing a specific line in the db (i.e. having it’s primary key).
I love this pattern: you can retrieve something from the db in one part of your code, even do modifications to it, store that object in a instance variable, etc... And in another part of the code, if you get hold of an object for the same “db row” (by whatever means: you got it passed as a argument, you made a bulk query to the db, you created a “new” mapped object with the primary key set to the same and add it to the session), you will end up with the same object. – Even the modifications from before (including unflushed) will be there.
(adding an mapped object to the session may fail because of this, and depending on the ORM and programming language this adding may give you another object back as “the same”)
Unfortunately there is not a ORMLite-native way of solving this problem. More complex ORM systems (such as Hibernate) have caching layers which are there specifically for this reason. ORMLite does not have a cache layer so it doesn't know that it just returned an object with the same id "recently". Here's documentation of Hibernate caching:
http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html
However, ORMLite is designed to be Lite and cache layers violate that designation IMO. About the only [unfortunate] solution that I see to your issue in ORMLite is to do what you are doing -- rebuilding the object tree based on the ids. If you give more details about your hierarchy we may be able to help more specifically.
So after thinking about your case a bit more #riwi, it occurred to me that if you have a Building that contains a collection of Floors, there is no reason why the Building object on each of the Floors in the collection cannot be set with the parent Building object. Duh. ORMLite has all of the information it needs to make this happen. I implemented this behavior and it was released in version 4.24.
Edit:
As of ORMLite version 4.26 we added an initial take on an object-cache that can support the requested features asked for. Here are the docs:
http://ormlite.com/docs/object-cache
In Java Web Application, i would like to know if it is a proper (or "standard"?) way that all the essential data such as the config data, message data, code maintenance data, dropdown option data and etc (assuming all data will not updated frequently) are loaded as a "static" variables from database when the server startup.Or is it more preferred way to retrieve data by querying db per request?
Thanks for all your advice here.
It is perfectly valid to pull out all the data that are not going to be modified during application life-cycle into and keep it in memory as singleton or something.
This is a good idea because it saves DB hits and retrieval is faster. A lot of environment specific settings and other data can also be pulled once and kept in an immutable hashmap for any future request.
In a common web-app you generally do not have so many config data/option objects that can eat up lot of memory and cause OOM. But, if you have a table with hundreds of thousands of config data, better assume pulling objects as and when requested. And if you do want to keep it in memory, think of putting this in some key-value store like MemcacheD.
We used DB to store config values and ehcache to avoid a lot of DB hits. This way you don't need to worry about memory consumption (it will use whatever memory you have).
EhCache is one of many available DB cache solution and can be configured on top of JPA etc.
You can configure ehcache (or many other cache providers) to deem the tables read-only, in which case it will only go to the DB if it's explicitly told to invalidate the cache. This performs pretty well. The overhead becomes visible though when the read occurs very frequently (like 100/sec), but usually storing the config value in a local variable and avoiding reading inside loops, passing it on through the method stack during the invocation mitigates this well enough.
Storing values in a Singleton as java objects performs the best, but if you want to modify these without app. start up, it becomes a little bit involved.
Here is a simple way to achieve dynamic configuration with Java objects:
private volatile ImmutableMap<String,Object> param_value
Basically you'll have to start thinking about multi-threaded access, and memory issues (while it's quite unlikely that you'll run out of memory because of configuration values, unless you have binary data as config values etc.).
In essence, I'd recommend using the DB and some cache provider unless that part of code really needs high-performance.