Fine grained vs coarse grained domain model In Memory Data Grid - java

I am wondering which approach is better. Should we use fine grained entities on the grid and later construct functionaly rich domain objects out of the fined grained entities.
Or alternatively we should construct the course grained domain objects and store them directly on the grid and the entities we just use for persistence.
Edit: I think that this question is not yet answered completely. So far we have comments from Hazelcast,Gemfire and Ignite. We are missing Infinispan, Coherence .... That is for completion sake :)

I agree with Valentin, it mainly depends on the system you want to use. Normally I would consider to store enhanced domain objects directly, anyhow if you would just have very few objects but their size is massive you end up with bad distribution and unequal memory usage on the nodes. If your domain object are "normally" sized and you have plenty, you shouldn't worry.
In Hazelcast it is better to store those objects directly but be aware of using a good serialization system as Java Serialization is slow. If you want to query on properties inside your domain objects you should also consider adding indexes.

I believe it can differ from one Data Grid to another. I'm more familiar with Apache Ignite, and in this case fine grained approach works much better, because it's more flexible and in many cases gives better data distribution and therefore better scalability. Ignite also provides rich SQL capabilities [1] that allow to join different entities and execute indexed search. This way you will not lose performance with fine grained model.
[1] https://apacheignite.readme.io/docs/sql-queries

One advantage of a coarse-grained object is data consistency. Everything in that object gets saved atomically. But if you split that object up into 4 small objects, you run the risk that 3 objects save and 1 fails (for whatever reason).
We use GemFire, and tend to favor coarse-grained objects...up to a point. For example our Customer object contains a list of Addresses. An alternative design would be to create one GemFire region for "Customer" and a separate GemFire region for "CustomerAddresses" and then hope you can keep those regions in sync.
The downside is that every time someone updates an Address, we re-write the entire Customer object. That's not very efficient, but our traffic patterns show that address changes are very rare (compared to all the other activity), so this works out fine.
One experience we've had though is the downside of using Java Serialization for long-term data storage. We avoid it now, because of all the problems caused by object compatibility as objects change over time. Not to mention it becomes headache for .NET clients to read the objects. :)

Related

hibernate second level cache with Redis -will it improve performance?

I am currently developing an application using Spring MVC4 and hibernate 4. I have implemented hibernate second level cache for performance improvement. If I use Redis which is an in-memory data structure store, used as a database, cache etc, the performance will increase but will it be a drastic change?
Drastic differences you may expect if you cache what is good to be cached and avoid caching data that should not be cached at all. Like beauty is in the eye of the beholder the same is with the performance. Here are several aspects you should have in mind when using hibernate AS second level cache provider:
No Custom serialization - Memory intensive
If you use second level caching, you would not be able to use fast serialization frameworks such as Kryo and will have to stick to java serializable which sucks.
On top of this for each entity type you will have a separate region and within each region, you will have an entry for each key of each entity.
In terms of memory efficiency, this is inefficient.
Lacks the ability to store and distribute rich objects
Most of the modern caches also present computing grid functionality having your objects fragmented into many small pieces decrease your ability to execute distributed tasks with guaranteed data co-location. That depends a little bit on the Grid provider, but for many would be a limitation.
Sub optimal performance
Depending on how much performance you need and what type of application you are having using hibernate second level cache might be a good or a bad choice. Good in terms that it is plug and play...." kind of..." bad because you will never squeeze the performance you would have gained. Also designing rich models mean more upfront work and more OOP.
Limited querying capabilities ON the Cache itself
That depends on the cache provider, but some of the providers really are not good doing JOINs with Where clause different than the ID. If you try to build and in memory index for a query on Hazelcast, for example, you will see what I mean.
Yes, if you use Redis, it will improve your performance.
No, it will not be a drastic change. :)
https://memorynotfound.com/spring-redis-application-configuration-example/
http://www.baeldung.com/spring-data-redis-tutorial
the above links will help you to find out the way of integration redis with your project.
It depends on the movement.
If You have 1000 or more requests per second and You are low on RAM, then Yes, use redis nodes on other machine to take some usage. It will greatly improve your RAM and request speed.
But If it's otherwise then do not use it.
Remember that You can use this approach later when You will see what is the RAM and database Connection Pool usage.
Your question was already discussed here. Check this link: Application cache v.s. hibernate second level cache, which to use?
This was the most accepted answer, which I agree with:
It really depends on your application querying model and the traffic
demands.
Using Redis/Hazelcast may yield the best performance since there won't
be any round-trip to DB anymore, but you end up having a normalized
data in DB and denormalized copy in your cache which will put pressure
on your cache update policies. So you gain the best performance at the
cost of implementing the cache update whenever the persisted data
changes.
Using 2nd level cache is easier to set up but it only stores
entities by id. There is also a query cache, storing ids returned by a
given query. So the 2nd level cache is a two-step process that you
need to fine tune to get the best performance. When you execute
projection queries the 2nd level object cache won't help you, since it
only operates on entity load. The main advantage of 2nd level cache is
that it's easier to keep it in sync whenever data changes, especially
if all your data is persisted by hibernate.
So, if you need ultimate
performance and you don't mind implementing your cache update logic
that ensures a minimum eventual consistency window, then go with an
external cache.
If you only need to cache entities (that usually don't change that
frequently) and you mostly access those through Hibernate entity
loading, then 2nd level cache can help you.
Hope it helps!

Always 'Entity first' approach, when designing java apps from scratch?

I'm just reading the book here: http://www.amazon.com/Java-Architects-Handbook-Second-Edition/dp/0972954880/ trying to find a strategy about how to efficiently design a (generic) medium to large application (200 tables or more) - for instance a classic, multi-layered, corporate intranet. I'm trying to adapt my past experience (as a database designer, but also OOAD) in order to architect such a java application. From what I've read, if you define your entities first, there is no recommended way to infer your database directly (automatically).
The book says that you would build the entity/object model first (OOAD) and THEN there is the db admin/dev.(?) job to build/infer the database (schema, normalization etc.) based on the entity model already built. If this is the case, I'm afraid the architect/developer could lose control over important aspects - normalization, entity-attribute-value modeling etc.
Perhaps like many older developers (back-end developers, architects etc) I feel more comfortable defining the database schema first - and spending a good amount of time on aspects like normalization etc. While this would be certainly possible nowadays, I'm asking myself if this would become (pretty soon, if not already) the 'old fashioned way' and not the norm - as a classic/recommended approach when designing applications from scratch.
I know Entity Framework (.NET) already have these approaches explicitly defined - 'entities first', 'database first', 'code first' and and these could be mixed, if necessary. I surely know that they recommend 'entity first' for newly designed apps, and 'database first' if you have already defined database schema (which is the case for many older applications, when migrating etc. I'm just asking if there is something similar for the java world.
So, the questions are: (although I know there is no silver bullet etc.)
'Entities first' for newly built apps - this is the norm nowadays?
What tools do you use (if any) in order to assist inferring db schema process? - your experience, pros & cons with concrete UML
tools etc.
What if you have parts/older/sub-domain database schema (which you'd want to preserve, mainly)? In such case, you would infer entities model from
database and then refactor the model using your preferred UML tool?
From labor force perspective (let's say for db of 200-500 tables): what is the best approach: for instance, to have 2 different people
involved in designing OOAD/entities and database respectively,
working together with an architect?
As you expect - my answer is it depends.
The problem is that there are so many possible flavours and dimensions to a good design you really need to take the widest view possible first.
Ask yourself some of the big questions:
Where is the core of the system? Is the database really the core or is it actually just a persistence layer for the code. It could also perhaps be that the database is the core and the code is really just a snazzy UI on the data. There can also be a mix - where some of the tables are core along with some of the entities.
What do you see in the future? Remember that there are developments going on as we speak that are moving database technology rapidly forward. There are some databases that are all in-ram. Some are designed for a distributed architecture. Some are primarily cloud. If you build your schema first you risk locking yourself in to a certain technology.
What scale do you want to achieve? By insisting on a specific database you may be closing doors to perhaps hand-held presence.
I generally find entity first as the best initial approach because you can always derive a schema from the entities and some meta-data. It is certainly possible to go schema first and grow the entities out of the schema but that way you generally find the database influences the design too much.
1) I've done database first in the past but now I usually do Entity first but that's mainly because of the tools I'm using in creating the applications. Entity first has a few good advantages over trying to match your entities to your defined schema later. You're also not locking yourself to tightly to your schema. What your application is for matters alot as well, if it's just a basic CRUD application, write once read many or does it actually 'do' something that will inform your choice over how to architect your application.
2) I use hibernate a lot which encourages creating your model first, designing all your entities etc and then generating the schema from that, hibernate can generate your whole schema from the models you've created (though you may need to tweak them to make sure they're not crazy). If you have 200 entities in your model then you probably want to do a significant amount of UML modelling ahead of time to ensure your model is consistent.
3) If you're working with partially legacy database then it can sometimes be good to fall into line with the schema design for that so your entities and schema are consistent. It can be a bit of a pain but then so it trying to explain why part of your app is just different to other parts. So yes I would probably infer my entities from the schema in that case. But again if it was totally crazy then it may be to do some very specific DAO code to hide that part of the schema from that app and pretend it's not there.
4) I can't really give you a good answer on this as I'm not sure what you're driving at really. Once you have the design standards for your schema it's turning the handle to crank it out.
So after all that my answer is 'It depends'
While the answers already posted cover a lot of points - and ultimately, all answers probably have to all sum up to "it depends" - I'd like to expand on a point that's been touched on already.
My focus is on data - I'm a business intelligence and data warehousing developer, and I deal with issues like data quality, data governance, having a set of master data, etc. To this end, I have to pull data from other systems - data which is in varying conditions.
When considering whether the core of your system is really the database or the front end (as suggested by OldCurmudgeon), I strongly suggest thinking outside of your own area. I have seen and heard about many systems where it's clear that the database has been treated as an afterthought (sometimes created via an entity-first model, but also sometimes hand-built), despite the fact that most of the business value is in the data. More and more companies are of course realising that their data is valuable and are adopting tools to make use of it - but it's difficult to do if poor transactional databases mean that data has been lost, was never saved in the first place, has been overwritten when a history is needed, or is inconsistent.
While I don't want to do myself and others with similar roles out of a job: If the data that a system you're working on is or might be valuable, if there's any reason it might be accessed by anything other than the front end you're creating, then it is worth the time and effort to create a sound data model to hold it. If the system is for an organisation or is going to be sold to organisations, there's a decent chance they'll want to report out of it, will want to run output from it into a data warehouse or other data stores, and will want to carry out analysis on the data it creates and holds.
I don't know enough about tools like Hibernate to know if it's possible to both use them to work in an entity-first manner and still create a good quality database, but I know that I have come across some problematic databases created in this manner. At the very least, as has been suggested, if you are going to work that way, make sure it is producing something sane and perhaps adjust it where necessary to maintain data integrity. If data integrity is a key requirement and you cannot get such a tool to create a suitable database that will ensure data integrity, then perhaps consider going back to doing things the "old fashioned" way.
I would also suggest that there's real value in developers working alongside any data specialists, analysts, architects, etc. they may have as colleagues to do some up-front modelling, even if the system they then produce uses entity-first and even if it veers away from the more conceptual models produced early on for technical reasons. I have seen many baked-in problems in systems which have been caused by a lack of understanding of the wider business entities and relationships, and which could have been avoided if time had been spent understanding the overall structure in this way. I've been personally responsible for building those problems when I was an application developer myself, so this shouldn't be read as criticism of front-end developers - just a vote in favour of cross-functional and collaborative analysis and modelling before development approaches and designs are decided.

Java Map vs Backend database. Which is better for speed and for multithreading for relations?

My algorithm will likely not be used on the web. The object I describe may be used by multiple threads, however.
The original object I had designed emulated pointers.
Reduced, a symbol would map to multiple pointers, and each unique pointer would map to a single symbol.
When I was finally finished and had a working algorithm, it turns out I actually needed six maps in total (these maps are called tens of thousands of times).
Initial testing with a very very small sample set of symbols showed the program to be working very efficiently. However, I'm afraid that once I increase the number of symbols by a few thousand-fold it will become sluggish.
Once the program completes and closes, the pointers do not need to persist.
I was wondering if I should re implement my algorithm using a database as a backend. Would this be better than using all of these maps?
The maps are stored in memory. The database will be stored on a hard drive (I have a SSD, so I'm afraid there will be a large difference in performance on my machine vs a machine using SATA/PATA). The maps should also be O(1). The maps might also become very ugly once multithreading is introduced, unless I use thread safe mapping, which would slow the program down. A database would efficiently handle these tasks.
I've formally written out the proper relations, and I'm sure I can implement it in a database if that was the best option. Which is the better option?
If you need not to persist that data structure, do not try to support it on a database. In your place, I would try some load tests with a proper amount of data on the data structure you already have and try to refine it from there if performance was not what I expected.
Anyway, the trend currently is to use relational databases in hard disk for persistence and cache frequently queried data in "big hashtables" in memory for performance, I doubt falling back to a database would improve your performance
If your data structures fit in memory, I would be shocked if using a database would be faster (not even considering the complexity of using a database implementation). By throwing away all the assumptions, features, safety and consistency that a database must maintain, you will gain performance. Even the best DB implementation, assuming enough memory to cache everything, pretty much has a ConcurrentHashMap as an upper bound on performance. As a practical matter, you won't get CHM performance even with great caching, because a DB API will require defensive copies or cache invalidations that you can avoid with your in-memory structure.
Apart from the likely performance boost simply from using an in-memory hashmap, you may also get additional performance by tuning your structure based on your specific use case. For example, perhaps the initial lookup is multi-threaded, but individual values are only accessed by a single thread. In that case, you can avoid locking those values.
Hard drives, even fast, are several orders of magnitude slower than your memory. So if your goal is performance you should stay in memory and use maps. For thread safety you can just use a ConcurrentHashMap which uses a lock-free algorithm and the synchronisation penalty in a multi threaded environment should be minimal.
You should also check if a single thread does not provide enough performance - multiple threads always introduce some overhead and they need to bring enough gains to offset it.
You may also want to check in-memory DBs such as HyperSQL or H2 Database.

Which NoSQL Implementation is Most Appropriate?

I'm new to NoSQL, and I'm scratching my head trying to figure out the most appropriate NoSQL implementation for the application I'm trying to build.
My Java application needs to have an in-memory hashmap containing millions to billions of entries as it models a single-layer neural network. Right now we're using Trove in order to be able to use primitives as keys and values to reduce the size of the map and increase the access speed. The map is a map of maps where the outer map's keys are longs and the inner maps have long/float key/values.
We need to be able to read the saved state from disk to the map of maps when the application starts up. The changes to the map of maps need also to be saved to disk either continuously or according to some scheduled interval.
I was at first drawn towards OrientDB because of their document and object DBs, although I'm still not sure at this point what would be better. Then I came across Redis, which is a key value store and works with an in-memory dataset that can be dumped to disk, including master-slave replication. However, it doesn't look like the values of the map can be anything other than Strings.
Am I looking in the right places for a solution to my needs? Right now, I like the in-memory and master-slave aspect of Redis, but I like the object/document capabilities of OrientDB as my data structures are more complicated than simple Strings and being able to use Trove with the primitive key/value types is very advantageous. It would be better if reading was cheap and writing was expensive rather than the other way around.
Thoughts?
Why not just serialize the Trove data structures directly to disk? There appears to be some sort of support for that judging by the documentation (http://trove4j.sourceforge.net/javadocs/serialized-form.html), but it's hard to tell because it's all auto-generated cruft instead of lovingly-made tutorials. Still, for your use case it's not obvious why you need a proper database, so perhaps KISS applies.
OrientDB has the most flexible engine with index, graph, transactions and complex documents as JSON. Why not?
Check out Java-Chronicle. It's a low latency persistence library. I think you may find it offers excellent performance for this type of data.
If you'd like to use Redis for this, you'd likely be best suited by using either ZSETs or HASHes as underlying structures (Redis supports structures, not just string values). Unless you need to fetch your parts of your maps based on the values/sorted order of the values, HASHes would probably be best (in terms of memory and speed).
So you would probably want to use a long -> {long:float, ...} . That is, longs mapping to long/float maps. You can then either fetch individual entries in the map with HGET, multiple entries with HMGET, or the full map with HGETALL. You can see the command reference http://redis.io/commands
On the space saving side of things, depending on the expected size of your HASHes, you may be able to tune them to use less space with limited/no negative effects on performance.
On the persistence side of things, you can either run Redis with snapshots or using incremental saving with append-only files. You can see the persistence documentation here: http://redis.io/topics/persistence
If you'd like to ask more pointed questions, you should head over to the mailing list https://groups.google.com/forum/?fromgroups=#!topic/redis-db/33ZYReULius
Redis supports more complex data structures than simple strings such as lists, (sorted) sets or hashes which might come handy for your domain model. On the other your neural network can leverage from rich graph capabilities of OrientDB depending on it's strucuture.

hibernate object vs database physical model

Is there any real issue - such as performance - when the hibernate object model and the database physical model no longer match? Any concerns? Should they be keep in sync?
Our current system was original designed for a low number of users so not much effort was done to keep the physical and objects in sync. The developers went about their task and the architects did not monitor. Now that we are in the process of rewriting/importing the legacy system into the new system, a concern has been raised in that the legacy system handles a lot of user volume and might bring the new system to its knees.
Update 20090331
From Pete's comments below - the concern was about table/data relationships in the data layer vs the object layer. If there is no dependencies between the two, then there is no performance hits if these relationships do not match? Is that correct?
The concern from my view is that the development team spends a lot of time "tuning" the hibernate queries/objects but nothing at the database layer to improve the performance of the application. I would have assumed that they would tune at both layers.
Could these issue be from just a poor initial design of the database to begin with and trying to cover/make up the difference by the use of Hibernate?
(I am new to this project so playing catchup)
Update: in response to comment: It is CRUCIAL that the database be optimized in addition to the Hibernate use. When you think about it, after all the work hibernate does, in the end it is just querying the database. If the database doesn't perform well (wrong or missing indexes, poorly set up table spaces, etc) it doesn't matter how much you tune Hibernate. On the flip side if your database is set up well but Hibernate isn't (perhaps the caching is not set up properly, etc., and you are going back to the database a lot more then you need to) then performance will suffer as well. It is always important to tune the system end to end, but start at the foundation (database) and work up.
End Update
I'm curious what you mean about 'don't match' - do you mean columns have been added to tables that aren't represented in the hibernate data objects? Tables have been added? I don't think anything like that would affect performance (more likely data integrity if you are not inserting/updating all columns)
In general, the goal of the object model should NOT be match the database schema verbatim. You want to abstract the underlying data complexity / joins / normalization, that is the whole point of using something like Hibernate.
So for example lets say you have (keeping things very simple) 'orders' and 'order items',
your application code should be able to do something like
order.getItems()
without having to know that underneath it is a one to many relationship. The details in your hibernate code control how the load is done (lazy, caching, etc).
If that doesn't answer your question then please provide more detail
You could of course code your abstraction layer in asm - "might" (awful word for a developer) be faster.
This is premature optimization - maybe breaking a clean project-layout.
As in the hibernate-manual - optimization can look different ways - plain coding some parts "might" be part of it.
It's certainly possible that the changes you describe could cause performance problems.
I would have thought that this should have been part of the design spec.
So when you're coding it, you bear the performance critiera in mind.
The only way to really know though is to load the data onto a test environment, and run some tests.
This should definately be done before going live, as it might produce some quite interesting results.

Categories