I would like to get an idea about wich storing data solution to use for a medium application stored on app engine written in JAVA (jsf 2.1).
I want use use a few tables around 15, with a lot of interactions. Should I use the usual datastore with JPA 2 but without (many to many) relations or shouldd i use an eternal database storage ?
The Google cloud sql seam to be the best solution with JPA 2 to persist the data, nevertheless it's not FREE.
With the datastore and JPA 2, we can't create many to many relations, but can't we do that with 2 "one to many" relations ?
For example :
A plane and some passengers. A plane can own many passengers, and a passenger can use many planes.
We can translate it to the relation : Plane many to many passengers. And in the datastore we should store it like PLANE one to many TICKET many to one PASSENGER
Thanks a lot for your answers :))
If you prefer a free Java AppEngine data storage solution, seriously consider AppEngine Datastore. And if you're open to opinions (sorry, not really on-topic on StackOverflow), my impression of JDO and JPA as abstraction layers on top of the Low Level Datastore API is that they are not worth using, especially as application complexity increases. Read previous StackOverflow questions about these APIs to get an impression of the kinds of problems you might encounter. At least the Low Level Datastore API gives you contact with the code that is going to eventually do the work anyway.
The hardest work in my opinion would be to redesign your application features that are based on relational assumptions (again see relevant SO questions). You are clear that many to many relations are not feasible with Datastore. The NoSQL approach might cause duplication of data (redundancy instead of normal form) and eventual consistency. That might clash with your SQL expectations but don't worry, the change is worth it. Your example of a ticket being related to a passenger and also a flight would be fine with three Kinds in the Datastore. Move the relational constraints into the application code, for example do not delete a passenger if a ticket for that passenger exists.
I'm just reading the book here: http://www.amazon.com/Java-Architects-Handbook-Second-Edition/dp/0972954880/ trying to find a strategy about how to efficiently design a (generic) medium to large application (200 tables or more) - for instance a classic, multi-layered, corporate intranet. I'm trying to adapt my past experience (as a database designer, but also OOAD) in order to architect such a java application. From what I've read, if you define your entities first, there is no recommended way to infer your database directly (automatically).
The book says that you would build the entity/object model first (OOAD) and THEN there is the db admin/dev.(?) job to build/infer the database (schema, normalization etc.) based on the entity model already built. If this is the case, I'm afraid the architect/developer could lose control over important aspects - normalization, entity-attribute-value modeling etc.
Perhaps like many older developers (back-end developers, architects etc) I feel more comfortable defining the database schema first - and spending a good amount of time on aspects like normalization etc. While this would be certainly possible nowadays, I'm asking myself if this would become (pretty soon, if not already) the 'old fashioned way' and not the norm - as a classic/recommended approach when designing applications from scratch.
I know Entity Framework (.NET) already have these approaches explicitly defined - 'entities first', 'database first', 'code first' and and these could be mixed, if necessary. I surely know that they recommend 'entity first' for newly designed apps, and 'database first' if you have already defined database schema (which is the case for many older applications, when migrating etc. I'm just asking if there is something similar for the java world.
So, the questions are: (although I know there is no silver bullet etc.)
'Entities first' for newly built apps - this is the norm nowadays?
What tools do you use (if any) in order to assist inferring db schema process? - your experience, pros & cons with concrete UML
tools etc.
What if you have parts/older/sub-domain database schema (which you'd want to preserve, mainly)? In such case, you would infer entities model from
database and then refactor the model using your preferred UML tool?
From labor force perspective (let's say for db of 200-500 tables): what is the best approach: for instance, to have 2 different people
involved in designing OOAD/entities and database respectively,
working together with an architect?
As you expect - my answer is it depends.
The problem is that there are so many possible flavours and dimensions to a good design you really need to take the widest view possible first.
Ask yourself some of the big questions:
Where is the core of the system? Is the database really the core or is it actually just a persistence layer for the code. It could also perhaps be that the database is the core and the code is really just a snazzy UI on the data. There can also be a mix - where some of the tables are core along with some of the entities.
What do you see in the future? Remember that there are developments going on as we speak that are moving database technology rapidly forward. There are some databases that are all in-ram. Some are designed for a distributed architecture. Some are primarily cloud. If you build your schema first you risk locking yourself in to a certain technology.
What scale do you want to achieve? By insisting on a specific database you may be closing doors to perhaps hand-held presence.
I generally find entity first as the best initial approach because you can always derive a schema from the entities and some meta-data. It is certainly possible to go schema first and grow the entities out of the schema but that way you generally find the database influences the design too much.
1) I've done database first in the past but now I usually do Entity first but that's mainly because of the tools I'm using in creating the applications. Entity first has a few good advantages over trying to match your entities to your defined schema later. You're also not locking yourself to tightly to your schema. What your application is for matters alot as well, if it's just a basic CRUD application, write once read many or does it actually 'do' something that will inform your choice over how to architect your application.
2) I use hibernate a lot which encourages creating your model first, designing all your entities etc and then generating the schema from that, hibernate can generate your whole schema from the models you've created (though you may need to tweak them to make sure they're not crazy). If you have 200 entities in your model then you probably want to do a significant amount of UML modelling ahead of time to ensure your model is consistent.
3) If you're working with partially legacy database then it can sometimes be good to fall into line with the schema design for that so your entities and schema are consistent. It can be a bit of a pain but then so it trying to explain why part of your app is just different to other parts. So yes I would probably infer my entities from the schema in that case. But again if it was totally crazy then it may be to do some very specific DAO code to hide that part of the schema from that app and pretend it's not there.
4) I can't really give you a good answer on this as I'm not sure what you're driving at really. Once you have the design standards for your schema it's turning the handle to crank it out.
So after all that my answer is 'It depends'
While the answers already posted cover a lot of points - and ultimately, all answers probably have to all sum up to "it depends" - I'd like to expand on a point that's been touched on already.
My focus is on data - I'm a business intelligence and data warehousing developer, and I deal with issues like data quality, data governance, having a set of master data, etc. To this end, I have to pull data from other systems - data which is in varying conditions.
When considering whether the core of your system is really the database or the front end (as suggested by OldCurmudgeon), I strongly suggest thinking outside of your own area. I have seen and heard about many systems where it's clear that the database has been treated as an afterthought (sometimes created via an entity-first model, but also sometimes hand-built), despite the fact that most of the business value is in the data. More and more companies are of course realising that their data is valuable and are adopting tools to make use of it - but it's difficult to do if poor transactional databases mean that data has been lost, was never saved in the first place, has been overwritten when a history is needed, or is inconsistent.
While I don't want to do myself and others with similar roles out of a job: If the data that a system you're working on is or might be valuable, if there's any reason it might be accessed by anything other than the front end you're creating, then it is worth the time and effort to create a sound data model to hold it. If the system is for an organisation or is going to be sold to organisations, there's a decent chance they'll want to report out of it, will want to run output from it into a data warehouse or other data stores, and will want to carry out analysis on the data it creates and holds.
I don't know enough about tools like Hibernate to know if it's possible to both use them to work in an entity-first manner and still create a good quality database, but I know that I have come across some problematic databases created in this manner. At the very least, as has been suggested, if you are going to work that way, make sure it is producing something sane and perhaps adjust it where necessary to maintain data integrity. If data integrity is a key requirement and you cannot get such a tool to create a suitable database that will ensure data integrity, then perhaps consider going back to doing things the "old fashioned" way.
I would also suggest that there's real value in developers working alongside any data specialists, analysts, architects, etc. they may have as colleagues to do some up-front modelling, even if the system they then produce uses entity-first and even if it veers away from the more conceptual models produced early on for technical reasons. I have seen many baked-in problems in systems which have been caused by a lack of understanding of the wider business entities and relationships, and which could have been avoided if time had been spent understanding the overall structure in this way. I've been personally responsible for building those problems when I was an application developer myself, so this shouldn't be read as criticism of front-end developers - just a vote in favour of cross-functional and collaborative analysis and modelling before development approaches and designs are decided.
I'm starting to build a new Spring-based multi-user document management application and I would like to venture into the world of NoSQL/MongoDB. Coming from a RDBMS background, I have several concerns with MongoDB, primarily:
Lack of transactions
More focused on performance/scalability than data integrity
Lack of a JPA standard
To start with, I do not expect high loads or massive reads vs writes. I suspect reads to writes will be about 10 to 1. Additionally, I do not expect very high loads - especially to start.
1) From what I can tell, there is no easy way to do multi-collection transactions. Where in a RDBMS I can easily have a per-user document ID counter maintained in a separate table, there does not seem to be a way to do this reliably in MongoDB given that it would be in a separate collection/document. Consequently, I'm not sure if/how one resolves this problem.
2) Additionally, from what I have read, NoSQL is great where data integrity isn't the primary concern (ex: blog comments, etc). However, I'm not sure how this translates to being the primary data store for an application. Does this mean that one can update a document and have it fail? I ran across an older unaccredited rant which discusses failed commits/etc which further flames the concerns.
3) The seemingly lack of a JPA-like standard for NoSQL would imply that I have to choose my DB and stick with it. Unlike JPA where I can easily swap one DB vendor for another JPQ/SQL compliant vendor, I have to code with MongoDB in mind and redesign my structure/queries if ever I wanted to switch to another NoSQL DB. I've seen Hibernate OGM, but it seems that it is still very much in its infancy and only provides rudimentary support. Definitely not something that would avoid mongodb specific queries.
Are these concerns easily mitigated? Being new to the NoSQL world, I'm still having trouble understanding the right business case when to use NoSQL.
These are good questions. Here's my 2 cents about MongoDB and some references to help you learn more. I won't speak about any other NoSQL thingies as there are a lot out there and there's no real unifying principle to NoSQL other than "it doesn't use SQL", except sometimes people make it work with SQL, so, yeah.
MongoDB does not do joins. Period. MongoDB does not have transactions - whether within one collection or involving multiple collections. The unit of atomicity is the document. How does this work in an application? Via schema designand some techniques for recovering parts of ACID semantics if necessary, for example using two-phase commits. In relational databases, schema design is straightforward and is based on the structure of the data and not its use case. Joins and transactions fill in the gap between the abstract, normalized data representation and the concrete ways the data will be used. The data modeling intro already linked explains the situation for MongoDB, for contrast:
The key challenge in data modeling is balancing the needs of the application, the performance characteristics of the database engine, and the data retrieval patterns. When designing data models, always consider the application usage of the data (i.e. queries, updates, and processing of the data) as well as the inherent structure of the data itself.
That specific "rant" is clearly very old as it talks about writes being unacknowledged by default. This isn't the case anymore. Given any distributed computer system operating over a network, it's pretty easy to come up with a way for it to behave poorly . The MongoDB blog covered a lot of this stuff in a series on consistency. I'd suggest touring the docs about journaling, replication, and write concern and see if that makes you feel better about MongoDB as a primary data store.
Yup. This comes with the NoSQL territory. What doesn't exist is common data access languages or standards because everything is new and trying to be different. Check back in 30 years.
I would like to know -an example is highly appreciated-
How to model relationships in Google App Engine for Java?
-One to Many
-Many to Many
I searched allover the web and I found nothing about Java all guides and tutorials are about Python.
I understood from this article that in Python the relationships are modeled using ReferenceProperty. However, I found nothing about this class in the Javadoc reference.
Furthermore, in this article they discussed the following:
there's currently a shortage of tools for Java users, largely due to the relative newness of the Java platform for App Engine.
However, that's was written in 2009.
At the end, I ended up modeling the relationships using the ancestor path of each entity. I discovered afterwords that this approach has problems and limit the scalability of the app.
Can you please guide me to the equivalent Java class to the Python's ReferenceProperty class? Or can you please give me an example of how to model the relationships in AppEngine using the java datastore low-level API.
Thanks in advance for your help.
Creating relationships between entities in GAE/J depends on db API that you are using:
JDO: entity relationships.
JPA: see docs.
Objectify: single-value relationships.
Low-level API: add a Key of one Entity as a property to another Entity: see property types.
Just a tip. When defining your data model think in terms of end-user queries and define your data model accordingly.
For example, let's take the example of a store renting books. In a traditional application, you would have three main entities :
--> Book
--> Client
--> Rent (to solve the many-to-many)
To display a report with which client is renting which book, you would issue a query joining on the Rent table, Book table and client table.
However, in GAE that won't work because the join operation is not supported.
The solution I found (maybe other solution) is to model with the same three tables but embedding the book and client definitions in the Rent table.
This way, displaying the list of books being rent by whom is extremely fast and inexpensive. The only drawback is that if for example the title of a book changes, I have to go through all the embedded objects. However, how often does that happen vs. read-only queries.
As a summary, think in terms of end-user queries
Giving an analogy: Twitter like scenario where in a person can be followed by huge number of people (one-to-many) ,
Few options which I could think of
Use some OR mapping tool with lazy loading. But when you access the "followers" side of relations, it will still load all the data even tough lazily. So not a suitable option.
Do not maintain one-to-many relation (or not use any OR mapping) . Fetch the "Followers" side in separate call and handle the paging etc programmatically.
Offload Fetching of large data to some search stack (Lucene/Solr) which can better handle large data. But this will introduce some latency between database update and index update.
Please share your thoughts/suggestions and any possible tools library. Stack consists of Java , MySQL.
Millions should not be a problem for an RDBMS as it is designed for those situations.
Sometimes it is also recommended to denormalize rather than normalize to optimize the performance of your application. This is specifically for applications that have very high read and very low write statistics.
I was asked to have a look at a legacy EJB3 application with significant performance problems. The original author is not available anymore so all I've got is the source code and some user comments regarding the unacceptable performance. My personal EJB3 skill are pretty basic, I can read and understand the annotated code but that's all until know.
The server has a database, several EJB3 beans (JPA) and a few stateless beans just to allow CRUD on 4..5 domain objects for remote clients. The client itself is a java application. Just a few are connected to the server in parallel. From the user comments I learned that
the client/server app performed well in a LAN
the app was practically unusable on a WAN (1MBit or more) because read and update operations took much too long (up to several minutes)
I've seen one potential problem - on all EJB, all relations have been defined with the fetching strategy FetchType.EAGER. Would that explain the performance issues for read operations, is it advisable to start tuning with the fetching strategies?
But that would not explain performance issues on update operations, or would it? Update is handled by an EntityManager, the client just passes the domain object to the manager bean and persisting is done with nothing but manager.persist(obj). Maybe the domain objects that are sent to the server are just too big (maybe a side effect of the EAGER strategy).
So my actual theory is that too many bytes are sent over a rather slow network and I should look at reducing the size of result sets.
From your experience, what are the typical and most common coding errors that lead to performance issues on CRUD operations, where should I start investigating/optimizing?
On all EJB, all relations have been defined with the fetching strategy FetchType.EAGER. Would that explain the performance issues for read operations?
Depending on the relations betweens classes, you might be fetching much more (the whole database?) than actually wanted when retrieving entities?
is it advisable to start tuning with the fetching strategies?
I can't say that making all relations EAGER is a very standard approach. To my experience, you usually keep them lazy and use "Fetch Joins" (a type of join allowing to fetch an association) when you want to eager load an association for a given use case.
But that would not explain performance issues on update operations, or would it?
It could. I mean, if the app is retrieving a big fat object graph when reading and then sending the same fat object graph back to update just the root entity, there might be a performance penalty. But it's kinda weird that the code is using em.persist(Object) to update entities.
From your experience, what are the typical and most common coding errors that lead to performance issues on CRUD operations, where should I start investigating/optimizing?
The obvious ones include:
Retrieving more data than required
N+1 requests problems (bad fetching strategy)
Poorly written JPQL queries
Non appropriate inheritance strategies
Unnecessary database hits (i.e. lack of caching)
I would start with writing some integration tests or functional tests before touching anything to guarantee you won't change the functional behavior. Then, I would activate SQL logging and start to look at the generated SQL for the major use cases and work on the above points.
From DBA position.
From your experience, what are the typical and most common coding errors that lead to performance issues on CRUD operations, where should I start investigating/optimizing?
Turn off caching
Enable sql logging Ejb3/Hibernate generates by default a lots of extremely stupid queries.
Now You see what I mean.
Change FetchType.EAGER to FetchType.LAZY
Say "no" for big business logic between em.find em.persist
Use ehcache http://ehcache.org/
Turn on entity cache
If You can, make primary keys immutable ( #Column(updatable = false, ...)
Turn on query cache
Never ever use Hibernate if You want big performance:
I my case a similar performance problem wasn't depending on the fetch strategy. Or lets say it was not really possible to change the business logic in the existing fetch strategies. In my case the solution was simply adding indices.
When your JPA Object model have a lot of relationsships (OneToOne, OneToMany, ...) you will typical use JPQL statements with a lot of joins. This can result in complex SQL translations. When you take a look at the datamodel (generated by the JPA) you will recognize that there are no indices for any of your table rows.
For example if you have a Customer and a Address object with an oneToOne relationship everything will work well on the first look. Customer and Address have an foreign key. But if you do selections like this
Select c from Customer as c where c.address.zip='8888'
you should take care about your table column 'zip' in the table ADDRESS. JPA will not create such an index for you during deployment. So in my case I was able to speed up the database performance by simply adding indices.
An SQL Statement in your database looks like this:
ALTER TABLE `mydatabase`.`ADDRESS` ADD INDEX `zip_index`(`IZIP`);
In the question, and in the other answers, I'm hearing a lot of "might"s and "maybe"s.
First find out what's going on. If you haven't done that, we're all just poking in the dark.
I'm no expert on this kind of system, but this method works on any language or OS.
When you find out what's making it take too long, why don't you summarize it here?
I'm especially interested to know if it was something that might have been guessed.