Fast and safe way to save and retrieve data in java [closed]

Fast and safe way to save and retrieve data in java [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
In order to add data persistance to an oriented billing software, i wonder what is the best way to save and retrieve data in my situation.
I work with JavaFX's TableView populated by custom objects (with many string, int, booleans, ...), each one representing one bill. The user must be able to add, read and edit data on the fly. Everything is stored locally, no need to use a cloud or something like.
I usually use serialization to write my objects, but is it a safe and fast way to store around 10.000 custom objects ?
Should i use XML, Serialization, a local database (with JavaDB ?) or ... ?
By fast, i mean that the user can write, and edit data. I have no problem with a small loading time when the app is launched.
By safe, i don't mean encrypted, it is safe in the "data won't get lost or corrupted" way.
Eventually if there are multiple solutions, why one over another ?

Any persistence mechanism (flat file, relational db, nosql) can be safe if used as designed or unsafe if abused/misunderstood. Your question is a very open ended question and can get very involved, or be very light.. it all depends.
Typically the choices come down to:
flat file (say binary serialisation, csv, json or xml). Very simple mechanism which takes effort to scale to large files and care must be taken when making changes to the code base; as changes could prevent older files from being readable. One has also got to bare in mind when the data is written in relation to changes coming in from the user and the possibility of a machine crash. ie there are not transactions and so data can corrupt, not a simple topic in its own right. As for which format is best, well many a religious war has been fought over that but typically a textual format (json, xml or csv) has the advantage of being human readable which helps debugging/maintenance tasks. XML and Json support nested structures which is an advantage over CSV. As for performance, text manipulation typically slows the parser down by about 10x compared to a binary one. However there are fast implementations and slow ones, and for 10k objects you are unlikely to notice the difference.
relational database. Very useful for apps that benefit from using relational queries (SQL), and a lot of effort has gone into making them transactional and robust to machine crashes. They are generally the persistence mechanism of choice for large businesses and require some knowledge to setup and maintain the DB itself. H2 is a very simple, low cost entry provider and Oracle is at the other extreme end of the spectrum. Relational databases suffer from a domain mismatch, specifically Object design and SQL design do not map together without some effort from the developer. They also typically suffer from scaling problems as they are not usually clusterable, not a problem for 10k rows.
no sql databases (eg redis, cassandra, mongo, couch, neo4j). Generally not transactional, but they are often faster than the relational dbs and offer clusters from the get-go making them very robust. They also support different data modelling paradigms such as graph, list, document making the NoSQL landscape much richer than the relational SQL one.
I assume that you are not working on a professional project and lack a mentor, so I will wrap up by suggesting that you focus on flat files first and then pick a DB product of some kind to experiment with (H2 is very good for learning relational products, and Mongo or Redis for ease of learning one of the NoSql products).

You can use H2 database (http://www.h2database.com/), it's really convenient way to store data, and you can use embedded database this looks like
Class.forName("org.h2.Driver");
Connection conn = DriverManager.getConnection("jdbc:h2:~/test", "sa", "");
// add application code here
conn.close();
H2 creates file test in user home directory named test.db.

Object serialization is safe. It's not particularly fast and you have to be very careful about how you make changes to the class to ensure that you can consistently deserialize. This is the biggest disadvantage in my opinion about object serialization.
XML (or JSON) aren't bad either. There are binding technologies like JAXB, Jackson or Gson which allow you to seamlessly map objects to XML or JSON. Permissive binding makes these formats easier to use than object serialization, having the additional benefit of being human readable and editable, but with the cost of being more verbose (consider file compression). If your storage format is a giant XML file, you can also search for records using XPath.
JavaDB (or H2 or SQLite) is good in that it implements a relational data store, so you can perform SQL queries on the data. Managing lots of records is much more straightforward with a proper database. You could probably save on disk space, too. I would recommend this approach.
Will there be multiple clients reading these files? In that case, for safety, you would have to implement some kind of file locking scheme to prevent data corruption. You can get around this with some kind of out-of-process data management, using a lightweight server such as H2 or one of the NoSQL datastores like Mongo or Cassandra.

Related

What is the correct way to structure this kind of data in Firestore?

I have seen videos and read the documentation of Cloud firestore, from Google Firebase service, but I can't figure this out coming from realtime database.
I have this web app in mind in which I want to store my providers from different category of products. I want perform a search query through all my products to find what providers I have for such product, and eventually access that provider info.
I am planning to use this structure for this purpose:
Providers ( Collection )
Provider 1 ( Document )
Name
City
Categories
Provider 2
Name
City
Products ( Collection )
Product 1 ( Document )
Name
Description
Category
Provider ID
Product 2
Name
Description
Category
Provider ID
So my question is, is this approach the right way to access the provider info once I get the product I want?
I know this is possible in the realtime database, using the provider ID I could search for that provider in the providers section, but with Firestore I am not sure if its possible or if this is right approach.

What is the correct way to structure this kind of data in Firestore?
You need to know that there is no "perfect", "the best" or "the correct" solution for structuring a Cloud Firestore database. The best and correct solution is the solution that fits your needs and makes your job easier. Bear also in mind that there is also no single "correct data structure" in the world of NoSQL databases. All data is modeled to allow the use-cases that your app requires. This means that what works for one app, may be insufficient for another app. So there is not a correct solution for everyone. An effective structure for a NoSQL type database is entirely dependent on how you intend to query it.
The way you are structuring your data looks good to me. In general, there are two ways in which you can achieve the same thing. The first one would be to keep a reference of the provider in the product object (as you already do) or to copy the entire provider object within the product document. This last technique is called denormalization and is a quite common practice when it comes to Firebase. So we often duplicate data in NoSQL databases, to suit queries that may not be possible otherwise. For a better understanding, I recommend you see this video, Denormalization is normal with the Firebase Database. It's for Firebase Realtime Database but the same principles apply to Cloud Firestore.
Also, when you are duplicating data, there is one thing that needs to keep in mind. In the same way, you are adding data, you need to maintain it. In other words, if you want to update/delete a provider object, you need to do it in every place that it exists.
You might wonder now, which technique is best. In a very general sense, the best way in which you can store references or duplicate data in a NoSQL database is completely dependent on your project's requirements.
So you should ask yourself some questions about the data you want to duplicate or simply keep it as references:
Is the static or will it change over time?
If it does, do you need to update every duplicated instance of the data so they all stay in sync? This is what I have also mentioned earlier.
When it comes to Firestore, are you optimizing for performance or cost?
If your duplicated data needs to change and stay in sync in the same time, then you might have a hard time in the future keeping all those duplicates up to date. This will also might imply you spend a lot of money keeping all those documents fresh, as it will require a read and write for each document for each change. In this case, holding only references will be the winning variant.
In this kind of approach, you write very little duplicated data (pretty much just the Provider ID). So that means that your code for writing this data is going to be quite simple and quite fast. But when reading the data, you will need to load the data from both collections, which means an extra database call. This typically isn't a big performance issue for reasonable numbers of documents, but definitely does require more code and more API calls.
If you need your queries to be very fast, you may want to prefer to duplicate more data so that the client only has to read one document per item queried, rather than multiple documents. But you may also be able to depend on local client caches makes this cheaper, depending on the data the client has to read.
In this approach, you duplicate all data for a provider for each product document. This means that the code to write this data is more complex, and you're definitely storing more data, one more provider object for each product document. And you'll need to figure out if and how to keep up to date on each document. But on the other hand, reading a product document now gives you all information about the provider document in one read.
This is a common consideration in NoSQL databases: you'll often have to consider write performance and disk storage vs. reading performance and scalability.
For your choice of whether or not to duplicate some data, it is highly dependent on your data and its characteristics. You will have to think that through on a case-by-case basis.
So in the end, remember that both are valid approaches, and neither of them is pertinently better than the other. It all depends on what your use-cases are and how comfortable you are with this new technique of duplicating data. Data duplication is the key to faster reads, not just in Cloud Firestore or Firebase Realtime Database but in general. Any time you add the same data to a different location, you're duplicating data in favor of faster read performance. Unfortunately in return, you have a more complex update and higher storage/memory usage. But you need to note that extra calls in Firebase real-time database, are not expensive, in Firestore are. How much duplication data versus extra database calls is optimal for you, depends on your needs and your willingness to let go of the "Single Point of Definition mindset", which can be called very subjective.
After finishing a few Firebase projects, I find that my reading code gets drastically simpler if I duplicate data. But of course, the writing code gets more complex at the same time. It's a trade-off between these two and your needs that determines the optimal solution for your app. Furthermore, to be even more precise you can also measure what is happening in your app using the existing tools and decide accordingly. I know that is not a concrete recommendation but that's software development. Everything is about measuring things.
Remember also, that some database structures are easier to be protected with some security rules. So try to find a schema that can be easily secured using Cloud Firestore Security Rules.
Please also take a look at my answer from this post where I have explained more about collections, maps and arrays in Firestore.

Concerns with NoSQL/MongoDB

I'm starting to build a new Spring-based multi-user document management application and I would like to venture into the world of NoSQL/MongoDB. Coming from a RDBMS background, I have several concerns with MongoDB, primarily:
Lack of transactions
More focused on performance/scalability than data integrity
Lack of a JPA standard
To start with, I do not expect high loads or massive reads vs writes. I suspect reads to writes will be about 10 to 1. Additionally, I do not expect very high loads - especially to start.
1) From what I can tell, there is no easy way to do multi-collection transactions. Where in a RDBMS I can easily have a per-user document ID counter maintained in a separate table, there does not seem to be a way to do this reliably in MongoDB given that it would be in a separate collection/document. Consequently, I'm not sure if/how one resolves this problem.
2) Additionally, from what I have read, NoSQL is great where data integrity isn't the primary concern (ex: blog comments, etc). However, I'm not sure how this translates to being the primary data store for an application. Does this mean that one can update a document and have it fail? I ran across an older unaccredited rant which discusses failed commits/etc which further flames the concerns.
3) The seemingly lack of a JPA-like standard for NoSQL would imply that I have to choose my DB and stick with it. Unlike JPA where I can easily swap one DB vendor for another JPQ/SQL compliant vendor, I have to code with MongoDB in mind and redesign my structure/queries if ever I wanted to switch to another NoSQL DB. I've seen Hibernate OGM, but it seems that it is still very much in its infancy and only provides rudimentary support. Definitely not something that would avoid mongodb specific queries.
Are these concerns easily mitigated? Being new to the NoSQL world, I'm still having trouble understanding the right business case when to use NoSQL.

These are good questions. Here's my 2 cents about MongoDB and some references to help you learn more. I won't speak about any other NoSQL thingies as there are a lot out there and there's no real unifying principle to NoSQL other than "it doesn't use SQL", except sometimes people make it work with SQL, so, yeah.
MongoDB does not do joins. Period. MongoDB does not have transactions - whether within one collection or involving multiple collections. The unit of atomicity is the document. How does this work in an application? Via schema designand some techniques for recovering parts of ACID semantics if necessary, for example using two-phase commits. In relational databases, schema design is straightforward and is based on the structure of the data and not its use case. Joins and transactions fill in the gap between the abstract, normalized data representation and the concrete ways the data will be used. The data modeling intro already linked explains the situation for MongoDB, for contrast:
The key challenge in data modeling is balancing the needs of the application, the performance characteristics of the database engine, and the data retrieval patterns. When designing data models, always consider the application usage of the data (i.e. queries, updates, and processing of the data) as well as the inherent structure of the data itself.
That specific "rant" is clearly very old as it talks about writes being unacknowledged by default. This isn't the case anymore. Given any distributed computer system operating over a network, it's pretty easy to come up with a way for it to behave poorly . The MongoDB blog covered a lot of this stuff in a series on consistency. I'd suggest touring the docs about journaling, replication, and write concern and see if that makes you feel better about MongoDB as a primary data store.
Yup. This comes with the NoSQL territory. What doesn't exist is common data access languages or standards because everything is new and trying to be different. Check back in 30 years.

Disadvantages of Object Relational Mapping

I am a fan of ORM - Object Relational Mapping and I have been using it with Rails for the past year and a half. Prior that, I use to write raw queries using JDBC and make Database do the heavy lifting via Stored Procedures. With ORM, I was initially happy to do stuff like coach.manager and manager.coaches which were very simple and easy to read.
But as time went by there were in-numerous associations creeping up and I ended up doing a.b.c.d which were firing queries in all directions, behind the scenes. With rails and ruby, the garbage collector went nuts and took insane time to load a very complex page which involves relatively lesser data. I had to replace this ORM style code by a simple Stored procedure and the result I saw was enormous. A page that took 50 seconds to load now takes only 2 seconds.
With this huge difference, should I continue using ORM? It is very clear it has severe overheads compared to a raw query.
In general, what are the general pitfalls of using an ORM framework like Hibernate, ActiveRecord?

An ORM is only a tool. If you don't use it correctly, you'll have bad results.
Nothing stops you from using dedicated HQL/criteria queries, with fetch joins or projections, to return the information that your page must display in as few queries as possible. This will take more or less the same time as dedicated SQL queries.
But of course, if you just get everything by ID and navigate through your objects without realizing how many queries it generates, it will lead to long loading times. The key is to know exactly what the ORM does behind the scene, and decide if it's appropriate or if another strategy must be adopted.

I think you've already identified the major tradeoff associated with ORM software. Every time you add a new layer of abstraction that tries to provide a generalized implementation of something that you used to do by hand there is going to be some loss of performance/efficiency.
As you noted, traversing multiple relationships such as a.b.c.d can be inefficient, because most ORM software will be doing an independent database query for each . along the way. But I'm not sure that means you should eliminate ORM altogether. Most ORM solutions (or at least, certainly Hibernate) allow you to specify custom queries where you can bring back exactly what you want in a single database operation. This should be about as fast as your dedicated SQL.
Really the issue is about understanding how the ORM layer is working behind the scenes, and realizing that while something like a.b.c.d is simple to write, what it causes the ORM layer to do as it is evaluated is not. As a general rule I always go with the simplest possible approach to begin, and then write optimized queries in areas where it makes sense/where it is obvious that the simple approach will not scale.

I'd say, one should use the appropriate tool for different tasks.
E.g., for CRUD operations, ORM frameworks like Hibernate can speed up development and it will perform well enough. Sometimes you need to do some necessary tweaks to achieve acceptable performance. I'm not sure, your task (what took 50 sec with Hibernate) could not be done properly with Hibernate, because you did not provide us with the details.
On the other hand, for example bulk operations involving hundreds of thousands of records is not the type of task you'd expect Hibernate will do without significant performance penalty.

As it was mentioned already, ORM is only a tool and you can use it eiter good or bad.
One of the most typical performance problems in ORMs is 1+N queries problem. It is caused by loading additional objects for each of objects from the list. This is caused by eager fetch of 1-to-n-relation entities for each element on list, the dealing is using HQL queries, specifying fields in projection or marking fetching 1-to-n relations to lazy.
Any time, you must exactly know what the ORM is doing in order to achieve good performance. Not understanding what operations are done in background is a way to disaster (slow, buggy and hard to analyze code because of unnecessary and wrongly written work-arounds).

I'm with Petar from your comments regarding the lazy fetching. Say you have an html table filled fields from object a.b.c.d. You could find your framework round-tripping the database thousands of times(possibly many more) . The disadvantage of ORM in this case is you have to read the documentation thoroughly. Most frameworks support disabling lazy fetching and many even support adding your own processing logic to bind the data set.
The net out is that almost any ORM is almost undoubtedly better than anything you are going to write yourself. You will find yourself saddled with maintaining huge libraries of boilerplate or worse writing the same code over and over again.

We are currently investigating to switch from our own data store layer with clean separation of transfer objects and data access objects to JPA. We used a generator to create the TOs, the DAOs and the SQL DDL as well from some documentation in docbook format. By this all of our stuff from documentation, the database structure and the generated Java classes where always in sync with a good documentation of the database itself.
What we discovered so far by using JPA:
Foreign key references cannot be used for imports, some special
queries and so on because they must not be placed in a managed
entity. JPA only allows the target class there.
Access to some user session scope is difficult upto impossible. We
still have no clue how to get the users id into the column
'userWhoLastMadeAnUpdate' in some PrePersist method.
Something expected to be quite easy with an ORM, namely "class
mapping" does not work at all. We are using HalDateTime
(http://sourceforge.net/projects/haldatetime/) internally.
Especially in the client. Mapping it with JPA directly is not
possible although HalDateTime supports it. Due to JPA restrictions
we have to use two fields in the entity.
JPA uses either one XML file to describe the mapping. So you have to
look at least into two files to even understand the relationship
between the Java class and the database. And the XML file becomes
huge for large applications.
Alternatively ORMs provide annotations in the Java class itself. So
its easier to learn and understand the relationship. But it forces
you to see all that database stuff in the client layer (which
completely breaks a proper layering).
You will have to restrict yourself to stay as close to a clean
database structure as anyhow possible. Otherwise you will for sure
end up with a mess of queries and statements by the ORM.
Use an ORM which provides a query language which is close to SQL
itself (JPA seems quite acceptable here). An ORM induced language
makes supporting a large application really expensive.

Raw resources versus SQLite database

I'm creating an application that will use a lot of data which is, for all intents and purposes, static. I had assumed it'd make most sense to use a SQLite database to handle that data. I'm wondering if it makes sense to just use an XML file(s) and then access it as a raw resource. Bear in mind that there's likely going to be a LOT of data, to the order of hundreds of separate pieces.
Am I right to assume SQLite is best, both in terms of memory management and overall design considerations or does SQLite not make sense if the data is basically static?

In fact, SQLite seems to be nonsense if the data is static. However, if what you are going to manipulate is a lot of data you should use it:
It will be easier to:
Retrieve data
Filter data
Sort data
Using XML files will cause some performance problems because of the way in which SAX or DOM parses XML.
It will be easier for you to update that set of data in the future (imagine that you want to add more data in the next release)

Cristian is right. Database gives you better access time and allows to modify data in very convenient way. XML might be a better idea in case of tree-like data structures.
In my opinion there are 2 question here:
what kind of data are you storing?
Do you allow user to modify this
data (for example in application or
using Notepad)
There is also 1 big disadvantage of XML - it is eventually open text. So anyone can read it. To prevent it, you would have to encrypt the data (and this means additional effort). In case of XML, using marshaling techniques (JiBX, Castor, JAXB) might be convenient and might also lower memory consumption.
Please describe what kind of data you are storing in DB, so we might come up with better answer.

Did you think of your data being stollen (from the sqlite database)?
Because as a sqlite database, anybody with root can just pull the db file and use it

XML vs. object trees

In my current project (an order management system build from scratch), we are handling orders in the form of XML objects which are saved in a relational database.
I would outline the requirements like this:
Selecting various details from anywhere in the order
Updating / enriching data (e.g. from the CRM system)
Keeping a record of the changes (invalidating old data, inserting new values)
Details of orders should be easily selectable by SQL queries (for 2nd level support)
What we did:
The serialization is done with proprietary code, disassembling the order into tables like customer, address, phone_number, order_position etc.
Whenever an order is processed a bit further (e.g. due to an incoming event), it is read completely from the database and assembled back into a XML document.
Selection of data is done by XPath (scattered over code).
Most updates are done directly in the database (the order will then be reloaded for the next step).
The problems we face:
The order structure (XSD) evolves with every release. Therefore XPaths and the custom persistence often breaks and produces bugs.
We ended up having a mixture of working with the document and the database (because the persistence layer can not persist the changes in the document).
Performance is not really an issue (yet), since it is an offline system and orders are often intentionally delayed by days.
I do not expect free consultancy here, but I am a little confused on how the approach could be improved (next time, basically).
What would you think is a good solution for handling these requirements?
Would working with an object graph, something like JXPath and OGNL and an OR mapper be a better approach? Or using XML support of e.g. the Oracle database?

If your schema changes often, I would advise against using any kind of object-mapping. You'd keep changing boilerplate code just for the heck of it.
Instead, use the declarative schema definition to validate data changes and access.
Consider an order as a single datum, expressed as an XML document.
Use a document-oriented store like MongoDB, Cassandra or one of the many XML databases to manipulate the document directly. Don't bother with cutting it into pieces to store it in a relational db.
Making the data accessible via reporting tools in a relational database might be considered secondary. A simple map-reduce job on a MongoDB, for example, could populate the required order details into a relational database whenever required, separating the two use cases quite naturally.

The standard Java EE approach is to represent your data as POJOs and use JPA for the database access and JAXB to convert the objects to/from XML.
JPA
Object-to-Relational standard
Supported by all the application server vendors.
Multiple available implementations EclipseLink, Hibernate, etc.
Powerful query language JPQL (that is very similar to SQL)
Handles query optimization for you.
JAXB
Object-to-XML standard
Supported by all the application server vendors.
Multiple implementations available: EclipseLink MOXy, Metro, Apache JaxMe, etc.
Example
http://bdoughan.blogspot.com/2010/08/creating-restful-web-service-part-15.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.