In my tiny little standalone Java application I want to store information.
My requirements:
read and write java objects (I do not want to use SQL, and also querying is not required)
easy to use
easy to setup
minimal external dependencies
I therefore want to use jaxb to store all the information in a simple XML-file in the filesystem. My example application looks like this (copy all the code into a file called Application.java and compile, no additional requirements!):
#XmlRootElement
class DataStorage {
String emailAddress;
List<String> familyMembers;
// List<Address> addresses;
}
public class Application {
private static JAXBContext jc;
private static File storageLocation = new File("data.xml");
public static void main(String[] args) throws Exception {
jc = JAXBContext.newInstance(DataStorage.class);
DataStorage dataStorage = load();
// the main application will be executed here
// data manipulation like this:
dataStorage.emailAddress = "me#example.com";
dataStorage.familyMembers.add("Mike");
save(dataStorage);
}
protected static DataStorage load() throws JAXBException {
if (storageLocation.exists()) {
StreamSource source = new StreamSource(storageLocation);
return (DataStorage) jc.createUnmarshaller().unmarshal(source);
}
return new DataStorage();
}
protected static void save(DataStorage dataStorage) throws JAXBException {
jc.createMarshaller().marshal(dataStorage, storageLocation);
}
}
How can I overcome these downsides?
Starting the application multiple times could lead to inconsistencies: Several users could run the application on a network drive and experience concurrency issues
Aborting the write process might lead to corrupted data or loosing all data
Seeing your requirements:
Starting the application multiple times
Several users could run the application on a network drive
Protection against data corruption
I believe that an XML based filesystem will not be sufficient. If you consider a proper relational database an overkill, you could still go for an H2 db. This is a super-lightweight db that would solve all these problems above (even if not perfectly, but surely much better than a handwritten XML db), and is still very easy to setup and maintain.
You can configure it to persist your changes to the disk, can be configured to run as a standalone server and accept multiple connections, or can run as part of your application in embedded-mode too.
Regarding the "How do you save the data" part:
In case you do not want to use any advanced ORM library (like Hibernate or any other JPA implementation) you can still use plain old JDBC. Or at least some Spring-JDBC, which is very lightweight and easy to use.
"What do you save"
H2 is a relational database. So whatever you save, it will end up in columns. But! If you really do not plan to query your data (neither apply migration scripts on it), saving your already XML-serialized objects is an option. You can easily define a table with an ID + a "data" varchar column, and save your xml there. There is no limit on data-length in H2DB.
Note: Saving XML in a relational database is generally not a good idea. I am only advising you to evaluate this option, because you seem confident that you only need a certain set of features from what an SQL implementation can provide.
Inconsistencies and concurrency are handled in two ways:
by locking
by versioning
Corrupted writing can not be handled very well at application level. The file system shall support journaling, which tries to fix that up to some extent. You can do this also by
making your own journaling file (i.e. a short-lived separate file containing changes to be committed to the real data file).
All of these features are available even in the simplest relational database, e.g. H2, SQLite, and even a web page can use such features in HTML5. It is quite an overkill to reimplement these from scratch, and the proper implementation of the data storage layer will actually make your simple needs quite complicated.
But, just for the records:
Concurrency handling with locks
prior starting to change the xml, use a file lock to gain an exclusive access to the file, see also How can I lock a file using java (if possible)
once the update is done, and you sucessfully closed the file, release the lock
Consistency (atomicity) handling with locks
other application instances may still try to read the file, while one of the apps are writing it. This can cause inconsistency (aka dirty-read). Ensure that during writing, the writer process has an exclusive lock on the file. If it is not possible to gain an exclusive access lock, the writer has to wait a bit and retry.
an application reading the file shall read it (if it can gain access, no other instances do an exclusive lock), then close the file. If reading is not possible (because of other app locking), wait and retry.
still an external application (e.g. notepad) can change the xml. You may prefer an exclusive read-lock while reading the file.
Basic journaling
Here the idea is that if you may need to do a lot of writes, (or if you later on might want to rollback your writes) you don't want to touch the real file. Instead:
writes as changes go to a separate journaling file, created and locked by your app instance
your app instance does not lock the main file, it locks only the journaling file
once all the writes are good to go, your app opens the real file with exclusive write lock, and commits every change in the journaling file, then close the file.
As you can see, the solution with locks makes the file as a shared resource, which is protected by locks and only one applicaition can access to the file at a time. This solves the concurrency issues, but also makes the file access as a bottleneck. Therefore modern databases such as Oracle use versioning instead of locking. The versioning means that both the old and the new version of the file are available at the same time. Readers will be served by the old, most complete file. Once writing of the new version is finished, it is merged to the old version, and the new data is getting available at once. This is more tricky to implement, but since it allows reading all the time for all applications in parallel, it scales much better.
To answer your three issues you mentioned:
Starting the application multiple times could lead to inconsistencies
Why would it lead to inconsistencies? If what you mean is multiple concurrent edit will lead to inconsistencies, you just have to lock the file before editing. The easiest way to create a lock file beside the file. Before starting edit, just check if a lock file exists.
If you want to make it more fault tolerant, you could also put a timeout on the file. e.g. a lock file is valid for 10 minutes. You could write a randomly generated uuid in the lockfile, and before saving, you could check if the uuid stil matches.
Several users could run the application on a network drive and experience concurrency issues
I think this is the same as number 1.
Aborting the write process might lead to corrupted data or loosing all data
This can be solved by making the write atomic or the file immutable. To make it atomic, instead of editing the file directly, just copy the file, and edit on the copy. After the copy is saved, just rename the files. But if you want to be on the safer side, you could always do things like append the timestamp on the file and never edit or delete a file. So every time an edit is made, you create a copy of it, with a newer timestamp appended on the file. And for reading, you will read always the newest one.
note that your simple answer won't handle concurrent writes by different instances. if two instances make changes and save, simply picking the newest one will end up losing the changes from the other instance. as mentioned by other answers, you should probably try to use file locking for this.
a relatively simple solution:
use a separate lock file for writing "data.xml.lck". lock this when writing the file
as mentioned in my comment, write to a temp file first "data.xml.tmp", then rename to the final name when the write is complete "data.xml". this will give a reasonable assurance that anyone reading the file will get a complete file.
even with the file locking, you still have to handle the "merge" problem (one instance reads, another writes, then the first wants to write). in order to handle this you should have a version number in the file content. when an instance wants to write, it first acquires the lock. then it checks its local version number against the file version number. if it is out of date, it needs to merge what is in the file with the local changes. then it can write a new version.
After thinking about it for a while, I would want to try to implement it like this:
Open the data.<timestamp>.xml-file with the latest timestamp.
Only use readonly mode.
Make changes.
Save the file as data.<timestamp>.xml - do not overwrite and check that no file with newer timestamp exists.
Related
I am a beginner in programming, so I am trying to learn with projects. My newest project is to create an agenda/calendar that is accessible from different computers (like a family calendar) so mom or dad can put up their events and everyone can see everyone's plans.
To a program that can store the instance of a family's agenda so they can go back to it at any time, I assume some sort of database or server to store their information is needed. How could I do this?
I apologize if my question is vague. I am relatively new to programming, but am so eager to keep learning.
You have several options.
The easiest is Serialization. Serialization takes an object and writes it to a stream using an ObjectOutputStream. You can read it back with an ObjectInputStream.
It's trivial because without error checking, it's just a few lines of code.
FileOutputStream fos = new FileOutputStream("calendar.dat");
ObjectOutputStream oos = new ObjectOutputStream(fos);
oos.writeObject(yourCalendar);
oos.close();
Similarly:
FileInputStream fis = new FileInputStream("calendar.dat");
ObjectInputStream ois = new ObjectInputStream(fis);
YourCalendar yourCalendar = (YourCalendar)ois.readObject();
ois.close();
Where yourCalendar is the master object containing your entire calendar and the appointments, etc.
Since you're not dealing with large amounts of information, it's perfectly adequate.
Now, that said, it's also fraught with danger. The file format is opaque (you can't just open it up in an editor and look at it). It can also be quite brittle. If you change your underlying classes that you're serializing, you may not be able to read a file back in. There's also potential security implications (likely not germane in your case, but they're still there).
Much of those can be mitigated, at the cost of complexity.
Similarly, you can use one of the JSON or XML libraries to serialize your objects out in to one of those text based formats. These are human readable, and can be a bit less sensitive to change than the binary format.
Of course with all of these, they're "all or nothing". In this case, you're writing out the entire object and all of its embedded objects. That means you can can't individually access the data. Nor can you use a 3rd party tool to access the data (like an SQL toolset). But, again, you don't have much data, so having this kind of access is likely not a big deal.
You wouldn't want to use this in a multi-user scenario, as it can not be incrementally updated (again, all or nothing).
But, all that said, for getting up and running, for simple persistence and being cognizant of its limitations, it will do the job for you and let you check this box on your project as you strive to work the other aspects of it. It's easy to enough to start with this and then, later, make a more robust persistence mechanism.
Memory is volatile. For storing data persistently you need to write it either in files or in databases.
Since it is opinion based question, I am putting my opinion.
You may begin with learning to read and write to files (text as well as binary).
While Writing to and reading from files you need to decide in which format you need to store in which format JSON, Yaml, XML or comma-separated or serialize your objects and store them into a file. The choice is yours.
While reading you need to write your own logic to search into them. So while files are a good and easy way to store data, you need to write either your own search mechanism or use document search like Elastic search.
Another option is to use a database that provides the power of SQL (if using a relational database) to search into your database. In order to use a database, you should learn about databases, reading from and writing to databases, and making a connection to the database in java.
In my opinion,
You should begin with the database approach as you can easily query on a date to get all the events present on the given date. Since, you not only want to store the events but you also want to go to a particular date and list out the events planned on that date. So, you need to store your data in such a way in which it is easy to search and read for you.
Also, I advise you to use the Spring framework and Maven which can take care of all the dependency, database connection with minimal configuration.
You may use h2 database, it is SQLite version and easy to use. Use file-based database, you need not use a server as of now.
Edit
Also as suggested by #springe, you can use any ORM like Hibernate to deal with the database which is a secure and recommended way used even in industrial code. Basically it is good practice to use JPA/ Hibernate when performing CRUD operations.
However, since you are new to programming and stuff, get mastery over plain SQL as well as learn good practices like using ORM.
For references
You can refer Baeldung for references, just google how to do this and that in java Bealdung and you will get pretty cool and short guide how to do it.
You will get spring configuration to connect to h2database, maven dependency to for Spring, and database there at Baeldung. Everything is standard and you just need to copy-paste while also learning how things work.
Keep learning, I loved your spirit. :)
Webapp, in my project to provide download CSV file functionality based on a search by end user, is doing the following:
A file is opened "download.csv" (not using File.createTempFile(String prefix,
String suffix, File directory); but always just "download.csv"), writing rows of data from a Sql recordset to it and then using FileUtils to copy that file's content to the servlet's OutputStream.
The recordset is based on a search criteria, like 1st Jan to 30th March.
Can this lead to a potential case where the file has contents of 2 users who make different date ranges/ other filters and submit at the same time so JVM processes the requests concurrently ?
Right now we are in dev and there is very little data.
I know we can write automated tests to test this, but wanted to know the theory.
I suggested to use the OutputStream of the Http Response (pass that to the service layer as a vanilla OutputSteam and directly write to that or wrap in a Buffered Writer and then write to it).
Only down side is that the data will be written slower than the File copy.
As if there is more data in the recordset it will take time to iterate thru it. But the total time of request should be less? (as the time to write to output stream of file will be same + time to copy from file to servlet output stream).
Anyone done testing around this and have test cases or solutions to share?
Well that is a tricky question if you really would like to go into the depth of both parts.
Concurrency
As you wrote this "same name" thing could lead to a race condition if you are working on a multi thread system (almost all of the systems are like that nowadays). I have seen some coding done like this and it can cause a lot of trouble. The result file could have not only lines from both of the searches but merged characters as well.
Examples:
Thread 1 wants to write: 123456789\n
Thread 2 wants to write: abcdefghi\n
Outputs could vary in the mentioned ways:
1st case:
123456789
abcdefghi
2nd case:
1234abcd56789
efghi
I would definitely use at least unique (UUID.randomUUID()) names to "hot-fix" the problem.
Concurrency
Having disk IO is a tricky thing if you go in-depth. The speads could vary in a vide range. In the JVM you can have blocking and non-blocking IO as well. The blocking one could wait until the data is really on the disk and the other will do some "magic" to flush the file later. There is a good read in here.
TL.DR.: As a rule of thumb it is better to have things in the memory (if it could fit) and not bother with the disk. If you use thread memory for that purpose as well you can avoid the concurrency problem as well. So in your case it could be better to rewrite the given part to utilize the memory only and write to the output.
I want to store my blobs outside of the database in files, however they are just random blobs of data and aren't directly linked to a file.
So for example I have a table called Data with the following columns:
id
name
comments
...
I can't just include a column called fileLink or something like that because the blob is just raw data. I do however want to store it outside of the database. I would love to create a file called 3.dat where 3 is the id number for that row entry. The only thing with this setup is that the main folder will quickly start to have a large number of files as the id is a flat folder structure and there will be OS file issues. And no the data is not grouped or structured, it's one massive list.
Is there a Java framework or library that will allow me to store and manage the blobs so that I can just do something like MyBlobAPI.saveBlob(id, data); and then do MyBlobAPI.getBlob(id) and so on? In other words something where all the File IO is handled for me?
Simply use an appropriate database which implements blobs as you described, and use JDBC. You really are not looking for another API but a specific implementation. It's up to the DB to take care of effective storing of blobs.
I think a home rolled solution will include something like a fileLink column in your table and your api will create files on the first save and then write that file on update.
I don't know of any code base that will do this for you. There are a bunch that provide an in memory file system for java. But it's only a few lines of code to write something that writes and reads java objects to a file.
You'll have to handle any file system limitations yourself. Though I doubt you'll ever burn through the limitations of modern file systems like btrfs or zfs. FAT32 is limited to 65K files per directory. But even last generation file systems support something on the order of 4 billion files per directory.
So by all means, write a class with two functions. One to serialize an object to a file; given it a unique key as a name. And another to deserialize the object by that key. If you are using a modern file system, you'll never run out of resources.
As far as I can tell there is no framework for this. The closest I could find was Hadoop's HDFS.
That being said the advice of just putting the BLOB's into the database as per the answers below is not always advisable. Sometimes it's good and sometimes it's not, it really depends on your situation. Here are a few links to such discussions:
Storing Images in DB - Yea or Nay?
https://softwareengineering.stackexchange.com/questions/150669/is-it-a-bad-practice-to-store-large-files-10-mb-in-a-database
I did find some addition really good links but I can't remember them offhand. There was one in particular on StackOverFlow but I can't find it. If you believe you know the link please add it in the comments so that I can confirm it's the right one.
We have seen lot of applications who are working with JSON file but i have a case study of which i want to get solution.
Let us see ...
a app is working with json file which gets requests from million users and every second thousands of requests has been completed.
JSON file is updated by admin panel every minute or second or specific time frame.
what will be behaviour of JSON file while request has been received to access JSON file and open for update from admin at same time (i have read it that JSON file will be fetched in readable mode.)
Let JSON file is writing using some script and its process is third of a second than what will be behaviour while 50% file has been updated.
Either file will be given with new written content when process completed or when it was partially updated?
Don't bother with locking, just use rename().
Assuming you're running on an OS where a rename() is an atomic operation, create a new file, say "/data/file/name.json.new", then when that's complete, rename the file. In C that would look like this:
rename( "/data/file/name.json.new", "/data/file/name.json" );
This way, any process opening "/data/file/name.json" will always see a consistent data file.
Practically, by what you describe, you want a service that applies operations on a file server-side.
You should though avoid taking the responsibility of Creating, Readind, Updating and Deleting (CRUD), as you will have troubles on preserving principles such as Atomicity, Consistency, Isolation and Durability (ACID), while there are systems doing that for you, the Database Management Systems.
In simple words, scenarios like what you describe should be a responsibility of a DBMS and not yours.
You probably need a NoSQL DBMS, that responsible for the CRUD operations of your database - which can be file-based in a JSON format and other forms, preserving ACID always (or almost always, but this is probably something you will learn on searching on it). MongoDB is a great example of such system.
Because you mentioned JSON, please take into consideration that it is another story to transfer the data, and another to store them. I suggest that you use the JSON format for requests & responses, but explore other options in storage. For instance, even a Relational DBMS that uses SQL can be good for you, it always depends on your needs. You might just need to form (encode & decode) the data in JSON format wherever received or sent to each client.
Take a look here for more info.
Tim Bray's article "Saving Data Safely" left me with open questions. Today, it's over a month old and I haven't seen any follow-up on it, so I decided to address the topic here.
One point of the article is that FileDescriptor.sync() should be called to be on the safe side when using FileOutputStream. At first, I was very irritated, because I never have seen any Java code doing a sync during the 12 years I do Java. Especially since coping with files is a pretty basic thing. Also, the standard JavaDoc of FileOutputStream never hinted at syncing (Java 1.0 - 6). After some research, I figured ext4 may actually be the first mainstream file system requiring syncing. (Are there other file systems where explicit syncing is advised?)
I appreciate some general thoughts on the matter, but I also have some specific questions:
When will Android do the sync to the file system? This could be periodic and additionally based on life cycle events (e.g. an app's process goes to the background).
Does FileDescriptor.sync() take care of syncing the meta data? That is syncing the directory of the changed file. Compare to FileChannel.force().
Usually, one does not directly write into the FileOutputStream. Here's my solution (do you agree?):
FileOutputStream fileOut = ctx.openFileOutput(file, Context.MODE_PRIVATE);
BufferedOutputStream out = new BufferedOutputStream(fileOut);
try {
out.write(something);
out.flush();
fileOut.getFD().sync();
} finally {
out.close();
}
Android will do the sync when it needs to -- such as when the screen turns off, shutting down the device, etc. If you are just looking at "normal" operation, explicit sync by applications is never needed.
The problem comes when the user pulls the battery out of their device (or does a hard reset of the kernel), and you want to ensure you don't lose any data.
So the first thing to realize: the issue is when power is suddenly lost, so a clean shutdown can not happen, and the question of what is going to happen in persistent storage at that point.
If you are just writing a single independent new file, it doesn't really matter what you do. The user could have pulled the battery while you were in the middle of writing, right before you started writing, etc. If you don't sync, it just means there is some longer time from when you are done writing during which pulling the battery will lose the data.
The big concern here is when you want to update a file. In that case, when you next read the file you want to have either the previous contents, or the new contents. You don't want to get something half-way written, or lose the data.
This is often done by writing the data in a new file, and then switching to that from the old file. Prior to ext4 you knew that, once you had finished writing a file, further operations on other files would not go on disk until the ones on that file, so you could safely delete the previous file or otherwise do operations that depend on your new file being fully written.
However now if you write the new file, then delete the old one, and the battery is pulled, when you next boot you may see that the old file is deleted and new file created but the contents of the new file is not complete. By doing the sync, you ensure that the new file is completely written at that point so can do further changes (such as deleting the old file) that depend on that state.
fileOut.getFD().sync(); should be on the finally clause, before the close().
sync() is way more important than close() considering durability.
So, everytime you want to 'finish' working on a file you should sync() it before close()ing it.
posix does not guarantee that pending writes will be written to disk when you issue a close().