ext4/fsync situation unclear in Android (Java) - java

Tim Bray's article "Saving Data Safely" left me with open questions. Today, it's over a month old and I haven't seen any follow-up on it, so I decided to address the topic here.
One point of the article is that FileDescriptor.sync() should be called to be on the safe side when using FileOutputStream. At first, I was very irritated, because I never have seen any Java code doing a sync during the 12 years I do Java. Especially since coping with files is a pretty basic thing. Also, the standard JavaDoc of FileOutputStream never hinted at syncing (Java 1.0 - 6). After some research, I figured ext4 may actually be the first mainstream file system requiring syncing. (Are there other file systems where explicit syncing is advised?)
I appreciate some general thoughts on the matter, but I also have some specific questions:
When will Android do the sync to the file system? This could be periodic and additionally based on life cycle events (e.g. an app's process goes to the background).
Does FileDescriptor.sync() take care of syncing the meta data? That is syncing the directory of the changed file. Compare to FileChannel.force().
Usually, one does not directly write into the FileOutputStream. Here's my solution (do you agree?):
FileOutputStream fileOut = ctx.openFileOutput(file, Context.MODE_PRIVATE);
BufferedOutputStream out = new BufferedOutputStream(fileOut);
try {
out.write(something);
out.flush();
fileOut.getFD().sync();
} finally {
out.close();
}

Android will do the sync when it needs to -- such as when the screen turns off, shutting down the device, etc. If you are just looking at "normal" operation, explicit sync by applications is never needed.
The problem comes when the user pulls the battery out of their device (or does a hard reset of the kernel), and you want to ensure you don't lose any data.
So the first thing to realize: the issue is when power is suddenly lost, so a clean shutdown can not happen, and the question of what is going to happen in persistent storage at that point.
If you are just writing a single independent new file, it doesn't really matter what you do. The user could have pulled the battery while you were in the middle of writing, right before you started writing, etc. If you don't sync, it just means there is some longer time from when you are done writing during which pulling the battery will lose the data.
The big concern here is when you want to update a file. In that case, when you next read the file you want to have either the previous contents, or the new contents. You don't want to get something half-way written, or lose the data.
This is often done by writing the data in a new file, and then switching to that from the old file. Prior to ext4 you knew that, once you had finished writing a file, further operations on other files would not go on disk until the ones on that file, so you could safely delete the previous file or otherwise do operations that depend on your new file being fully written.
However now if you write the new file, then delete the old one, and the battery is pulled, when you next boot you may see that the old file is deleted and new file created but the contents of the new file is not complete. By doing the sync, you ensure that the new file is completely written at that point so can do further changes (such as deleting the old file) that depend on that state.

fileOut.getFD().sync(); should be on the finally clause, before the close().
sync() is way more important than close() considering durability.
So, everytime you want to 'finish' working on a file you should sync() it before close()ing it.
posix does not guarantee that pending writes will be written to disk when you issue a close().

Related

How file manipulations perform during power outage

Linux machine, Java standalone application
I am having the following situation:
I have:
consecutive file write(which creates the destination file and writes some content to it) and file move.
I also have a power outage problem, which instantly cuts off the power of computer during these operations.
As a result, I am getting that the file was created, and it was moved as well, but the file content is empty.
The question is what under the hood can be causing this exact outcome? Considering the time sensitivity, may be hard drive is disabled before the processor and RAM during the cut out, but in that case, how is it possible that the file is created and moved after, but the write before moving is not successful?
I tried catching and logging the exception and debug information but the problem is power outage disables the logging abilities(I/O) as well.
try {
FileUtils.writeStringToFile(file, JsonUtils.toJson(object));
} finally {
if (file.exists()) {
FileUtils.moveFileToDirectory(file, new File(path), true);
}
}
Linux file systems don't necessarily write things to disk immediately, or in exactly the order that you wrote them. That includes both file content and file / directory metadata.
So if you get a power failure at the wrong time, you may find that the file data and metadata is inconsistent.
Normally this doesn't matter. (If the power fails and you don't have a UPS, the applications go away without getting a chance to finish what they were doing.)
However, if it does matter, you can do the following: to force the file to "sync" before you move it:
FileOutputStream fos = ...
// write to file
fs.getFD().sync();
fs.close();
// now move it
You need to read the javadoc for sync() carefully to understand what the method actually does.
You also need to read the javadoc for the method you are using to move the file regarding atomicity.

See size and the content of file (the new added strings) while writing to it in java after each refresh

I have used PrintWriter to write to a text file with java, the process takes time, bcz there is an iteration process, at end of each iteration, a string is appended to the file, if I wait till the process ends, I can open and see the content of the file without problem, but how to see the updating process to the file while it is still writing to, e.g. I refresh inside the folder then I see the size becomes larger and larger with each refresh click?
You can't see the progress of your stuff being written into the file, since it might not actually be.
First: Java Streams usually optimize access to the harddrive, to access it as little as possible since doing so is slow. To force Java to do what it thinks is wrinting to disk, call PrintWriter.flush() after you have appended something.
Second: even that might not do the trick, since most operating systems do the same optimisation with the same result, and you can't simply force you os to flush().
As already mentioned, the best you can do is Printwriter.flush().
As #Elliott mentioned you should do pw.flush() after each append to see the size change immediately.

Minimal code to reliably store java object in a file

In my tiny little standalone Java application I want to store information.
My requirements:
read and write java objects (I do not want to use SQL, and also querying is not required)
easy to use
easy to setup
minimal external dependencies
I therefore want to use jaxb to store all the information in a simple XML-file in the filesystem. My example application looks like this (copy all the code into a file called Application.java and compile, no additional requirements!):
#XmlRootElement
class DataStorage {
String emailAddress;
List<String> familyMembers;
// List<Address> addresses;
}
public class Application {
private static JAXBContext jc;
private static File storageLocation = new File("data.xml");
public static void main(String[] args) throws Exception {
jc = JAXBContext.newInstance(DataStorage.class);
DataStorage dataStorage = load();
// the main application will be executed here
// data manipulation like this:
dataStorage.emailAddress = "me#example.com";
dataStorage.familyMembers.add("Mike");
save(dataStorage);
}
protected static DataStorage load() throws JAXBException {
if (storageLocation.exists()) {
StreamSource source = new StreamSource(storageLocation);
return (DataStorage) jc.createUnmarshaller().unmarshal(source);
}
return new DataStorage();
}
protected static void save(DataStorage dataStorage) throws JAXBException {
jc.createMarshaller().marshal(dataStorage, storageLocation);
}
}
How can I overcome these downsides?
Starting the application multiple times could lead to inconsistencies: Several users could run the application on a network drive and experience concurrency issues
Aborting the write process might lead to corrupted data or loosing all data
Seeing your requirements:
Starting the application multiple times
Several users could run the application on a network drive
Protection against data corruption
I believe that an XML based filesystem will not be sufficient. If you consider a proper relational database an overkill, you could still go for an H2 db. This is a super-lightweight db that would solve all these problems above (even if not perfectly, but surely much better than a handwritten XML db), and is still very easy to setup and maintain.
You can configure it to persist your changes to the disk, can be configured to run as a standalone server and accept multiple connections, or can run as part of your application in embedded-mode too.
Regarding the "How do you save the data" part:
In case you do not want to use any advanced ORM library (like Hibernate or any other JPA implementation) you can still use plain old JDBC. Or at least some Spring-JDBC, which is very lightweight and easy to use.
"What do you save"
H2 is a relational database. So whatever you save, it will end up in columns. But! If you really do not plan to query your data (neither apply migration scripts on it), saving your already XML-serialized objects is an option. You can easily define a table with an ID + a "data" varchar column, and save your xml there. There is no limit on data-length in H2DB.
Note: Saving XML in a relational database is generally not a good idea. I am only advising you to evaluate this option, because you seem confident that you only need a certain set of features from what an SQL implementation can provide.
Inconsistencies and concurrency are handled in two ways:
by locking
by versioning
Corrupted writing can not be handled very well at application level. The file system shall support journaling, which tries to fix that up to some extent. You can do this also by
making your own journaling file (i.e. a short-lived separate file containing changes to be committed to the real data file).
All of these features are available even in the simplest relational database, e.g. H2, SQLite, and even a web page can use such features in HTML5. It is quite an overkill to reimplement these from scratch, and the proper implementation of the data storage layer will actually make your simple needs quite complicated.
But, just for the records:
Concurrency handling with locks
prior starting to change the xml, use a file lock to gain an exclusive access to the file, see also How can I lock a file using java (if possible)
once the update is done, and you sucessfully closed the file, release the lock
Consistency (atomicity) handling with locks
other application instances may still try to read the file, while one of the apps are writing it. This can cause inconsistency (aka dirty-read). Ensure that during writing, the writer process has an exclusive lock on the file. If it is not possible to gain an exclusive access lock, the writer has to wait a bit and retry.
an application reading the file shall read it (if it can gain access, no other instances do an exclusive lock), then close the file. If reading is not possible (because of other app locking), wait and retry.
still an external application (e.g. notepad) can change the xml. You may prefer an exclusive read-lock while reading the file.
Basic journaling
Here the idea is that if you may need to do a lot of writes, (or if you later on might want to rollback your writes) you don't want to touch the real file. Instead:
writes as changes go to a separate journaling file, created and locked by your app instance
your app instance does not lock the main file, it locks only the journaling file
once all the writes are good to go, your app opens the real file with exclusive write lock, and commits every change in the journaling file, then close the file.
As you can see, the solution with locks makes the file as a shared resource, which is protected by locks and only one applicaition can access to the file at a time. This solves the concurrency issues, but also makes the file access as a bottleneck. Therefore modern databases such as Oracle use versioning instead of locking. The versioning means that both the old and the new version of the file are available at the same time. Readers will be served by the old, most complete file. Once writing of the new version is finished, it is merged to the old version, and the new data is getting available at once. This is more tricky to implement, but since it allows reading all the time for all applications in parallel, it scales much better.
To answer your three issues you mentioned:
Starting the application multiple times could lead to inconsistencies
Why would it lead to inconsistencies? If what you mean is multiple concurrent edit will lead to inconsistencies, you just have to lock the file before editing. The easiest way to create a lock file beside the file. Before starting edit, just check if a lock file exists.
If you want to make it more fault tolerant, you could also put a timeout on the file. e.g. a lock file is valid for 10 minutes. You could write a randomly generated uuid in the lockfile, and before saving, you could check if the uuid stil matches.
Several users could run the application on a network drive and experience concurrency issues
I think this is the same as number 1.
Aborting the write process might lead to corrupted data or loosing all data
This can be solved by making the write atomic or the file immutable. To make it atomic, instead of editing the file directly, just copy the file, and edit on the copy. After the copy is saved, just rename the files. But if you want to be on the safer side, you could always do things like append the timestamp on the file and never edit or delete a file. So every time an edit is made, you create a copy of it, with a newer timestamp appended on the file. And for reading, you will read always the newest one.
note that your simple answer won't handle concurrent writes by different instances. if two instances make changes and save, simply picking the newest one will end up losing the changes from the other instance. as mentioned by other answers, you should probably try to use file locking for this.
a relatively simple solution:
use a separate lock file for writing "data.xml.lck". lock this when writing the file
as mentioned in my comment, write to a temp file first "data.xml.tmp", then rename to the final name when the write is complete "data.xml". this will give a reasonable assurance that anyone reading the file will get a complete file.
even with the file locking, you still have to handle the "merge" problem (one instance reads, another writes, then the first wants to write). in order to handle this you should have a version number in the file content. when an instance wants to write, it first acquires the lock. then it checks its local version number against the file version number. if it is out of date, it needs to merge what is in the file with the local changes. then it can write a new version.
After thinking about it for a while, I would want to try to implement it like this:
Open the data.<timestamp>.xml-file with the latest timestamp.
Only use readonly mode.
Make changes.
Save the file as data.<timestamp>.xml - do not overwrite and check that no file with newer timestamp exists.

Java File Marker API

I have a need for my application to be able to read large (very large, 100GB+) text files and process the content in these files potentially at different times. For instance, it might run for an hour and finish processing a few GBs, and then I shut it down and come back to it a few days later to resume processsing the same file.
To do this I will need to read the files into memory-friendly chunks; each chunk/page/block/etc will be read in, one at a time, processed, before then next chunk is read into memory.
I need the program to be able to mark where it is inside the input file, so if it shuts down, or if I need to "replay" the last chunk being processed, I can jump right to the point in the file where I am and continue processing. Specifically, I need to be able to do the following things:
When the processing begings, scan a file for a "MARKER" (some marker that indicates where we left off processing the last time)
If the MARKER exists, jump to it and begin processing from that point
Else, if the MARKER doesn't exist, then place a MARKER after the first chunk (for now, let's say that a "chunk" is just a line-of-text, as BufferedReader#readLine() would read in) and begin processing the first chunk/line
For each chunk/line processed, move the MARKER after the next chunk (thus, advancing the MARKER further down the file)
If we reach a point where there are no more chunks/lines after the current MARKER, we've finished processing the file
I tried coding this up myself and notice that BufferedReader has some interesting methods on it that sounds like they're suited for this very purpose: mark(), reset(), etc. But the Javadocs on them are kind of vague and I'm not sure that these "File Marker" methods will accomplish all the things I need to be able to do. I'm also completely open to a 3rd party JAR/lib that has this capability built into it, but Google didn't turn anything up.
Any ideas here?
Forget about markers. You cannot "insert" text without rewritting the whole file.
Use a RandomAccessFile and store the current position you are reading. When you need to open again the file, just use seek to find the position.
A Reader's "mark" is not persistent; it solely forms part of the state of the Reader itself.
I suggest that you not store the state information in the text file itself; instead, have a file alongside which stores the byte offset of the most recently processed chunk. That'll eliminate the obvious problems involving overwriting data in the original text file.
The marker of the buffered reader is not persisted after different runs of your application. I would neither change the content of that huge file to mark a position, since this can lead to significant IO and/or filesystem fragmentation, depending on your OS.
I would use a properties file to store the configuration of the program externally. Have a look at the documentation, the API is straight forward:
http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html

How to manage the creation and deletion of temporary files

I'm adding code to a large JSP web application, integrating functionality to convert CGM files to PDFs (or PDFs to CGMs) to display to the user.
It looks like I can create the converted files and store them in the directory designated by System.getProperty("java.io.tmpdir"). How do I manage their deletion, though? The program resides on a Linux-based server. Will the OS automatically delete from /tmp or will I need to come up with functionality myself? If it's the latter scenario, what are good ways to go about doing it?
EDIT: I see I can use deleteOnExit() (relevant answer elsewhere), but I think the JVM runs more or less continuously in the background so I'm not sure if the exits would be frequent enough.
I don't think I need to cache any converted files--just convert a file anew every time it's needed.
You can do this
File file = File.createTempFile("base_name", ".tmp", new File(temporaryFolderPath));
file.deleteOnExit();
the file will be deleted when the virtual machine terminates
Edit:
If you want to delete it after the job is done, just do it:
File file = null;
try{
file = File.createTempFile("webdav", ".tmp", new File(temporaryFolderPath));
// do sth with the file
}finally{
file.delete();
}
There are ways to have the JVM delete files when the JVM exits using deleteOnExit() but I think there are known memory leaks using that method. Here is a blog explaining the leak: http://www.pongasoft.com/blog/yan/java/2011/05/17/file-dot-deleteOnExit-is-evil/
A better solution would either be to delete old files using a cron or if you know you aren't going to use the file again, why not just delete it after processing?
From your comment :
Also, could I just create something that checks to see if the size of my files exceeds a certain amount, and then deletes the oldest ones if that's true? Or am I overthinking it?
You could create a class that keeps track of the created files with a size limit. When the size of the created files, after creating a new one, goes over the limit, it deletes the oldest one. Beware that this may delete a file that still needs to exist even if it is the oldest one. You might need a way to know which files still need to be kept and delete only those that are not needed anymore.
You could have a timer in the class to check periodically instead of after each creation. This solution is tied to your application while using a cron isn't.

Categories