Hadoop S3A filesystem, abort object upload? - java

I have code like
ParquetWriter<Record> writer = getParquetWriter("s3a://my_bucket/my_object_path.snappy.parquet");
for (Record r : someIterable) {
validate(r);
writer.write()
}
writer.close();
if validate throws an exception, I want to release all resources associated with the writer. But I don't want to create any objects in S3 in that case. Is this achievable?
If I close the writer it will conclude the s3 multipart upload and create an object in the cloud. If I don't close it, the parts written so far will remain in the disk buffer, clogging up the works.

Yes it is a problem. It's been discussed in HADOOP-16906 Add some Abortable.abort() interface for streams etc which can be terminated
Problem here is it's not enough to add to the S3ABlockOutputStream class, we'd need to pass it through the FSDataOutputStream etc, specify it in the FS APIs, define semantics if the passthrough doesn't work, commit to maintaining it etc. A lot of effort. If you do want to do that though, patches welcome...
Keep an eye on HDFS-13934, multipart upload API. This will let you do the upload and then commit/abort it. Doesn't quite fit your workflow.
Afraid you will have to go with the upload. Do remember to set a lifecycle rule for the bucket to delete old uploads, and look at the hadoop s3guard uploads command to list/abort them too.

Related

How file manipulations perform during power outage

Linux machine, Java standalone application
I am having the following situation:
I have:
consecutive file write(which creates the destination file and writes some content to it) and file move.
I also have a power outage problem, which instantly cuts off the power of computer during these operations.
As a result, I am getting that the file was created, and it was moved as well, but the file content is empty.
The question is what under the hood can be causing this exact outcome? Considering the time sensitivity, may be hard drive is disabled before the processor and RAM during the cut out, but in that case, how is it possible that the file is created and moved after, but the write before moving is not successful?
I tried catching and logging the exception and debug information but the problem is power outage disables the logging abilities(I/O) as well.
try {
FileUtils.writeStringToFile(file, JsonUtils.toJson(object));
} finally {
if (file.exists()) {
FileUtils.moveFileToDirectory(file, new File(path), true);
}
}
Linux file systems don't necessarily write things to disk immediately, or in exactly the order that you wrote them. That includes both file content and file / directory metadata.
So if you get a power failure at the wrong time, you may find that the file data and metadata is inconsistent.
Normally this doesn't matter. (If the power fails and you don't have a UPS, the applications go away without getting a chance to finish what they were doing.)
However, if it does matter, you can do the following: to force the file to "sync" before you move it:
FileOutputStream fos = ...
// write to file
fs.getFD().sync();
fs.close();
// now move it
You need to read the javadoc for sync() carefully to understand what the method actually does.
You also need to read the javadoc for the method you are using to move the file regarding atomicity.

Minimal code to reliably store java object in a file

In my tiny little standalone Java application I want to store information.
My requirements:
read and write java objects (I do not want to use SQL, and also querying is not required)
easy to use
easy to setup
minimal external dependencies
I therefore want to use jaxb to store all the information in a simple XML-file in the filesystem. My example application looks like this (copy all the code into a file called Application.java and compile, no additional requirements!):
#XmlRootElement
class DataStorage {
String emailAddress;
List<String> familyMembers;
// List<Address> addresses;
}
public class Application {
private static JAXBContext jc;
private static File storageLocation = new File("data.xml");
public static void main(String[] args) throws Exception {
jc = JAXBContext.newInstance(DataStorage.class);
DataStorage dataStorage = load();
// the main application will be executed here
// data manipulation like this:
dataStorage.emailAddress = "me#example.com";
dataStorage.familyMembers.add("Mike");
save(dataStorage);
}
protected static DataStorage load() throws JAXBException {
if (storageLocation.exists()) {
StreamSource source = new StreamSource(storageLocation);
return (DataStorage) jc.createUnmarshaller().unmarshal(source);
}
return new DataStorage();
}
protected static void save(DataStorage dataStorage) throws JAXBException {
jc.createMarshaller().marshal(dataStorage, storageLocation);
}
}
How can I overcome these downsides?
Starting the application multiple times could lead to inconsistencies: Several users could run the application on a network drive and experience concurrency issues
Aborting the write process might lead to corrupted data or loosing all data
Seeing your requirements:
Starting the application multiple times
Several users could run the application on a network drive
Protection against data corruption
I believe that an XML based filesystem will not be sufficient. If you consider a proper relational database an overkill, you could still go for an H2 db. This is a super-lightweight db that would solve all these problems above (even if not perfectly, but surely much better than a handwritten XML db), and is still very easy to setup and maintain.
You can configure it to persist your changes to the disk, can be configured to run as a standalone server and accept multiple connections, or can run as part of your application in embedded-mode too.
Regarding the "How do you save the data" part:
In case you do not want to use any advanced ORM library (like Hibernate or any other JPA implementation) you can still use plain old JDBC. Or at least some Spring-JDBC, which is very lightweight and easy to use.
"What do you save"
H2 is a relational database. So whatever you save, it will end up in columns. But! If you really do not plan to query your data (neither apply migration scripts on it), saving your already XML-serialized objects is an option. You can easily define a table with an ID + a "data" varchar column, and save your xml there. There is no limit on data-length in H2DB.
Note: Saving XML in a relational database is generally not a good idea. I am only advising you to evaluate this option, because you seem confident that you only need a certain set of features from what an SQL implementation can provide.
Inconsistencies and concurrency are handled in two ways:
by locking
by versioning
Corrupted writing can not be handled very well at application level. The file system shall support journaling, which tries to fix that up to some extent. You can do this also by
making your own journaling file (i.e. a short-lived separate file containing changes to be committed to the real data file).
All of these features are available even in the simplest relational database, e.g. H2, SQLite, and even a web page can use such features in HTML5. It is quite an overkill to reimplement these from scratch, and the proper implementation of the data storage layer will actually make your simple needs quite complicated.
But, just for the records:
Concurrency handling with locks
prior starting to change the xml, use a file lock to gain an exclusive access to the file, see also How can I lock a file using java (if possible)
once the update is done, and you sucessfully closed the file, release the lock
Consistency (atomicity) handling with locks
other application instances may still try to read the file, while one of the apps are writing it. This can cause inconsistency (aka dirty-read). Ensure that during writing, the writer process has an exclusive lock on the file. If it is not possible to gain an exclusive access lock, the writer has to wait a bit and retry.
an application reading the file shall read it (if it can gain access, no other instances do an exclusive lock), then close the file. If reading is not possible (because of other app locking), wait and retry.
still an external application (e.g. notepad) can change the xml. You may prefer an exclusive read-lock while reading the file.
Basic journaling
Here the idea is that if you may need to do a lot of writes, (or if you later on might want to rollback your writes) you don't want to touch the real file. Instead:
writes as changes go to a separate journaling file, created and locked by your app instance
your app instance does not lock the main file, it locks only the journaling file
once all the writes are good to go, your app opens the real file with exclusive write lock, and commits every change in the journaling file, then close the file.
As you can see, the solution with locks makes the file as a shared resource, which is protected by locks and only one applicaition can access to the file at a time. This solves the concurrency issues, but also makes the file access as a bottleneck. Therefore modern databases such as Oracle use versioning instead of locking. The versioning means that both the old and the new version of the file are available at the same time. Readers will be served by the old, most complete file. Once writing of the new version is finished, it is merged to the old version, and the new data is getting available at once. This is more tricky to implement, but since it allows reading all the time for all applications in parallel, it scales much better.
To answer your three issues you mentioned:
Starting the application multiple times could lead to inconsistencies
Why would it lead to inconsistencies? If what you mean is multiple concurrent edit will lead to inconsistencies, you just have to lock the file before editing. The easiest way to create a lock file beside the file. Before starting edit, just check if a lock file exists.
If you want to make it more fault tolerant, you could also put a timeout on the file. e.g. a lock file is valid for 10 minutes. You could write a randomly generated uuid in the lockfile, and before saving, you could check if the uuid stil matches.
Several users could run the application on a network drive and experience concurrency issues
I think this is the same as number 1.
Aborting the write process might lead to corrupted data or loosing all data
This can be solved by making the write atomic or the file immutable. To make it atomic, instead of editing the file directly, just copy the file, and edit on the copy. After the copy is saved, just rename the files. But if you want to be on the safer side, you could always do things like append the timestamp on the file and never edit or delete a file. So every time an edit is made, you create a copy of it, with a newer timestamp appended on the file. And for reading, you will read always the newest one.
note that your simple answer won't handle concurrent writes by different instances. if two instances make changes and save, simply picking the newest one will end up losing the changes from the other instance. as mentioned by other answers, you should probably try to use file locking for this.
a relatively simple solution:
use a separate lock file for writing "data.xml.lck". lock this when writing the file
as mentioned in my comment, write to a temp file first "data.xml.tmp", then rename to the final name when the write is complete "data.xml". this will give a reasonable assurance that anyone reading the file will get a complete file.
even with the file locking, you still have to handle the "merge" problem (one instance reads, another writes, then the first wants to write). in order to handle this you should have a version number in the file content. when an instance wants to write, it first acquires the lock. then it checks its local version number against the file version number. if it is out of date, it needs to merge what is in the file with the local changes. then it can write a new version.
After thinking about it for a while, I would want to try to implement it like this:
Open the data.<timestamp>.xml-file with the latest timestamp.
Only use readonly mode.
Make changes.
Save the file as data.<timestamp>.xml - do not overwrite and check that no file with newer timestamp exists.

is there a more efficient way of sending an mp4 file to the user

I am using Spring-MVC and I need to send a MP4 file back to the user. The MP4 files are, of course, very large in size (> 2 GB).
I found this SO thread Downloading a file from spring controllers, which shows how to stream back a binary file, which should theoretically work for my case. However, what I am concerned about is efficiency.
In one case, an answer may implicate to load all the bytes into memory.
byte[] data = SomeFileUtil.loadBytes(new File("somefile.mp4"));
In another case, an answer suggest using IOUtils.
InputStream is = new FileInputStream(new File("somefile.mp4"));
OutputStream os = response.getOutputStream();
IOUtils.copy(is, os);
I wonder if either of these are more memory efficient than simply defining a resource mapping?
<resources mapping="/videos/**" location="/path/to/videos/"/>
The resource mapping may work, except that I need to protect all requests to these videos, and I do not think resource mapping will lend itself to logic that protects the content.
Is there another way to stream back binary data (namely, MP4)? I'd like something that's memory efficient.
I would think that defining a resource mapping would be the cleanest way of handling it. With regards to protecting access, you can simply add /videos/** to your security configuration and define what access you allow for it via something like
<security:intercept-url pattern="/videos/**" access="ROLE_USER, ROLE_ADMIN"/>
or whatever access you desire.
Also, you might consider saving these large mp4's to a cloud storage and/or CDN such as Amazon S3 (with our without CloudFront).
Then you can generate unique urls which will last as long as you want them to. Then the download is handled by Amazon rather than having to use the computing power, data space, and memory of your web server to serve up the large resource files. Also, if you use something like CloudFront, you can configure it for streaming rather than download.
Loading the entire file into memory is worse, as well as using more memory and being non-scalable. You don't transmit any data until you've loaded it all, which adds all that latency.

How to create a file that streams to http response

I'm writing a web application and want the user to be able click a link and get a file download.
I have an interface is in a third party library that I can't alter:
writeFancyData(File file, Data data);
Is there an easy way that I can create a file object that I can pass to this method that when written to will stream to the HTTP response?
Notes:
Obviously I could just write a temporary file and then read it back in and then write it the output stream of the http response. However what I'm looking for is a way to avoid the file system IO. Ideally by creating a fake file that when written to will instead write to the output stream of the http response.
e.g.
writeFancyData(new OutputStreamBackedFile(response.getOutputStream()), data);
I need to use the writeFancyData method as it writes a file in a very specific format that I can't reproduce.
Assuming writeFancyData is a black box, it's not possible. As a thought experiment, consider an implementation of writeFancyData that did something like this:
public void writeFancyData(File file, Data data){
File localFile = new File(file.getPath());
...
// process data from file
...
}
Given the only thing you can return from any extended version of File is the path name, you're just not going to be able to get the data you want into that method. If the signature included some sort of stream, you would be in a lot better position, but since all you can pass in is a File, this can't be done.
In practice the implementation is probably one of the FileInputStream or FileReader classes that use the File object really just for the name and then call out to native methods to get a file descriptor and handle the actual i/o.
As dlawrence writes the API it is impossible to determine what the API is doing with the File.
A non-java approach is to create a named pipe. You could establish a reader for the pipe in your program, create a File on that path and pass it to API.
Before doing anything so fancy, I would recommend analyzing performance and verify that disk i/o is indeed a bottleneck.
Given that API, the best you can do is to give it the File for a file in a RAM disk filesystem.
And lodge a bug / defect report against the API asking for an overload that takes a Writer or OutputStream argument.

ext4/fsync situation unclear in Android (Java)

Tim Bray's article "Saving Data Safely" left me with open questions. Today, it's over a month old and I haven't seen any follow-up on it, so I decided to address the topic here.
One point of the article is that FileDescriptor.sync() should be called to be on the safe side when using FileOutputStream. At first, I was very irritated, because I never have seen any Java code doing a sync during the 12 years I do Java. Especially since coping with files is a pretty basic thing. Also, the standard JavaDoc of FileOutputStream never hinted at syncing (Java 1.0 - 6). After some research, I figured ext4 may actually be the first mainstream file system requiring syncing. (Are there other file systems where explicit syncing is advised?)
I appreciate some general thoughts on the matter, but I also have some specific questions:
When will Android do the sync to the file system? This could be periodic and additionally based on life cycle events (e.g. an app's process goes to the background).
Does FileDescriptor.sync() take care of syncing the meta data? That is syncing the directory of the changed file. Compare to FileChannel.force().
Usually, one does not directly write into the FileOutputStream. Here's my solution (do you agree?):
FileOutputStream fileOut = ctx.openFileOutput(file, Context.MODE_PRIVATE);
BufferedOutputStream out = new BufferedOutputStream(fileOut);
try {
out.write(something);
out.flush();
fileOut.getFD().sync();
} finally {
out.close();
}
Android will do the sync when it needs to -- such as when the screen turns off, shutting down the device, etc. If you are just looking at "normal" operation, explicit sync by applications is never needed.
The problem comes when the user pulls the battery out of their device (or does a hard reset of the kernel), and you want to ensure you don't lose any data.
So the first thing to realize: the issue is when power is suddenly lost, so a clean shutdown can not happen, and the question of what is going to happen in persistent storage at that point.
If you are just writing a single independent new file, it doesn't really matter what you do. The user could have pulled the battery while you were in the middle of writing, right before you started writing, etc. If you don't sync, it just means there is some longer time from when you are done writing during which pulling the battery will lose the data.
The big concern here is when you want to update a file. In that case, when you next read the file you want to have either the previous contents, or the new contents. You don't want to get something half-way written, or lose the data.
This is often done by writing the data in a new file, and then switching to that from the old file. Prior to ext4 you knew that, once you had finished writing a file, further operations on other files would not go on disk until the ones on that file, so you could safely delete the previous file or otherwise do operations that depend on your new file being fully written.
However now if you write the new file, then delete the old one, and the battery is pulled, when you next boot you may see that the old file is deleted and new file created but the contents of the new file is not complete. By doing the sync, you ensure that the new file is completely written at that point so can do further changes (such as deleting the old file) that depend on that state.
fileOut.getFD().sync(); should be on the finally clause, before the close().
sync() is way more important than close() considering durability.
So, everytime you want to 'finish' working on a file you should sync() it before close()ing it.
posix does not guarantee that pending writes will be written to disk when you issue a close().

Categories