What is the recommended way to append to files on HDFS?

What is the recommended way to append to files on HDFS? - java

I'm having trouble figuring out a safe way to append to files in HDFS.
I'm using a small, 3-node Hadoop cluster (CDH v.5.3.9 to be specific). Our process is a data pipeliner which is multi-threaded (8 threads) and it has a stage which appends lines of delimited text to files in a dedicated directory on HDFS. I'm using locks to synchronize access of the threads to the buffered writers which append the data.
My first issue is deciding on the approach generally.
Approach A is to open the file, append to it, then close it for every line appended. This seems slow and would seem to create too many small blocks, or at least I see some such sentiment in various posts.
Approach B is to cache the writers but periodically refresh them to make sure the list of writers doesn't grow unbounded (currently, it's one writer per each input file processed by the pipeliner). This seems like a more efficient approach but I imagine having open streams over a period of time however controlled may be an issue, especially for output file readers (?)
Beyond this, my real issues are two. I am using the FileSystem Java Hadoop API to do the appending and am intermittently getting these 2 exceptions:
org.apache.hadoop.ipc.RemoteException: failed to create file /output/acme_20160524_1.txt for DFSClient_NONMAPREDUCE_271210261_1 for client XXX.XX.XXX.XX because current leaseholder is trying to recreate file.
org.apache.hadoop.ipc.RemoteException: BP-1999982165-XXX.XX.XXX.XX-1463070000410:blk_1073760252_54540 does not exist or is not under Constructionblk_1073760252_545
40{blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[[DISK]DS-ccdf4e55-234b-4e17-955f-daaed1afdd92:NORMAL|RBW], ReplicaUnderConst
ruction[[DISK]DS-1f66db61-759f-4c5d-bb3b-f78c260e338f:NORMAL|RBW]]}
Anyone have any ideas on either of those?
For the first problem, I've tried instrumenting logic discussed in this post but didn't seem to help.
I'm also interested in the role of the dfs.support.append property, if at all applicable.
My code for getting the file system:
userGroupInfo = UserGroupInformation.createRemoteUser("hdfs"); Configuration conf = new Configuration();
conf.set(key1, val1);
...
conf.set(keyN, valN);
fileSystem = userGroupInfo.doAs(new PrivilegedExceptionAction<FileSystem>() {
public FileSystem run() throws Exception {
return FileSystem.get(conf);
}
});
My code for getting the OutputStream:
org.apache.hadoop.fs.path.Path file = ...
public OutputStream getOutputStream(boolean append) throws IOException {
OutputStream os = null;
synchronized (file) {
if (isFile()) {
os = (append) ? fs.append(file) : fs.create(file, true);
} else if (append) {
// Create the file first, to avoid "failed to append to non-existent file" exception
FSDataOutputStream dos = fs.create(file);
dos.close();
// or, this can be: fs.createNewFile(file);
os = fs.append(file);
}
// Creating a new file
else {
os = fs.create(file);
}
}
return os;
}

I got file appending working with CDH 5.3 / HDFS 2.5.0. My conclusions so far are as follows:
Cannot have one dedicated thread doing appends per file, or multiple threads writing to multiple files, whether we’re writing data via one and the same instance of the HDFS API FileSystem, or different instances.
Cannot refresh (i.e. close and reopen) the writers; they must stay open.
This last item leads to occasional relatively rare ClosedChannelException which appears to be recoverable (by retrying to append).
We use a single thread executor service with a blocking queue (one for appending to all files); a writer per file, the writers stay open (till the end of processing when they’re closed).
When we upgrade to CDH newer than 5.3, we’ll want to revisit this and see what threading strategy makes sense: one and only thread, one thread per file, multiple threads writing to multiple files. Additionally, we’ll want to see if writers can be/need to be periodically closed and reopened.
In addition, I have seen the following error as well, and was able to make it go away by setting 'dfs.client.block.write.replace-datanode-on-failure.policy' to 'NEVER' on the client side.
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[XXX.XX.XXX.XX:50010, XXX.XX.XXX.XX:50010], original=[XXX.XX.XXX.XX:50010, XXX.XX.XXX.XX:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:969) ~[hadoop-hdfs-2.5.0.jar:?]
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1035) ~[hadoop-hdfs-2.5.0.jar:?]
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184) ~[hadoop-hdfs-2.5.0.jar:?]
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:532) ~[hadoop-hdfs-2.5.0.jar:?]

Related

Delete file after staring connection using FileInputStream

I have a temporary file which I want to send the client from the controller in the Play Framework. Can I delete the file after opening a connection using FileInputStream? For example can I do something like this -
File file = getFile();
InputStream is = new FileInputStream(file);
file.delete();
renderBinary(is, "name.txt");
What if file is a large file? If I delete the file, will subsequent reads() on InputStream give an error? I have tried with files of around 1MB I don't get an error.
Sorry if this is a very naive question, but I could not find anything related to this and I am pretty new to Java

I just encountered this exact same scenario in some code I was asked to work on. The programmer was creating a temp file, getting an input stream on it, deleting the temp file and then calling renderBinary. It seems to work fine even for very large files, even into the gigabytes.
I was surprised by this and am still looking for some documentation that indicates why this works.
UPDATE: We did finally encounter a file that caused this thing to bomb. I think it was over 3 Gb. At that point, it became necessary to NOT delete the file while the rendering was in process. I actually ended up using the Amazon Queue service to queue up messages for these files. The messages are then retrieved by a scheduled deletion job. Works out nicely, even with clustered servers on a load balancer.

It seems counter-intuitive that the FileInputStream can still read after the file is removed.
DiskLruCache, a popular library in the Android world originating from the libcore of the Android platform, even relies on this "feature", as follows:
// Open all streams eagerly to guarantee that we see a single published
// snapshot. If we opened streams lazily then the streams could come
// from different edits.
InputStream[] ins = new InputStream[valueCount];
try {
for (int i = 0; i < valueCount; i++) {
ins[i] = new FileInputStream(entry.getCleanFile(i));
}
} catch (FileNotFoundException e) {
....
As #EJP pointed out in his comment on a similar question, "That's how Unix and Linux behave. Deleting a file is really deleting its name from the directory: the inode and the data persist while any processes have it open."
But I don't think it is a good idea to rely on it.

File issues with threading in tomcat

I have a tomcat server and i have a controller which writes in to a file, the data coming in the request. SO my doubt is whether multiple threads within the server can write into the same file at the same time and cause issues?
My requirement is that all requests appends data to the same file. I am not using any threading from my end.
My code is as follows:
File file = new File(fileName);
try {
if(!file.exists()) {
file.createNewFile();
}
InputStream inputStream = request.getInputStream();
FileWriter fileWriter = new FileWriter(fileName,true);
BufferedWriter bufferWriter = new BufferedWriter(fileWriter);
bufferWriter.write(IOUtils.toString(inputStream));
bufferWriter.flush();
bufferWriter.close();
}

There is the standard solution for such issue.
You have to create singleton class, which will be shared between all threads.
This singleton will have some BlockingQueue (e.g. LinkedBlockingQueue) in which all threads will put their messages for writing into single file.
This singleton by it self also will be the Thread and inside its run() method it will constantly take values from queue and sequentially write it into needed file.

My requirement is that all requests appends data to the same file
Doing a task for each request (like logging or in your case, appending text to a file) can be best implemented using a filter (javax.servlet.Filter). You don't have to create a singleton manually then and you can turn a filter on or off whenever you need its functionality or not.
However, you still need to synchronize the concurrent access to your file. As Andremoniy pointed out, you can do this using an own Thread, so that your filter does not block the request/response.
EDIT
One thing about the shared object used to write to the file: It is better to store an instance of this object in the javax.servlet.ServletContext rather than creating a singleton object. This is the standard way to go if you need to have an object accessible by all other components in a Java web application using servlets.

Reading a file and editing it in Java

What I am doing is I am reading in a html file and I am looking for a specific location in the html for me to enter some text.
So I am using a bufferedreader to read in the html file and split it by the tag . I want to enter some text before this but I am not sure how to do this. The html would then be along the lines of ...(newText)(/HEAD) (The brackets round head are meant to be angled brackets. Don't know how to insert them)
Would I need a PrintWriter to the same file and if so, how would I tell that to write it in the correct location.
I am not sure which way would be most efficient to do something like this.
Please Help.
Thanks in advance.
Here is part of my java code:
File f = new File("newFile.html");
FileOutputStream fos = new FileOutputStream(f);
PrintWriter pw = new PrintWriter(fos);
BufferedReader read = new BufferedReader(new FileReader("file.html"));
String str;
int i=0;
boolean found = false;
while((str= read.readLine()) != null)
{
String[] data = str.split("</HEAD>");
if(found == false)
{
pw.write(data[0]);
System.out.println(data[0]);
pw.write("</script>");
found = true;
}
if(i < 1)
{
pw.write(data[1]);
System.out.println(data[1]);
i++;
}
pw.write(str);
System.out.println(str);
}
}
catch (Exception e) {
e.printStackTrace( );
}
When I do this it gets to a point in the file and I get these errors:
FATAL ERROR: MERLIN: Unable to connect to EDG API,
Cannot find .edg_properties file.,
java.lang.OutOfMemoryError: unable to create new native thread,
Cannot truncate table,
EXCEPTION:Cannot open connection to server: SQLExceptio,
Caught IOException: java.io.IOException: JZ0C0: Connection is already closed, ...
I'm not sure why I get these or what all of these mean?
please Help.

Should be pretty easy:
Read file into a String
Split into before/after chunks
Open a temp file for writing
Write before chunk, your text, after chunk
Close up, and move temp file to original
Sounds like you are wondering about the last couple steps in particular. Here is the essential code:
File htmlFile = ...;
...
File tempFile = File.createTempFile("foo", ".html");
FileWriter writer = new FileWriter(tempFile);
writer.write(before);
writer.write(yourText);
writer.write(after);
writer.close();
tempFile.renameTo(htmlFile);

Most people suggest writing to a temporary file and then copying the temporary file over the original on successful completion.

The forum thread has some ideas of how to do it.
GL.

For reading and writing you can use FileReaders/FileWriters or the corresponding IO stream classes.
For the editing, I'd suggest to use an HTML parser to handle the document. It can read the HTML document into an internal datastructure which simplifies your effort to search for content and apply modification. (Most?) Parsers can serialize the document to HTML again.
At least you're sure to not corrupt the HTML document structure.

Following up on the list of errors in your edit, a lot of that possibly stems from the OutOfMemoryError. That means you simply ran out of memory in the JVM, so Java was unable to allocate objects. This may be caused by a memory leak in your application, or it could simply be that the work you're trying to do does need more memory transiently than you have allocated it.
You can increase the amount of memory that the JVM starts up with by providing the Xmx argument to the java executable, e.g.:
-Xmx1024m
would set the maximum heap size to 1024 megabytes.
The other issues might possibly caused by this; when objects can't reliably be created or modified, lots of weird things tend to happen. That said, there's a few things that look like you can take action. In particular, whatever MERLIN is it looks like it can't do it's work because it needs a property file for EDG, which it's unable to find in the location it's looking. You'll probably need to either put a config file there, or tell it to look at another location.
The other IOExceptions are fairly self-explanatory. Your program could not establish a connection to the server because of a SQLException (the underlying exception itself will probably be found in the logs); and some other part of the program tried to communicate to a remote machine using a closed connection.
I'd look at fixing the properties file (if it's not a benign error) and the memory issues first, and then seeing if any of the remaining problems still manifest.

Java file locking on a network

This is perhaps similar to previous posts, but I want to be specific about the use of locking on a network, rather than locally. I want to write a file to a shared location, so it may well go on a network (certainly a Windows network, maybe Mac). I want to prevent other people from reading any part of this file whilst it it being written. This will not be a highly concurrent process, and the files will be typically less than 10MB.
I've read the FileLock documentation and File documentation and am left somewhat confused, as to what is safe and what is not. I want to lock the entire file, rather than portions of it.
Can I use FileChannel.tryLock(), and it is safe on a network, or does it depend on the type of network? Will it work on a standard Windows network (if there is such a thing).
If this does not work, is the best thing to create a zero byte file or directory as a lock file, and then write out the main file. Why does that File.createNewFile() documentation say don't use this for file locking? I appreciate this is subject to race conditions, and is not ideal.

This can't be reliably done on a network file system. As long as your application is the only application that accesses the file, it's best to implement some kind of cooperative locking process (perhaps writing a lock file to the network filesystem when you open the file). The reason that is not recommended, however, is that if your process crashes or the network goes down or any other number of issues happen, your application gets into a nasty, dirty state.

You can have a empty file which is lying on the server you want to write to.
When you want to write to the server you can catch the token. Only when you have the token you should write to any file which is lying on the server.
When you are ready with you file operations or an exception was thrown you have to release the token.
The helper class can look like
private FileLock lock;
private File tokenFile;
public SLTokenLock(String serverDirectory) {
String tokenFilePath = serverDirectory + File.separator + TOKEN_FILE;
tokenFile = new File(tokenFilePath);
}
public void catchCommitToken() throws TokenException {
RandomAccessFile raf;
try {
raf = new RandomAccessFile(tokenFile, "rw"); //$NON-NLS-1$
FileChannel channel = raf.getChannel();
lock = channel.tryLock();
if (lock == null) {
throw new TokenException(CANT_CATCH_TOKEN);
}
} catch (Exception e) {
throw new TokenException(CANT_CATCH_TOKEN, e);
}
}
public void releaseCommitToken() throws TokenException {
try {
if (lock != null && lock.isValid()) {
lock.release();
}
} catch (Exception e) {
throw new TokenException(CANT_RELEASE_TOKEN, e);
}
}
Your operations then should look like
try {
token.catchCommitToken();
// WRITE or READ to files inside the directory
} finally {
token.releaseCommitToken();
}

I found this bug report which describes why the note about file locking was added to the File.createNewFile documentation.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4676183
It states:
If you mark the file as deleteOnExit before invoking createNewFile but the file already exists, you run the risk of deleting a file you didn't create and dropping someone elses lock! On the other hand, if you mark the file after creating it, you lose atomicity: if the program exits before the file is marked, it won't get deleted and the lock will be "wedged".
So it looks like the main reason locking is discouraged with File.createNewFile() is that you can end up with orphaned lock files if the JVM unexpectedly terminates before you have a chance to delete it. If you can deal with orphaned lock files then it could be used as a simple locking mechanism. However, I wouldn't recommend the method suggested in the comments of the bug report as it has race conditions around read/writing the timestamp value and reclaiming the expired lock.

Rather than implementing a locking strategy which will, in all likelihood, rely on readers to adhere to your convention but will not force them to, perhaps you can write the file out to a hidden or obscurely named file where it will be effectively invisible to readers. When the write operation is complete, rename the file to the expected public name.
The downside is that hiding and/or renaming without additional IO may require you to use native OS commands, but the procedure to do so should be fairly simple and deterministic.

How to handle incomplete files? Getting exception

I need to create a java program which will create thread to search for a file in particular folder(source folder) and pick the file immediately for process work(convert it into csv file format) once it found the file in the source folder. Problem i am facing now is file which comes to source folder is big size(FTP tool is used to copy file from server to source folder), thread is picking that file immediately before it copies fully to source folder and throwing exception. How do i stop thread until the file copy into source folder completely?. It has to pick the file for processing only after the file is copied completely into source folder.

Tha safest way is to download the file to a different location and then move it to the target folder.
Another variation mentioned by Bombe is to change the file name to some other extension after downloading and look only for files with that extension.

I only read the file which is not in write mode. This is safest as this means no other process is writing in this file. You can check if file is not in write mode by using canWrite method of File class.
This solution works fine for me as I also have the exact same scenario you facing.

You could try different things:
Repeatedly check the last modification date and the size of the file until it doesn’t change anymore for a given amount of time, then process it. (As pointed out by qbeuek this is neither safe nor deterministic.)
Only process files with names that match certain criteria (e.g. *.dat). Change the FTP upload/download process to upload/download files with a different name (e.g. *.dat.temp) and rename the files once they are complete.
Download the files to a different location and move them to your processing directory once they’re complete.
As Vinegar said, if it doesn’t work the first time, try again later. :)

If you have some control on the process that does the FTP you could potentially have it create a "flag file" in the source directory immediately AFTER the ftp for the big file is finished.
Then your Java thread has to check the presence of this flag file, if it's present then there is a file ready to be processed in the source directory. Before processing the big file, the thread should remove the flag file.
Flag file can be anything (even an empty file).

Assuming you have no control over FTP process...
Let it be like this. When you get the exception, then try to process it again next time. Repeat it until the file gets processed. Its good to keep few attributes in case of exception to check it later, like; name, last-modified, size.
Check the exact exception before deciding to process it later, the exception might occur for some other reason.

If your OS is Linux, and your kernel > 2.6.13, you could use the filesystem event notification API named inotify.
There's a Java implementation here : https://bitbucket.org/nbargnesi/inotify-java.
Here's a sample code (heavily inspired from the website).
try {
Inotify i = new Inotify();
InotifyEventListener e = new InotifyEventListener() {
#Override
public void filesystemEventOccurred(InotifyEvent e) {
System.out.println("inotify event occurred!");
}
#Override
public void queueFull(EventQueueFull e) {
System.out.println("inotify event queue: " + e.getSource() +
" is full!");
}
};
i.addInotifyEventListener(e);
i.addWatch(System.getProperty("user.home"), Constants.IN_CLOSE_WRITE);
} catch (UnsatisfiedLinkError e) {
System.err.println("unsatisfied link error");
} catch (UserLimitException e) {
System.err.println("user limit exception");
} catch (SystemLimitException e) {
System.err.println("system limit exception");
} catch (InsufficientKernelMemoryException e) {
System.err.println("insufficient kernel memory exception");
}

This is in Grails and I am using FileUtils Library from the Apache commons fame. The sizeof function returns the size in bytes.
def fileModified = sourceFile.lastModified()
def fileSize = FileUtils.sizeOf(sourceFile)
Thread.sleep(3000) //sleep to calculate size difference if the file is currently getting copied
if((fileSize != FileUtils.sizeOf(sourceFile)) && (fileModified != sourceFile.lastModified())) //the file is still getting copied to return
{
if(log.infoEnabled)
log.info("File is getting copied!")
return
}
Thread.sleep(1000) //breather for picking up file just copied.
Please note that this also depends on what utility or OS you are using to transfer the files.
The safest bet is to copy the file which is been copied or has been copied to different file or directory. The copy process is robust one and it assure you that file is present after the copying process. The one I am using is from commons API.
FileUtils.copyFileToDirectory(File f, Directory D)
If you are copying a huge file which is in process of getting copied beware that this will take time and you might like to start this in parallel thread or best have a seperate application dedicated for transfer process.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.