Time Complexity of Java's NIO StandardOpenOption.Append operation - java

The current documentation for StandardOpenOption.Append says:
If the file is opened for WRITE access then bytes will be written to the end of the file rather than the beginning.````
However, I can't seem to find any further information on how this works internally.
My use case involves appending data to a huge file. I currently use BufferedWriter, but my understanding is that, if I have some way to maintain a pointer to the end of the file, I can easily append to it, without first traversing from start of file, till end of file.
So, my question is: Does StandardOpenOption.Append actually work in a similar method? Or does this also, internally, move to end of file and perform the append?

Related

Will positioned read or seek() from HDFS file load and ignore whole content of the file?

I want to read sub-content the big file from some offset/position.
For example I have a file of 1M lines and I want to read 50 lines starting from 100th. (line no: 101 to 150 - both inclusive)
I think I should be using PositionalReadable.
https://issues.apache.org/jira/browse/HADOOP-519
I see that FSInputStream.readFully actually uses seek() method of Seekable.
When I check the underlying implementation of seek() I see that it uses BlockReader.skip()
Wouldn't the blockReader.skip() read the whole data till position to skip the bytes? Question is would HDFS load first 100 lines as well in order to get to 101th line.
How to make position to be at any desired offset in the file like 10000th line of the file without loading the rest of the content? Something what s3 offers in header-offsets.
Here is the similar question I found: How to read files with an offset from Hadoop using Java, but it suggest using seek() and that is argued in the comments that seek() is expensive operation and should be used sparingly. Which I guess is correct because seek seems to read all the data in order to skip till the position.
The short answer may or may not read as much data as skip(n).
As you said, seek() internally calls BlockReader.skip(). BlockReader is an interface type and is created via BlockReaderFactory(). The BlockReader implementation that is created is either BlockReaderRemote or BlockReaderLocal. (Exactly, ExternalBlockReader is also possible, but it is excluded because it is a special case)
BlockReaderRemote is used when a client reads data from a remote DataNode over the network via RPC over TCP. In this case, if you analyze the skip() method code, you can see that readNextPacket is repeatedly called as many times as n bytes to skip. That is, it actually reads the data to be skipped.
BlockReaderLocal is used when the client is on the same machine as the DataNode where the block is stored. In this case, the client can read the block file directly, and change dataPos to actually do an offset-based skip on the next read operation.
+Additional information (2023.01.19)
The above applies to both Hadoop 3.x.x and 2.x.x, but the path and name of the implementation have been changed from version 2.8.0 due to a change in the project structure.
< 2.8.0
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocal.java
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader.java
>= 2.8.0
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderLocal.java
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderRemote.java
Related Jira issues
https://issues.apache.org/jira/browse/HDFS-8057
https://issues.apache.org/jira/browse/HDFS-8925
I recommend you to look at SequenceFile format may be it will suit your needs.
We use seek to read from arbitrary place of a file.
https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/SequenceFile.Reader.html#seek(long)

How to append and delete data in the middle of a text file, shifting the remaining file content using java code?

I am using RandomAccessFile but writeBytes() overwrites the
existing content Mostly i want to implement this without using any new
temporary file at least methods name few clues or techniques will do.
To insert without the use of any temporary file, you'll have to find a way to read in and shift all subsequent lines down by one when you insert a new line.
The sequential storage of lines of text poses the same kind of issues that the use of a standard array (vs. a LinkedList) does: You can't just unhook a pointer, plop in some text, and then hook up the pointer to next item to point to a subsequent line in the file. You have to perform the entire shift.
So I'd guess that you'd want to go to end of file, shift it down by a line, then move up each line and perform the same shift until you hit the position at which you want to insert the new line, at which point, you'll have cleared a space for it.
This seems very performance inefficient.
Hopefully someone knows of a better way, but this would be my approach, were a temporary file not an option.
(Alternately, you could also always just read the whole thing into a StringBuffer if it were small enough, peform the insert within that, and then write the file back out from scratch, but I imagine that you've considered that option already.)

How to create an InputStream of files that have a certain extension in Java?

I have a lot of files in a directory but I only want to read the ones with a certain extension (say .txt). I want these files added to the same BufferedInputStream so that I can read them in one go. When I call read() at the end of a file, the next one should begin.
It really feels like there should be an obvious answer to this but I had no luck finding it.
You might want to take a look at SequenceInputStream:
A SequenceInputStream represents the logical concatenation of other
input streams. It starts out with an ordered collection of input
streams and reads from the first one until end of file is reached,
whereupon it reads from the second one, and so on, until end of file
is reached on the last of the contained input streams.
To me the "obvious answer" is:
Just iterate through all the files in the directory using a proper filter. For each file create a FileInputStream, read it and close it.
I don't think there is an obvious answer to this question.
Probably you need to create a Wrapper InputStream with a list of files you want to read from. Internally you will open/close streams as needed, namely when a file is completely read.
It is not obvious but should not be difficult. This way you can work 'only' with one InputStream for all files.

Accessing a file incrementally in java while it is being dynamically updated?

I want to read a file incrementally in java while the file is being modified/written by some other process. So suppose Process "A" is writing/logging a file "X" and another process "B" wants to incrementally read the file "X", say every 1 sec (or even continuously) to find a particular pattern. What's the best way to do this in java? I know I can use RandomAccessFile's 'seek' method but will that interfere with the writing of the file? Is there a better way to do this?
Poll the data modified of the file. Opening it up for reading can prevent other programs from writing to the file at the same time.
If you're able to use Java 7, you could take advantage of the WatchService ... but it doesn't solve having to parse the whole file.
The only thing I can think off is maintaining some kind of "marker" that would indicate the last position you were up to. The next time you came to read the file, you could skip to this point and read from there (updating the marker when you're done)

How to edit a specific attribute in file?

So say you have a file that is written in XML or soe other coding language of that sort. Is it possible to just rewrite one line rather than getting the entire file into a string, then changing then line, then having to rewrite the whole string back to the file?
In general, no. File systems don't usually support the idea of inserting or modifying data in the middle of a file.
If your data file is in a fixed-size record format then you can edit a record without overwriting the rest of the file. For something like XML, you could in theory overwrite one value with a shorter one by inserting semantically-irrelevant whitespace, but you wouldn't be able to write a larger value.
In most cases it's simpler to just rewrite the whole file - either by reading and writing in a streaming fashion if you can (and if the file is too large to read into memory in one go) or just by loading the whole file into some in-memory data structure (e.g. XDocument), making the changes, and then saving the file again. (You may want to consider saving to a different file then moving the files around to avoid losing data if the save operation fails for some reason.)
If all of this ends up being too expensive, you should consider using multiple files (so each one is smaller) or a database.
If the line you want to replace is larger than the new line that you want to replace it with, then it is possible as long as it is acceptable to have some kind of padding (for example white-space characters ' ') which will not effect your application.
If on the other hand the new content are larger than the content to be replaced you will need to shift all the data downwards, so you need to rewrite the file, or at least from the replaced line onwards.
Since you mention XML, it might be you are approaching your problem in the wrong way. Could it be that what you need is to replace a specific XML node? In which case you might consider using DOM to read the XML into a hierarchy of nodes and adding/updating/removing in there before writing the XML tree back to the file.

Categories