I'm writing save logic for an application and part of it will save a dynamic list of "chunks" of data to a single file. Some of those chunks might have been provided by a plugin though (which would have included logic to read it back), so I need to find a way to properly skip unrecognized chunks of data if the plugin which created it has been removed.
My current solution is to write a length (int32) before each "chunk" so if there's an error the reader can skip past it and continue reading the next chunk.
However, this requires calculating the length of the data before writing any of it, and since our system is somewhat dynamic and allows nested data types, I'd rather avoid the overhead of caching everything just to measure it.
I'm considering using file markers somehow - I could scan the file for a very specific byte sequence that separates chunks. That could be written after each chunk rather than before.
Are other options I'm not thinking of? My goal is to find a way to write the data as immediately, without the need for caching and measuring it.
Related
I want to read sub-content the big file from some offset/position.
For example I have a file of 1M lines and I want to read 50 lines starting from 100th. (line no: 101 to 150 - both inclusive)
I think I should be using PositionalReadable.
https://issues.apache.org/jira/browse/HADOOP-519
I see that FSInputStream.readFully actually uses seek() method of Seekable.
When I check the underlying implementation of seek() I see that it uses BlockReader.skip()
Wouldn't the blockReader.skip() read the whole data till position to skip the bytes? Question is would HDFS load first 100 lines as well in order to get to 101th line.
How to make position to be at any desired offset in the file like 10000th line of the file without loading the rest of the content? Something what s3 offers in header-offsets.
Here is the similar question I found: How to read files with an offset from Hadoop using Java, but it suggest using seek() and that is argued in the comments that seek() is expensive operation and should be used sparingly. Which I guess is correct because seek seems to read all the data in order to skip till the position.
The short answer may or may not read as much data as skip(n).
As you said, seek() internally calls BlockReader.skip(). BlockReader is an interface type and is created via BlockReaderFactory(). The BlockReader implementation that is created is either BlockReaderRemote or BlockReaderLocal. (Exactly, ExternalBlockReader is also possible, but it is excluded because it is a special case)
BlockReaderRemote is used when a client reads data from a remote DataNode over the network via RPC over TCP. In this case, if you analyze the skip() method code, you can see that readNextPacket is repeatedly called as many times as n bytes to skip. That is, it actually reads the data to be skipped.
BlockReaderLocal is used when the client is on the same machine as the DataNode where the block is stored. In this case, the client can read the block file directly, and change dataPos to actually do an offset-based skip on the next read operation.
+Additional information (2023.01.19)
The above applies to both Hadoop 3.x.x and 2.x.x, but the path and name of the implementation have been changed from version 2.8.0 due to a change in the project structure.
< 2.8.0
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocal.java
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader.java
>= 2.8.0
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderLocal.java
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderRemote.java
Related Jira issues
https://issues.apache.org/jira/browse/HDFS-8057
https://issues.apache.org/jira/browse/HDFS-8925
I recommend you to look at SequenceFile format may be it will suit your needs.
We use seek to read from arbitrary place of a file.
https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/SequenceFile.Reader.html#seek(long)
My problem is this. I have a snappy compressed avro file of 2GB with about 1000 avro records stored on HDFS. I know I can write code to "open up this avro file" and print out each avro record. My question is, is there a way in java to say, open up this avro file, iterate through each record and output into a text file the "start position" and "end position" of each record within that avro file such that... I could have a java function call "readRecord(startposition, endposition)" that could take the startposition and endposition to quickly read out one specific avro record without having to iterate through the whole file?
I don't have time to provide you an off-the-shelf implementation but I think that I can provide you some hints.
Let's start with the Avro Specification: Object Container Files
Basically a Avro file is a suite of self-contained blocks containing one or more records (you can configure the size block and a record will never be split across two blocks). At the beginning of each block you find:
A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that codec.
The file's 16-byte sync marker.
The documentation explicitly states "Thus, each block's binary data can be efficiently extracted or skipped without deserializing the contents. The combination of block size, object counts, and sync markers enable detection of corrupt blocks and help ensure data integrity.".
You cannot directly seek to a specific record, but you can seek to a given block then iterate over its objects. It is not exactly what you need, but seems close enough. I believe that you won't be able to do much better than that with Avro containers. You can still tweak the block size to bound maximum the number of iteration within a block. When compression is used, it is applied at block level so it won't be an issue.
I believe that a such reader can be implemented using only public Avro API (FileDataReader provides seek and sync methods etc.)
You could compress each record individually. This won't give you as good a compression ratio, but it would be random access.
I suggest using a ZIP or JAR format.
give each record a notional file name, could be just a number.
write the serialized data as the contents of the file to the JAR.
When you want random access
open the JAR
lookup the entry by name.
read it and deserialize.
This will compress the data in the most efficient manner possible for each entry.
I am writing a software which has a part dealing with read and write operaions. I am wondering how costly these operations are on a csv file. Is there are any other file formats that consume less time? Because I have to do write and read on csv files at the end of every cycle.
Read and write operations depend on the file system, hardware, software configuration, memory, mermory setup and size of the file to read. But not on the format. A different problem related with this is the cost of parsing the file that surely must relative low as csv is very simple.
The point is that CSV is a good format for tables of data but not for nested data. If your data has a lot of nested information you can separate it into different csv files or you will have some information redundancy that will penalize your performance. But other formats might have other kind of redundancy.
And do not optimize prematurily. If you are reading and writing from the file very frecuently this file will surely be kept on RAM. JSON or a zipped file might save size and be read faster but would have a higher parsing time and could be even slower at the end. And the parsing time depends also on the implemenation of the library (Gson vs Jackson) and version.
It will be nice to know the reasons behind your problem to give better ansewrs.
The cost of reading / writing to a CSV file, and whether it is suitable for your application, depend on the details of your use case. Specifically, if you are simply reading from the beginning of the file and writing to the end of the file, then the CSV format is likely to work fine. However, if you need to access particular records in the middle of your file then you probably wish to choose another format.
The main issue with a CSV file is that it is not a good format choice for random access, since each record (row) is of variable size, so you cannot simply seek to a particular record offset in the file, and instead need to read every row (well, you could still jump and sample, but you cannot seek directly by record offset). Other formats with fixed sized records would allow you to seek directly to a particular record in the file, making updating of an entry in the middle of the file possible without needing to re-read and re-write the entire file.
So say you have a file that is written in XML or soe other coding language of that sort. Is it possible to just rewrite one line rather than getting the entire file into a string, then changing then line, then having to rewrite the whole string back to the file?
In general, no. File systems don't usually support the idea of inserting or modifying data in the middle of a file.
If your data file is in a fixed-size record format then you can edit a record without overwriting the rest of the file. For something like XML, you could in theory overwrite one value with a shorter one by inserting semantically-irrelevant whitespace, but you wouldn't be able to write a larger value.
In most cases it's simpler to just rewrite the whole file - either by reading and writing in a streaming fashion if you can (and if the file is too large to read into memory in one go) or just by loading the whole file into some in-memory data structure (e.g. XDocument), making the changes, and then saving the file again. (You may want to consider saving to a different file then moving the files around to avoid losing data if the save operation fails for some reason.)
If all of this ends up being too expensive, you should consider using multiple files (so each one is smaller) or a database.
If the line you want to replace is larger than the new line that you want to replace it with, then it is possible as long as it is acceptable to have some kind of padding (for example white-space characters ' ') which will not effect your application.
If on the other hand the new content are larger than the content to be replaced you will need to shift all the data downwards, so you need to rewrite the file, or at least from the replaced line onwards.
Since you mention XML, it might be you are approaching your problem in the wrong way. Could it be that what you need is to replace a specific XML node? In which case you might consider using DOM to read the XML into a hierarchy of nodes and adding/updating/removing in there before writing the XML tree back to the file.
I am developing an application that is writing data to a file. Lets assume while it is writing the data we plug off the battery. What will happen to the file? will it be half-writen (corrupted), empty or same as before we wrote to it? My guess is that it will be corrupted. How to check if it is corrupted when we restart the phone when the file was storing an arraylist of objects? will java throw a corrupted file exception or say that the read arraylist is null or that it is an unknown object?
PS. maybe create another file that will keep the MD5 checksum of the data file? And whenever I write to the file data first I produce its checksum and then when I read from the data file produce a checksum and compare it with the previous. That will indicate whether my data are intact but it wont allow me to roll back to a previous state (pre-corrupted one). I would like a method that would be as lightweight as possible, I am already using the CPU too much by reading/writing changes to my storage on every attribute change of a set of thousands. Probably a database would have been a better idea.
I can't say how Java will read in a corrupted serialized array, but for safety let's assume that there's no error detection.
In that case, you have two easy options:
Store a checksum of your data inside your data structure, before you serialize it.
Compute the checksum of the final serialized file.
Either case will work the same way, though the first option might be a bit faster since you compute the checksum before you've written anything to disk (and therefor avoid an extra round of file I/O).
As you mentioned, MD5 would be fine for this. (Even a basic CRC would probably be fine -- you don't need a cryptograhpic hash for this.)
If you want to allow rolling back to a previous version -- I'd just store each version as a separate file and then have a pointer to the most recent one. (If you update the pointer as the last step of your write operation, this will also provide an extra level of protection against corrupt data being input to your app -- though you'll have to prepare for this pointer to be corrupt as well. Since this is essentially a commit step, you could interpret a corrupt pointer as "use the last version".)
And yes, at this point you might want to just use the SQLite functionality built into Android. :)