How to Override InputFormat and OutputFormat In hadoop Application

How to Override InputFormat and OutputFormat In hadoop Application - java

I have an application which need to read a file which is a serialized result of ArrayList.(ArrayList<String>, 50000 records in this list, size: 20MB)
I don't know exactly how to read the data in to hadoop platform. I only have some sense I need to override InputFormat and OutpurFormat.
I'm a beginner in hadoop platform. Could you give me some advise?
Thanks,
Zheng.

To start with you'll need to extend the FileInputFormat, notable implementing the abstract FileInputFormat.createRecordReader method.
You can look through the source of something like the LineRecordReader (which is what TextInputFormat uses to process text files).
From there you're pretty much on your own (i.e. it depends on how your ArrayList has been serialized). Look through the source for the LineRecordReader and try and relate that to how your ArrayList has been serialized.
Some other points of note, is your file format splittable? I.e. can you seek to an offset in the file and recover the stream from there (Text files can as they just scan forward to the end of the current line and then start from there). If your file format uses compression, you also need to take this into account (you cannot for example randomly seek to a position in a gzip file). By default FileInputFormat.isSplittable will return true, which you may want to initially override to be false. If you do stick with 'unsplittable' then note that your file will be processed by a single mapper (not matter it's size).

Before processing data on Hadoop you should upload data to HDFS or another supported file system of cause if it wasn't upload here by something else. If you are controlling the uploading process you can convert data on the uploading stage to something you can easily process, like:
simple text file (line per array's item)
SequenceFile if array can contain lines with '\n'
This is the simplest solution since you don't have to interfere to Hadoop's internals.

Related

Will positioned read or seek() from HDFS file load and ignore whole content of the file?

I want to read sub-content the big file from some offset/position.
For example I have a file of 1M lines and I want to read 50 lines starting from 100th. (line no: 101 to 150 - both inclusive)
I think I should be using PositionalReadable.
https://issues.apache.org/jira/browse/HADOOP-519
I see that FSInputStream.readFully actually uses seek() method of Seekable.
When I check the underlying implementation of seek() I see that it uses BlockReader.skip()
Wouldn't the blockReader.skip() read the whole data till position to skip the bytes? Question is would HDFS load first 100 lines as well in order to get to 101th line.
How to make position to be at any desired offset in the file like 10000th line of the file without loading the rest of the content? Something what s3 offers in header-offsets.
Here is the similar question I found: How to read files with an offset from Hadoop using Java, but it suggest using seek() and that is argued in the comments that seek() is expensive operation and should be used sparingly. Which I guess is correct because seek seems to read all the data in order to skip till the position.

The short answer may or may not read as much data as skip(n).
As you said, seek() internally calls BlockReader.skip(). BlockReader is an interface type and is created via BlockReaderFactory(). The BlockReader implementation that is created is either BlockReaderRemote or BlockReaderLocal. (Exactly, ExternalBlockReader is also possible, but it is excluded because it is a special case)
BlockReaderRemote is used when a client reads data from a remote DataNode over the network via RPC over TCP. In this case, if you analyze the skip() method code, you can see that readNextPacket is repeatedly called as many times as n bytes to skip. That is, it actually reads the data to be skipped.
BlockReaderLocal is used when the client is on the same machine as the DataNode where the block is stored. In this case, the client can read the block file directly, and change dataPos to actually do an offset-based skip on the next read operation.
+Additional information (2023.01.19)
The above applies to both Hadoop 3.x.x and 2.x.x, but the path and name of the implementation have been changed from version 2.8.0 due to a change in the project structure.
< 2.8.0
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/BlockReaderLocal.java
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/RemoteBlockReader.java
>= 2.8.0
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderLocal.java
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderRemote.java
Related Jira issues
https://issues.apache.org/jira/browse/HDFS-8057
https://issues.apache.org/jira/browse/HDFS-8925

I recommend you to look at SequenceFile format may be it will suit your needs.
We use seek to read from arbitrary place of a file.
https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/io/SequenceFile.Reader.html#seek(long)

How to get start end and end of each avro record in a compressed avro file?

My problem is this. I have a snappy compressed avro file of 2GB with about 1000 avro records stored on HDFS. I know I can write code to "open up this avro file" and print out each avro record. My question is, is there a way in java to say, open up this avro file, iterate through each record and output into a text file the "start position" and "end position" of each record within that avro file such that... I could have a java function call "readRecord(startposition, endposition)" that could take the startposition and endposition to quickly read out one specific avro record without having to iterate through the whole file?

I don't have time to provide you an off-the-shelf implementation but I think that I can provide you some hints.
Let's start with the Avro Specification: Object Container Files
Basically a Avro file is a suite of self-contained blocks containing one or more records (you can configure the size block and a record will never be split across two blocks). At the beginning of each block you find:
A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that codec.
The file's 16-byte sync marker.
The documentation explicitly states "Thus, each block's binary data can be efficiently extracted or skipped without deserializing the contents. The combination of block size, object counts, and sync markers enable detection of corrupt blocks and help ensure data integrity.".
You cannot directly seek to a specific record, but you can seek to a given block then iterate over its objects. It is not exactly what you need, but seems close enough. I believe that you won't be able to do much better than that with Avro containers. You can still tweak the block size to bound maximum the number of iteration within a block. When compression is used, it is applied at block level so it won't be an issue.
I believe that a such reader can be implemented using only public Avro API (FileDataReader provides seek and sync methods etc.)

You could compress each record individually. This won't give you as good a compression ratio, but it would be random access.
I suggest using a ZIP or JAR format.
give each record a notional file name, could be just a number.
write the serialized data as the contents of the file to the JAR.
When you want random access
open the JAR
lookup the entry by name.
read it and deserialize.
This will compress the data in the most efficient manner possible for each entry.

how costly(Time) are read and write operations on csv file in java?

I am writing a software which has a part dealing with read and write operaions. I am wondering how costly these operations are on a csv file. Is there are any other file formats that consume less time? Because I have to do write and read on csv files at the end of every cycle.

Read and write operations depend on the file system, hardware, software configuration, memory, mermory setup and size of the file to read. But not on the format. A different problem related with this is the cost of parsing the file that surely must relative low as csv is very simple.
The point is that CSV is a good format for tables of data but not for nested data. If your data has a lot of nested information you can separate it into different csv files or you will have some information redundancy that will penalize your performance. But other formats might have other kind of redundancy.
And do not optimize prematurily. If you are reading and writing from the file very frecuently this file will surely be kept on RAM. JSON or a zipped file might save size and be read faster but would have a higher parsing time and could be even slower at the end. And the parsing time depends also on the implemenation of the library (Gson vs Jackson) and version.
It will be nice to know the reasons behind your problem to give better ansewrs.

The cost of reading / writing to a CSV file, and whether it is suitable for your application, depend on the details of your use case. Specifically, if you are simply reading from the beginning of the file and writing to the end of the file, then the CSV format is likely to work fine. However, if you need to access particular records in the middle of your file then you probably wish to choose another format.
The main issue with a CSV file is that it is not a good format choice for random access, since each record (row) is of variable size, so you cannot simply seek to a particular record offset in the file, and instead need to read every row (well, you could still jump and sample, but you cannot seek directly by record offset). Other formats with fixed sized records would allow you to seek directly to a particular record in the file, making updating of an entry in the middle of the file possible without needing to re-read and re-write the entire file.

Rapidly changing Configuration/Status File? JAVA

I need some way to store a configuration/status file that needs to be changed rapidly. The status of each key value pair (key-value) is stored in that file. The status needs to be changed rather too rapidly as per the status of a communication (Digital multimedia broadcasting) hardware.
What is the best way to go about creating such a file? ini? XML? Any off the shelf filewriter in Java? I can't use databases.

It sounds like you need random access to update parts of the file frequently without re-writing the entire file. Design binary file format and use RandomAccessFile API to read/write it. You are going to want to use fixed number of bytes for key and for value, such that you can index into the middle of the file and update the value without having to re-write all of the following records. Basically, you would be re-implementing how a database stores a table.
Another alternative is to only store a single key-value pair per file such that the cost of re-writing the file is minor. Maybe you can think of a way to use file name as the key and only store value in the file content.
I'd be inclined to try the second option unless you are dealing with more than a few thousand records.

The obvious solution would be to put the "configuration" information into a Properties object, and then use Properties.store(...) or Properties.storeToXML(...) to save to a file output stream or writer.
You also need to do something to ensure that whatever is reading the file will see a consistent snapshot. For instance, you could write to a new file each time and do a delete / rename dance to replace the the old with the new.
But if the update rate for the file is too high, you are going to create a lot of disc traffic, and you are bound slow down your application. This is going to apply (eventually) no matter what file format / API you use. So, you may want to consider not writing to a file at all.

At some point, configuration that changes too rapidly becomes "program state" and not configuration. If it is changing so rapidly, why do you have confidence that you can meaningfully write it to, and then read it from, a filesystem?
Say more about what the status is an who the consumer of the data is...

Java: Where can I find advanced file manipulation source/libraries?

I'm writing arbitrary byte arrays (mock virus signatures of 32 bytes) into arbitrary files, and I need code to overwrite a specific file given an offset into the file. My specific question is: is there source code/libraries that I can use to perform this particular task?
I've had this problem with Python file manipulation as well. I'm looking for a set of functions that can kill a line, cut/copy/paste, etc. My assumptions are that these are extremely common tasks, and I couldn't find it in the Java API nor my google searches.
Sorry for not RTFM well; I haven't come across any information, and I've been looking for a while now.

Maybe you are looking for something like the RandomAccessFile class in the standard Java JDK. It supports reads and writes at some offset, as well as byte arrays.

Java's RandomAccessFile is exactly what you want.
It includes methods like seek(long) that allow you to move wherever you need in the file. It also allows for reading and writing at the same time.

As far as I know, Java has primarily lower level functions for manipulating files directly. Here is the best I've come up with
The actions you describe are standard in the Swing world, and for text comes down to manipulating a Document object. These act on data in memory. The class java.nio.channels.FileChannel has similar methods that act directly on a file. Neither fine the end of lines automatically, but other classes in java.io and java.nio do.
Apache Commons has a sandbox library called Flatfile which looks like it does what you want. The problem is that no code has been released yet. You may, however, want to talk to people working on it to get some more ideas. I didn't do a general check on libraries.

Have you looked into File/FileReader/FileWriter/BufferedReader? You can get the contents of the files and manipulate it as you like, you can search the data in the files, you can overwrite files, create new, append to an existing....
I am not sure this is exactly what you are asking for but I use these APIs all the time for logging, RTF editors, text file creation for email, and many other things.
As far as cut/copy/past goes, I have not come across the ability to do that directly, however, you can output the contents of the file and "copy" what part of it you want and "paste" it into a new file, or append it to an existing.

While writing a byte array to a file is a common task, writing to a give file 32-bytes byte array just once is just not something you are going to find in java.io :)
To get started, would the below method and comments look reasonable to you? I bet someone here, maybe even myself, could whip it out quick like.
public static void writeFauxVirusSignature(File file, byte[] bytes, long offset) {
//open file
//move to offset
//write bytes
//close file
}
Questions:
How big could the potential target files be?
Do you need performance?
I ask because clean, easy to read code would use Apache Commons lib's, but large file writes in a performance sensitive environment will necessitate using java.nio libraries

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.