Add Entire Files Text as Map Key in Hadoop - java

I'm looking for a way to load an entire file text into my map. Not a single line at a time like TextInputFormat does.
So that when I do value.toString in my map it gives me the entire input to work with.

You have to put every line into a StringBuilder until you've reached the end of file. Or you override your own RecordReader that provides this functionality. But I would not recommend this.

I would path name of the file to the mapper, and then will be free to load is entirely or do some kind of streaming processing.

Related

MapReduce with filename as Key, contents as Values, many small files

I've looked at FileInputFormat where filename is KEY and text contents are VALUE, How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?, and Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job, but I'm having trouble getting off the ground. Not having done anything with Hadoop before, I'm wary of starting down the wrong path if someone else can see that I'm making a mistake.
I have a directory containing something like 100K small files containing HTML, and I want to create an inverted index using Amazon Elastic MapReduce, implemented in Java. Once I have the file contents, I know what I want my map and reduce functions to do.
After looking here, my understanding is I need to subclass FileInputFormat and override isSplitable. However, my filenames are related to the URLs from which the HTML came, so I want to keep them. Is replacing NullWritable with Text all I need to do? Any other advice?
You should use WholeFileInputFormat to pass the whole file to your mapper
conf.setInputFormat(WholeFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path("input"));
FileOutputFormat.setOutputPath(conf,new Path("output"));

Parsing XML file from the end of file

I want to use XML for storing some data. But I do not want read full file when I want to get the last data that was inserted there, as well as I do not want to rewrite full file when adding new data there. Is there a standard way in java to parse xml file not from the beginning but from the end. So that for example SAX or StaX parser will first encounter last closing root tag and than last tag. Or if I want to do this I should read and write everything like I am reading/writing regular text file?
Fundamentally, XML is a poor representation choice for this. The format is inherently "contained" like this, and I haven't seen any APIs which encourage you to fight against that.
Options:
Choose a different format entirely (e.g. use a database)
Create lots of small XML files instead - each one self-contained. When you want the whole of the data, read all the files
Just swallow the hit and read/write the whole file each time.
I found a good topic on this with example solutions for what I want.
This link: http://www.oreillynet.com/xml/blog/2007/03/parsing_xml_backwards.html
Seems that XML is not good file format to achieve what I want. There is no standard parser that can parse XML from the end instead of beginning.
Probably the best solution for will be storing all xml data in one file that contains composition of many xml files contents. On each line stored separate contents of XML. The file itself is not well formed XML but each line contains well formed xml that I will parse using standard xml parser(StaX).
This way I will be able to read just lines from the end of file and append new data to the end of file. When I need the whole data or only the part of it I will read all line or part of them. Probably I can also implement pagination from the end of file for that because the file can be big.
Why XML in each line? I think it is easy to use API for parsing it as well as it is human readable to store data in xml instead of just separating values in the line with some symbol.
Why not use sax/stax and simply process only your last entry? Yes, it will need to open and go through the whole file, but at least it's fairly efficient as opposed to loading the whole DOM tree.
Short of doing that, I don't think you can do what you're asking using XML as a source.
Another alternative, apart from the ones provided by Jon Skeet in his answer, would be to keep the same format but insert the latest entries first, and stop processing the files as soon as you've read your entry.

How to extract specific line from a text file without sequentially travelling over lines

I want to read a specific line in a text file without sequentially going over each line and maintaining a line-counter or something.
Basically I want to know if there is any Class from core library that will give Random-Access to any line in a text file.
There is no standard functionality that implements your requirements. But you might want to have a look at:
http://docs.oracle.com/javase/tutorial/essential/io/rafs.html
... for an alternative approach.

How to Override InputFormat and OutputFormat In hadoop Application

I have an application which need to read a file which is a serialized result of ArrayList.(ArrayList<String>, 50000 records in this list, size: 20MB)
I don't know exactly how to read the data in to hadoop platform. I only have some sense I need to override InputFormat and OutpurFormat.
I'm a beginner in hadoop platform. Could you give me some advise?
Thanks,
Zheng.
To start with you'll need to extend the FileInputFormat, notable implementing the abstract FileInputFormat.createRecordReader method.
You can look through the source of something like the LineRecordReader (which is what TextInputFormat uses to process text files).
From there you're pretty much on your own (i.e. it depends on how your ArrayList has been serialized). Look through the source for the LineRecordReader and try and relate that to how your ArrayList has been serialized.
Some other points of note, is your file format splittable? I.e. can you seek to an offset in the file and recover the stream from there (Text files can as they just scan forward to the end of the current line and then start from there). If your file format uses compression, you also need to take this into account (you cannot for example randomly seek to a position in a gzip file). By default FileInputFormat.isSplittable will return true, which you may want to initially override to be false. If you do stick with 'unsplittable' then note that your file will be processed by a single mapper (not matter it's size).
Before processing data on Hadoop you should upload data to HDFS or another supported file system of cause if it wasn't upload here by something else. If you are controlling the uploading process you can convert data on the uploading stage to something you can easily process, like:
simple text file (line per array's item)
SequenceFile if array can contain lines with '\n'
This is the simplest solution since you don't have to interfere to Hadoop's internals.

In java, how do i edit 1 line of a text file?

Ok so I know the value of the line, I dont have the line number, how would I edit only 1 line?
Its a config file, i.e
x=y
I want a command to edit x=y to x=y,z.
or even x=z.
In Java you can use `Properties class:
app.config file:
x=y
java:
public void writeConfig() throws Exception {
Properties tempProp = new Properties();
tempProp.load(new FileInputStream("app.config"));
tempProp.setProperty("x", "y,z");
tempProp.store(new FileOutputStream("app.config"), null);
}
If you are using that configuration format, you might want to use
java.util.Properties
component to read/write on that file.
But if you just want to edit it by hand, you can just read the file line by line and match the variable you want to change.
One way to do it is to:
Read the file into memory; e.g. as an array of Strings representing the lines of the file.
Locate the String/line you want to change.
Use a regex (or whatever) to modify the String/line
Write a new version of the file from the in memory version.
There are many variations on this. You also need to take care when you write the new version of the file to guard against losing everything if something goes wrong during the write. (Typically you write the new version to a temporary file, rename the old version out of the way (e.g. as a backup) and rename the new version in place of the old one.)
Unfortunately, there is no way to add or remove characters in the middle of a regular text file without rewriting a large part of the file. This "problem" is not specific to Java. It is fundamental to the way that text files are modelled / represented on most mainstream operating systems.
Unless the new line has the exact same length as the old one, your best bet is to
Open a temporary output file
Read the config file, line by line
Search for your key
If you can't find it, just write the line you just read to the output file
If you can find it, write the new value to the temporary file instead
Until you hit EOF
Delete old file
Rename new file to the old file
IF your config file is small, you can also do the whole parsing/modification step in memory and then write the final result back to the config file, that way you skip the temporary file (although a temporary file is a good way to prevent corruption if something breaks while you write the file).
If this is not what you're looking for, you should edit your question to be a lot more clear. I'm just guessing what you're asking for.
If your data is all key and value pairs, for example ...
key1=value1
key2=value2
... then load them into a Properties object. Off the top of my head, you'll need a FileInputStream to load the properties, modify with myProperties.put(key, value) and then save the properties with the use of a FileOutputStream.
Hope this helps!
rh

Categories