i am trying to write a hadoop mapreduce program in java. For which the input is an array and output is also an array. But till now i have only seen people use inputs and outputs as files for it. So i was just wondering if mapreduce can have any other input and output formats.
Thanks
A wide variety of the Input and Output formats are supported by Hadoop. Check the subclasses of InputFormat and OutputFormat. Extend the InputFormat and OutputFormat if any custom formats are required. Check this article from Cloudera on DB input/output format.
Hadoop is a file system and the point of map-reduce is to tackle large amount of data that would usually not fit in memory - so input and output would usually be stored on disks somehow (a.k.a. files).
Hadoop mapreduce classes comes with support for reading the different types of files supported by hadoop (text files, sequential files) you can also write your own sources e.g. HBase comes with a map-reduce wrapper that reads its format of file. I haven't tried that but you can, as the article pointed by Praveen demonstrate, read from other sources
Output is even easier - since you're writing Java code you can do whatever in your reduce phase, so if you want to, say, put messages into a queue in the reduce phase just do that
Related
I was looking at the wordCount example from Apache Beam
and when I tried to run this example in local, it wrote the counts into multiple files. I created a test project to read and write data from a file and even that write operation wrote the output in to multiple files. How do I get the result in just a single file? I am using direct runner
That is happening for performance reasons. You should be able to force a single file by using TextIO.Write.withoutSharding
withoutSharding
public TextIO.Write withoutSharding()
Forces a single file as output and empty shard name template. This
option is only compatible with unwindowed writes.
For unwindowed writes, constraining the number of shards is likely to
reduce the performance of a pipeline. Setting this value is not
recommended unless you require a specific number of output files.
This is equivalent to .withNumShards(1).withShardNameTemplate("")
A colleague of mine was telling me that he didn't see any issues with using tarballs to hold the data sent to a MapReduce job. My understanding of how Hadoop and MR/Spark work together is that the preferred hadoop storage formats are designed so that the data files can be split along the Hadoop block size and fanned out to MR mappers or Spark workers (to be partitioned). Tar strikes me as a really terrible format to do this because AFAIK Tar is absolutely not in any way designed to accommodate the way Hadoop and its various job engines work. Am I missing something here?
I want to make a better performance for data processing using Hadoop MapReduce. So, do I need to use it along with Hadoop DFS? Or maybe MapReduce can be use with other type of data distributed? Show me the way, please....
Hadoop is a framework which includes Map Reduce programming model for computation and HDFS for storage.
HDFS stands for hadoop distributed file system which is inspired from Google File System. The overall Hadoop project is inspired based on the research paper published by Google.
research.google.com/archive/mapreduce-osdi04.pdf
http://research.google.com/archive/mapreduce.html
Using Map Reduce programming model data will be computed in parallel way in different nodes across the cluster which will decrease the processing time.
You need to use HDFS or HBASE to store your data in the cluster to get the high performance. If you like to choose normal file system, then there will not be much difference. Once the data goes to distributed system, automatically it will be divided across different block and replicated by default 3 times to avoid fault tolerance. All these will not be possible with normal file system
Hope this helps!
First, your idea is wrong. Performance of Hadoop MapReduce is not directly related to the performance of HDFS. It is considered to be slow because of its architecture:
It processes data with Java. Each separate mapper and reducer is a separate instance of JVM, which need to be invoked, which takes some time
It puts intermediate data on the HDDs many times. At minimum, mappers write their results (one), reducers reads and merges them, writing result set to disks (two), reducer results written back to your filesystem, usually HDFS (three). You can find more details on the process here: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/.
Second, Hadoop is open framework and it supports many different filesystems. You can read data from FTP, S3, local filesystem (NFS share, for instance), MapR-FS, IBM GPFS, GlusterFS by RedHat, etc. So you are free to choose the one you like. The main idea for MapReduce is to specify InputFormat and OutputFormat that would be able to work with your filesystem
Spark at the moment is considered to be a faster replacement of the Hadoop MapReduce as it puts much of the computations to the memory. But its use really depends on your case
Assume you have an JAVA application which processes some input (ranging 1 - 5 GBs) and saves the output (~100s MBs) to a file in an append only environment like HDFS.
The basic structure of the file is as follows
set of values (most of the data)
Set of keys
some metadata
keys and values are similar concepts to a Map Reduce paradigm.
Since the amount of data written to file is huge, it is better to dump of chunks of the file to disk when possible. What are good ways of designing such a file format to keep it flexible for later releases? How do we maintain versions of the file format in Java?
Any good resources/links would be helpful too! I am trying to understand best practices for creating your own custom file format with the above constraints.
Thanks !
Have you considered Apache Avro?
http://avro.apache.org/docs/1.3.0/index.html
I am working on my project to integrate apache avro into my
MapR program. However, I am very confused
by the usage of new mapreduce packages compared to mapred.
The latter takes a detailed instruction on how to use
in different situations and less information is given for the new.
But what I knew is that they correspond to new and old interfaces of hadoop.
Does anyone has any experience or examples using mapreduce interfaces
for jobs whose input is non-Avro data
(such as TextInputFormat) file
and output is avro file.
The two packages represent input / output formats, mapper and reducer base classes for the corresponding Hadoop mapred and mapreduce APIs.
So if your job uses the old (mapred) package APIs, then you should use the corresponding mapred avro package classes.
Avro has an example word count adaptation that uses Avro output format, which should be easy to modify for the newer mapreduce API:
http://svn.apache.org/viewvc/avro/trunk/doc/examples/mr-example/src/main/java/example/AvroWordCount.java?view=markup
Here's some gist with the modifications: https://gist.github.com/chriswhite199/6755242