I am working on my project to integrate apache avro into my
MapR program. However, I am very confused
by the usage of new mapreduce packages compared to mapred.
The latter takes a detailed instruction on how to use
in different situations and less information is given for the new.
But what I knew is that they correspond to new and old interfaces of hadoop.
Does anyone has any experience or examples using mapreduce interfaces
for jobs whose input is non-Avro data
(such as TextInputFormat) file
and output is avro file.
The two packages represent input / output formats, mapper and reducer base classes for the corresponding Hadoop mapred and mapreduce APIs.
So if your job uses the old (mapred) package APIs, then you should use the corresponding mapred avro package classes.
Avro has an example word count adaptation that uses Avro output format, which should be easy to modify for the newer mapreduce API:
http://svn.apache.org/viewvc/avro/trunk/doc/examples/mr-example/src/main/java/example/AvroWordCount.java?view=markup
Here's some gist with the modifications: https://gist.github.com/chriswhite199/6755242
Related
I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):
pyarrow.parquet.write_table(table, "example.parquet")
Now I want to read these files (and preferably get an Arrow Table) using a Java program.
In Python, I can simply use the following to get an Arrow Table from my Parquet file:
table = pyarrow.parquet.read_table("example.parquet")
Is there an equivalent and easy solution in Java?
I couldn't really find any good / working examples nor any usefull documentation for Java (only for Python). Or some examples don't provide all needed Maven dependencies. I also don't want to use a Hadoop file system, I just want to use local files.
Note: I also found out that I can't use "Apache Avro" because my Parquet files contains column names with the symbols [, ] and $ which are invalid characters in Apache Avro.
Also, can you please provide Maven dependencies if your solution uses Maven.
I am on Windows and using Eclipse.
Update (November 2020): I never found a suitable solution and just stuck with Python for my usecase.
it's somewhat an overkill, but you can use Spark.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
A colleague of mine was telling me that he didn't see any issues with using tarballs to hold the data sent to a MapReduce job. My understanding of how Hadoop and MR/Spark work together is that the preferred hadoop storage formats are designed so that the data files can be split along the Hadoop block size and fanned out to MR mappers or Spark workers (to be partitioned). Tar strikes me as a really terrible format to do this because AFAIK Tar is absolutely not in any way designed to accommodate the way Hadoop and its various job engines work. Am I missing something here?
When you download Apache Crunch from their website (it comes as source code), it comes without the related MapReduce classes it's based on. Two questions:
1- How is this possible? Apache Crunch is an abstraction on top of MapReduce. How come it isn't packaged with the MapReduce classes?
2- What do I need to do to develop using Apache Crunch? Do I need to download Crunch and MapReduce separately? If so, how can I know which MapReduce version I need to match the Crunch version?
I just looked for Mapreduce classes within Apache Crunch
I did performed a random check and none of those classes seem to extend hadoop classes and rather methioned like below
Static functions for working with legacy Mappers and Reducers that live under the org.apache.hadoop.mapred.* package as part of Crunch pipelines.
Are you using version 0.6 or earlier ?
I was wondering if it is possible to define a Hierarchical MapReduce job?.
In other words I would like to have a map-reduce job, that in the mapper phase will call a different MapReduce job. Is it possible? Do you have any recommendations how to do it?
I want to do it in order to have additional level of parallelism/distribution in my program.
Thanks,
Arik.
Hadoop definitive guide book contains lot of recipes related to MapReduce job chaining including sample code and detailed explanation. Especially chapter called like 'advanced API usage' or something near it.
I personally succeeded with replacement of complex map-reduce job with several HBase tables used as sources with handmade TableInputFormat extension. The result was input format which combines source data with minimal reduction so job was transformed to single mapper step. So I recommend you to look in this direction too.
You should try Cascading. It allows you to define pretty complex jobs with multiple steps.
I guess you need oozie tool. Oozie helps in defining workflows using an xml file.
i am trying to write a hadoop mapreduce program in java. For which the input is an array and output is also an array. But till now i have only seen people use inputs and outputs as files for it. So i was just wondering if mapreduce can have any other input and output formats.
Thanks
A wide variety of the Input and Output formats are supported by Hadoop. Check the subclasses of InputFormat and OutputFormat. Extend the InputFormat and OutputFormat if any custom formats are required. Check this article from Cloudera on DB input/output format.
Hadoop is a file system and the point of map-reduce is to tackle large amount of data that would usually not fit in memory - so input and output would usually be stored on disks somehow (a.k.a. files).
Hadoop mapreduce classes comes with support for reading the different types of files supported by hadoop (text files, sequential files) you can also write your own sources e.g. HBase comes with a map-reduce wrapper that reads its format of file. I haven't tried that but you can, as the article pointed by Praveen demonstrate, read from other sources
Output is even easier - since you're writing Java code you can do whatever in your reduce phase, so if you want to, say, put messages into a queue in the reduce phase just do that