When you download Apache Crunch from their website (it comes as source code), it comes without the related MapReduce classes it's based on. Two questions:
1- How is this possible? Apache Crunch is an abstraction on top of MapReduce. How come it isn't packaged with the MapReduce classes?
2- What do I need to do to develop using Apache Crunch? Do I need to download Crunch and MapReduce separately? If so, how can I know which MapReduce version I need to match the Crunch version?
I just looked for Mapreduce classes within Apache Crunch
I did performed a random check and none of those classes seem to extend hadoop classes and rather methioned like below
Static functions for working with legacy Mappers and Reducers that live under the org.apache.hadoop.mapred.* package as part of Crunch pipelines.
Are you using version 0.6 or earlier ?
Related
I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):
pyarrow.parquet.write_table(table, "example.parquet")
Now I want to read these files (and preferably get an Arrow Table) using a Java program.
In Python, I can simply use the following to get an Arrow Table from my Parquet file:
table = pyarrow.parquet.read_table("example.parquet")
Is there an equivalent and easy solution in Java?
I couldn't really find any good / working examples nor any usefull documentation for Java (only for Python). Or some examples don't provide all needed Maven dependencies. I also don't want to use a Hadoop file system, I just want to use local files.
Note: I also found out that I can't use "Apache Avro" because my Parquet files contains column names with the symbols [, ] and $ which are invalid characters in Apache Avro.
Also, can you please provide Maven dependencies if your solution uses Maven.
I am on Windows and using Eclipse.
Update (November 2020): I never found a suitable solution and just stuck with Python for my usecase.
it's somewhat an overkill, but you can use Spark.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
A colleague of mine was telling me that he didn't see any issues with using tarballs to hold the data sent to a MapReduce job. My understanding of how Hadoop and MR/Spark work together is that the preferred hadoop storage formats are designed so that the data files can be split along the Hadoop block size and fanned out to MR mappers or Spark workers (to be partitioned). Tar strikes me as a really terrible format to do this because AFAIK Tar is absolutely not in any way designed to accommodate the way Hadoop and its various job engines work. Am I missing something here?
I am working on my project to integrate apache avro into my
MapR program. However, I am very confused
by the usage of new mapreduce packages compared to mapred.
The latter takes a detailed instruction on how to use
in different situations and less information is given for the new.
But what I knew is that they correspond to new and old interfaces of hadoop.
Does anyone has any experience or examples using mapreduce interfaces
for jobs whose input is non-Avro data
(such as TextInputFormat) file
and output is avro file.
The two packages represent input / output formats, mapper and reducer base classes for the corresponding Hadoop mapred and mapreduce APIs.
So if your job uses the old (mapred) package APIs, then you should use the corresponding mapred avro package classes.
Avro has an example word count adaptation that uses Avro output format, which should be easy to modify for the newer mapreduce API:
http://svn.apache.org/viewvc/avro/trunk/doc/examples/mr-example/src/main/java/example/AvroWordCount.java?view=markup
Here's some gist with the modifications: https://gist.github.com/chriswhite199/6755242
I was wondering if it is possible to define a Hierarchical MapReduce job?.
In other words I would like to have a map-reduce job, that in the mapper phase will call a different MapReduce job. Is it possible? Do you have any recommendations how to do it?
I want to do it in order to have additional level of parallelism/distribution in my program.
Thanks,
Arik.
Hadoop definitive guide book contains lot of recipes related to MapReduce job chaining including sample code and detailed explanation. Especially chapter called like 'advanced API usage' or something near it.
I personally succeeded with replacement of complex map-reduce job with several HBase tables used as sources with handmade TableInputFormat extension. The result was input format which combines source data with minimal reduction so job was transformed to single mapper step. So I recommend you to look in this direction too.
You should try Cascading. It allows you to define pretty complex jobs with multiple steps.
I guess you need oozie tool. Oozie helps in defining workflows using an xml file.
I am trying to build a recommendation engine, for that I am thinking of using apache mahout but I am unable to make out if mahout process the data in real time or does it pre-process the data when the server is idle and store the results somewhere in the database.
Also does anyone have any idea what approach do sites like amazon,netflix follow?
Either/or, but not both. There are parts inside from an older project that are essentially real time for moderate scale. There are also Hadoop based implementations which are all offline. The two are not related.
I am a primary creator of these parts, and if you want a system that does both together, I suggest you look at my current project Myrrix (http://myrrix.com)