Hello Im trying to run map reduce jobs on git repositories. I wanted to use a map job to first concurrently clone all repositories to hdfs then do further map reduce jobs on the files. Im running into a problem in that Im not sure how to write the repository files to hdfs. I have seen examples which write individual files but those were outside the mapper and only write single files. The jgit api only exposes a filerepository structure which inherits from file but the hdfs uses paths written as dataoutputstreams. Is there a good way to convert between the two or any examples that do something similar?
Thanks
The input data to Hadoop Mapper must be on HDFS and not on your local machine or anything other than HDFS. Map-reduce jobs are not meant for migrating data from one place to another. They are used to process huge volumes of data present on HDFS. I am sure that your repository data in not HDFS, and if it is then you wont have needed to perform any operation at first place. So please keep in mind that map-reduce jobs are used for processing large volumes of data already present on HDFS (Hadoop file system).
Related
I am new to Hadoop and have been given the task of migrating structured data to HDFS using a java code. I know the same can be accomplished by Sqoop, but that is not my task.
Can someone please explain a possible way to do this.
I did attempt to do it. What I did was copy data from psql server using jdbc driver and then store it in a csv format in HDFS. Is this the right way to go about this?
I have read that Hadoop has its own datatypes for storing structured data. Can you please explain as to how that happens.
Thank you.
The state of the art is using (pull ETL) sqoop as regular batch processes to fetch the data from RDBMs. However, this way of doing is both resource consuming for the RDBMS (often sqoop run multiple thread with multiple jdbc connections), and takes long time (often you run sequential fetch on the RDBMS), and lead to data corruptions (the living RDBMS is updated while this long sqoop process is "always in late").
Then some alternative paradigm exists (push ETL) and are maturing. The idea behind is to build change data capture streams that listen the RDBMS. An example project is debezium. You can then build a realtime ETL that synchronize the RDBMS and the datawarehouse on hadoop or elsewhere.
Sqoop is a simple tool which perform following.
1) Connect to the RDBMS ( postgresql) and get the metadata of the table and create a pojo(a Java Class) of the table.
2) Use the java class to import and export through a mapreduce program.
If you need to write plain java code (Where parallelism you need to control for performance)
Do following:
1) Create a Java Class which connects to RDBMS using Java JDBC
2) Create a Java Class which accepts input String( Get from resultset) and write to HDFS service into a file.
Otherway doing this.
Create a MapReduce using DBInputFormat pass the number of input splits with TextOutputFormat as output directory to HDFS.
Please visit https://bigdatatamer.blogspot.com/ for any hadoop and spark related question.
Thanks
Sainagaraju Vaduka
You are better off using Sqoop. Because you may end up doing exactly what Sqoop is doing if you go the path of building it yourself.
Either way, conceptually, you will need a custom mapper with custom input format with ability to read partitioned data from the source. In this case, table column on which the data has to be partitioned would be required to exploit parallelism. A partitioned source table would be ideal.
DBInputFormat doesn't optimise the calls on source database. Complete dataset is sliced into configured number of splits by the InputFormat.
Each of the mappers would be executing the same query and loading only the portion of the data corresponding to split. This would result in each mapper issuing the same query along with sorting of dataset so it can pick its portion of data.
This class doesn't seem to take advantage of a partitioned source table. You can extend it to handle partitioned tables more efficiently.
Hadoop has structured file formats like AVRO, ORC and Parquet to begin with.
If your data doesn't require to be stored in a columnar format (used primarily for OLAP use cases where only few columns of large set of columns is required to be selected ), go with AVRO.
The way you are trying to do is not a good one because you are going to waste so much of time in developing the code,testing etc., Instead use sqoop to import the data from any RDBMS to hive. The first tool which has to come in our mind is Sqoop(SQL to Hadoop).
I want to store some Java/Scala objects as records in Parquet format, and I'm currently using parquet-avro and the AvroParquetWriter class for this purpose. This works fine, but it is very coupled to Hadoop and it's file system implementation(s). Instead, I would like to somehow get the raw binary data of the files (preferably, but not absolutely necessary, in a streaming fashion) and handle the writing of the files "manually", due to the nature of the framework I'm integrating with. Has anyone been able to achieve something like this?
I've been parsing log files using MapReduce, but it always outputs a text file named "part-00000" to store my results, and I have to then import part--00000 into mysql manually.
Is there an easy way to store MapReduce results directly in MySQL? For example, how might I store the results of the classic "Word Count" MapReduce program in MySQL directly?
I'm using Hadoop 1.2.1, and the mapred libraries (i.e. org.apache.hadoop.mapred.* instead of org.apache.hadoop.mapreduce.*, and the two are not compatible as far as I'm aware.) I don't have access to Sqoop.
By using DBOutputFormat, we can write MapReduce output to direct databases.
Here is some example, go through this.
Personally i suggest Sqoop for Data imports (from DB to HDFS) and exports (From hdfs to DB).
My understanding of Apache Hive is that its a SQL-like tooling layer for querying Hadoop clusters. My understanding of Apache Pig is that its a procedural language for querying Hadoop clusters. So, if my understanding is correct, Hive and Pig seem like two different ways of solving the same problem.
My problem, however, is that I don't understand the problem they are both solving in the first place!
Say we have a DB (relational, NoSQL, doesn't matter) that feeds data into HDFS so that a particular MapReduce job can be run against that input data:
I'm confused as to which system Hive/Pig are querying! Are they querying the database? Are they querying the raw input data stored in the DataNodes on HDFS? Are they running little ad hoc, on-the-fly MR jobs and reporting their results/outputs?
What is the relationship between these query tools, the MR job input data stored on HDFS, and the MR job itself?
Apache Pig and Apache Hive load data from the HDFS unless you run it locally, in which case it will load it locally. How does it get the data from a DB? It does not. You need other framework to export the data in your traditional DB into your HDFS, such as Sqoop.
Once you have the data in your HDFS, you can start working with Pig and Hive. They never query a DB. In Apache Pig, for example, you could load your data using a Pig loader:
A = LOAD 'path/in/your/HDFS' USING PigStorage('\t');
As for Hive, you need to create a table and then load the data into the table:
LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;
Again, the data must be in the HDFS.
As to how it works, it depends. Traditionally it has always worked with a MapReduce execution engine. Both Hive and Pig parse the statements you write in PigLatin or HiveQL and translate it into an execution plan consisting of a certain number of MapReduce jobs, depending on the plan. However, now it can also translate it into Tez, a new execution engine which perhaps is too new to work correctly.
Why the need of Pig or Hive? Well, you really don't need these frameworks. Everything they can do, you can do it as well writing your own MapReduce or Tez jobs. However, writing for instance a JOIN operation in MapReduce might take hundreds or thousands of lines of code (really), while it is only one single line of code in Pig or Hive.
I dont think you can query any data with Hive/Pig without actually adding to them. So first you need to add data. And this data can be coming from any place and you just give the path for the data to be picked or add directly to them. Once data is in place, the query fetches the data only from those tables.
Underneath, they use map reduce as a tool to do the process. If you just have on the go data lying somewhere and need analysis, you can directly go to map redue and define your own logic. Hive is mostly at the SQL front. So you get querying features similar to SQL, and at the backend, map reduce does the job. Hope this info helps
Im not agree with that Pig and Hive solve the same problem, Hive is for querying data stored on hdfs as external or internal tables, Pig is for managing data flow stored on hdfs in a Directed Acyclic Graph, this is their main goals and we dont care about other uses, here i want to make difference between :
Querying data (the main purpose of Hive) which is getting answers to some questions about your data, for example : How many distinct user visiting my website per mounth in this year.
Managing a data flow (the main purpose of Pig) is to make your data go from initial state to have at the end a different state through transformations, for example : Data in location A filtered by critiria c joined with data in location B stored in location C.
Smeeb, Pig,Hive does same thing , I mean processing data which comes in files or what ever format.
here if you want to process data present in RDMS, first get that data to HDFS with help of Sqoop(SQL+HADOOP).
Hive used HQL like SQL to process, Pig uses kind flow with help of piglatin.
Hive stores all input data in tables format so, first thing before load data to Hive create a hive table, that structure (metadata) will be stored in any RDMS(Mysql). Then load with LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;
I want to make a better performance for data processing using Hadoop MapReduce. So, do I need to use it along with Hadoop DFS? Or maybe MapReduce can be use with other type of data distributed? Show me the way, please....
Hadoop is a framework which includes Map Reduce programming model for computation and HDFS for storage.
HDFS stands for hadoop distributed file system which is inspired from Google File System. The overall Hadoop project is inspired based on the research paper published by Google.
research.google.com/archive/mapreduce-osdi04.pdf
http://research.google.com/archive/mapreduce.html
Using Map Reduce programming model data will be computed in parallel way in different nodes across the cluster which will decrease the processing time.
You need to use HDFS or HBASE to store your data in the cluster to get the high performance. If you like to choose normal file system, then there will not be much difference. Once the data goes to distributed system, automatically it will be divided across different block and replicated by default 3 times to avoid fault tolerance. All these will not be possible with normal file system
Hope this helps!
First, your idea is wrong. Performance of Hadoop MapReduce is not directly related to the performance of HDFS. It is considered to be slow because of its architecture:
It processes data with Java. Each separate mapper and reducer is a separate instance of JVM, which need to be invoked, which takes some time
It puts intermediate data on the HDDs many times. At minimum, mappers write their results (one), reducers reads and merges them, writing result set to disks (two), reducer results written back to your filesystem, usually HDFS (three). You can find more details on the process here: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/.
Second, Hadoop is open framework and it supports many different filesystems. You can read data from FTP, S3, local filesystem (NFS share, for instance), MapR-FS, IBM GPFS, GlusterFS by RedHat, etc. So you are free to choose the one you like. The main idea for MapReduce is to specify InputFormat and OutputFormat that would be able to work with your filesystem
Spark at the moment is considered to be a faster replacement of the Hadoop MapReduce as it puts much of the computations to the memory. But its use really depends on your case