I've been parsing log files using MapReduce, but it always outputs a text file named "part-00000" to store my results, and I have to then import part--00000 into mysql manually.
Is there an easy way to store MapReduce results directly in MySQL? For example, how might I store the results of the classic "Word Count" MapReduce program in MySQL directly?
I'm using Hadoop 1.2.1, and the mapred libraries (i.e. org.apache.hadoop.mapred.* instead of org.apache.hadoop.mapreduce.*, and the two are not compatible as far as I'm aware.) I don't have access to Sqoop.
By using DBOutputFormat, we can write MapReduce output to direct databases.
Here is some example, go through this.
Personally i suggest Sqoop for Data imports (from DB to HDFS) and exports (From hdfs to DB).
Related
I am new to Hadoop and have been given the task of migrating structured data to HDFS using a java code. I know the same can be accomplished by Sqoop, but that is not my task.
Can someone please explain a possible way to do this.
I did attempt to do it. What I did was copy data from psql server using jdbc driver and then store it in a csv format in HDFS. Is this the right way to go about this?
I have read that Hadoop has its own datatypes for storing structured data. Can you please explain as to how that happens.
Thank you.
The state of the art is using (pull ETL) sqoop as regular batch processes to fetch the data from RDBMs. However, this way of doing is both resource consuming for the RDBMS (often sqoop run multiple thread with multiple jdbc connections), and takes long time (often you run sequential fetch on the RDBMS), and lead to data corruptions (the living RDBMS is updated while this long sqoop process is "always in late").
Then some alternative paradigm exists (push ETL) and are maturing. The idea behind is to build change data capture streams that listen the RDBMS. An example project is debezium. You can then build a realtime ETL that synchronize the RDBMS and the datawarehouse on hadoop or elsewhere.
Sqoop is a simple tool which perform following.
1) Connect to the RDBMS ( postgresql) and get the metadata of the table and create a pojo(a Java Class) of the table.
2) Use the java class to import and export through a mapreduce program.
If you need to write plain java code (Where parallelism you need to control for performance)
Do following:
1) Create a Java Class which connects to RDBMS using Java JDBC
2) Create a Java Class which accepts input String( Get from resultset) and write to HDFS service into a file.
Otherway doing this.
Create a MapReduce using DBInputFormat pass the number of input splits with TextOutputFormat as output directory to HDFS.
Please visit https://bigdatatamer.blogspot.com/ for any hadoop and spark related question.
Thanks
Sainagaraju Vaduka
You are better off using Sqoop. Because you may end up doing exactly what Sqoop is doing if you go the path of building it yourself.
Either way, conceptually, you will need a custom mapper with custom input format with ability to read partitioned data from the source. In this case, table column on which the data has to be partitioned would be required to exploit parallelism. A partitioned source table would be ideal.
DBInputFormat doesn't optimise the calls on source database. Complete dataset is sliced into configured number of splits by the InputFormat.
Each of the mappers would be executing the same query and loading only the portion of the data corresponding to split. This would result in each mapper issuing the same query along with sorting of dataset so it can pick its portion of data.
This class doesn't seem to take advantage of a partitioned source table. You can extend it to handle partitioned tables more efficiently.
Hadoop has structured file formats like AVRO, ORC and Parquet to begin with.
If your data doesn't require to be stored in a columnar format (used primarily for OLAP use cases where only few columns of large set of columns is required to be selected ), go with AVRO.
The way you are trying to do is not a good one because you are going to waste so much of time in developing the code,testing etc., Instead use sqoop to import the data from any RDBMS to hive. The first tool which has to come in our mind is Sqoop(SQL to Hadoop).
What is the best way to create and populate Parquet files in HDFS using Java without the support of Hive or Impala libraries?
My goal is to write a simple csv record (String) to a Parquet file located in HDFS.
All the questions/answers previously asked are confusing.
Seems like parquet-mr is the way to go. They provide implementations for Thrift and Avro. Own implementations should be based on ParquetOutputFormat and might look similar to AvroParquetOutputFormat and AvroWriteSupport which does the actual conversion.
My understanding of Apache Hive is that its a SQL-like tooling layer for querying Hadoop clusters. My understanding of Apache Pig is that its a procedural language for querying Hadoop clusters. So, if my understanding is correct, Hive and Pig seem like two different ways of solving the same problem.
My problem, however, is that I don't understand the problem they are both solving in the first place!
Say we have a DB (relational, NoSQL, doesn't matter) that feeds data into HDFS so that a particular MapReduce job can be run against that input data:
I'm confused as to which system Hive/Pig are querying! Are they querying the database? Are they querying the raw input data stored in the DataNodes on HDFS? Are they running little ad hoc, on-the-fly MR jobs and reporting their results/outputs?
What is the relationship between these query tools, the MR job input data stored on HDFS, and the MR job itself?
Apache Pig and Apache Hive load data from the HDFS unless you run it locally, in which case it will load it locally. How does it get the data from a DB? It does not. You need other framework to export the data in your traditional DB into your HDFS, such as Sqoop.
Once you have the data in your HDFS, you can start working with Pig and Hive. They never query a DB. In Apache Pig, for example, you could load your data using a Pig loader:
A = LOAD 'path/in/your/HDFS' USING PigStorage('\t');
As for Hive, you need to create a table and then load the data into the table:
LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;
Again, the data must be in the HDFS.
As to how it works, it depends. Traditionally it has always worked with a MapReduce execution engine. Both Hive and Pig parse the statements you write in PigLatin or HiveQL and translate it into an execution plan consisting of a certain number of MapReduce jobs, depending on the plan. However, now it can also translate it into Tez, a new execution engine which perhaps is too new to work correctly.
Why the need of Pig or Hive? Well, you really don't need these frameworks. Everything they can do, you can do it as well writing your own MapReduce or Tez jobs. However, writing for instance a JOIN operation in MapReduce might take hundreds or thousands of lines of code (really), while it is only one single line of code in Pig or Hive.
I dont think you can query any data with Hive/Pig without actually adding to them. So first you need to add data. And this data can be coming from any place and you just give the path for the data to be picked or add directly to them. Once data is in place, the query fetches the data only from those tables.
Underneath, they use map reduce as a tool to do the process. If you just have on the go data lying somewhere and need analysis, you can directly go to map redue and define your own logic. Hive is mostly at the SQL front. So you get querying features similar to SQL, and at the backend, map reduce does the job. Hope this info helps
Im not agree with that Pig and Hive solve the same problem, Hive is for querying data stored on hdfs as external or internal tables, Pig is for managing data flow stored on hdfs in a Directed Acyclic Graph, this is their main goals and we dont care about other uses, here i want to make difference between :
Querying data (the main purpose of Hive) which is getting answers to some questions about your data, for example : How many distinct user visiting my website per mounth in this year.
Managing a data flow (the main purpose of Pig) is to make your data go from initial state to have at the end a different state through transformations, for example : Data in location A filtered by critiria c joined with data in location B stored in location C.
Smeeb, Pig,Hive does same thing , I mean processing data which comes in files or what ever format.
here if you want to process data present in RDMS, first get that data to HDFS with help of Sqoop(SQL+HADOOP).
Hive used HQL like SQL to process, Pig uses kind flow with help of piglatin.
Hive stores all input data in tables format so, first thing before load data to Hive create a hive table, that structure (metadata) will be stored in any RDMS(Mysql). Then load with LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;
Hello Im trying to run map reduce jobs on git repositories. I wanted to use a map job to first concurrently clone all repositories to hdfs then do further map reduce jobs on the files. Im running into a problem in that Im not sure how to write the repository files to hdfs. I have seen examples which write individual files but those were outside the mapper and only write single files. The jgit api only exposes a filerepository structure which inherits from file but the hdfs uses paths written as dataoutputstreams. Is there a good way to convert between the two or any examples that do something similar?
Thanks
The input data to Hadoop Mapper must be on HDFS and not on your local machine or anything other than HDFS. Map-reduce jobs are not meant for migrating data from one place to another. They are used to process huge volumes of data present on HDFS. I am sure that your repository data in not HDFS, and if it is then you wont have needed to perform any operation at first place. So please keep in mind that map-reduce jobs are used for processing large volumes of data already present on HDFS (Hadoop file system).
I'm looking tools/libraries which allows fast (easy) data import to existing database tables. For example phpmyadmin allows data import from .csv, .xml etc. In Hadoop hue via Beesvax for Hive we can create table from file. I'm looking tools which I can use with postgresql or libraries which allows doing such things fast and easily - I'm looking for way to avoid coding it manualy from reading file to inserting to db via jdbc.
You can do all that with standard tools in PostgreSQL, without additional libraries.
For .csv files you can use the built in COPY command. COPY is fast and simple. The source file has to lie on the same machine as the database for that. If not, you can use the very similar \copy meta-command of psql.
For .xml files (or any format really) you can use the built in pg_read_file() inside a plpgsql function. However, I quote:
Only files within the database cluster directory and the log_directory
can be accessed.
So you have to put your source file there or create a symbolic link to your actual file/directory. Then you can parse it with unnest() and xpath() and friends. You need at least PostgreSQL 8.4 for that.
A kick start on parsing XML in this blog post by Scott Bailey.