Unable to connect Hive with MongoDB using mongo-hadoop connector - java

I am trying to install and configure hive with mongo-hadoop-core 2.0.2, for the first time. I have installed hadoop 2.8.0, Hive 2.1.1 and MongoDB 3.4.6. and everything works fine when running individually.
My problem is, I am not able to connect MongoDB with Hive. I am using mongo-Hadoop connector for this as mentioned here https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage
The required jars are added to Hadoop and Hive lib. Even I add them in hive.sh or runtime from hive console.
I am getting error while executing Create table query
My Query is
CREATE EXTERNAL TABLE testHive
(
id STRING,
name STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/hiveDb.testHive');
And I get the following error
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/hadoop/io/BSONWritable
hive> ERROR hive.ql.exec.DDLTask - java.lang.NoClassDefFoundError: com/mongodb/hadoop/io/BSONWritable
at com.mongodb.hadoop.hive.BSONSerDe.initialize(BSONSerDe.java:132)
at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:537)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:424)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:411)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:279)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:261)
It shows that com/mongodb/hadoop/io/BSONWritable class is not in classpath but I have added the required(mongo-hadoop-core.jar) jar and class are present in the jar.
The version of jars I am using
mongo-hadoop-core 2.0.2,
mongo-hadoop-hive 2.0.2,
mongo-java-driver 3.0.2
Thanks

You need to register jars explicitly. In your Hive script, use ADD JAR commands to include these JARs (core, hive, and the Java driver), e.g., ADD JAR /path-to/mongo-hadoop-hive-<version>.jar;.
If you are running from Hive shell, use like this.
hive> ADD JAR /path-to/mongo-hadoop-hive-<version>.jar;
Then execute your query.

Related

Hive 3.1.2 UDAFs not working in Spark 3.0.0

pyspark.sql.utils.AnalysisException: No handler for UDF/UDAF/UDTF 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFHistogramNumeric': java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.udf.generic.SimpleGenericUDAFParameterInfo.<init>([Lorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;, boolean, boolean); line 4 pos 29
I get the above error when I try to use histogram_numeric from Hive in Spark SQL.
I've included the relevant hive-exec jar, enabled hive support and I'm starting to wonder if this isn't supported at the moment.
Hive version: 3.1.2
Spark version: 3.0.0
If someone has a simple snippet which works for them when registering Hive UDAFs in Spark 3.0.0 that would be super useful too
I tried to register hive uadf via hiveCtx.udf.registerJavaUDAF, but no luck.
hiveCtx.udf.registerJavaUDAF("histogram_numeric", "org.apache.hadoop.hive.ql.udf.generic.GenericUDAFHistogramNumeric")
The hive class which implements "histogram_numeric" was there, but it doesn't conform to spark's JavaUADF interface.
But I found the code with dataframe's selectExpr works. I don't know why.
users_spark_df.selectExpr('histogram_numeric(age, 2)')
Making histogram with Spark DataFrame column

How to import mongo data in hive?

I am facing an issue.
So when I try to import mongo data to hive using the below command it is giving me an error.
CREATE EXTERNAL TABLE gok
(
id STRING,
name STRING,
state STRING,
email STRING) STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name","state":"state"}') TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/gokul_test.play_test');
Note:
The versions of the tools used are below:
Java JDK 8
Hadoop: 2.8.4
Hive: 2.3.3
MongoDB: 4.2
The jar versions are of below which has been moved to HADOOP_HOME/lib and HIVE_HOME/lib:
mongo-hadoop-core-2.0.2.jar
mongo-hadoop-hive-2.0.2.jar
mongo-java-driver-2.13.2.jar
So the error is
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hive/serde2/SerDe
I have tried by manually adding jars in hive then the error which I have received is below.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.com/mongodb/hadoop/hive/BSONSerDe
Both the errors are different.
let me know if you know any resolution or need more details.
You should add the jars to your hive session.
Which hive client are you using?
If you were using "beeline", you can add the full path of the jars before trying to create the table:
beeline !connect jdbc:hive2://localhost:10000 “” ””
So, as soon as your session is created, you must add the jars, using "add jar" and the full path of the jar file:
add jar hdfs://sandbox.hortonworks.com:8020/tmp/udfs/mongo-hadoop-hive-1.5.0-SNAPSHOT.jar;
add jar hdfs://sandbox.hortonworks.com:8020/tmp/udfs/mongo-hadoop-core-1.5.0-SNAPSHOT.jar;
add jar hdfs://sandbox.hortonworks.com:8020/tmp/udfs/mongodb-driver-3.0.4.jar;
So the next step is to drop/create the table
DROP TABLE IF EXISTS bars;
CREATE EXTERNAL TABLE bars
(
objectid STRING,
Symbol STRING,
TS STRING,
Day INT,
Open DOUBLE,
High DOUBLE,
Low DOUBLE,
Close DOUBLE,
Volume INT
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"objectid":"_id",
"Symbol":"Symbol", "TS":"Timestamp", "Day":"Day", "Open":"Open", "High":"High", "Low":"Low", "Close":"Close", "Volume":"Volume"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/marketdata.minibars');
source: https://community.cloudera.com/t5/Support-Questions/Mongodb-with-hive-Error-return-code-1-from-org-apache-hadoop/td-p/138161
It looks like the mongo-hadoop-hive-<version>.jar is not correctly added into the hive system.
Try adding the mongodb JAR using the below command:
ADD JAR /path-to/mongo-hadoop-hive-<version>.jar
More info: https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage
Alternatively: you could also try to ingest the mongodb BSON data into hive in an AVRO format and then build tables in hive. Its a long process but it will get your job done. You will need to build a new connector for reading from mongo and converting it to avro format.

How to Connect Teradata using Pyspark

I am trying to connect teradata server through PySpark.
My CLI code is as below,
from pyspark.sql import SparkSession
spark=SparkSession.builder
.appName("Teradata connect")
.getOrCreate()
df = sqlContext.read
.format("jdbc")
.options(url="jdbc:teradata://xy/",
driver="com.teradata.jdbc.TeraDriver",
dbtable="dbname.tablename",
user="user1",password="***")
.load()
Which is giving error,
py4j.protocol.Py4JJavaError: An error occurred while calling o159.load.
: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
To resolve this I think, I need to add jar terajdbc4.jar and `tdgssconfig.jar.
In Scala, to add jar we can use
sc.addJar("<path>/jar-name.jar")
If I use the same for PySpark, I am having error,
AttributeError: 'SparkContext' object has no attribute 'addJar'.
or
AttributeError: 'SparkSession' object has no attribute 'addJar'
How can I add jar terajdbc4.jar and tdgssconfig.jar?
Try following this post which explains how to add jdbc drivers to pyspark.
How to add jdbc drivers to classpath when using PySpark?
The above example is for postgres and docker, but the answer should work for your scenario.
Note, you are correct about the driver files. Most JDBC drivers are in a single file, but Teradata splits it out into two parts. I think one is the actual driver and the other (tdgss) has security stuff in it. Both files must be added to the classpath for it to work.
Alternatively, simply google "how to add jdbc drivers to pyspark".

MySQL to PostgreSQL migration: mysql connector

I am trying to migrate from MySQL to PostgreSQL and I have a Java-related problem that I am not able to fix. Full disclosure: I know little or nothing about Java, but the migration uses a Java-based script, so for me it becomes a configuration problem.
Short version of the problem:
The migration tool throws this exception:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
mysql-connector-java-5.0.8-bin.jar is already in the "JAVA_HOME\jre\lib\ext" directory, and I don't know how to solve this depencency problem.
Long version of the problem:
I was trying to migrate from MySQL to PostgreSQL. I checked the official postgresql documentation and I chose the free tool from entreprisedb (that can be downloaded here) to start the migration.
From the installation readme, they tell you that the mysql connector is not installed by default, but they also tell you the steps to solve this problem:
To enable MySQL connectivity, download MySQL's freely available JDBC driver from:
http://www.enterprisedb.com/downloads/third-party-jdbc-drivers
Place the mysql-connector-java-5.0.8-bin.jar file in the "JAVA_HOME\jre\lib\ext" directory (in my case: "C:\Program Files\Java\jre1.8.0_60\lib\ext\mysql-connector-java-5.0.8-bin.jar").
After configuring the tool properly and executing the .bat, this is the error I get:
Connecting with source MySQL database server...
MTK-11009: Error Connecting Database "MySQL Server"
DB-null: java.sql.SQLException: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
Stack Trace:
com.edb.MTKException: MTK-11009: Error Connecting Database "MySQL Server"
at com.edb.dbhandler.mysql.MySQLConnection.<init>(MySQLConnection.java:48)
at com.edb.common.MTKFactory.createMTKConnection(MTKFactory.java:250)
at com.edb.MigrationToolkit.createNewSourceConnection(MigrationToolkit.java:5982)
at com.edb.MigrationToolkit.initToolkit(MigrationToolkit.java:3346)
at com.edb.MigrationToolkit.main(MigrationToolkit.java:1700)
Caused by: java.sql.SQLException: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at com.edb.Utility.processException(Utility.java:327)
at com.edb.dbhandler.mysql.MySQLConnection.<init>(MySQLConnection.java:47)
... 4 more
...which, to my understanding, probably means that mysql-connector-java-5.0.8-bin.jar is not found.
All the links I've found online regarding the error are specific for Eclipse or other IDEs, so I have not yet been able to solve this dependency problem.
SOLUTION
With the help of a friend that masters Java, this is the solution he achieved:
To start looking for the problem, we opened the runMTK.bat. The execution line reads:
cscript //nologo "..\etc\sysconfig\runJavaApplication.vbs" "..\etc\sysconfig\edbmtk-49.config" "-Dprop=..\etc\toolkit.properties -classpath -jar edb-migrationtoolkit.jar %*"
So then we opened this runJavaApplication.vbs, and in order to know the JAVA_EXECUTABLE_PATH that the program was using, we add this line to the script:
Wscript.Echo "JAVA_EXECUTABLE_PATH = " & JAVA_EXECUTABLE_PATH
With that info, we discover that the script is using the Java folder under C:\Program Files (x86), instead of the one under C:\Program Files (where I dropped the mysql jar). So we copy the mysql-connector-java-5.0.8-bin.jar in the \ext folder of the x86, and now the script works.
Word of advice: the script is throwing errors in half of the exported tables, so all the hassle may not be worth it. BUT if anyone is interested in making this migration script work from A to Z (which has been quite a challenge), here are the details:
HOW TO
Free tool (from entreprisedb):
http://www.enterprisedb.com/downloads/postgres-postgresql-downloads
Extract the files from the zip and fun the installer (ppasmeta-9.5.0.5-windows-x64.exe) as administrator.
To enable MySQL connectivity, download MySQL's freely available JDBC driver from:
http://www.enterprisedb.com/downloads/third-party-jdbc-drivers
Place the mysql-connector-java-5.0.8-bin.jar file in the "JAVA_HOME\jre\lib\ext" directory (in my case: "C:\Program Files\Java\jre1.8.0_60\lib\ext\mysql-connector-java-5.0.8-bin.jar").
The Migration Toolkit documentation can be found:
here (online doc): https://www.enterprisedb.com/docs/en/9.4/migrate/toc.html
or here (pdf doc): http://get.enterprisedb.com/docs/Postgres_Plus_Migration_Guide_v9.5.pdf
First: modify C:\Program Files\PostgresPlus\edbmtk\etc\toolkit.properties (Info here):
SRC_DB_URL=jdbc:mysql://SOURCE-HOST-NAME/SOURCE-DB-NAME
SRC_DB_USER=********
SRC_DB_PASSWORD=********
TARGET_DB_URL=jdbc:edb://localhost:5444/DESTINATION-DB-NAME
TARGET_DB_USER=enterprisedb
TARGET_DB_PASSWORD=********
Then: execute C:\Program Files\PostgresPlus\edbmtk\bin\runMTK.bat (Info here).
runMTK.bat -sourcedbtype mysql -targetdbtype enterprisedb -allTables YOUR_DB_SCHEMA
// ...or with a limited subset of tables:
runMTK.bat -sourcedbtype mysql -targetdbtype enterprisedb -tables TABLE1,TABLE2,TABLE3 YOUR_DB_SCHEMA
In order to get this subset of tables from MySQL:
SELECT
GROUP_CONCAT(TABLE_NAME)
FROM
information_schema.tables
WHERE
TABLE_SCHEMA = 'your_db_name'

Cannot validate serde : org.openx.data.jsonserde.jsonserde

I have written this query to create a table on hive. My data is initially in json format, so i have downloaded and build serde and added all jar required for it to run. But i am getting the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerDe
QUERY:
create table tip(type string,
text string,
business_id string,
user_id string,
date date,
likes int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES("date.mapping"="date")
STORED AS TEXTFILE;
I too encountered this problem. In my case, I managed to fix this issue by adding json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar at hive command prompt as shown below:
hive> ADD JAR /usr/local/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar;
Below are the steps I have followed on Ubuntu 14.04:
1. Fire up Linux terminal and cd /usr/local
2. sudo git clone https://github.com/rcongiu/Hive-JSON-Serde.git
3. sudo mvn -Pcdh5 clean package
4. The serde file will be in
/usr/local/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar
5. Go to hive prompt and ADD JAR file as shown in Step 6.
6. hive> ADD JAR /usr/local/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7- SNAPSHOT-jar-with-dependencies.jar;
7. Now create hive table from hive> prompt. At this stage, Hive table should be created successfully without any error.
Hive Version: 1.2.1
Hadoop Version: 2.7.1
Reference: Hive-JSON-Serde
You have to build the project cloned using the maven !
mvn install
in the directory /path/directory/Hive-JSON-Serd
here we are in /usr/local

Categories