How to import mongo data in hive? - java

I am facing an issue.
So when I try to import mongo data to hive using the below command it is giving me an error.
CREATE EXTERNAL TABLE gok
(
id STRING,
name STRING,
state STRING,
email STRING) STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name","state":"state"}') TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/gokul_test.play_test');
Note:
The versions of the tools used are below:
Java JDK 8
Hadoop: 2.8.4
Hive: 2.3.3
MongoDB: 4.2
The jar versions are of below which has been moved to HADOOP_HOME/lib and HIVE_HOME/lib:
mongo-hadoop-core-2.0.2.jar
mongo-hadoop-hive-2.0.2.jar
mongo-java-driver-2.13.2.jar
So the error is
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hive/serde2/SerDe
I have tried by manually adding jars in hive then the error which I have received is below.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.com/mongodb/hadoop/hive/BSONSerDe
Both the errors are different.
let me know if you know any resolution or need more details.

You should add the jars to your hive session.
Which hive client are you using?
If you were using "beeline", you can add the full path of the jars before trying to create the table:
beeline !connect jdbc:hive2://localhost:10000 “” ””
So, as soon as your session is created, you must add the jars, using "add jar" and the full path of the jar file:
add jar hdfs://sandbox.hortonworks.com:8020/tmp/udfs/mongo-hadoop-hive-1.5.0-SNAPSHOT.jar;
add jar hdfs://sandbox.hortonworks.com:8020/tmp/udfs/mongo-hadoop-core-1.5.0-SNAPSHOT.jar;
add jar hdfs://sandbox.hortonworks.com:8020/tmp/udfs/mongodb-driver-3.0.4.jar;
So the next step is to drop/create the table
DROP TABLE IF EXISTS bars;
CREATE EXTERNAL TABLE bars
(
objectid STRING,
Symbol STRING,
TS STRING,
Day INT,
Open DOUBLE,
High DOUBLE,
Low DOUBLE,
Close DOUBLE,
Volume INT
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"objectid":"_id",
"Symbol":"Symbol", "TS":"Timestamp", "Day":"Day", "Open":"Open", "High":"High", "Low":"Low", "Close":"Close", "Volume":"Volume"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/marketdata.minibars');
source: https://community.cloudera.com/t5/Support-Questions/Mongodb-with-hive-Error-return-code-1-from-org-apache-hadoop/td-p/138161

It looks like the mongo-hadoop-hive-<version>.jar is not correctly added into the hive system.
Try adding the mongodb JAR using the below command:
ADD JAR /path-to/mongo-hadoop-hive-<version>.jar
More info: https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage
Alternatively: you could also try to ingest the mongodb BSON data into hive in an AVRO format and then build tables in hive. Its a long process but it will get your job done. You will need to build a new connector for reading from mongo and converting it to avro format.

Related

How to deploy compiled PL/Java code straight to Postgres database from application?

I have set up a Postgres database server with PL/Java binary installed on it.
I have observed the process of getting an example PL/Java code to install and run on the database as it starts with moving the compiled .jar file from application server to the database server, via file transfer, then call sqlj.install_jar('file::<path>', 'name', true); to load the .jar into the database server.
I am looking for a different way to load compiled PL/Java code without resorting to the file transfer method as explained above. I am looking through PL/Java documentation and it mentions that sqlj.install_jar function also supports pulling a .jar from web. Theoretically, I could get the application server to briefly spin up a HTTP file server to serve the .jar file and invoke the sqlj.install_jar to pull the .jar from the ad-hoc webserver. However, this may be difficult if the hostname of the application server is not known (i.e. not localhost or behind firewall/private network).
However I am wondering if there are a better way to do it. I am looking for a way that allows the application server to directly push implementation inside .jar using the existing connection to Postgres server without resorting to "hacks" explained above.
Does something like this already exists in PL/Java?
If you do this in psql:
\df sqlj.install_jar
you will see there are two versions of the function:
Schema | Name | Result data type | Argument data types | Type
--------+-------------+------------------+------------------------------------------------------------------------+------
sqlj | install_jar | void | image bytea, jarname character varying, deploy boolean | func
sqlj | install_jar | void | urlstring character varying, jarname character varying, deploy boolean | func
The one that takes a urlstring is the one that is specified by the SQL/JRT standard. The one that takes a bytea is a PL/Java nonstandard extension, but it can be useful for this case. If the jar file is available on the client machine you are running psql on, you can do:
postgres=$ \lo_import foo.jar
lo_import 16725
postgres=$ select sqlj.install_jar(lo_get(16725), 'foo', true);
install_jar
-------------
(1 row)
postgres=$ \lo_unlink 16725
lo_unlink 16725
That is, you can use psql's \lo_import command, which opens a local file and saves it as a "large object" on the server, and gives you an Oid number to refer to it (the 16725 in my example might be a different number for you).
Once the large object is there, the SQL function lo_get(16725) returns its contents as a bytea, so you can pass it to the bytea flavor of install_jar. Once that's done, you just use \lo_unlink to remove the large object from the server.
If you are using JDBC or some other programmatic API to connect to the server, you can just bind your local jar file as first parameter in select sqlj.install_jar(?::bytea,?,?);.
Unfortunately, No is the short answer. This PL/Java is not formatted or written with control files and .sql files needed to deploy it as an extension. Although whatever I read on their official site
site says it's an extension for PG.
But to have an extension installed PG(postgres) way, you need to have the control file for it and cos its written in java and do not have any control files it has to be compiled in its own way.
Normally PGXS is something helps compile the extensions for postgres with the help of PG_CONFIG file.
Note: on their site it clearly mentions "PL/Java can be downloaded, then built using Maven"
Just wanted to share a little of what i'm aware of :), Hope it helps.

How to Connect Teradata using Pyspark

I am trying to connect teradata server through PySpark.
My CLI code is as below,
from pyspark.sql import SparkSession
spark=SparkSession.builder
.appName("Teradata connect")
.getOrCreate()
df = sqlContext.read
.format("jdbc")
.options(url="jdbc:teradata://xy/",
driver="com.teradata.jdbc.TeraDriver",
dbtable="dbname.tablename",
user="user1",password="***")
.load()
Which is giving error,
py4j.protocol.Py4JJavaError: An error occurred while calling o159.load.
: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
To resolve this I think, I need to add jar terajdbc4.jar and `tdgssconfig.jar.
In Scala, to add jar we can use
sc.addJar("<path>/jar-name.jar")
If I use the same for PySpark, I am having error,
AttributeError: 'SparkContext' object has no attribute 'addJar'.
or
AttributeError: 'SparkSession' object has no attribute 'addJar'
How can I add jar terajdbc4.jar and tdgssconfig.jar?
Try following this post which explains how to add jdbc drivers to pyspark.
How to add jdbc drivers to classpath when using PySpark?
The above example is for postgres and docker, but the answer should work for your scenario.
Note, you are correct about the driver files. Most JDBC drivers are in a single file, but Teradata splits it out into two parts. I think one is the actual driver and the other (tdgss) has security stuff in it. Both files must be added to the classpath for it to work.
Alternatively, simply google "how to add jdbc drivers to pyspark".

Unable to connect Hive with MongoDB using mongo-hadoop connector

I am trying to install and configure hive with mongo-hadoop-core 2.0.2, for the first time. I have installed hadoop 2.8.0, Hive 2.1.1 and MongoDB 3.4.6. and everything works fine when running individually.
My problem is, I am not able to connect MongoDB with Hive. I am using mongo-Hadoop connector for this as mentioned here https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage
The required jars are added to Hadoop and Hive lib. Even I add them in hive.sh or runtime from hive console.
I am getting error while executing Create table query
My Query is
CREATE EXTERNAL TABLE testHive
(
id STRING,
name STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/hiveDb.testHive');
And I get the following error
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/hadoop/io/BSONWritable
hive> ERROR hive.ql.exec.DDLTask - java.lang.NoClassDefFoundError: com/mongodb/hadoop/io/BSONWritable
at com.mongodb.hadoop.hive.BSONSerDe.initialize(BSONSerDe.java:132)
at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:537)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:424)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:411)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:279)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:261)
It shows that com/mongodb/hadoop/io/BSONWritable class is not in classpath but I have added the required(mongo-hadoop-core.jar) jar and class are present in the jar.
The version of jars I am using
mongo-hadoop-core 2.0.2,
mongo-hadoop-hive 2.0.2,
mongo-java-driver 3.0.2
Thanks
You need to register jars explicitly. In your Hive script, use ADD JAR commands to include these JARs (core, hive, and the Java driver), e.g., ADD JAR /path-to/mongo-hadoop-hive-<version>.jar;.
If you are running from Hive shell, use like this.
hive> ADD JAR /path-to/mongo-hadoop-hive-<version>.jar;
Then execute your query.

MySQL to PostgreSQL migration: mysql connector

I am trying to migrate from MySQL to PostgreSQL and I have a Java-related problem that I am not able to fix. Full disclosure: I know little or nothing about Java, but the migration uses a Java-based script, so for me it becomes a configuration problem.
Short version of the problem:
The migration tool throws this exception:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
mysql-connector-java-5.0.8-bin.jar is already in the "JAVA_HOME\jre\lib\ext" directory, and I don't know how to solve this depencency problem.
Long version of the problem:
I was trying to migrate from MySQL to PostgreSQL. I checked the official postgresql documentation and I chose the free tool from entreprisedb (that can be downloaded here) to start the migration.
From the installation readme, they tell you that the mysql connector is not installed by default, but they also tell you the steps to solve this problem:
To enable MySQL connectivity, download MySQL's freely available JDBC driver from:
http://www.enterprisedb.com/downloads/third-party-jdbc-drivers
Place the mysql-connector-java-5.0.8-bin.jar file in the "JAVA_HOME\jre\lib\ext" directory (in my case: "C:\Program Files\Java\jre1.8.0_60\lib\ext\mysql-connector-java-5.0.8-bin.jar").
After configuring the tool properly and executing the .bat, this is the error I get:
Connecting with source MySQL database server...
MTK-11009: Error Connecting Database "MySQL Server"
DB-null: java.sql.SQLException: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
Stack Trace:
com.edb.MTKException: MTK-11009: Error Connecting Database "MySQL Server"
at com.edb.dbhandler.mysql.MySQLConnection.<init>(MySQLConnection.java:48)
at com.edb.common.MTKFactory.createMTKConnection(MTKFactory.java:250)
at com.edb.MigrationToolkit.createNewSourceConnection(MigrationToolkit.java:5982)
at com.edb.MigrationToolkit.initToolkit(MigrationToolkit.java:3346)
at com.edb.MigrationToolkit.main(MigrationToolkit.java:1700)
Caused by: java.sql.SQLException: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at com.edb.Utility.processException(Utility.java:327)
at com.edb.dbhandler.mysql.MySQLConnection.<init>(MySQLConnection.java:47)
... 4 more
...which, to my understanding, probably means that mysql-connector-java-5.0.8-bin.jar is not found.
All the links I've found online regarding the error are specific for Eclipse or other IDEs, so I have not yet been able to solve this dependency problem.
SOLUTION
With the help of a friend that masters Java, this is the solution he achieved:
To start looking for the problem, we opened the runMTK.bat. The execution line reads:
cscript //nologo "..\etc\sysconfig\runJavaApplication.vbs" "..\etc\sysconfig\edbmtk-49.config" "-Dprop=..\etc\toolkit.properties -classpath -jar edb-migrationtoolkit.jar %*"
So then we opened this runJavaApplication.vbs, and in order to know the JAVA_EXECUTABLE_PATH that the program was using, we add this line to the script:
Wscript.Echo "JAVA_EXECUTABLE_PATH = " & JAVA_EXECUTABLE_PATH
With that info, we discover that the script is using the Java folder under C:\Program Files (x86), instead of the one under C:\Program Files (where I dropped the mysql jar). So we copy the mysql-connector-java-5.0.8-bin.jar in the \ext folder of the x86, and now the script works.
Word of advice: the script is throwing errors in half of the exported tables, so all the hassle may not be worth it. BUT if anyone is interested in making this migration script work from A to Z (which has been quite a challenge), here are the details:
HOW TO
Free tool (from entreprisedb):
http://www.enterprisedb.com/downloads/postgres-postgresql-downloads
Extract the files from the zip and fun the installer (ppasmeta-9.5.0.5-windows-x64.exe) as administrator.
To enable MySQL connectivity, download MySQL's freely available JDBC driver from:
http://www.enterprisedb.com/downloads/third-party-jdbc-drivers
Place the mysql-connector-java-5.0.8-bin.jar file in the "JAVA_HOME\jre\lib\ext" directory (in my case: "C:\Program Files\Java\jre1.8.0_60\lib\ext\mysql-connector-java-5.0.8-bin.jar").
The Migration Toolkit documentation can be found:
here (online doc): https://www.enterprisedb.com/docs/en/9.4/migrate/toc.html
or here (pdf doc): http://get.enterprisedb.com/docs/Postgres_Plus_Migration_Guide_v9.5.pdf
First: modify C:\Program Files\PostgresPlus\edbmtk\etc\toolkit.properties (Info here):
SRC_DB_URL=jdbc:mysql://SOURCE-HOST-NAME/SOURCE-DB-NAME
SRC_DB_USER=********
SRC_DB_PASSWORD=********
TARGET_DB_URL=jdbc:edb://localhost:5444/DESTINATION-DB-NAME
TARGET_DB_USER=enterprisedb
TARGET_DB_PASSWORD=********
Then: execute C:\Program Files\PostgresPlus\edbmtk\bin\runMTK.bat (Info here).
runMTK.bat -sourcedbtype mysql -targetdbtype enterprisedb -allTables YOUR_DB_SCHEMA
// ...or with a limited subset of tables:
runMTK.bat -sourcedbtype mysql -targetdbtype enterprisedb -tables TABLE1,TABLE2,TABLE3 YOUR_DB_SCHEMA
In order to get this subset of tables from MySQL:
SELECT
GROUP_CONCAT(TABLE_NAME)
FROM
information_schema.tables
WHERE
TABLE_SCHEMA = 'your_db_name'

Cannot validate serde : org.openx.data.jsonserde.jsonserde

I have written this query to create a table on hive. My data is initially in json format, so i have downloaded and build serde and added all jar required for it to run. But i am getting the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerDe
QUERY:
create table tip(type string,
text string,
business_id string,
user_id string,
date date,
likes int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES("date.mapping"="date")
STORED AS TEXTFILE;
I too encountered this problem. In my case, I managed to fix this issue by adding json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar at hive command prompt as shown below:
hive> ADD JAR /usr/local/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar;
Below are the steps I have followed on Ubuntu 14.04:
1. Fire up Linux terminal and cd /usr/local
2. sudo git clone https://github.com/rcongiu/Hive-JSON-Serde.git
3. sudo mvn -Pcdh5 clean package
4. The serde file will be in
/usr/local/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar
5. Go to hive prompt and ADD JAR file as shown in Step 6.
6. hive> ADD JAR /usr/local/Hive-JSON-Serde/json-serde/target/json-serde-1.3.7- SNAPSHOT-jar-with-dependencies.jar;
7. Now create hive table from hive> prompt. At this stage, Hive table should be created successfully without any error.
Hive Version: 1.2.1
Hadoop Version: 2.7.1
Reference: Hive-JSON-Serde
You have to build the project cloned using the maven !
mvn install
in the directory /path/directory/Hive-JSON-Serd
here we are in /usr/local

Categories