I need to write my spark dataset to oracle database table. I am using dataset write method with append mode. But getting analysis exception,
when the spark job was triggered on cluster using spark2-submit command.
I have read the json file, flattened it and set into a dataset as abcDataset.
Spark Version - 2
Oracle Database
JDBC Driver - oracle.jdbc.driver.OracleDriver
Programming Language - Java
Dataset<Row> abcDataset= dataframe.select(col('abc').....{and other columns};
Properties dbProperties = new Properties();
InputStream is = SparkReader.class.getClassLoader().getResourceAsStream("dbProperties.yaml");
dbProperties.load(is);
String jdbcUrl = dbProperties.getProperty("jdbcUrl");
dbProperties.put("driver","oracle.jdbc.driver.OracleDriver");
String where = "USER123.PERSON";
abcDataset.write().format("org.apache.spark.sql.execution.datasources.jdbc.DefaultSource").option("driver", "oracle.jdbc.driver.OracleDriver").mode("append").jdbc(jdbcUrl, where, dbProperties);
Expected - to write into database but getting the error below -
org.apache.spark.sql.AnalysisException: Multiple sources found for jdbc (org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider, org.apache.spark.sql.execution.datasources.jdbc.DefaultSource), please specify the fully qualified class name.;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:670)
Do we need to set any additional property in spark submit command, as i am running this on cluster, or any step is missing ?
You need to use either abcDataset.write.jdbc or abcDataset.write.format("jdbc") when you are writing via jdbc from spark to rdbms.
Related
Please give step by step process
How to connect MySQL db and excute queries and stored those results in db table by using jsr223 sampler? Please give sample code this topic
Download MySQL JDBC Driver and drop it to "lib" folder of your JMeter installation (or other folder in JMeter Classpath)
Restart JMeter to pick up the .jar
Add Thread Group to your Test Plan
Add JSR223 Sampler to your Thread Group
Put the following code into "Script" area:
import groovy.sql.Sql
def url = 'jdbc:mysql://your-database-host:your-database-port/your-database-name'
def user = 'your-username'
def password = 'your-password'
def driver = 'com.mysql.cj.jdbc.Driver'
def sql = Sql.newInstance(url, user, password, driver)
def query= 'INSERT INTO your-table-name (your-first-column, your-second-column) VALUES (?,?)'
def params = ['your-first-value', 'your-second-value']
sql.executeInsert query, params
sql.close()
Change your-database-host, your-database-port, etc. to real IP address, port, credentials, table name, column name, etc.
Enjoy.
More information:
Apache Groovy - Working with a relational database
Apache Groovy - Why and How You Should Use It
P.S. I believe using JDBC Request sampler would be way faster and easier
Following diagram depicts the simplified ingestion flow we are building to ingest data from different RDBS to Hive.
Step 1: Using JDBC connection to the data-source, source data is streamed and saved in a CSV file on HDFS using HDFS java API.
Basically, execute a 'SELECT * ' query and each row is saved in CSV until the ResultSet is exhausted.
Step 2: Using LOAD DATA INPATH command, Hive table is populated using the CSV file created in Step 1.
We use JDBC ResultSet.getString() to get column data.
This works fine for non-binary data.
But for BLOC,CLOB type columns, we cannot write column data into a text/CSV file.
My question is it possible to use OCR or AVRO format to handle binary columns? Does these formats support write row-by-row?
(Update: We are aware of Sqoop/Nifi..etc technologies, the reason for implementing our custom ingestion-flow is beyond the scope of this question)
I have a comma separated file, which I want to load into memory and query it as if it was a database, I've come across many concepts/names but am not sure which is correct like ... embedded DB, in-memory database (Apache ignite, etc ...), how can I achieve that ?
I recommend to work with Apache Spark, you can load your file and then query it using spark-sql as follow:
val df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
// Select only the "user_id" column
df.select("user_id").show()
see link for more information.
If you are using Apache Spark 1.6 version, your code would be
HiveContext hqlContext = new HiveContext(sparkContext);
DataFrame df = hqlContext.read().format("com.databricks.spark.csv").option("inferSchema", "true")
.option("header", "true").load(csvpath);
df.registerTempTable("Table name");
And then you can query from the table
I'm using Spark 2.1 to read data from Cassandra in Java.
I tried the code posted in https://stackoverflow.com/a/39890996/1151472 (with SparkSession) and it worked. However when I replaced spark.read() method with spark.sql() one, the following exception is thrown:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `wiki`.`treated_article`; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation `wiki`.`treated_article`
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I'm using same spark configuration for both read and sql methods
read() code:
Dataset dataset =
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
sql() code:
spark.sql("SELECT * FROM WIKI.TREATED_ARTICLE");
Spark Sql uses a Catalogue to look up database and table references. When you write in a table identifier that isn't in the catalogue it will throw errors like the one you posted. The read command doesn't require a catalogue since you are required to specify all of the relevant information in the invocation.
You can add entries to the catalogue either by
Registering DataSets as Views
First create your DataSet
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
Then use one of the catalogue registry functions
void createGlobalTempView(String viewName)
Creates a global temporary view using the given name.
void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name.
void createTempView(String viewName)
Creates a local temporary view using the given name
OR Using a SQL Create Statement
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test",
cluster "Test Cluster",
pushdown "true"
)
Once added to the catalogue by either of these methods you can reference the table in all sql calls issued by that context.
Example
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test"
);
SELECT * FROM words;
// Hello 1
// World 2
The Datastax (My employer) Enterprise software automatically registers all Cassandra tables by placing entries in the Hive Metastore used by Spark as a Catalogue. This makes all tables accessible without manual registration.
This method allows for select statements to be used without an accompanying CREATE VIEW
I cannot think of a way to make that work off the top of my head. The problem lies in that Spark doesn't know the format to try, and the location that this would be specified is taken by the keyspace. The closest documentation for something like this that I can find is here in the DataFrames section of the Cassandra connector documentation. You can try to specify a using statement, but I don't think that will work inside of a select. So, your best bet beyond that is to create a PR to handle this case, or stick with the read DSL.
From within a java code - where I already have a connection to a database - I need to find the default schema of the connection.
I have the following code that gives me a list of all schemas of that connection.
rs = transactionManager.getDataSource().getConnection().getMetaData().getSchemas();
while (rs.next()) {
log.debug("The schema is {} and the catalogue is {} ", rs.getString(1), rs.getString(2));
}
However, I don't want the list of all the schemas. I need the default schema of this connection.
Please help.
Note1: I am using H2 and DB2 on Windows7 (dev box) and Linux Redhat (production box)
Note2: I finally concluded that it was not possible to use the Connections object in Java to find the default schema of both H2 and DB2 using the same code. I fixed the problem with a configuration file. However, if someone can share a solution, I could go back and refactor the code.
Please use connection.getMetaData().getURL() method which returns String like
jdbc:mysql://localhost:3306/?autoReconnect=true&useUnicode=true&characterEncoding=utf8
We can parse it easily and get the schema name. It works for all JDBC drivers.