How to insert data into mysql through spark DataFrames using Java - java

I am quite new to spark and dataframes and I am going nuts searching on internet how to insert data into mysql table using dataframes (Spark-Java). I found lots of stuff on scala but there are very little information provided on Java.
I followed the steps provided in the link http://www.sparkexpert.com/2015/04/17/save-apache-spark-dataframe-to-database/. It looked pretty simple but when tried it myself ,I faced issues in creating correct dataframe schema and inserting data in table with autoincrement field.
mySql table(Person) schema
+------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+----------------+
| person_id | int(11) | NO | PRI | NULL | auto_increment |
| first_name | varchar(30) | YES | | NULL | |
| last_name | varchar(30) | YES | | NULL | |
| gender | char(1) | YES | | NULL | |
+------------+-------------+------+-----+---------+----------------+
Java Code
DataFrame usersDf= sqlContext.jsonFile("data.json");
usersDf.printSchema();
usersDf.insertIntoJDBC(MYSQL_CONNECTION_URL, "person", false);
data.json
{"person_id":null,"first_name":"Judith1","last_name":"knight1","gender":"M"}
{"person_id":null,"first_name":"Judith2","last_name":"knight2","gender":"F"}
{"person_id":null,"first_name":"Judith3","last_name":"knight3","gender":"M"}
{"person_id":null,"first_name":"Judith4","last_name":"knight4","gender":"M"}
When I run the above code, Dataframes are created with schema given below:
root
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- last_name: string (nullable = true)
|-- person_id: string (nullable = true)
Whereas schema should be
root
|-- person_id: integer (nullable = false)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- gender: string (nullable = true)
So because of wrong schema created I got the following error.
java.sql.SQLException: Incorrect integer value: 'Judith1' for column 'person_id' at row 1
Please let me know how to solve this problem.I know there is some minor problem only but I couldn't find it.Any help would be much appreciated. Thanks in advance.

Related

How to resolve com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast... Java Spark

Hi I am new to Java Spark, and have been looking for solutions for couple of days.
I am working on loading MongoDB data into hive table, however, I found some error while saveAsTable that occurs this error
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType(StructField(oid,StringType,true)) (value: BsonString{value='54d3e8aeda556106feba7fa2'})
I've tried increase the sampleSize, different mongo-spark-connector versions, ... but non of working solutions.
I can't figure out what is the root cause and what are the gaps in between that needs to be done?
The most confusing part is I have similar sets of data using the same flow without issue.
the mongodb data schema is like nested struct and array
root
|-- sample: struct (nullable = true)
| |-- parent: struct (nullable = true)
| | |-- expanded: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- distance: integer (nullable = true)
| | | | |-- id: struct (nullable = true)
| | | | | |-- oid: string (nullable = true)
| | | | |-- keys: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- parent_id: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- oid: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- id: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- oid: string (nullable = true)
sample data
"sample": {
"expanded": [
{
"distance": 0,
"type": "domain",
"id": "54d3e17b5cf737074d4065b0",
"parent_id": [
"54d3e1775cf737074d406599"
],
"name": "level2"
},
{
"distance": 1,
"type": "domain",
"id": "54d3e1775cf737074d406599",
"name": "level1"
}
],
"id": [
"54d3e17b5cf737074d4065b0"
]
}
sample code
public static void main(final String[] args) throws InterruptedException {
// spark session read mongodb
SparkSession mongo_spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("mongo_spark.master", "local")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/test_db.test_collection")
.enableHiveSupport()
.getOrCreate();
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(mongo_spark.sparkContext());
// Load data and infer schema, disregard toDF() name as it returns Dataset
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
implicitDS.printSchema();
implicitDS.show();
// createOrReplaceTempView to see if the data being read
// implicitDS.createOrReplaceTempView("my_table");
// implicitDS.printSchema();
// implicitDS.show();
// saveAsTable
implicitDS.write().saveAsTable("my_table");
mongo_spark.sql("SELECT * FROM my_table limit 1").show();
mongo_spark.stop();
}
If anyone have some thoughts I would be very much appreciate.
Thanks
As I increase the sample size properly, this problem doesn't exist anymore.
How to config Java Spark sparksession samplesize
I had the same problem and sampleSize partially fixes this problem, but doesn't solve it if you have a lot of data.
Here is the solution how you can fix this. Use this approach together with increased sampleSize (in my case it's 100000):
def fix_schema(schema: StructType) -> StructType:
"""Fix spark schema due to inconsistent MongoDB schema collection.
It fixes such issues like:
Cannot cast STRING into a NullType
Cannot cast STRING into a StructType
:param schema: a source schema taken from a Spark DataFrame to be fixed
"""
if isinstance(schema, StructType):
return StructType([fix_schema(field) for field in schema.fields])
if isinstance(schema, ArrayType):
return ArrayType(fix_schema(schema.elementType))
if isinstance(schema, StructField) and is_struct_oid_obj(schema):
return StructField(name=schema.name, dataType=StringType(), nullable=schema.nullable)
elif isinstance(schema, StructField):
return StructField(schema.name, fix_schema(schema.dataType), schema.nullable)
if isinstance(schema, NullType):
return StringType()
return schema
def is_struct_oid_obj(struct_field: StructField) -> bool:
"""
Checks that our schema has StructType field with single oid name inside
:param struct_field: a StructField from Spark schema
:return bool
"""
return (isinstance(struct_field.dataType, StructType)
and len(struct_field.dataType.fields) == 1
and struct_field.dataType.fields[0].name == "oid")

how to flatten complex nested json in spark dataframe using java dynamically

My input json dataframe file looks like the following:
company: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- address: struct (nullable = true)
| | | |-- city: string (nullable = true)
| | | |-- county: string (nullable = true)
| | | |-- latitude: string (nullable = true)
| | | |-- line1: string (nullable = true)
| | | |-- line2: string (nullable = true)
| | | |-- longitude: string (nullable = true)
| | | |-- postalCode: long (nullable = true)
| | | |-- state: struct (nullable = true)
| | | | |-- code: int (nullable = true)
| | | | |-- name: string (nullable = true)
| | | |-- stateOtherDescription: string (nullable = true)
| | |-- addressSourceOther: string (nullable = true)
| | |-- addressSourceType: struct (nullable = true)
| | | |-- code: int (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- reasons: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | |-- improve: string (nullable = true)
| | | |-- far: string (nullable = true)
| | | |-- home: string (nullable = true)
I want to flatten it dynamically using spark java. Can someone help me with this
Take a look at the de.stefanfrings/parsing/JsonSlurper class in http://stefanfrings.de/bfUtilities/bfUtilities.zip. It reads a JSON document and creates a flat HashMap from the content.

Dataset Filter working in an unexpected way

Scenario:
I have read two XML files via specifying a schema on load.
In the schema, One of the tags is mandatory. One XML is missing that mandatory tag.
Now, when I do the following, I am expecting the XML with the missing mandatory tag to be filtered out.
dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());
In the code when I try to count the Rows of the dataset, I am getting the count as 2 (2 Input XMLS) but when I try to print the dataset via show() method, I am getting a NPE.
When I debugged the above line and do the following, I get 0 as the count.
dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();
Question:
Can anyone please answer the questions/affirm my understanding below
Why is spark Dataset not filtering the Row which does not have a mandatory column?
Why there are no exception in the count but in show method?
For 2, I believe the count is just counting no of Rows without looking into the contents.
For show, the iterator actually goes through the Struct Fields to print their values and when it does not find the mandatory column, it errors out.
P.S. If I make the mandatory column optional, all is working fine.
Edit:
Providing Reading options as requested
For loading the data I am executing the following
Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
.option("header", "true")
.option("inferSchema", "false")
.option("rowTag", rowTag)//rowTag is "body" tag in the XML
.option("failFast", "true")
.option("mode", "FAILFAST")
.schema(schema)
.load(XMLfilePath);
Providing samples as requested
Schema:
root
|-- old: struct (nullable = true)
| |-- _beyond: string (nullable = true)
| |-- lot: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _chose: string (nullable = true)
| |-- real: struct (nullable = true)
| | |-- _eat: string (nullable = true)
| | |-- kill: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _top: string (nullable = true)
| | |-- tool: struct (nullable = true)
| | | |-- _VALUE: string (nullable = true)
| | | |-- _affect: string (nullable = true)
|-- porch: struct (nullable = true)
| |-- _account: string (nullable = true)
| |-- cast: string (nullable = true)
| |-- vegetable: struct (nullable = true)
| | |-- leg: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _nose: string (nullable = true)
| | |-- now: struct (nullable = true)
| | | |-- _gravity: string (nullable = true)
| | | |-- chief: struct (nullable = true)
| | | | |-- _VALUE: long (nullable = true)
| | | | |-- _further: string (nullable = true)
| | | |-- field: string (nullable = true)
Sample XML:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<body>
<porch account="something">
<vegetable>
<now gravity="wide">
<field>box</field>
<chief further="satisfied">-1889614487</chief>
</now>
<leg nose="angle">912658017.229279</leg>
</vegetable>
<cast>clear</cast>
</porch>
<old beyond="continent">
<real eat="term">
<kill top="plates">-1623084908.8669372</kill>
<tool affect="pond">today</tool>
</real>
<lot chose="swung">promised</lot>
</old>
</body>
</root>
Schema in JSON format:
{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}
My scenario can be reproduced by making the element "old" as nullable = false, and removing the tag from the XML

Spark UDF: How to write a UDF on each row to extract a specific value in a nested struct?

I'm using Spark in Java to process XML files. The package spark-xml package from databricks is used to read the xml files into dataframe.
The example xml files are:
<RowTag>
<id>1</id>
<name>john</name>
<expenses>
<travel>
<details>
<date>20191203</date>
<amount>400</amount>
</details>
</travel>
</expenses>
</RowTag>
<RowTag>
<id>2</id>
<name>joe</name>
<expenses>
<food>
<details>
<date>20191204</date>
<amount>500</amount>
</details>
</food>
</expenses>
</RowTag>
The result spark Dataset<Row> df is shown below, each row represents one xml file.
+--+------+----------------+
|id| name |expenses |
+---------+----------------+
|1 | john |[[20191203,400]]|
|2 | joe |[[20191204,500]]|
+--+------+----------------+
df.printSchema(); shows below:
root
|-- id: int(nullable = true)
|-- name: string(nullable = true)
|-- expenses: struct (nullable = true)
| |-- travel: struct (nullable = true)
| | |-- details: struct (nullable = true)
| | | |-- date: string (nullable = true)
| | | |-- amount: int (nullable = true)
| |-- food: struct (nullable = true)
| | |-- details: struct (nullable = true)
| | | |-- date: string (nullable = true)
| | | |-- amount: int (nullable = true)
The desired output dataframe is like:
+--+------+-------------+
|id| name |expenses_date|
+---------+-------------+
|1 | john |20191203 |
|2 | joe |20191204 |
+--+------+-------------+
And basically I want a generic solution to get the date from the xml with the following structure, in which only the tag <X> will differ.
<RowTag>
<id>1</id>
<name>john</name>
<expenses>
**<X>**
<details>
<date>20191203</date>
<amount>400</amount>
</details>
**</X>**
</expenses>
</RowTag>
What I have tried:
spark.udf().register("getDate",(UDF1 <Row, String>) (Row row) -> {
return row.getStruct(0).getStruct(0).getAs("date").toString();
}, DataTypes.StringType);
df.select(callUDF("getDate",df.col("expenses")).as("expenses_date")).show();
But it didn't work, because row.getStruct(0) routes to <travel>, but for row joe, there's no <travel> tag under <expenses>, so it returned a java.lang.NullPointerException. What I want is a generic solution that for each row, it can auto-get the next tag name, e.g. row.getStruct(0) routes to <travel> for row john and to <food> for row joe.
So my question is: how should I reformulate my UDF to achieve this?
Thanks in advance!! :)
The spark-xml package allows you to access nested fields directly in the select expression. Why are you looking for UDF?
df.selectExpr("id", "name", "COALESCE(`expenses`.`food`.`details`.`date`, `expenses`.`travel`.`details`.`date`) AS expenses_date" ).show()
Output:
+---+----+-------------+
| id|name|expenses_date|
+---+----+-------------+
| 1|john| 20191203|
| 2| joe| 20191204|
+---+----+-------------+
EDIT
If the only tag which is changing is the one after expenses struct then you can search for all the fields under expenses and then coalesce the columns : expenses.X.details.date. Something like this in Spark :
val expenses_fields = df.select(col("expenses.*")).columns
val date_cols = expenses_fields.map(f => col(s"`expenses`.`$f`.`details`.`date`"))
df.select(col("id"), col("name"), coalesce(date_cols: _*).alias("expenses_date")).show()
Still, you don't need using UDF!

I am using apache spark to parse json files . How to get nested key from json files wheather it's array or nested key

I have multiple json files which keeps json data init. Json Structure look like this.
{
"Name":"Vipin Suman",
"Email":"vpn2330#gmail.com",
"Designation":"Trainee Programmer",
"Age":22 ,
"location":
{"City":
{
"Pin":324009,
"City Name":"Ahmedabad"
},
"State":"Gujarat"
},
"Company":
{
"Company Name":"Elegant",
"Domain":"Java"
},
"Test":["Test1","Test2"]
}
I tried this
String jsonFilePath = "/home/vipin/workspace/Smarten/jsonParsing/Employee/Employee-03.json";
String[] jsonFiles = jsonFilePath.split(",");
Dataset<Row> people = sparkSession.read().json(jsonFiles);
i am getting schema for this is
root
|-- Age: long (nullable = true)
|-- Company: struct (nullable = true)
| |-- Company Name: string (nullable = true)
| |-- Domain: string (nullable = true)
|-- Designation: string (nullable = true)
|-- Email: string (nullable = true)
|-- Name: string (nullable = true)
|-- Test: array (nullable = true)
| |-- element: string (containsNull = true)
|-- location: struct (nullable = true)
| |-- City: struct (nullable = true)
| | |-- City Name: string (nullable = true)
| | |-- Pin: long (nullable = true)
| |-- State: string (nullable = true)
i am getting the view of table:-
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
|Age| Company| Designation| Email| Name| Test| location|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
| 22|[Elegant,Java]|Trainee Programmer|vpn2330#gmail.com|Vipin Suman|[Test1, Test2]|[[Ahmedabad,32400...|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
i want result as:-
Age | Company Name | Domain| Designation | Email | Name | Test | City Name | Pin | State |
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test1 | Ahmedabad | 324009 | Gujarat
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test2 | Ahmedabad | 324009 |
how i can get table in above formet. i tried out everything. I am new to apache spark can any one help me out??
I suggest you do your work in scala which is better supported by spark. To do your work, you can use "select" API to select a specific column, use alias to rename a column, and you can refer to here to say how to select complex data format(https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html)
Based on your result, you also need to use "explode" API (Flattening Rows in Spark)
In Scala it could be done like this:
people.select(
$"Age",
$"Company.*",
$"Designation",
$"Email",
$"Name",
explode($"Test"),
$"location.City.*",
$"location.State")
Unfortunately, following code in Java would fail:
people.select(
people.col("Age"),
people.col("Company.*"),
people.col("Designation"),
people.col("Email"),
people.col("Name"),
explode(people.col("Test")),
people.col("location.City.*"),
people.col("location.State"));
You can use selectExpr instead though:
people.selectExpr(
"Age",
"Company.*",
"Designation",
"Email",
"Name",
"EXPLODE(Test) AS Test",
"location.City.*",
"location.State");
PS:
You can pass the path to the directory or directories instead of the list of JSON files in sparkSession.read().json(jsonFiles);.

Categories