I have the data schema of LinkeIn account as shown below. I need to query the skills which is in the for of array, where array may contains either JAVA OR java OR Java or JAVA developer OR Java developer.
Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"
+ " WHERE ARRAY_CONTAINS(skills,'Java') "
+ " OR ARRAY_CONTAINS(skills,'JAVA')"
+ " OR ARRAY_CONTAINS(skills,'Java developer') "
+ "AND ARRAY_CONTAINS(experience['description'],'Java developer')" );
df.printschema()
root
|-- skills: array (nullable = true)
| |-- element: string (containsNull = true)
df.show()
+--------------------+
| skills|
+--------------------+
| [Java, java]|
|[Java Developer, ...|
| [dev]|
+--------------------+
Related
Hi I am new to Java Spark, and have been looking for solutions for couple of days.
I am working on loading MongoDB data into hive table, however, I found some error while saveAsTable that occurs this error
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType(StructField(oid,StringType,true)) (value: BsonString{value='54d3e8aeda556106feba7fa2'})
I've tried increase the sampleSize, different mongo-spark-connector versions, ... but non of working solutions.
I can't figure out what is the root cause and what are the gaps in between that needs to be done?
The most confusing part is I have similar sets of data using the same flow without issue.
the mongodb data schema is like nested struct and array
root
|-- sample: struct (nullable = true)
| |-- parent: struct (nullable = true)
| | |-- expanded: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- distance: integer (nullable = true)
| | | | |-- id: struct (nullable = true)
| | | | | |-- oid: string (nullable = true)
| | | | |-- keys: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- parent_id: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- oid: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- id: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- oid: string (nullable = true)
sample data
"sample": {
"expanded": [
{
"distance": 0,
"type": "domain",
"id": "54d3e17b5cf737074d4065b0",
"parent_id": [
"54d3e1775cf737074d406599"
],
"name": "level2"
},
{
"distance": 1,
"type": "domain",
"id": "54d3e1775cf737074d406599",
"name": "level1"
}
],
"id": [
"54d3e17b5cf737074d4065b0"
]
}
sample code
public static void main(final String[] args) throws InterruptedException {
// spark session read mongodb
SparkSession mongo_spark = SparkSession.builder()
.master("local")
.appName("MongoSparkConnectorIntro")
.config("mongo_spark.master", "local")
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/test_db.test_collection")
.enableHiveSupport()
.getOrCreate();
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(mongo_spark.sparkContext());
// Load data and infer schema, disregard toDF() name as it returns Dataset
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
implicitDS.printSchema();
implicitDS.show();
// createOrReplaceTempView to see if the data being read
// implicitDS.createOrReplaceTempView("my_table");
// implicitDS.printSchema();
// implicitDS.show();
// saveAsTable
implicitDS.write().saveAsTable("my_table");
mongo_spark.sql("SELECT * FROM my_table limit 1").show();
mongo_spark.stop();
}
If anyone have some thoughts I would be very much appreciate.
Thanks
As I increase the sample size properly, this problem doesn't exist anymore.
How to config Java Spark sparksession samplesize
I had the same problem and sampleSize partially fixes this problem, but doesn't solve it if you have a lot of data.
Here is the solution how you can fix this. Use this approach together with increased sampleSize (in my case it's 100000):
def fix_schema(schema: StructType) -> StructType:
"""Fix spark schema due to inconsistent MongoDB schema collection.
It fixes such issues like:
Cannot cast STRING into a NullType
Cannot cast STRING into a StructType
:param schema: a source schema taken from a Spark DataFrame to be fixed
"""
if isinstance(schema, StructType):
return StructType([fix_schema(field) for field in schema.fields])
if isinstance(schema, ArrayType):
return ArrayType(fix_schema(schema.elementType))
if isinstance(schema, StructField) and is_struct_oid_obj(schema):
return StructField(name=schema.name, dataType=StringType(), nullable=schema.nullable)
elif isinstance(schema, StructField):
return StructField(schema.name, fix_schema(schema.dataType), schema.nullable)
if isinstance(schema, NullType):
return StringType()
return schema
def is_struct_oid_obj(struct_field: StructField) -> bool:
"""
Checks that our schema has StructType field with single oid name inside
:param struct_field: a StructField from Spark schema
:return bool
"""
return (isinstance(struct_field.dataType, StructType)
and len(struct_field.dataType.fields) == 1
and struct_field.dataType.fields[0].name == "oid")
I'm using Spark in Java to process XML files. The package spark-xml package from databricks is used to read the xml files into dataframe.
The example xml files are:
<RowTag>
<id>1</id>
<name>john</name>
<expenses>
<travel>
<details>
<date>20191203</date>
<amount>400</amount>
</details>
</travel>
</expenses>
</RowTag>
<RowTag>
<id>2</id>
<name>joe</name>
<expenses>
<food>
<details>
<date>20191204</date>
<amount>500</amount>
</details>
</food>
</expenses>
</RowTag>
The result spark Dataset<Row> df is shown below, each row represents one xml file.
+--+------+----------------+
|id| name |expenses |
+---------+----------------+
|1 | john |[[20191203,400]]|
|2 | joe |[[20191204,500]]|
+--+------+----------------+
df.printSchema(); shows below:
root
|-- id: int(nullable = true)
|-- name: string(nullable = true)
|-- expenses: struct (nullable = true)
| |-- travel: struct (nullable = true)
| | |-- details: struct (nullable = true)
| | | |-- date: string (nullable = true)
| | | |-- amount: int (nullable = true)
| |-- food: struct (nullable = true)
| | |-- details: struct (nullable = true)
| | | |-- date: string (nullable = true)
| | | |-- amount: int (nullable = true)
The desired output dataframe is like:
+--+------+-------------+
|id| name |expenses_date|
+---------+-------------+
|1 | john |20191203 |
|2 | joe |20191204 |
+--+------+-------------+
And basically I want a generic solution to get the date from the xml with the following structure, in which only the tag <X> will differ.
<RowTag>
<id>1</id>
<name>john</name>
<expenses>
**<X>**
<details>
<date>20191203</date>
<amount>400</amount>
</details>
**</X>**
</expenses>
</RowTag>
What I have tried:
spark.udf().register("getDate",(UDF1 <Row, String>) (Row row) -> {
return row.getStruct(0).getStruct(0).getAs("date").toString();
}, DataTypes.StringType);
df.select(callUDF("getDate",df.col("expenses")).as("expenses_date")).show();
But it didn't work, because row.getStruct(0) routes to <travel>, but for row joe, there's no <travel> tag under <expenses>, so it returned a java.lang.NullPointerException. What I want is a generic solution that for each row, it can auto-get the next tag name, e.g. row.getStruct(0) routes to <travel> for row john and to <food> for row joe.
So my question is: how should I reformulate my UDF to achieve this?
Thanks in advance!! :)
The spark-xml package allows you to access nested fields directly in the select expression. Why are you looking for UDF?
df.selectExpr("id", "name", "COALESCE(`expenses`.`food`.`details`.`date`, `expenses`.`travel`.`details`.`date`) AS expenses_date" ).show()
Output:
+---+----+-------------+
| id|name|expenses_date|
+---+----+-------------+
| 1|john| 20191203|
| 2| joe| 20191204|
+---+----+-------------+
EDIT
If the only tag which is changing is the one after expenses struct then you can search for all the fields under expenses and then coalesce the columns : expenses.X.details.date. Something like this in Spark :
val expenses_fields = df.select(col("expenses.*")).columns
val date_cols = expenses_fields.map(f => col(s"`expenses`.`$f`.`details`.`date`"))
df.select(col("id"), col("name"), coalesce(date_cols: _*).alias("expenses_date")).show()
Still, you don't need using UDF!
I have multiple json files which keeps json data init. Json Structure look like this.
{
"Name":"Vipin Suman",
"Email":"vpn2330#gmail.com",
"Designation":"Trainee Programmer",
"Age":22 ,
"location":
{"City":
{
"Pin":324009,
"City Name":"Ahmedabad"
},
"State":"Gujarat"
},
"Company":
{
"Company Name":"Elegant",
"Domain":"Java"
},
"Test":["Test1","Test2"]
}
I tried this
String jsonFilePath = "/home/vipin/workspace/Smarten/jsonParsing/Employee/Employee-03.json";
String[] jsonFiles = jsonFilePath.split(",");
Dataset<Row> people = sparkSession.read().json(jsonFiles);
i am getting schema for this is
root
|-- Age: long (nullable = true)
|-- Company: struct (nullable = true)
| |-- Company Name: string (nullable = true)
| |-- Domain: string (nullable = true)
|-- Designation: string (nullable = true)
|-- Email: string (nullable = true)
|-- Name: string (nullable = true)
|-- Test: array (nullable = true)
| |-- element: string (containsNull = true)
|-- location: struct (nullable = true)
| |-- City: struct (nullable = true)
| | |-- City Name: string (nullable = true)
| | |-- Pin: long (nullable = true)
| |-- State: string (nullable = true)
i am getting the view of table:-
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
|Age| Company| Designation| Email| Name| Test| location|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
| 22|[Elegant,Java]|Trainee Programmer|vpn2330#gmail.com|Vipin Suman|[Test1, Test2]|[[Ahmedabad,32400...|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
i want result as:-
Age | Company Name | Domain| Designation | Email | Name | Test | City Name | Pin | State |
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test1 | Ahmedabad | 324009 | Gujarat
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test2 | Ahmedabad | 324009 |
how i can get table in above formet. i tried out everything. I am new to apache spark can any one help me out??
I suggest you do your work in scala which is better supported by spark. To do your work, you can use "select" API to select a specific column, use alias to rename a column, and you can refer to here to say how to select complex data format(https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html)
Based on your result, you also need to use "explode" API (Flattening Rows in Spark)
In Scala it could be done like this:
people.select(
$"Age",
$"Company.*",
$"Designation",
$"Email",
$"Name",
explode($"Test"),
$"location.City.*",
$"location.State")
Unfortunately, following code in Java would fail:
people.select(
people.col("Age"),
people.col("Company.*"),
people.col("Designation"),
people.col("Email"),
people.col("Name"),
explode(people.col("Test")),
people.col("location.City.*"),
people.col("location.State"));
You can use selectExpr instead though:
people.selectExpr(
"Age",
"Company.*",
"Designation",
"Email",
"Name",
"EXPLODE(Test) AS Test",
"location.City.*",
"location.State");
PS:
You can pass the path to the directory or directories instead of the list of JSON files in sparkSession.read().json(jsonFiles);.
I have the data schema of LinkeIn account as shown below. I need to query the skills which is in the for of array, where array may contains either JAVA OR java OR Java or JAVA developer OR Java developer.
Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"
+ " WHERE ARRAY_CONTAINS(skills,'Java') "
+ " OR ARRAY_CONTAINS(skills,'JAVA')"
+ " OR ARRAY_CONTAINS(skills,'Java developer') "
+ "AND ARRAY_CONTAINS(experience['description'],'Java developer')" );
The above query is what i have tried and please suggest any better way.and also how to use case-insentive query ?
df.printschema()
root
|-- skills: array (nullable = true)
| |-- element: string (containsNull = true)
df.show()
+--------------------+
| skills|
+--------------------+
| [Java, java]|
|[Java Developer, ...|
| [dev]|
+--------------------+
Now lets register it as a temp table:
>>> df.registerTempTable("t")
Now, we will explode the array, convert each element as lower case and query using LIKE operator:
>>> res = sqlContext.sql("select skills, lower(skill) as skill from (select skills, explode(skills) skill from t) a where lower(skill) like '%java%'")
>>> res.show()
+--------------------+--------------+
| skills| skill|
+--------------------+--------------+
| [Java, java]| java|
| [Java, java]| java|
|[Java Developer, ...|java developer|
|[Java Developer, ...| java dev|
+--------------------+--------------+
Now, you can do a distinct on skills field.
I am quite new to spark and dataframes and I am going nuts searching on internet how to insert data into mysql table using dataframes (Spark-Java). I found lots of stuff on scala but there are very little information provided on Java.
I followed the steps provided in the link http://www.sparkexpert.com/2015/04/17/save-apache-spark-dataframe-to-database/. It looked pretty simple but when tried it myself ,I faced issues in creating correct dataframe schema and inserting data in table with autoincrement field.
mySql table(Person) schema
+------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+----------------+
| person_id | int(11) | NO | PRI | NULL | auto_increment |
| first_name | varchar(30) | YES | | NULL | |
| last_name | varchar(30) | YES | | NULL | |
| gender | char(1) | YES | | NULL | |
+------------+-------------+------+-----+---------+----------------+
Java Code
DataFrame usersDf= sqlContext.jsonFile("data.json");
usersDf.printSchema();
usersDf.insertIntoJDBC(MYSQL_CONNECTION_URL, "person", false);
data.json
{"person_id":null,"first_name":"Judith1","last_name":"knight1","gender":"M"}
{"person_id":null,"first_name":"Judith2","last_name":"knight2","gender":"F"}
{"person_id":null,"first_name":"Judith3","last_name":"knight3","gender":"M"}
{"person_id":null,"first_name":"Judith4","last_name":"knight4","gender":"M"}
When I run the above code, Dataframes are created with schema given below:
root
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- last_name: string (nullable = true)
|-- person_id: string (nullable = true)
Whereas schema should be
root
|-- person_id: integer (nullable = false)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- gender: string (nullable = true)
So because of wrong schema created I got the following error.
java.sql.SQLException: Incorrect integer value: 'Judith1' for column 'person_id' at row 1
Please let me know how to solve this problem.I know there is some minor problem only but I couldn't find it.Any help would be much appreciated. Thanks in advance.