Stream join example with Apache Kafka? - java

I was looking for an example using Kafka Streams on how to do this sort of thing, i.e. join a customers table with a addresses table and sink the data to ES:-
Customers
+------+------------+----------------+-----------------------+
| id | first_name | last_name | email |
+------+------------+----------------+-----------------------+
| 1001 | Sally | Thomas | sally.thomas#acme.com |
| 1002 | George | Bailey | gbailey#foobar.com |
| 1003 | Edward | Davidson | ed#walker.com |
| 1004 | Anne | Kim | annek#noanswer.org |
+------+------------+----------------+-----------------------+
Addresses
+----+-------------+---------------------------+------------+--------------+-------+----------+
| id | customer_id | street | city | state | zip | type |
+----+-------------+---------------------------+------------+--------------+-------+----------+
| 10 | 1001 | 3183 Moore Avenue | Euless | Texas | 76036 | SHIPPING |
| 11 | 1001 | 2389 Hidden Valley Road | Harrisburg | Pennsylvania | 17116 | BILLING |
| 12 | 1002 | 281 Riverside Drive | Augusta | Georgia | 30901 | BILLING |
| 13 | 1003 | 3787 Brownton Road | Columbus | Mississippi | 39701 | SHIPPING |
| 14 | 1003 | 2458 Lost Creek Road | Bethlehem | Pennsylvania | 18018 | SHIPPING |
| 15 | 1003 | 4800 Simpson Square | Hillsdale | Oklahoma | 73743 | BILLING |
| 16 | 1004 | 1289 University Hill Road | Canehill | Arkansas | 72717 | LIVING |
+----+-------------+---------------------------+------------+--------------+-------+----------+
Output Elasticsearch index
"hits": [
{
"_index": "customers_with_addresses",
"_type": "_doc",
"_id": "1",
"_score": 1.3278645,
"_source": {
"first_name": "Sally",
"last_name": "Thomas",
"email": "sally.thomas#acme.com",
"addresses": [{
"street": "3183 Moore Avenue",
"city": "Euless",
"state": "Texas",
"zip": "76036",
"type": "SHIPPING"
}, {
"street": "2389 Hidden Valley Road",
"city": "Harrisburg",
"state": "Pennsylvania",
"zip": "17116",
"type": "BILLING"
}],
}
}, ….
Table data is coming from Debezium topics, am I correct in thinking I need some Java in the middle to join the streams, output it to a new topic which then sinks that into ES?
Would anyone have any example code of this?
Thanks.

Yes, You can implement the solution using Kafka streams API in java in following way.
Consume the topics as stream.
Aggregate the address stream in a list using customer ID and convert the stream into table.
Join Customer stream with address table
Below is the example(considering data is consumed in json format) :
KStream<String,JsonNode> customers = builder.stream("customer", Consumed.with(stringSerde, jsonNodeSerde));
KStream<String,JsonNode> addresses = builder.stream("address", Consumed.with(stringSerde, jsonNodeSerde));
// Select the customer ID as key in order to join with address.
KStream<String,JsonNode> customerRekeyed = customers.selectKey(value-> value.get("id").asText());
ObjectMapper mapper = new ObjectMapper();
// Select Customer_id as key to aggregate the addresses and join with customer
KTable<String,JsonNode> addressTable = addresses
.selectKey(value-> value.get("customer_id").asText())
.groupByKey()
.aggregate(() ->mapper::createObjectNode, //initializer
(key,value,aggregate) -> aggregate.add(value),
Materialized.with(stringSerde, jsonNodeSerde)
); //adder
// Join Customer Stream with Address Table
KStream<String,JsonNode> customerAddressStream = customerRekeyed.leftJoin(addressTable,
(left,right) -> {
ObjectNode finalNode = mapper.createObjectNode();
ArrayList addressList = new ArrayList<JsonNode>();
// Considering the address is arrayNode
((ArrayNode)right).elements().forEachRemaining(addressList ::add);
left.putArray("addresses").allAll(addressList);
return left;
},Joined.keySerde(stringSerde).withValueSerde(jsonNodeSerde));
You can refer the details about all type of joins here :
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#joining

Depending on how strict your requirement is to nest multiple addresses in one customer node, you can do this in KSQL (which is built on top of Kafka Streams).
Populate some test data into Kafka (which in your case is done already through Debezium):
$ curl -s "https://api.mockaroo.com/api/ffa9ff20?count=10&key=ff7856d0" | kafkacat -b localhost:9092 -t addresses -P
$ curl -s "https://api.mockaroo.com/api/9b868890?count=4&key=ff7856d0" | kafkacat -b localhost:9092 -t customers -P
Fire up KSQL and to start with just inspect the data:
ksql> PRINT 'addresses' FROM BEGINNING ;
Format:JSON
{"ROWTIME":1558519823351,"ROWKEY":"null","id":1,"customer_id":1004,"street":"8 Moulton Center","city":"Bronx","state":"New York","zip":"10474","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":2,"customer_id":1001,"street":"5 Hollow Ridge Alley","city":"Washington","state":"District of Columbia","zip":"20016","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":3,"customer_id":1000,"street":"58 Maryland Point","city":"Greensboro","state":"North Carolina","zip":"27404","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":4,"customer_id":1002,"street":"55795 Derek Avenue","city":"Temple","state":"Texas","zip":"76505","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":5,"customer_id":1002,"street":"164 Continental Plaza","city":"Modesto","state":"California","zip":"95354","type":"SHIPPING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":6,"customer_id":1004,"street":"6 Miller Road","city":"Louisville","state":"Kentucky","zip":"40205","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":7,"customer_id":1003,"street":"97 Shasta Place","city":"Pittsburgh","state":"Pennsylvania","zip":"15286","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":8,"customer_id":1000,"street":"36 Warbler Circle","city":"Memphis","state":"Tennessee","zip":"38109","type":"SHIPPING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":9,"customer_id":1001,"street":"890 Eagan Circle","city":"Saint Paul","state":"Minnesota","zip":"55103","type":"SHIPPING"}
{"ROWTIME":1558519823354,"ROWKEY":"null","id":10,"customer_id":1000,"street":"8 Judy Terrace","city":"Washington","state":"District of Columbia","zip":"20456","type":"SHIPPING"}
^C
Topic printing ceased
ksql>
ksql> PRINT 'customers' FROM BEGINNING;
Format:JSON
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1001,"first_name":"Jolee","last_name":"Handasyde","email":"jhandasyde0#nhs.uk"}
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1002,"first_name":"Rebeca","last_name":"Kerrod","email":"rkerrod1#sourceforge.net"}
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1003,"first_name":"Bobette","last_name":"Brumble","email":"bbrumble2#cdc.gov"}
{"ROWTIME":1558519852368,"ROWKEY":"null","id":1004,"first_name":"Royal","last_name":"De Biaggi","email":"rdebiaggi3#opera.com"}
Now we declare a STREAM (Kafka topic + schema) on the data so that we can manipulate it further:
ksql> CREATE STREAM addresses_RAW (ID INT, CUSTOMER_ID INT, STREET VARCHAR, CITY VARCHAR, STATE VARCHAR, ZIP VARCHAR, TYPE VARCHAR) WITH (KAFKA_TOPIC='addresses', VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
ksql> CREATE STREAM customers_RAW (ID INT, FIRST_NAME VARCHAR, LAST_NAME VARCHAR, EMAIL VARCHAR) WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
We're going to model the customers as a TABLE, and to do that the Kafka messages need to be keyed correctly (and the moment they have null keys, as can be seen from the "ROWKEY":"null" in the PRINT output above). You can configure Debezium to set the message key so this step may not be necessary for you in KSQL:
ksql> CREATE STREAM CUSTOMERS_KEYED WITH (PARTITIONS=1) AS SELECT * FROM CUSTOMERS_RAW PARTITION BY ID;
Message
----------------------------
Stream created and running
----------------------------
Now we declare a TABLE (state for a given key, instantiated from a Kafka topic + schema):
ksql> CREATE TABLE CUSTOMER (ID INT, FIRST_NAME VARCHAR, LAST_NAME VARCHAR, EMAIL VARCHAR) WITH (KAFKA_TOPIC='CUSTOMERS_KEYED', VALUE_FORMAT='JSON', KEY='ID');
Message
---------------
Table created
---------------
Now we can join the data:
ksql> CREATE STREAM customers_with_addresses AS
SELECT CUSTOMER_ID,
FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME,
FIRST_NAME,
LAST_NAME,
TYPE AS ADDRESS_TYPE,
STREET,
CITY,
STATE,
ZIP
FROM ADDRESSES_RAW A
INNER JOIN CUSTOMER C
ON A.CUSTOMER_ID = C.ID;
Message
----------------------------
Stream created and running
----------------------------
This creates a new KSQL STREAM which in turn populates a new Kafka topic.
ksql> SHOW STREAMS;
Stream Name | Kafka Topic | Format
------------------------------------------------------------------------------------------
CUSTOMERS_KEYED | CUSTOMERS_KEYED | JSON
ADDRESSES_RAW | addresses | JSON
CUSTOMERS_RAW | customers | JSON
CUSTOMERS_WITH_ADDRESSES | CUSTOMERS_WITH_ADDRESSES | JSON
The stream has a schema:
ksql> DESCRIBE CUSTOMERS_WITH_ADDRESSES;
Name : CUSTOMERS_WITH_ADDRESSES
Field | Type
------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
CUSTOMER_ID | INTEGER (key)
FULL_NAME | VARCHAR(STRING)
FIRST_NAME | VARCHAR(STRING)
ADDRESS_TYPE | VARCHAR(STRING)
LAST_NAME | VARCHAR(STRING)
STREET | VARCHAR(STRING)
CITY | VARCHAR(STRING)
STATE | VARCHAR(STRING)
ZIP | VARCHAR(STRING)
------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;
We can query the stream:
ksql> SELECT * FROM CUSTOMERS_WITH_ADDRESSES WHERE CUSTOMER_ID=1002;
1558519823351 | 1002 | 1002 | Rebeca Kerrod | Rebeca | LIVING | Kerrod | 55795 Derek Avenue | Temple | Texas | 76505
1558519823351 | 1002 | 1002 | Rebeca Kerrod | Rebeca | SHIPPING | Kerrod | 164 Continental Plaza | Modesto | California | 95354
We can also stream it to Elasticsearch using Kafka Connect:
curl -i -X POST -H "Accept:application/json" \
-H "Content-Type:application/json" http://localhost:8083/connectors/ \
-d '{
"name": "sink-elastic-customers_with_addresses-00",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"topics": "CUSTOMERS_WITH_ADDRESSES",
"connection.url": "http://elasticsearch:9200",
"type.name": "type.name=kafkaconnect",
"key.ignore": "true",
"schema.ignore": "true",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false"
}
}'
Result:
$ curl -s http://localhost:9200/customers_with_addresses/_search | jq '.hits.hits[0]'
{
"_index": "customers_with_addresses",
"_type": "type.name=kafkaconnect",
"_id": "CUSTOMERS_WITH_ADDRESSES+0+2",
"_score": 1,
"_source": {
"ZIP": "76505",
"CITY": "Temple",
"ADDRESS_TYPE": "LIVING",
"CUSTOMER_ID": 1002,
"FULL_NAME": "Rebeca Kerrod",
"STATE": "Texas",
"STREET": "55795 Derek Avenue",
"LAST_NAME": "Kerrod",
"FIRST_NAME": "Rebeca"
}
}

We built a demo and blog post on this very use case (streaming aggregates to Elasticsearch) a while ago on the Debezium blog.
One issue to keep in mind is that this solution (based on Kafka Streams, but I reckon it's the same for KSQL) is prone to exposing intermediary join results. E.g. assume you insert a customer and 10 addresses in one transaction. The stream join approach might first produce an aggregate of the customer and their first five addresses and shortly thereafter the complete aggregate with all the 10 addresses. This might or might not be desirable for your specific use case. I also remember that handling deletions isn't trivial (e.g. if you delete one of the 10 addresses, so you'll have to produce the aggregate again with the remaining 9 addresses with might have been untouched, though).
An alternative to consider can be the outbox pattern where you'd essentially produce an explicit event with the precomputed aggregated from within your application itself. I.e. it requires a little help of the application, but then it avoids the subtleties of producing that join result after the fact.

Related

MetaException(message:Partition spec is incorrect): proper way to fix case-sensitive insert from Spark into partitioned Hive table

I have partitioned Hive table described as
CREATE TABLE IF NOT EXISTS TRADES
(
TRADE_ID STRING,
TRADE_DATE INT,
-- ...
)
PARTITIONED BY (BUSINESS_DATE INT)
STORED AS PARQUET;
When I insert the data from Spark-based Java application as
try (SparkSession sparkSession = SparkSession.builder()
.config(new SparkConf())
.enableHiveSupport()
// .config("hive.exec.dynamic.partition", "true")
// .config("hive.exec.dynamic.partition.mode", "nonstrict")
// .config("spark.sql.hive.convertMetastoreParquet", "false")
.getOrCreate()) {
dataset.select(columns(joinedDs, businessDate))
.write()
.format("parquet")
.option("compression", "snappy")
.mode(SaveMode.Append)
.insertInto("TRADES"));
}
//...
private Column[] columns(Dataset<Row> dataset, LocalDate businessDate) {
List<Column> columns = new ArrayList<>();
for (String column : appConfig.getColumns()) {
columns.add(dataset.col(column));
}
columns.add(lit(dateToInteger(businessDate)).as("BUSINESS_DATE"));
return columns.toArray(new Column[0]);
}
it fails with the following exception:
23/01/27 10:39:26 ERROR metadata.Hive: Exception when loading partition with parameters partPath=hdfs://path-to-trades/.hive-staging_hive_2023-01-27_10-38-29_374_384898966661095068-1/-ext-10000/BUSINESS_DATE=20221230, table=trades, partSpec={business_date=, BUSINESS_DATE=20221230}, replace=false, listBucketingEnabled=false, isAcid=false, hasFollowingStatsTask=false
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Partition spec is incorrect. {business_date=, BUSINESS_DATE=20221230})
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1662)
at org.apache.hadoop.hive.ql.metadata.Hive.lambda$loadDynamicPartitions$4(Hive.java:1970)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: MetaException(message:Partition spec is incorrect. {business_date=, BUSINESS_DATE=20221230})
at org.apache.hadoop.hive.metastore.Warehouse.makePartName(Warehouse.java:329)
at org.apache.hadoop.hive.metastore.Warehouse.makePartPath(Warehouse.java:312)
at org.apache.hadoop.hive.ql.metadata.Hive.genPartPathFromTable(Hive.java:1751)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1607)
... 5 more
I fix this issue by specifying lower case for partitioning column both in DDL:
CREATE TABLE IF NOT EXISTS TRADES
(
TRADE_ID STRING,
TRADE_DATE INT,
-- ...
)
PARTITIONED BY (business_date INT) -- <-- lower case!
STORED AS PARQUET;
and Java code
columns.add(lit(dateToInteger(businessDate)).as("business_date")); // <-- lower case!
I suspect there is a way to configure Spark to use Hive's case (or vice versa) or make them case-insensitive. Tried to uncomment properties for SparkSession, but no luck.
Does anyone know how to implement in properly? I use Spark 2.4.0 and Hive 2.1.1.
P.S. Output of sparkSession.sql("describe formatted TRADES").show(false):
+--------------------+--------------------+-------+
| col_name| data_type|comment|
+--------------------+--------------------+-------+
| ROW_INDEX| int| null|
| OP_GENESIS_FEED_ID| string| null|
| business_date| int| null|
|# Partition Infor...| | |
| # col_name| data_type|comment|
| business_date| int| null|
| | | |
|# Detailed Table ...| | |
| Database| managed| |
| Table| trades| |
| Owner| managed| |
| Created Time|Mon Jan 30 19:59:...| |
| Last Access|Thu Jan 01 02:00:...| |
| Created By|Spark 2.4.0-cdh6.2.1| |
| Type| MANAGED| |
| Provider| hive| |
| Table Properties|[transient_lastDd...| |
| Location|....................| |
| Serde Library|org.apache.hadoop...| |
| InputFormat|org.apache.hadoop...| |
| OutputFormat|org.apache.hadoop...| |
| Storage Properties|[serialization.fo...| |
| Partition Provider| Catalog| |
+--------------------+--------------------+-------+

update row value using another table in spark java

I have 2 datasets in java spark, first one contains id, name , age
and second one are the same, i need to check values (name and id) and if it's similar update the age with new age in dataset2
i tried all possible ways but i found that java in spark don't have much resourses and all possible ways not worked
this is one i tried :
dataset1.createOrReplaceTempView("updatesTable");
datase2.createOrReplaceTempView("carsTheftsFinal2");
updatesNew.show();
Dataset<Row> test = spark.sql( "ALTER carsTheftsFinal2 set carsTheftsFinal2.age = updatesTable.age from updatesTable where carsTheftsFinal2.id = updatesTable.id AND carsTheftsFinal2.name = updatesTable.name ");
test.show(12);
and this is the error :
Exception in thread "main"
org.apache.spark.sql.catalyst.parser.ParseException: no viable
alternative at input 'ALTER carsTheftsFinal2'(line 1, pos 6)
I have hint: that i can use join to update without using update statement ( java spark not provide update )
Assume that we have ds1 with this data:
+---+-----------+---+
| id| name|age|
+---+-----------+---+
| 1| Someone| 18|
| 2| Else| 17|
| 3|SomeoneElse| 14|
+---+-----------+---+
and ds2 with this data:
+---+----------+---+
| id| name|age|
+---+----------+---+
| 1| Someone| 14|
| 2| Else| 18|
| 3|NotSomeone| 14|
+---+----------+---+
According to your expected result, the final table would be:
+---+-----------+---+
| id| name|age|
+---+-----------+---+
| 3|SomeoneElse| 14| <-- not modified, same as ds
| 1| Someone| 14| <-- modified, was 18
| 2| Else| 18| <-- modified, was 17
+---+-----------+---+
This is achieved with the following transformations, first, we rename ds2's age with age2.
val renamedDs2 = ds2.withColumnRenamed("age", "age2")
Then:
// we join main dataset with the renamed ds2, now we have age and age2
ds1.join(renamedDs2, Seq("id", "name"), "left")
// we overwrite age, if age2 is not null from ds2, take it, otherwise leave age
.withColumn("age",
when(col("age2").isNotNull, col("age2")).otherwise(col("age"))
)
// finally, we drop age2
.drop("age2")
Hope this does what you want!

JDBC pagination in one to many relation and mapping to pojo?

we have a messaging logs table and we are using this table to provide a search UI which lets to search messages by id or status or auditor or date. Table audit looks like below
+-----------+----------+---------+---------------------+
| messageId | auditor | status | timestamp |
+-----------+----------+---------+---------------------+
| 10 | program1 | Failed | 2020-08-01 10:00:00 |
| 11 | program2 | success | 2020-08-01 10:01:10 |
| 12 | program3 | Failed | 2020-08-01 10:01:15 |
+-----------+----------+---------+---------------------+
Since in a given date range we could have many messages matching the criteria so we added pagination for the query. Now as a new feature we are adding another table with one to many relation which contain tags as the possible reasons for the failure. The table failure_tags will look like below
+-----------+----------+-------+--------+
| messageId | auditor | type | cause |
+-----------+----------+-------+--------+
| 10 | program1 | type1 | cause1 |
| 10 | program1 | type1 | cause2 |
| 10 | program1 | type2 | cause3 |
+-----------+----------+-------+--------+
Now for a general search query for a status = 'Failed' and using left join with the other table will retrieve 4 rows as below
+-----------+----------+-------+--------+---------------------+
| messageId | auditor | type | cause | timestamp |
+-----------+----------+-------+--------+---------------------+
| 10 | program1 | type1 | cause1 | 2020-08-01 10:00:00 |
| 10 | program1 | type1 | cause2 | 2020-08-01 10:00:00 |
| 10 | program1 | type2 | cause3 | 2020-08-01 10:00:00 |
| 12 | program3 | | | 2020-08-01 10:01:15 |
+-----------+----------+-------+--------+---------------------+
The requirement is to since the 3 rows of messageId 10 belongs to same message the requirement is to merge the rows into 1 in json response, so the response will have only 2 elements
[
{
"messageId": "10",
"auditor": "program1",
"failures": [
{
"type": "type1",
"cause": [
"cause1",
"cause2"
]
},
{
"type": "type2",
"cause": [
"cause3"
]
}
],
"date": "2020-08-01 10:00:00"
},
{
"messageId": "12",
"auditor": "program3",
"failures": [],
"date": "2020-08-01 10:01:15"
}
]
Because of this merge for a pagination request of 10 elements after fetching from the database and merging would result in less than 10 results.
The 1 solution, I could think of is after merging, if its less than page size, initiate a search again do the combining process and take the top 10 elements. Is there any better solution to get all the results in 1 query instead of going twice or more to DB ?
We use generic spring - JDBC not the JPA.

Only compare YYYY-MM part of date using Springboot and MongoDB

I am facing issue with mongoDb date fields.
I have one column in MongoDB collect which is date type. Which store whole date in ISO format.
As like below.
2020-05-13T14:00:00.000
Now from my api i am passing 2020-05 means YYYY-MM and i want to fetch all records with 2020-05 value in db.
I had tried with multiple option as below.
Date mydate=new Date();
Criteria criterid = Criteria
.where("date").is(mydate)
I have below data in DB
|---------------------|------------------|------------------|
| Name | ID | Date |
|---------------------|------------------|------------------|
| A | 1 |2020-05-13T14:00:00|
|---------------------|------------------|------------------|
| B | 2 |2020-12-31T14:00:00|
|---------------------|------------------|------------------|
| C | 3 |2020-07-17T14:00:00|
|---------------------|------------------|------------------|
| D | 4 |2020-05-29T14:00:00|
|---------------------|------------------|------------------|
enter code here
And if i will pass "2020-05" then it should returns
|---------------------|------------------|--------------------|
| Name | ID | Date |
|---------------------|------------------|--------------------|
| A | 1 |2020-05-13T14:00:00 |
|---------------------|------------------|--------------------|
| D | 4 |2020-05-29T14:00:00 |
|---------------------|------------------|--------------------|
Is there anything in mongoDb like STR_TO_DATE() in MySQL and TO_DATE() in oracle`**.
I am trying to implement this in Spring boot application.
Try above query.
It will help you.
One solution would be this one:
db.collection.find({
$expr: {
$and: [
{ $eq: [{ $year: "$Date" }, 2020] },
{ $eq: [{ $month: "$Date" }, 5] }
]
}
})

Attempting to count unique users between two categories in Spark

I have a Dataset structure in Spark with two columns, one called user the other called category. Such that the table looks some what like this:
+---------------+---------------+
| user| category|
+---------------+---------------+
| garrett| syncopy|
| garrison| musictheory|
| marta| sheetmusic|
| garrett| orchestration|
| harold| chopin|
| marta| russianmusic|
| niko| piano|
| james| sheetmusic|
| manny| violin|
| charles| gershwin|
| dawson| cello|
| bob| cello|
| george| cello|
| george| americanmusic|
| bob| personalcompos|
| george| sheetmusic|
| fred| sheetmusic|
| bob| sheetmusic|
| garrison| sheetmusic|
| george| musictheory|
+---------------+---------------+
only showing top 20 rows
Each row in the table is unique but a user and category can appear multiple times. The objective is to count the number of users that two categories share. For example cello and americanmusic share a user named george and musictheory and sheetmusic share users george and garrison. The goal is to get the number of distinct users between n categories meaning that there is at most n squared edges between categories. I understand partially how to do this operation but I am struggling a little bit converting my thoughts to Spark Java.
My thinking is that I need to do a self-join on user to get a table that would be structured like this:
+---------------+---------------+---------------+
| user| category| category|
+---------------+---------------+---------------+
| garrison| musictheory| sheetmusic|
| george| musictheory| sheetmusic|
| garrison| musictheory| musictheory|
| george| musictheory| musicthoery|
| garrison| sheetmusic| musictheory|
| george| sheetmusic| musictheory|
+---------------+---------------+---------------+
The self join operation in Spark (Java code) is not difficult:
Dataset<Row> newDataset = allUsersToCategories.join(allUsersToCategories, "users");
This is getting somewhere, however I get mappings to the same category as in rows 3 and 4 in the above example and I get backwards mappings where the categories are reversed such that essentially is double counting each user interaction like in rows 5 and 6 of the above example.
What I would believe I need to do is have some sort of conditional in my join that says something along the lines of X < Y so that equal categories and duplicates get thrown away. Finally I then need to count the number of distinct rows for n squared combinations where n is the number of categories.
Could somebody please explain how to do this in Spark and specifically Spark Java since I am a little unfamiliar with the Scala syntax?
Thanks for the help.
I'm not sure if I understand your requirements correctly, but I will try to help.
According to my understanding expected result for above data should look like below. If it's not true, please let me know I will try to make requried modifications.
+--------------+--------------+-+
|_1 |_2 |
+--------------+--------------+-+
|personalcompos|sheetmusic |1|
|cello |musictheory |1|
|americanmusic |cello |1|
|cello |sheetmusic |2|
|cello |personalcompos|1|
|russianmusic |sheetmusic |1|
|americanmusic |sheetmusic |1|
|americanmusic |musictheory |1|
|musictheory |sheetmusic |2|
|orchestration |syncopy |1|
+--------------+--------------+-+
In this case you can solve your problem with below Scala code:
allUsersToCategories
.groupByKey(_.user)
.flatMapGroups{case (user, userCategories) =>
val categories = userCategories.map(uc => uc.category).toSeq
for {
c1 <- categories
c2 <- categories
if c1 < c2
} yield (c1, c2)
}
.groupByKey(x => x)
.count()
.show()
If you need symetric result you can just change if statement in flatMapGroups transformation to if c1 != c2.
Please note that in above example I used Dataset API, which for test purpose was created with below code:
case class UserCategory(user: String, category: String)
val allUsersToCategories = session.createDataset(Seq(
UserCategory("garrett", "syncopy"),
UserCategory("garrison", "musictheory"),
UserCategory("marta", "sheetmusic"),
UserCategory("garrett", "orchestration"),
UserCategory("harold", "chopin"),
UserCategory("marta", "russianmusic"),
UserCategory("niko", "piano"),
UserCategory("james", "sheetmusic"),
UserCategory("manny", "violin"),
UserCategory("charles", "gershwin"),
UserCategory("dawson", "cello"),
UserCategory("bob", "cello"),
UserCategory("george", "cello"),
UserCategory("george", "americanmusic"),
UserCategory("bob", "personalcompos"),
UserCategory("george", "sheetmusic"),
UserCategory("fred", "sheetmusic"),
UserCategory("bob", "sheetmusic"),
UserCategory("garrison", "sheetmusic"),
UserCategory("george", "musictheory")
))
I was trying to provide example in Java, but I don't have any experience with Java+Spark and it is too time consuming for me to migrate above example from Scala to Java...
I found the answer a couple of hours ago using spark sql:
Dataset<Row> connection per shared user = spark.sql("SELECT a.user as user, "
+ "a.category as categoryOne, "
+ "b.category as categoryTwo "
+ "FROM allTable as a INNER JOIN allTable as b "
+ "ON a.user = b.user AND a.user < b.user");
This will then create a Dataset with three columns user, categoryOne, and categoryTwo. Each row will be unique and will indicate when the user exists in both categories.

Categories