we have a messaging logs table and we are using this table to provide a search UI which lets to search messages by id or status or auditor or date. Table audit looks like below
+-----------+----------+---------+---------------------+
| messageId | auditor | status | timestamp |
+-----------+----------+---------+---------------------+
| 10 | program1 | Failed | 2020-08-01 10:00:00 |
| 11 | program2 | success | 2020-08-01 10:01:10 |
| 12 | program3 | Failed | 2020-08-01 10:01:15 |
+-----------+----------+---------+---------------------+
Since in a given date range we could have many messages matching the criteria so we added pagination for the query. Now as a new feature we are adding another table with one to many relation which contain tags as the possible reasons for the failure. The table failure_tags will look like below
+-----------+----------+-------+--------+
| messageId | auditor | type | cause |
+-----------+----------+-------+--------+
| 10 | program1 | type1 | cause1 |
| 10 | program1 | type1 | cause2 |
| 10 | program1 | type2 | cause3 |
+-----------+----------+-------+--------+
Now for a general search query for a status = 'Failed' and using left join with the other table will retrieve 4 rows as below
+-----------+----------+-------+--------+---------------------+
| messageId | auditor | type | cause | timestamp |
+-----------+----------+-------+--------+---------------------+
| 10 | program1 | type1 | cause1 | 2020-08-01 10:00:00 |
| 10 | program1 | type1 | cause2 | 2020-08-01 10:00:00 |
| 10 | program1 | type2 | cause3 | 2020-08-01 10:00:00 |
| 12 | program3 | | | 2020-08-01 10:01:15 |
+-----------+----------+-------+--------+---------------------+
The requirement is to since the 3 rows of messageId 10 belongs to same message the requirement is to merge the rows into 1 in json response, so the response will have only 2 elements
[
{
"messageId": "10",
"auditor": "program1",
"failures": [
{
"type": "type1",
"cause": [
"cause1",
"cause2"
]
},
{
"type": "type2",
"cause": [
"cause3"
]
}
],
"date": "2020-08-01 10:00:00"
},
{
"messageId": "12",
"auditor": "program3",
"failures": [],
"date": "2020-08-01 10:01:15"
}
]
Because of this merge for a pagination request of 10 elements after fetching from the database and merging would result in less than 10 results.
The 1 solution, I could think of is after merging, if its less than page size, initiate a search again do the combining process and take the top 10 elements. Is there any better solution to get all the results in 1 query instead of going twice or more to DB ?
We use generic spring - JDBC not the JPA.
I was looking for an example using Kafka Streams on how to do this sort of thing, i.e. join a customers table with a addresses table and sink the data to ES:-
Customers
+------+------------+----------------+-----------------------+
| id | first_name | last_name | email |
+------+------------+----------------+-----------------------+
| 1001 | Sally | Thomas | sally.thomas#acme.com |
| 1002 | George | Bailey | gbailey#foobar.com |
| 1003 | Edward | Davidson | ed#walker.com |
| 1004 | Anne | Kim | annek#noanswer.org |
+------+------------+----------------+-----------------------+
Addresses
+----+-------------+---------------------------+------------+--------------+-------+----------+
| id | customer_id | street | city | state | zip | type |
+----+-------------+---------------------------+------------+--------------+-------+----------+
| 10 | 1001 | 3183 Moore Avenue | Euless | Texas | 76036 | SHIPPING |
| 11 | 1001 | 2389 Hidden Valley Road | Harrisburg | Pennsylvania | 17116 | BILLING |
| 12 | 1002 | 281 Riverside Drive | Augusta | Georgia | 30901 | BILLING |
| 13 | 1003 | 3787 Brownton Road | Columbus | Mississippi | 39701 | SHIPPING |
| 14 | 1003 | 2458 Lost Creek Road | Bethlehem | Pennsylvania | 18018 | SHIPPING |
| 15 | 1003 | 4800 Simpson Square | Hillsdale | Oklahoma | 73743 | BILLING |
| 16 | 1004 | 1289 University Hill Road | Canehill | Arkansas | 72717 | LIVING |
+----+-------------+---------------------------+------------+--------------+-------+----------+
Output Elasticsearch index
"hits": [
{
"_index": "customers_with_addresses",
"_type": "_doc",
"_id": "1",
"_score": 1.3278645,
"_source": {
"first_name": "Sally",
"last_name": "Thomas",
"email": "sally.thomas#acme.com",
"addresses": [{
"street": "3183 Moore Avenue",
"city": "Euless",
"state": "Texas",
"zip": "76036",
"type": "SHIPPING"
}, {
"street": "2389 Hidden Valley Road",
"city": "Harrisburg",
"state": "Pennsylvania",
"zip": "17116",
"type": "BILLING"
}],
}
}, ….
Table data is coming from Debezium topics, am I correct in thinking I need some Java in the middle to join the streams, output it to a new topic which then sinks that into ES?
Would anyone have any example code of this?
Thanks.
Yes, You can implement the solution using Kafka streams API in java in following way.
Consume the topics as stream.
Aggregate the address stream in a list using customer ID and convert the stream into table.
Join Customer stream with address table
Below is the example(considering data is consumed in json format) :
KStream<String,JsonNode> customers = builder.stream("customer", Consumed.with(stringSerde, jsonNodeSerde));
KStream<String,JsonNode> addresses = builder.stream("address", Consumed.with(stringSerde, jsonNodeSerde));
// Select the customer ID as key in order to join with address.
KStream<String,JsonNode> customerRekeyed = customers.selectKey(value-> value.get("id").asText());
ObjectMapper mapper = new ObjectMapper();
// Select Customer_id as key to aggregate the addresses and join with customer
KTable<String,JsonNode> addressTable = addresses
.selectKey(value-> value.get("customer_id").asText())
.groupByKey()
.aggregate(() ->mapper::createObjectNode, //initializer
(key,value,aggregate) -> aggregate.add(value),
Materialized.with(stringSerde, jsonNodeSerde)
); //adder
// Join Customer Stream with Address Table
KStream<String,JsonNode> customerAddressStream = customerRekeyed.leftJoin(addressTable,
(left,right) -> {
ObjectNode finalNode = mapper.createObjectNode();
ArrayList addressList = new ArrayList<JsonNode>();
// Considering the address is arrayNode
((ArrayNode)right).elements().forEachRemaining(addressList ::add);
left.putArray("addresses").allAll(addressList);
return left;
},Joined.keySerde(stringSerde).withValueSerde(jsonNodeSerde));
You can refer the details about all type of joins here :
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#joining
Depending on how strict your requirement is to nest multiple addresses in one customer node, you can do this in KSQL (which is built on top of Kafka Streams).
Populate some test data into Kafka (which in your case is done already through Debezium):
$ curl -s "https://api.mockaroo.com/api/ffa9ff20?count=10&key=ff7856d0" | kafkacat -b localhost:9092 -t addresses -P
$ curl -s "https://api.mockaroo.com/api/9b868890?count=4&key=ff7856d0" | kafkacat -b localhost:9092 -t customers -P
Fire up KSQL and to start with just inspect the data:
ksql> PRINT 'addresses' FROM BEGINNING ;
Format:JSON
{"ROWTIME":1558519823351,"ROWKEY":"null","id":1,"customer_id":1004,"street":"8 Moulton Center","city":"Bronx","state":"New York","zip":"10474","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":2,"customer_id":1001,"street":"5 Hollow Ridge Alley","city":"Washington","state":"District of Columbia","zip":"20016","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":3,"customer_id":1000,"street":"58 Maryland Point","city":"Greensboro","state":"North Carolina","zip":"27404","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":4,"customer_id":1002,"street":"55795 Derek Avenue","city":"Temple","state":"Texas","zip":"76505","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":5,"customer_id":1002,"street":"164 Continental Plaza","city":"Modesto","state":"California","zip":"95354","type":"SHIPPING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":6,"customer_id":1004,"street":"6 Miller Road","city":"Louisville","state":"Kentucky","zip":"40205","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":7,"customer_id":1003,"street":"97 Shasta Place","city":"Pittsburgh","state":"Pennsylvania","zip":"15286","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":8,"customer_id":1000,"street":"36 Warbler Circle","city":"Memphis","state":"Tennessee","zip":"38109","type":"SHIPPING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":9,"customer_id":1001,"street":"890 Eagan Circle","city":"Saint Paul","state":"Minnesota","zip":"55103","type":"SHIPPING"}
{"ROWTIME":1558519823354,"ROWKEY":"null","id":10,"customer_id":1000,"street":"8 Judy Terrace","city":"Washington","state":"District of Columbia","zip":"20456","type":"SHIPPING"}
^C
Topic printing ceased
ksql>
ksql> PRINT 'customers' FROM BEGINNING;
Format:JSON
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1001,"first_name":"Jolee","last_name":"Handasyde","email":"jhandasyde0#nhs.uk"}
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1002,"first_name":"Rebeca","last_name":"Kerrod","email":"rkerrod1#sourceforge.net"}
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1003,"first_name":"Bobette","last_name":"Brumble","email":"bbrumble2#cdc.gov"}
{"ROWTIME":1558519852368,"ROWKEY":"null","id":1004,"first_name":"Royal","last_name":"De Biaggi","email":"rdebiaggi3#opera.com"}
Now we declare a STREAM (Kafka topic + schema) on the data so that we can manipulate it further:
ksql> CREATE STREAM addresses_RAW (ID INT, CUSTOMER_ID INT, STREET VARCHAR, CITY VARCHAR, STATE VARCHAR, ZIP VARCHAR, TYPE VARCHAR) WITH (KAFKA_TOPIC='addresses', VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
ksql> CREATE STREAM customers_RAW (ID INT, FIRST_NAME VARCHAR, LAST_NAME VARCHAR, EMAIL VARCHAR) WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
We're going to model the customers as a TABLE, and to do that the Kafka messages need to be keyed correctly (and the moment they have null keys, as can be seen from the "ROWKEY":"null" in the PRINT output above). You can configure Debezium to set the message key so this step may not be necessary for you in KSQL:
ksql> CREATE STREAM CUSTOMERS_KEYED WITH (PARTITIONS=1) AS SELECT * FROM CUSTOMERS_RAW PARTITION BY ID;
Message
----------------------------
Stream created and running
----------------------------
Now we declare a TABLE (state for a given key, instantiated from a Kafka topic + schema):
ksql> CREATE TABLE CUSTOMER (ID INT, FIRST_NAME VARCHAR, LAST_NAME VARCHAR, EMAIL VARCHAR) WITH (KAFKA_TOPIC='CUSTOMERS_KEYED', VALUE_FORMAT='JSON', KEY='ID');
Message
---------------
Table created
---------------
Now we can join the data:
ksql> CREATE STREAM customers_with_addresses AS
SELECT CUSTOMER_ID,
FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME,
FIRST_NAME,
LAST_NAME,
TYPE AS ADDRESS_TYPE,
STREET,
CITY,
STATE,
ZIP
FROM ADDRESSES_RAW A
INNER JOIN CUSTOMER C
ON A.CUSTOMER_ID = C.ID;
Message
----------------------------
Stream created and running
----------------------------
This creates a new KSQL STREAM which in turn populates a new Kafka topic.
ksql> SHOW STREAMS;
Stream Name | Kafka Topic | Format
------------------------------------------------------------------------------------------
CUSTOMERS_KEYED | CUSTOMERS_KEYED | JSON
ADDRESSES_RAW | addresses | JSON
CUSTOMERS_RAW | customers | JSON
CUSTOMERS_WITH_ADDRESSES | CUSTOMERS_WITH_ADDRESSES | JSON
The stream has a schema:
ksql> DESCRIBE CUSTOMERS_WITH_ADDRESSES;
Name : CUSTOMERS_WITH_ADDRESSES
Field | Type
------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
CUSTOMER_ID | INTEGER (key)
FULL_NAME | VARCHAR(STRING)
FIRST_NAME | VARCHAR(STRING)
ADDRESS_TYPE | VARCHAR(STRING)
LAST_NAME | VARCHAR(STRING)
STREET | VARCHAR(STRING)
CITY | VARCHAR(STRING)
STATE | VARCHAR(STRING)
ZIP | VARCHAR(STRING)
------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;
We can query the stream:
ksql> SELECT * FROM CUSTOMERS_WITH_ADDRESSES WHERE CUSTOMER_ID=1002;
1558519823351 | 1002 | 1002 | Rebeca Kerrod | Rebeca | LIVING | Kerrod | 55795 Derek Avenue | Temple | Texas | 76505
1558519823351 | 1002 | 1002 | Rebeca Kerrod | Rebeca | SHIPPING | Kerrod | 164 Continental Plaza | Modesto | California | 95354
We can also stream it to Elasticsearch using Kafka Connect:
curl -i -X POST -H "Accept:application/json" \
-H "Content-Type:application/json" http://localhost:8083/connectors/ \
-d '{
"name": "sink-elastic-customers_with_addresses-00",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"topics": "CUSTOMERS_WITH_ADDRESSES",
"connection.url": "http://elasticsearch:9200",
"type.name": "type.name=kafkaconnect",
"key.ignore": "true",
"schema.ignore": "true",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false"
}
}'
Result:
$ curl -s http://localhost:9200/customers_with_addresses/_search | jq '.hits.hits[0]'
{
"_index": "customers_with_addresses",
"_type": "type.name=kafkaconnect",
"_id": "CUSTOMERS_WITH_ADDRESSES+0+2",
"_score": 1,
"_source": {
"ZIP": "76505",
"CITY": "Temple",
"ADDRESS_TYPE": "LIVING",
"CUSTOMER_ID": 1002,
"FULL_NAME": "Rebeca Kerrod",
"STATE": "Texas",
"STREET": "55795 Derek Avenue",
"LAST_NAME": "Kerrod",
"FIRST_NAME": "Rebeca"
}
}
We built a demo and blog post on this very use case (streaming aggregates to Elasticsearch) a while ago on the Debezium blog.
One issue to keep in mind is that this solution (based on Kafka Streams, but I reckon it's the same for KSQL) is prone to exposing intermediary join results. E.g. assume you insert a customer and 10 addresses in one transaction. The stream join approach might first produce an aggregate of the customer and their first five addresses and shortly thereafter the complete aggregate with all the 10 addresses. This might or might not be desirable for your specific use case. I also remember that handling deletions isn't trivial (e.g. if you delete one of the 10 addresses, so you'll have to produce the aggregate again with the remaining 9 addresses with might have been untouched, though).
An alternative to consider can be the outbox pattern where you'd essentially produce an explicit event with the precomputed aggregated from within your application itself. I.e. it requires a little help of the application, but then it avoids the subtleties of producing that join result after the fact.
This is Json Array of Object(Student Data) . I am loaded that Json-Ld Data in Jena Model
[
{
"#context" : {
"myvocab" : "http://mywebsite.com/vocab/",
"name" : "myvocab:name",
"firstname" : "myvocab:firstname",
"lastname" : "myvocab:lastname",
"rollNumber" : "myvocab:rollNumber"
},
"name" : {
"firstname" : "Dhannan",
"lastname" : "Chaudhary"
},
"rollNumber" : "26"
},
{
"#context" : {
"myvocab" : "http://mywebsite.com/vocab/",
"name" : "myvocab:name",
"firstname" : "myvocab:firstname",
"lastname" : "myvocab:lastname",
"rollNumber" : "myvocab:rollNumber"
},
"name" : {
"firstname" : "Maakin",
"lastname" : "Dhayaal"
},
"rollNumber" : "69"
}
]
This is my model output for above example ( by using SPARQL )
-------------------------------------------------------------------
| Subject | Predicate | Object |
===================================================================
| _:b0 | <http://mywebsite.com/vocab/lastname> | "Chaudhary" |
| _:b0 | <http://mywebsite.com/vocab/firstname> | "Dhannan" |
| _:b1 | <http://mywebsite.com/vocab/lastname> | "Dhayaal" |
| _:b1 | <http://mywebsite.com/vocab/firstname> | "Maakin" |
| _:b2 | <http://mywebsite.com/vocab/rollNumber> | "62" |
| _:b2 | <http://mywebsite.com/vocab/name> | _:b1 |
| _:b3 | <http://mywebsite.com/vocab/rollNumber> | "61" |
| _:b3 | <http://mywebsite.com/vocab/name> | _:b0 |
-------------------------------------------------------------------
From this model I want only Subjects(Resources in term of Jena) of every Student for my case it should ( _:b2 , _:b3) .
But by using model.listSubjects() it gives iterator to all subjects ( _:b0 , _:b1 , _:b2 , _:b3)
My main goal is to be able to get individual models for student 1 and student 2.
How can I achieve this?
Every Suggestions are welcome.
First, you can use RDF Type literal to define Student class as well as the StudentName class (not sure why you'd need to break them up).
You can then check if the subject has the property that you are looking for. You can see how we do this in Eclipse Lyo Jena provider.
Finally, you can model your domain with Lyo modelling tools and generate the POJOs for your domain that can be converted from/to Jena models in a single method call.