Kafka Streams combinations of message - java

I have messages in topic (compact topic):
{id: 1, groupId: 1}
{id: 2, groupId: 1}
{id: 3, groupId: 1}
{id: 4, groupId: 2}
I want grouped messages by groupId and then get all possible combinations in each group. For example for groupId=1 combinations:
id:1-id:2, id:1-id:3, id:2-id:3.
How can I do this?
Maybe self-join streams?

Use map operator to move the groupId over to the record key, then groupByKey
Otherwise, use aggregate operator and reduce by the value's groupId

Related

List as key for a key value store

I want to store key value pair in data base where key is a list of Integers or a set of Integers.
The use case that I have has the below steps
I will get a list of integers
I will need to check if that list of integers (as a key) is already present in the DB
If this is present, I will need to pick up the value from the DB
There are certain computations that I need to do if the list of integers (or set of integers) is not there in the DB already, if this there then I just want to pass the value and avoid the computations.
I am thinking of keeping the data in a key value store but I want the key to be specifically a list or set of integers.
I have thought about below options
Option A
Generate a unique hash for the list of integers and store that as key in key/value store
Problem:
I will have hash collision which will break my use case. I believe there is no way to generate hash with uniqueness 100% of the time.
This will not work.
If there is away to generate a unique hash (100%) times then that is the best way.
Option B
Create an immutable class with List of integers or Set of integers and store that as a key for my key value store.
Please share any feasible ways to achieve the need.
You don’t need to do anything special:
Map<List<Integer>, String> keyValueStore = new HashMap<>();
List<Integer> key = Arrays.asList(1, 2, 3);
keyValueStore.put(key, "foo");
All JDK collections implement sensible equals() and hashCode() that is based solely on the contents of the list.
Thank you. I would like to share some more findings.
I now tried the below further to what I mentioned in my earlier post.
I added the below documents in Mongodb
db.products.insertMany([
{
mapping: [1, 2,3],
hashKey:'ABC123',
date: Date()
},
{
mapping: [4, 5],
hashKey:'ABC45' ,
date: Date()
},
{
mapping: [6, 7,8],
hashKey:'ABC678' ,
date: Date()
},
{
mapping: [9, 10,11],
hashKey:'ABC91011',
date: Date()
},
{
mapping: [1, 9,10],
hashKey:'ABC1910',
date: Date()
},
{
mapping: [1, 3,4],
hashKey:'ABC134',
date: Date()
},
{
mapping: [4, 5,6],
hashKey:'ABC456',
date: Date()
}
]);
When I am now trying to find the mapping I am getting expected results
> db.products.find({ mapping: [4,5]}).pretty();
{
"_id" : ObjectId("5d4640281be52eaf11b25dfc"),
"mapping" : [
4,
5
],
"hashKey" : "ABC45",
"date" : "Sat Aug 03 2019 19:17:12 GMT-0700 (PDT)"
}
The above is giving the right result as the mapping [4,5] (insertion order retained) is present in the DB
> db.products.find({ mapping: [5,4]}).pretty();
The above is giving no result as expected as the mapping [5,4] is not present in the DB. The insertion order is retained
So it seems the "mapping" as List is working as expected.
I used Spring Data to read from MongoDB that is running locally.
The format of the document is
{
"_id" : 1,
"hashKey" : "ABC123",
"mapping" : [
1,
2,
3
],
"_class" : "com.spring.mongodb.document.Mappings"
}
I inserted 1.7 million records into DB using org.springframework.boot.CommandLineRunner
Then the query similar to my last example:
db.mappings.find({ mapping: [1,2,3]})
is taking average 1.05 seconds to find the mapping from 1.7 M records.
Please share if you have any suggestion to make it faster and how fast can I expect it to run.
I am not sure about create, update and delete performance as yet.

Spark: After CollectAsMap() or Collect(), every entry has same value

I need to read a text file and change this file to Map.
When I make JavaPairRDD, it works well.
However, when I change JavaPairRDD to Map, every entry has same value, more specifically the last data in text file.
inputFile:
1 book:3000
2 pencil:1000
3 coffee:2500
When I read a text file, I used Hadoop custom input format.
Using this format, Key is number and Value is custom class Expense<content,price>.
JavaPairRDD<Integer,Expense> inputRDD = JavaSparkContext.newAPIHadoopFile(inputFile, ExpenseInputFormat.class, Integer.class, Expense.class, HadoopConf);
inputRDD:
[1, (book,3000)]
[2, (pencil,1000)]
[3, (coffee,2500)]
However, when I do
Map<Integer,Expense> inputMap = new HashMap<Integer,Expense>(inputRDD.collectAsMap());
inputMap:
[1, (coffee,2500)]
[2, (coffee,2500)]
[3, (coffee,2500)]
As we can see, key is correctly inserted, but every value is the last value of input.
I don't know why it happens..

Sql query on json data contains arraylist

Table - ABC
Record 1 -- {"count": 0, "Groups": ["PADDY TRADERS", "WHEET TRADERS", "DAL TRADERS"], "apmcId": "33500150180006", "isSent": 0, "userId": "M0000020PRFREG2015050636624494USS1"}
Record 2 -- {"count": 0, "Groups": ["X TRADERS", "Y TRADERS", "Z TRADERS"], "apmcId": "566565656", "isSent": 0, "userId": "5435435435345435435435"}
These are the records in ABC table, now i am querying as below to get first record as expected return but not able to do so. Pls help me for querying on records which contains list data inside.
"SELECT * FROM ABC WHERE data->>'Groups'->'PADDY TRADERS'";
MySQL does not yet support JSON directly, unless you are on version >= 5.7 (see here for a nice blog post concerning JSON in mysql 5.7)
Therefore all you can do is get the field in which the JSON is stored, interpret it with the JSON library of your choice and do whatever you need doing.

Elasticsearch java client concatenating terms

I am building an elasticsearch TermsFilterBuilder using the java client so this it has an array of 3 elements with the values [3, 4, 5]. I can see in the debugger that the "values" property is this array of 3 elements.
However, when the query is sent off to elasticsearch it concatenates all of the values like this:
{
"terms" : {
"offer.accommodation.rating" : [ 345 ]
}
This is causing my query to fail. Why is it doing this?

MongoDB data model to support unique visitors, per event, per date range

I've got multiple websites, where each website has visitors that "trigger" multiple events I want to track. I have a log of those events, from all websites, each event is filled with the website-id, the event-name and the user-id that did the event (for the sake of simplicity, let's say that's it).
The requirements:
Be able to get, per website-id and event-name, how many unique visitors got it.
This should support also date range (distinct unique visitors on the range).
I was thinking of creating a collection per "website-id" with the following data model (as example):
collection ev_{websiteId}:
[
{
_id: "error"
dailyStats: [
{
_id: 20121005 <-- (yyyyMMdd int, should be indexed!)
hits: 5
users: [
{
_id: 1, <-- should be indexed!
hits: 1
},
{
_id: 2
hits: 3
},
{
_id: 3,
hits: 1
}
]
},
{
_id: 20121004
hits: 8
users: [
{
_id: 1,
hits: 2
},
{
_id: 2
hits: 3
},
{
_id: 3,
hits: 3
}
]
},
]
},
{
_id: "pageViews"
dailyStats: [
{
_id: 20121005
hits: 500
users: [
{
_id: 1,
hits: 100
},
{
_id: 2
hits: 300
},
{
_id: 3,
hits: 100
}
]
},
{
_id: 20121004
hits: 800
users: [
{
_id: 1,
hits: 200
},
{
_id: 2
hits: 300
},
{
_id: 3,
hits: 300
}
]
},
]
},
]
I'm using the _id to hold the event-id.
I'm using dailyStats._id to hold when it happened (an integer in yyyyMMdd format).
I'm using dailySattes.users._id to represent a user's unique-id hash.
In order to get the unique users, I should basically be able to run (mapreduce?) distinct count number of items in the array(s), per the given date range (I will convert the date range to yyyyMMdd).
My questions:
does this data model makes sense to you? I'm concerned about scalability of this model over time (if I've got a lot of daily unique visitors in some client, it make cause a huge document).
I was thinking of deleting dailyStats documents by _id < [date as yyyyMMdd]. This way I can keep my documents size to a sane number, but still, there are limits here.
Is there an easy way to run "upsert" that will also create the dailyStats if not already created, add the user, if not already created and increment "hits" property for both?
what about map-reduce? how would you approach it (need to run distinct on the users._id for all subdocuments in the given date range)? is there an easier way with the new aggregation framework?
btw - another option to solve unique visitors is using Redis Bitmaps but I am not sure it's worth holding multiple data storage (maintenance-wise).
Few comments on the current above architecture.
I'm slightly worried as you've pointed out about the scalability and how much pre-aggregation you're really doing.
Most of the Mongo instances I've worked with when doing metrics have similar cases to what you pointed out but you really seem to be relying heavily on doing updates to a single document and upserting various parts of it which is going to slow down and potentially cause a bit of locking..
I might suggest a different path, one that Mongo even suggests when talking with them about doing metrics. Seeing as you already have a structure that you're looking to do I'd create something along the lines of:
{
"_id":"20121005_siteKey_page",
"hits":512,
"users":[
{
"uid":5,
"hits":512,
}
}
This way you are limiting your document sizes to something that is going to be reasonable to do quick upserts on. From here you can do mapreduce jobs in batches to further extend out what you're looking to see.
It also depends on your end goal, are you looking to provide realtime metrics? What sort of granularity are you attemtping to get? Redis Maps may be something you want to at least look at: Great article here.
Regardless it is a fun problem to solve :)
Hope this has helped!

Categories