I have 1,000 records in a topic. I am trying to filter the records from the input topic to the output topic based on the Salary.
For Example: I want the records of people whose salary is higher than 30,000.
I am trying to use KSTREAMS using Java for this.
The records are in text format(Comma Seperated), example:
first_name, last_name, email, gender, ip_address, country, salary
Redacted,Tranfield,user#example.com,Female,45.25.XXX.XXX,Russia,$12345.01
Redacted,Merck,user#example.com,Male,236.224.XXX.XXX,Belarus,$54321.96
Redacted,Kopisch,user#example.com,Male,61.36.XXX.XXX,Morocco,$12345.05
Redacted,Edds,user#example.com,Male,6.87.XXX.XXX,Poland,$54321.72
Redacted,Alston,user#example.com,Female,56.146.XXX.XXX,Indonesia,$12345.16
...
This is my code:
public class StreamsStartApp {
public static void main(String[] args) {
System.out.println();
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-starter-app");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
StreamsBuilder builder = new StreamsBuilder();
// Stream from Kafka topic
KStream<Long, Long> newInput = builder.stream("word-count-input");
Stream<Long, Long> usersAndColours = newInput
// step 1 - we ensure that a comma is here as we will split on it
.filter(value -> value.contains(",")
// step 2 - we select a key that will be the user id
.selectKey((key, value) -> value.split(",")[6])
// step 3 - got stuck here.
// .filter(key -> key.value[6] > 30000
// .selectKey((new1, value1) -> value1.split)(",")[3])
// .filter((key, value) -> key.greater(10));
// .filter((key, value) -> key > 10);
// .filter(key -> key.getkey().intValue() > 10);
usersAndColours.to("new-output");
Runtime.getRuntime().addShutdownHook(new Thread(streams::close))
Here in this above code near step 1, I have separated the sample data using ','.
In step 2 I have selected one field i.e.: salary field as key.
Now in step 3 I am trying to filter the data using salary field.
I tried some ways which are commented, but nothing worked.
Any ideas will help.
First, both your key and value are String serdes, not Longs, so KStream<Long, Long> is not correct.
And value.split(",")[6] is just a String, not a Double. (or a Long, since there's decimal values)
You need to remove the $ from your column and parse the string to a Double, then you can filter on it. Also it's not key.value[6] because your key is not an object with a value field.
And you should probably make the email the key, not the salary, if you even need a key, that is
Realistically, you can do this in one line (made two here for readability)
newInput.filter(value -> value.contains(",") &&
Double.parseDouble(value.split(",")[6].replace("$", "")) > 30000);
Related
I have a input object
#Getter
class Txn {
private String hash;
private String withdrawId;
private String depositId;
private Integer amount;
private String date;
}
and the output object is
#Builder
#Getter
class UserTxn {
private String hash;
private String walletId;
private String txnType;
private Integer amount;
}
In the Txn object transfers the amount from the withdrawId -> depositId.
what I am doing is I am adding all the transactions (Txn objects) in a single amount grouped by hash.
but for that I have to make two streams for groupingby withdrawId and second or for depositId and then the third stream for merging them
grouping by withdrawId
var withdrawStream = txnList.stream().collect(Collectors.groupingBy(Txn::getHash, LinkedHashMap::new,
Collectors.groupingBy(Txn::getWithdrawId, LinkedHashMap::new, Collectors.toList())))
.entrySet().stream().flatMap(hashEntrySet -> hashEntrySet.getValue().entrySet().stream()
.map(withdrawEntrySet ->
UserTxn.builder()
.hash(hashEntrySet.getKey())
.walletId(withdrawEntrySet.getKey())
.txnType("WITHDRAW")
.amount(withdrawEntrySet.getValue().stream().map(Txn::getAmount).reduce(0, Integer::sum))
.build()
));
grouping by depositId
var depositStream = txnList.stream().collect(Collectors.groupingBy(Txn::getHash, LinkedHashMap::new,
Collectors.groupingBy(Txn::getDepositId, LinkedHashMap::new, Collectors.toList())))
.entrySet().stream().flatMap(hashEntrySet -> hashEntrySet.getValue().entrySet().stream()
.map(withdrawEntrySet ->
UserTxn.builder()
.hash(hashEntrySet.getKey())
.walletId(withdrawEntrySet.getKey())
.txnType("DEPOSIT")
.amount(withdrawEntrySet.getValue().stream().map(Txn::getAmount).reduce(0, Integer::sum))
.build()
));
then merging them again, using deposites - withdraws
var res = Stream.concat(withdrawStream, depositStream).collect(Collectors.groupingBy(UserTxn::getHash, LinkedHashMap::new,
Collectors.groupingBy(UserTxn::getWalletId, LinkedHashMap::new, Collectors.toList())))
.entrySet().stream().flatMap(hashEntrySet -> hashEntrySet.getValue().entrySet().stream()
.map(withdrawEntrySet -> {
var depositAmount = withdrawEntrySet.getValue().stream().filter(userTxn -> userTxn.txnType.equals("DEPOSIT")).map(UserTxn::getAmount).reduce(0, Integer::sum);
var withdrawAmount = withdrawEntrySet.getValue().stream().filter(userTxn -> userTxn.txnType.equals("WITHDRAW")).map(UserTxn::getAmount).reduce(0, Integer::sum);
var totalAmount = depositAmount-withdrawAmount;
return UserTxn.builder()
.hash(hashEntrySet.getKey())
.walletId(withdrawEntrySet.getKey())
.txnType(totalAmount > 0 ? "DEPOSIT": "WITHDRAW")
.amount(totalAmount)
.build();
}
));
My question is, How can I do this in one stream.
Like by somehow groupingBy withdrawId and depositId is one grouping.
something like
res = txnList.stream()
.collect(Collectors.groupingBy(Txn::getHash,
LinkedHashMap::new,
Collectors.groupingBy(Txn::getWithdrawId && Txn::getDepositId,
LinkedHashMap::new, Collectors.toList())))
.entrySet().stream().flatMap(hashEntrySet -> hashEntrySet.getValue().entrySet().stream()
.map(walletEntrySet ->
{
var totalAmount = walletEntrySet.getValue().stream().map(
txn -> Objects.equals(txn.getDepositId(), walletEntrySet.getKey())
? txn.getAmount() : (-txn.getAmount())).reduce(0, Integer::sum);
return UserTxn.builder()
.hash(hashEntrySet.getKey())
.walletId(walletEntrySet.getKey())
.txnType("WITHDRAW")
.amount(totalAmount)
.build();
}
));
TL;DR
For those who didn't understand the question, OP wants to generate from each Txn instance (Txn probably stands for transaction) two peaces of data: hash and withdrawId + aggregated amount, and hash and depositId + aggregated amount.
And then they want to merge the two parts together (for that reason they were creating the two streams, and then concatenating them).
Note: it seems like there's a logical flow in the original code: the same amount gets associated with withdrawId and depositId. Which doesn't reflect that this amount has been taken from one account and transferred to another. Hence, it would make sense if for depositId amount would be used as is, and for withdrawId - negated (i.e. -1 * amount).
Collectors.teeing()
You can make use of the Java 12 Collector teeing() and internally group stream elements into two distinct Maps:
the first one by grouping the stream data by withdrawId and hash.
and another one by grouping the data depositId and hash.
Teeing expects three arguments: 2 downstream Collectors and a Function combining the results produced by collectors.
As the downstream of teeing() we can use a combination of Collectors groupingBy() and summingInt(), the second one is needed to accumulate integer amount of the transaction.
Note that there's no need in using nested Collector groupingBy() instead we can create a custom type that would hold id and hash (and its equals/hashCode should be implemented based on the wrapped id and hash). Java 16 record fits into this role perfectly well:
public record HashWalletId(String hash, String walletId) {}
Instances of HashWalletId would be used as Keys in both intermediate Maps.
The finisher function of teeing() would merge the results of the two Maps together.
The only thing left is to generate instances of UserTxn out of map entries.
List<Txn> txnList = // initializing the list
List<UserTxn> result = txnList.stream()
.collect(Collectors.teeing(
Collectors.groupingBy(
txn -> new HashWalletId(txn.getHash(), txn.getWithdrawId()),
Collectors.summingInt(txn -> -1 * txn.getAmount())), // because amount has been withdrawn
Collectors.groupingBy(
txn -> new HashWalletId(txn.getHash(), txn.getDepositId()),
Collectors.summingInt(Txn::getAmount)),
(map1, map2) -> {
map2.forEach((k, v) -> map1.merge(k, v, Integer::sum));
return map1;
}
))
.entrySet().stream()
.map(entry -> UserTxn.builder()
.hash(entry.getKey().hash())
.walletId(entry.getKey().walletId())
.txnType(entry.getValue() > 0 ? "DEPOSIT" : "WITHDRAW")
.amount(entry.getValue())
.build()
)
.toList(); // remove the terminal operation if your goal is to produce a Stream
I wouldn’t use this in my code because I think it’s not readable and will be very hard to change and manage in the future(SOLID).
But in case you still want this-
If I got your design right hash is unique per user and transaction will only have deposit or withdrawal, if so, this will work-
You could triple groupBy via collectors chaining like you did in your example.
You can create the Txn type via simple map function just check which id is null.
Map<String, Map<String, Map<String, List<Txn>>>> groupBy =
txnList.stream()
.collect(Collectors.groupingBy(Txn::getHash, LinkedHashMap::new,
Collectors.groupingBy(Txn::getDepositId, LinkedHashMap::new,
Collectors.groupingBy(Txn::getWithdrawId, LinkedHashMap::new, Collectors.toList()))));
then use the logic from your example on this stream.
My app gets some string from web service. It's look like this:
name=Raul&city=Paris&id=167136
I want to get map from this string:
{name=Raul, city=Paris, id=167136}
Code:
Arrays.stream(input.split("&"))
.map(sub -> sub.split("="))
.collect(Collectors.toMap(string-> string[0]), string -> string[1]));
It's okay and works in most cases, but app can get a string with duplicate keys, like this:
name=Raul&city=Paris&id=167136&city=Oslo
App will crash with following uncaught exception:
Exception in thread "main" java.lang.IllegalStateException: Duplicate key city (attempted merging values Paris and Oslo)
I tried to change collect method:
.collect(Collectors.toMap(tokens -> tokens[0], tokens -> tokens[1]), (r, strings) -> strings[0]);
But complier says no:
Cannot resolve method 'collect(java.util.stream.Collector<T,capture<?>,java.util.Map<K,U>>, <lambda expression>)'
And Array type expected; found: 'T'
I guess, it's because I have an array. How to fix it?
You are misunderstanding the final argument of toMap (the merge operator). When it find a duplicate key it hands the current value in the map and the new value with the same key to the merge operator which produces the single value to store.
For example, if you want to just store the first value found then use (s1, s2) -> s1. If you want to comma separate them, use (s1, s2) -> s1 + ", " + s2.
If you want to add value of duplicated keys together and group them by key (since app can get a string with duplicate keys), instead of using Collectors.toMap() you can use a Collectors.groupingBy with custom collector (Collector.of(...)) :
String input = "name=Raul&city=Paris&city=Berlin&id=167136&id=03&id=505";
Map<String, Set<Object>> result = Arrays.stream(input.split("&"))
.map(splitedString -> splitedString.split("="))
.filter(keyValuePair -> keyValuePair.length() == 2)
.collect(
Collectors.groupingBy(array -> array[0], Collector.of(
() -> new HashSet<>(), (set, array) -> set.add(array[1]),
(left, right) -> {
if (left.size() < right.size()) {
right.addAll(left);
return right;
} else {
left.addAll(right);
return left;
}
}, Collector.Characteristics.UNORDERED)
)
);
This way you'll get :
result => size = 3
"city" -> size = 2 ["Berlin", "Paris"]
"name" -> size = 1 ["Raul"]
"id" -> size = 3 ["167136","03","505"]
You can achieve the same result using kotlin collections
val res = message
.split("&")
.map {
val entry = it.split("=")
Pair(entry[0], entry[1])
}
println(res)
println(res.toMap()) //distinct by key
The result is
[(name, Raul), (city, Paris), (id, 167136), (city, Oslo)]
{name=Raul, city=Oslo, id=167136}
I have a few problems with creating a KTable with a timewindow in Kafka.
I want to create a table that counts the number of ID's in the stream like this.
ID (String) | Count (Long)
X | 5
Y | 6
Z | 7
and so forth. I want to able to get the table using the Kafka REST-API, preferably as .json.
Heres my code at the moment:
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> streams = builder.stream(srcTopic);
KTable<Windowed<String>, Long> numCount = streams
.flatMapValues(value -> getID(value))
.groupBy((key, value) -> value)
.windowedBy(TimeWindows.of(windowSizeMs).advanceBy(advanceMs))
.count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("foo"));
The problem I'm facing right now is that the table isn't created as a <String, Long> but as <String, String> instead. Which means that I can't get the correct count number, but instead I'm receiving the correct key but with corrupted counts. I've tried to force it out as a Long using Long.valueOf(value) without success. I don't know how to proceed from here. Do I need to write the KTable to a new topic? Since I want the table to be queryable using the kafka REST-API I don't think it's needed, am I right? The Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("foo") should make it queryable as "foo", right?
The KTable creates a changelog-topic, is this enough in order to make it queryable? Or do I have to create a new topic for it to write to?
I'm using another KStream to verify the output now.
KStream<String, String> streamOut = builder.stream(srcTopic);
streamOut.foreach((key, value) -> System.out.println(key + " => " + value));
and it outputs:
ID COUNT
2855 => ~
2857 => �
2859 => �
2861 => V(
2863 => �
2874 => �
2877 => J
2880 => �2
2891 => �=
Either way, I don't really want to use a KStream to collect the output, I want to query the KTable. But as mentioned, I don't really understand how the query works..
Update
Managed to get it to work with
ReadOnlyWindowStore<String, Long> windowStore =
kafkaStreams.store("tst", QueryableStoreTypes.windowStore());
long timeFrom = 0;
long timeTo = System.currentTimeMillis(); // now (in processing-time)
WindowStoreIterator<Long> iterator = windowStore.fetch("x", timeFrom, timeTo);
while (iterator.hasNext()) {
KeyValue<Long, Long> next = iterator.next();
long windowTimestamp = next.key;
System.out.println(windowTimestamp + ":" + next.value);
}
Many thanks in advance,
The output type of KTable is <Windowed<String>,String> because in Kafka Streams multiple windows are maintained in parallel to allow handling out-of-order data. Thus, it's not the case, that there is a single window instance, but many window instances in parallel. (cf. https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows)
Keeping "older" windows allow to update them when data arrives late. Note, Kafka Streams semantics is based on event-time.
You can still query the KTable -- you only need to know what window you want to query.
Update
The JavaDoc describe how to query the table: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedKStream.java#L94-L101
KafkaStreams streams = ... // counting words
Store queryableStoreName = ... // the queryableStoreName should be the name of the store as defined by the Materialized instance
ReadOnlyWindowStore<String,Long> localWindowStore = streams.store(queryableStoreName, QueryableStoreTypes.<String, Long>windowStore());
String key = "some-word";
long fromTime = ...;
long toTime = ...;
WindowStoreIterator<Long> countForWordsForWindows = localWindowStore.fetch(key, timeFrom, timeTo); // key must be local (application state is shared over all running Kafka Streams instances)
https://kafka.apache.org/10/documentation/streams/quickstart
I had a question on counting words within a message using kafka streams. Essentially, I'd like to count the total number of words, rather than count each instance of a word.
So, instead of
all 1
streams 1
lead 1
to 1
kafka 1
I need
totalWordCount 5
or something similar.
I tried a variety of things to this part of the code :
KTable<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.groupBy((key, value) -> value)
.count();
such as adding .selectKey((key, value) -> "totalWordCount") in an attempt to change each key (all, streams, etc) to totalWordCount thinking it'll increment itself
I've also tried to edit my code using this to try and achieve the total word count.
I have not succeeded, and after doing some more reading, now I am thinking that I have been approaching this incorrectly. It seems as if what I need to do is have 3 topics (I've been working with only 2) and have 2 producers where the last producer somehow takes data from the first producer (that shows the word count of each instance) and basically add up the numbers in order to output the total number of words, but I'm not entirely sure how to approach it. Any help/guidance is greatly appreciated. Thanks.
Where did you put the selectKey()? The idea is basically correct, but note, that groupBy() does set the key, too.
KTable<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.groupBy((key, value) -> "totalWordCount")
.count();
or (using groupByKey() to not change the key before the aggregation)
KTable<String, Long> wordCounts = textLines
.selectKey((key, value) -> "totalWordCount")
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.groupByKey()
.count();
#Configuration
#EnableKafkaStreams
public class FirstStreamApp {
#Bean
public KStream<String,String> process(StreamsBuilder builder){
KStream<String,String> inputStream = builder.stream("streamIn", Consumed.with(Serdes.String(),Serdes.String()));
KStream<String,String> upperCaseStream = inputStream.mapValues(value->value.toUpperCase());
upperCaseStream.to("outTopic", Produced.with(Serdes.String(),Serdes.String()));
KTable<String, Long> wordCounts = upperCaseStream.flatMapValues(v-> Arrays.asList(v.split(" "))).selectKey((k, v) -> v).groupByKey().
count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as("counts-store"));
wordCounts.toStream().to("wordCountTopic", Produced.with(Serdes.String(),Serdes.Long()));
return upperCaseStream;
}
}
I have an object with city and zip fields, let's call it Record.
public class Record() {
private String zip;
private String city;
//getters and setters
}
Now, I have a collection of these objects, and I group them by zip using the following code:
final Collection<Record> records; //populated collection of records
final Map<String, List<Record>> recordsByZip = records.stream()
.collect(Collectors.groupingBy(Record::getZip));
So, now I have a map where the key is the zip and the value is a list of Record objects with that zip.
What I want to get now is the most common city for each zip.
recordsByZip.forEach((zip, records) -> {
final String mostCommonCity = //get most common city for these records
});
I would like to do this with all stream operations. For example, I am able to get a map of the frequency for each city by doing this:
recordsByZip.forEach((zip, entries) -> {
final Map<String, Long> frequencyMap = entries.stream()
.map(GisSectorFileRecord::getCity)
.filter(StringUtils::isNotBlank)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
});
But I would like to be able to do a single-line stream operation that will just return the most frequent city.
Are there any Java 8 stream gurus out there that can work some magic on this?
Here is an ideone sandbox if you'd like to play around with it.
You could have the following:
final Map<String, String> mostFrequentCities =
records.stream()
.collect(Collectors.groupingBy(
Record::getZip,
Collectors.collectingAndThen(
Collectors.groupingBy(Record::getCity, Collectors.counting()),
map -> map.entrySet().stream().max(Map.Entry.comparingByValue()).get().getKey()
)
));
This groups each records by their zip, and by their cities, counting the number of cities for each zip. Then, the map of the number of cities by zip is post-processed to keep only the city having the maximum count.
I think Multiset is a good choice for this kind of question. Here is code by abacus-common
Stream.of(records).map(e -> e.getCity()).filter(N::notNullOrEmpty)
.toMultiset().maxOccurrences().get().getKey();
Disclosure: I'm the developer of abacus-common.