Can not modify value in JavaRDD - java

I have a question about how to update JavaRDD values.
I have a JavaRDD<CostedEventMessage> with message objects containing information about to which partition of kafka topic it should be written to.
I'm trying to change the partitionId field of such objects using the following code:
rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions));
where the repartitionEvent logic is:
costedEventMessage.setPartitionId(1);
return costedEventMessage;
But the modification does not happen.
Could you please advice why and how to correctly modify values in a JavaRDD?

Spark is lazy, so from the code you pasted above it's not clear if you actually performed any action on the JavaRDD (like collect or forEach) and how you came to the conclusion that data was not changed.
For example, if you assumed that by running the following code:
List<CostedEventMessage> messagesLst = ...;
JavaRDD<CostedEventMessage> rddToKafka = javaSparkContext.parallelize(messagesLst);
rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions));
Each element in messagesLst would have partition set to 1, you are wrong.
That would hold true if you added for example:
messagesLst = rddToKafka.collect();
For more details refer to documentation

Related

Kafka GroupTable tests generating extra messages when using ProcessorTopologyTestDriver

I've written a stream that takes in messages and sends out a table of the keys that have appeared. If something appears, it will show a count of 1. This is a simplified version of my production code in order to demonstrate the bug. In a live run, a message is sent out for each message received.
However, when I run it in a unit test using ProcessorTopologyTestDriver, I get a different behavior. If a key that has already been seen before is received, I get an extra message.
If I send messages with keys "key1", then "key2", then "key1", I get the following output.
key1 - 1
key2 - 1
key1 - 0
key1 - 1
For some reason, it decrements the value before adding it back in. This only happens when using ProcessorTopologyTestDriver. Is this expected? Is there a work around? Or is this a bug?
Here's my topology:
final StreamsBuilder builder = new StreamsBuilder();
KGroupedTable<String, String> groupedTable
= builder.table(applicationConfig.sourceTopic(), Consumed.with(Serdes.String(), Serdes.String()))
.groupBy((key, value) -> KeyValue.pair(key, value), Serialized.with(Serdes.String(), Serdes.String()));
KTable<String, Long> countTable = groupedTable.count();
KStream<String, Long> countTableAsStream = countTable.toStream();
countTableAsStream.to(applicationConfig.outputTopic(), Produced.with(Serdes.String(), Serdes.Long()));
Here's my unit test code:
TopologyWithGroupedTable top = new TopologyWithGroupedTable(appConfig, map);
Topology topology = top.get();
ProcessorTopologyTestDriver driver = new ProcessorTopologyTestDriver(config, topology);
driver.process(inputTopic, "key1", "theval", Serdes.String().serializer(), Serdes.String().serializer());
driver.process(inputTopic, "key2", "theval", Serdes.String().serializer(), Serdes.String().serializer());
driver.process(inputTopic, "key1", "theval", Serdes.String().serializer(), Serdes.String().serializer());
ProducerRecord<String, Long> outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key1", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value());
outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key2", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value());
outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key1", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value()); //this fails, I get 0. If I pull another message, it shows key1 with a count of 1
Here's a repo of the full code:
https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/
Stream topology: https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/src/main/java/com/nick/kstreams/TopologyWithGroupedTable.java
Test code: https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/src/test/java/com/nick/kstreams/TopologyWithGroupedTableTests.java
It's not a bug, but behavior by design (c.f. explanation below).
The difference in behavior is due to KTable state store caching (cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html). When you run the unit test, the cache is flushed after each record, while in your production run, this is not the case. If you disable caching in your production run, I assume that it behaves the same as in your unit test.
Side remark: ProcessorTopologyTestDriver is an internal class and not part of public API. Thus, there is no compatibility guarantee. You should use the official unit-test packages instead: https://docs.confluent.io/current/streams/developer-guide/test-streams.html
Why do you see two records:
In your code, you are using a KTable#groupBy() and in your specific use case, you don't change the key. However, in general, the key might be changed (depending on the value of the input KTable. Thus, if the input KTable is changed, the downstream aggregation needs to remove/subtract the old key-value pair from the aggregation result, and add the new key-value pair to the aggregation result—in general, the key of the old and new pair are different and thus, it's required to generate two records because the subtraction and addition could happen on different instances as different keys might be hashed differently. Does this make sense?
Thus, for each update of the input KTable, two updates two the result KTable on usually two different key-value pairs need to be computed. For you specific case, in which the key does not change, Kafka Stream does the same thing (there is no check/optimization for this case to "merge" both operations into one if the key is actually the same).

Spark reduceByKey function seems not working with single one key

I have a 5 row records in mysql, like
sku:001 seller:A stock:UK margin:10
sku:002 seller:B stock:US margin:5
sku:001 seller:A stock:UK margin:10
sku:001 seller:A stock:UK margin:3
sku:001 seller:A stock:UK margin:7
And I've this rows read into spark and transformed them into
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<margin,xxx>).
Seems like works fine until now.
However, When I used the reduceByKey function to sum the margin as the structure like:
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<marginSummary, xxx>).
the final result got 2 elements
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<margin,xxx>).
JavaPairRDD<Tuple3<String,String,String>, Map>(<sku,seller,stock>, Map<marginSummary, xxx>).
seems like the row2 didn't enter the reduceByKey function body. I was wondering why?
It is expected outcome. func is called only when objects for a single key are merged. If there is only one key, there is no reason to call it.
Unfortunately it looks like you have a bigger problem, which can be inferred from you question. You are trying to change the type of the value in reduceByKey. In general it shouldn't even compile as reduceByKey takes Function2<V,V,V> - input and output types have to be identical.
If you want to change a type, you should use either combineByKey
public <C> JavaPairRDD<K,C> combineByKey(Function<V,C> createCombiner,
Function2<C,V,C> mergeValue,
Function2<C,C,C> mergeCombiners)
or aggregateByKey
public <U> JavaPairRDD<K,U> aggregateByKey(U zeroValue,
Function2<U,V,U> seqFunc,
Function2<U,U,U> combFunc)
Both can change the types and fixed your current problem. Please refer to Java test suite for examples: 1 and 2.

Java 8 Stream API Filtering

I have a collection of objects, with the following type:
{
String action_name; //add or delete
long action_time;
String action_target;
}
Need to get the latest merged operation on each action_target
Sample input data:
[add|1001|item1, add|1002|item2, delete|1003|item1, add|1004|item1]
Expected result:
[add|1002|item2, add|1004|item1]
Sample input data:
[add|1001|item1, add|1002|item2, delete|1003|item1]
Expected result:
[add|1002|item2]
Sample input data:
[delete|1001|item1, add|1002|item2, add|1003|item1]
Expected result:
[add|1002|item2, add|1003|item1]
Is this approachable using Java8 stream APIs? Thanks.
You want to group by one criteria (the action_target) combined with reducing the groups to the maximum of their action_time values:
Map<String,Item> map=items.stream().collect(
Collectors.groupingBy(item->item.action_target,
Collectors.collectingAndThen(
Collectors.maxBy(Comparator.comparing(item->item.action_time)),
Optional::get)));
This returns a Map<String,Item> but, of course, you may call values() on it to get a collection of items.
Beautified with static imports, the code looks like:
Map<String,Item> map=items.stream().collect(groupingBy(item->item.action_target,
collectingAndThen(maxBy(comparing(item->item.action_time)), Optional::get)));
Your additional request of taking care of idempotent "add" and follow-up "delete" actions can be simplified to “remove items whose last action is "delete"” which can be implemented just by doing that after collecting using a mutable map:
HashMap<String,Item> map=items.stream().collect(groupingBy(
item->item.action_target, HashMap::new,
collectingAndThen(maxBy(comparing(item->item.action_time)), Optional::get)));
map.values().removeIf(item->item.action_name.equals("delete"));

hbase: querying for specific value with dynamically created qualifier

Hy,
Hbase allows a column family to have different qualifiers in different rows. In my case a column family has the following specification
abc[cnt] # where cnt is an integer that can be any positive integer
what I want to achieve is to get all the data from a different column family, only if the value of the described qualifier (in a different column family) matches.
for narrowing the Scan down I just add those two families I need for the query. but that is as far as I could get for now.
I already achieved the same behaviour with a SingleColumnValueFilter, but then the qualifier was known in advance. but for this one the qualifier can be abc1, abc2 ... there would be too many options, thus too many SingleColumnValueFilter's.
Then I tried using the ValueFilter, but this filter only returns those columns that match the value, thus the wrong column family.
Can you think of any way to achieve my goal, querying for a value within a dynamically created qualifier in a column family and returning the contents of the column family and another column family (as specified when creating the Scan)? preferably only querying once.
Thanks in advance for any input.
UPDATE: (for clarification as discussed in the comments)
in a more graphical way, a row may have the following:
colfam1:aaa
colfam1:aab
colfam1:aac
colfam2:abc1
colfam2:abc2
whereas I want to get all of the family colfam1 if any value of colfam2 has e.g. the value x, with regard to the fact that colfam2:abc[cnt] is dynamically created with cnt being any positive integer
I see two approaches for this: client-side filtering or server-side filtering.
Client-side filtering is more straightforward. The Scan adds only the two families "colfam1" and "colfam2". Then, for each Result you get from scanner.next(), you must filter according to the qualifiers in "colfam2".
byte[] queryValue = Bytes.toBytes("x");
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("colfam1");
scan.addFamily(Bytes.toBytes("colfam2");
ResultScanner scanner = myTable.getScanner(scan);
Result res;
while((res = scanner.next()) != null) {
NavigableMap<byte[],byte[]> colfam2 = res.getFamilyMap(Bytes.toBytes("colfam2"));
boolean foundQueryValue = false;
SearchForQueryValue: while(!colfam2.isEmpty()) {
Entry<byte[], byte[]> cell = colfam2.pollFirstEntry();
if( Bytes.equals(cell.getValue(), queryValue) ) {
foundQueryValue = true;
break SearchForQueryValue;
}
}
if(foundQueryValue) {
NavigableMap<byte[],byte[]> colfam1 = res.getFamilyMap(Bytes.toBytes("colfam1"));
LinkedList<KeyValue> listKV = new LinkedList<KeyValue>();
while(!colfam1.isEmpty()) {
Entry<byte[], byte[]> cell = colfam1.pollFirstEntry();
listKV.add(new KeyValue(res.getRow(), Bytes.toBytes("colfam1"), cell.getKey(), cell.getValue());
}
Result filteredResult = new Result(listKV);
}
}
(This code was not tested)
And then finally filteredResult is what you want. This approach is not elegant and might also give you performance issues if you have a lot of data in those families. If "colfam1" has a lot of data, you don't want to transfer it to the client if it will end up not being used if value "x" is not in a qualifier of "colfam2".
Server-side filtering. This requires you to implement your own Filter class. I believe you cannot use the provided filter types to do this. Implementing your own Filter takes some work, you also need to compile it as a .jar and make it available to all RegionServers. But then, it helps you to avoid sending loads of data of "colfam1" in vain.
It is too much work for me to show you how to custom implement a Filter, so I recommend reading a good book (HBase: The Definitive Guide for example). However, the Filter code will look pretty much like the client-side filtering I showed you, so that's half of the work done.

retrieving the values from the nested hashmap

I have a XML file with many copies of table node structure as below:
<databasetable TblID=”123” TblName=”Department1_mailbox”>
<SelectColumns>
<Slno>dept1_slno</Slno>
<To>dept1_to</To>
<From>dept1_from</From>
<Subject>dept1_sub</Subject>
<Body>dept1_body</Body>
<BCC>dept1_BCC</BCC>
<CC>dept1_CC</CC>
</SelectColumns>
<WhereCondition>MailSentStatus=’New’</WhereCondition>
<UpdateSuccess>
<MailSentStatus>’Yes’</MailSentStatus>
<MailSentFailedReason>’Mail Sent Successfully’</MailSentFailedReason>
</UpdateSuccess>
<UpdateFailure>
<MailSentStatus>’No’</MailSentStatus>
<MailSentFailedReason>’Mail Sending Failed ’</MailSentFailedReason>
</ UpdateFailure>
</databasetable>
As it is not an efficient manner to traverse the file for each time to fetch the details of each node for the queries in the program, I used the nested hashmap concept to store the details while traversing the XML file for the first time. The structure I used is as below:
MapMaster
Key Value
123 MapDetails
Key Value
TblName Department1_mailbox
SelectColumns mapSelect
Key Value
Slno dept1_slno
To dept1_to
From dept1_from
Subject dept1_sub
Body dept1_body
BCC dept1_BCC
CC dept1_CC
WhereCondition MailSentStatus=’New’
UpdateSuccess mapUS
MailSentStatus ’Yes’
MailSentFailedReason ’Mail Sent Successfully’
UpdateFailure mapUF
MailSentStatus ’No’
MailSentFailedReason ’Mail Sending Failed’
But the problem I’m facing now is regarding retrieving the Value part using the nested Keys. For example,
If I need the value of Slno Key, I have to specify TblID, SelectColumns, Slno in nested form like:
Stirng Slno = ((HashMap)((HashMap)mapMaster.get(“123”))mapDetails.get(“SelectColumns”))mapSelect.get(“Slno”);
This is unconvinent to use in a program. Please suggest a solution but don’t tell that iterators are available. As I’ve to fetch the individual value from the map according to the need of my program.
EDIT:my program has to fetch the IDs of the department for which there is privilege to send mails and then these IDs are compared with the IDs in XML file. Only information of those IDs are fetched from XML which returned true in comparison. This is all my program. Please help.
Thanks in advance,
Vishu
Never cast to specific Map implementation. Better use casting to Map interface, i.e.
((Map)one.get("foo")).get("bar")
Do not use casting in your case. You can define collection using generics, so compiler will do work for you:
Map<String, Map> one = new HashMap<String, Map>();
Map<String, Integer> two = new HashMap<String, Integer>();
Now your can say:
int n = one.get("foo").get("bar");
No casting, no problems.
But the better solution is not to use nested tables at all. Create your custom classes like SelectColumns, WhereCondition etc. Each class should have appropriate private fields, getters and setters. Now parse your XML creating instance of these classes. And then use getters to traverse the data structure.
BTW if you wish to use JAXB you do not have to do almost anything! Something like the following:
Unmarshaller u = JAXBContext.newInstance(SelectColumns.class, WhereCondition.class).createUnmarshaller();
SelectColumns[] columns = (SelectColumns[])u.unmarshal(in);
One approach to take would be to generate fully qualified keys that contain the XML path to the element or attribute. These keys would be unique, stored in a single hashmap and get you to the element quickly.
Your code would simply have to generate a unique textual representation of the path and store and retrieve the xml element based on the key.

Categories