Java Kafka stream processing tumbling windows

Java Kafka stream processing tumbling windows - java

I am processing a stream of messages using Java KStream APIs. Currently I my code emits to get the output every 5 mins, but I want to get the out put at the top of the 5 min interval ( i.e 17:10, 17:15 etc)
Currently the interval depends on the time the program started. If the program starts at 17:08 the data gets collected at 17:13, 17:18, 17:23 etc intervals.
Is there a way that I can schedule so the data gets emitted at 5 min intervals which are multiples of 5?
class WindowedTransformerExample {
public static void main(String[] args) {
final StreamsBuilder builder = new StreamsBuilder();
final String stateStoreName = "stateStore";
final StoreBuilder<KeyValueStore<String, String>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.inMemoryKeyValueStore(stateStoreName),
Serdes.String(),
Serdes.String());
builder.addStateStore(keyValueStoreBuilder);
builder.<String, String>stream("topic").transform(new
WindowedTransformer(stateStoreName), stateStoreName)
.filter((k, v) -> k != null && v != null)
// Here's where you do something with records emitted after 10 minutes
.foreach((k, v)-> System.out.println());
}
static final class WindowedTransformer implements TransformerSupplier<String, String, KeyValue<String, String>> {
private final String storeName;
public WindowedTransformer(final String storeName) {
this.storeName = storeName;
}
#Override
public Transformer<String, String, KeyValue<String, String>> get() {
return new Transformer<String, String, KeyValue<String, String>>() {
private KeyValueStore<String, String> keyValueStore;
private ProcessorContext processorContext;
#Override
public void init(final ProcessorContext context) {
processorContext = context;
keyValueStore = (KeyValueStore<String, String>) context.getStateStore(storeName);
// could change this to PunctuationType.STREAM_TIME if needed
context.schedule(Duration.ofMinutes(5), PunctuationType.WALL_CLOCK_TIME, (ts) -> {
try(final KeyValueIterator<String, String> iterator = keyValueStore.all()) {
while (iterator.hasNext()) {
final KeyValue<String, String> keyValue = iterator.next();
processorContext.forward(keyValue.key, keyValue.value);
}
}
});
}
#Override
public KeyValue<String, String> transform(String key, String value) {
if (key != null) {
keyValueStore.put(key, value);
}
return null;
}
#Override
public void close() {
}
};
}
}
}

Related

How to configure correct parallelism in persistor bolt?

I'm using apache storm to create a topology that initially read a "stream" of tuple in a file, and next it split and store the tuples in mongodb.
I've a cluster on Atlas with a shared replica set. I've already developed the topology, and the solution works properly if I use a single thread.
public static StormTopology build() {
return buildWithSpout();
}
public static StormTopology buildWithSpout() {
Config config = new Config();
TopologyBuilder builder = new TopologyBuilder();
CsvSpout datasetSpout = new CsvSpout("file.txt");
SplitterBolt splitterBolt = new SplitterBolt(",");
PartitionMongoInsertBolt insertPartitionBolt = new PartitionMongoInsertBolt();
builder.setSpout(DATA_SPOUT_ID, datasetSpout, 1);
builder.setBolt(DEPENDENCY_SPLITTER_ID, splitterBolt, 1).shuffleGrouping(DATA_SPOUT_ID);
builder.setBolt(UPDATER_COUNTER_ID, insertPartitionBolt, 1).shuffleGrouping(DEPENDENCY_SPLITTER_ID);
}
However, when I use parallel processes, my persistor bolt don't save all tuples in mongodb, despite the tuples are correctly emitted by the previous bolt.
builder.setSpout(DATA_SPOUT_ID, datasetSpout, 1);
builder.setBolt(DEPENDENCY_SPLITTER_ID, splitterBolt, 3).shuffleGrouping(DATA_SPOUT_ID);
builder.setBolt(UPDATER_COUNTER_ID, insertPartitionBolt, 3).shuffleGrouping(DEPENDENCY_SPLITTER_ID);
This is my first bolt:
public class SplitterBolt extends BaseBasicBolt {
private String del;
private MongoConnector db = null;
public SplitterBolt(String del) {
this.del = del;
}
public void prepare(Map stormConf, TopologyContext context) {
db = MongoConnector.getInstance();
}
public void execute(Tuple input, BasicOutputCollector collector) {
String tuple = input.getStringByField("tuple");
int idTuple = Integer.parseInt(input.getStringByField("id"));
String opString = "";
String[] data = tuple.split(this.del);
for(int i=0; i < data.length; i++) {
OpenBitSet attrs = new OpenBitSet();
attrs.fastSet(i);
opString = Utility.toStringOpenBitSet(attrs, 5);
collector.emit(new Values(idTuple, opString, data[i]));
}
db.incrementCount();
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("idtuple","binaryattr","value"));
}
}
And this is my persistor bolt that store in mongo all tuples:
public class PartitionMongoInsertBolt extends BaseBasicBolt {
private MongoConnector mongodb = null;
public void prepare(Map stormConf, TopologyContext context) {
//Singleton Instance
mongodb = MongoConnector.getInstance();
}
public void execute(Tuple input, BasicOutputCollector collector) {
mongodb.insertUpdateTuple(input);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {}
}
My only doubt is that I used a singleton pattern for the connection class to mongo. Can this be a problem?
UPDATE
This is my MongoConnector class:
public class MongoConnector {
private MongoClient mongoClient = null;
private MongoDatabase database = null;
private MongoCollection<Document> partitionCollection = null;
private static MongoConnector mongoInstance = null;
public MongoConnector() {
MongoClientURI uri = new MongoClientURI("connection string");
this.mongoClient = new MongoClient(uri);
this.database = mongoClient.getDatabase("db.database");
this.partitionCollection = database.getCollection("db.collection");
}
public static MongoConnector getInstance() {
if (mongoInstance == null)
mongoInstance = new MongoConnector();
return mongoInstance;
}
public void insertUpdateTuple2(Tuple tuple) {
int idTuple = (Integer) tuple.getValue(0);
String attrs = (String) tuple.getValue(1);
String value = (String) tuple.getValue(2);
value = value.replace('.', ',');
Bson query = Filters.eq("_id", attrs);
Document docIterator = this.partitionCollection.find(query).first();
if (docIterator != null) {
Bson newValue = new Document(value, idTuple);
Bson updateDocument = new Document("$push", newValue);
this.partitionCollection.updateOne(docIterator, updateDocument);
} else {
Document document = new Document();
document.put("_id", attrs);
ArrayList<Integer> partition = new ArrayList<Integer>();
partition.add(idTuple);
document.put(value, partition);
this.partitionCollection.insertOne(document);
}
}
}
SOLUTION UPDATE
I've solved the problem chainging this line:
this.partitionCollection.updateOne(docIterator, updateDocument);
in
this.partitionCollection.findOneAndUpdate(query, updateDocument);

Get the top 3 elements from the memory cache

When I need to get the top 3 items from a Map, I can write the code,
private static Map<String, Integer> SortMapBasedOnValues(Map<String, Integer> map, int n) {
Map<String, Integer> sortedDecreasingly = map.entrySet().stream()
.sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).limit(n)
.collect(toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e2, LinkedHashMap::new));
return sortedDecreasingly;
}
I have a memory cache that I use to keep track of some app data,
public class MemoryCache<K, T> {
private long timeToLive;
private LRUMap map;
protected class CacheObject {
public long lastAccessed = System.currentTimeMillis();
public T value;
protected CacheObject(T value) {
this.value = value;
}
}
public MemoryCache(long timeToLive, final long timerInterval, int maxItems) {
this.timeToLive = timeToLive * 1000;
map = new LRUMap(maxItems);
if (this.timeToLive > 0 && timerInterval > 0) {
Thread t = new Thread(new Runnable() {
public void run() {
while (true) {
try {
Thread.sleep(timerInterval * 1000);
} catch (InterruptedException ex) {
}
cleanup();
}
}
});
t.setDaemon(true);
t.start();
}
}
public void put(K key, T value) {
synchronized (map) {
map.put(key, new CacheObject(value));
}
}
#SuppressWarnings("unchecked")
public T get(K key) {
synchronized (map) {
CacheObject c = (CacheObject) map.get(key);
if (c == null)
return null;
else {
c.lastAccessed = System.currentTimeMillis();
return c.value;
}
}
}
public void remove(K key) {
synchronized (map) {
map.remove(key);
}
}
public int size() {
synchronized (map) {
return map.size();
}
}
#SuppressWarnings("unchecked")
public void cleanup() {
long now = System.currentTimeMillis();
ArrayList<K> deleteKey = null;
synchronized (map) {
MapIterator itr = map.mapIterator();
deleteKey = new ArrayList<K>((map.size() / 2) + 1);
K key = null;
CacheObject c = null;
while (itr.hasNext()) {
key = (K) itr.next();
c = (CacheObject) itr.getValue();
if (c != null && (now > (timeToLive + c.lastAccessed))) {
deleteKey.add(key);
}
}
}
for (K key : deleteKey) {
synchronized (map) {
map.remove(key);
}
Thread.yield();
}
}
}
Inside the app, I initialize it,
MemoryCache<String, Integer> cache = new MemoryCache<String, Integer>(200, 500, 100);
Then I can add the data,
cache.put("productId", 500);
I would like to add functionality in the MemoryCache class so if called will return a HashMap of the top 3 items based on the value.
Do you have any advise how to implement that?

While I don't have a good answer, I convert the MemoryCache to the HashMap with an additional functionality implemented inside the class of MemoryCache and later, use it with the function provided earlier to retrieve the top 3 items based on the value,
Here is my updated code,
/**
* convert the cache full of items to regular HashMap with the same
* key and value pair
*
* #return
*/
public Map<Product, Integer> convertToMap() {
synchronized (lruMap) {
Map<Product, Integer> convertedMap = new HashMap<>();
MapIterator iterator = lruMap.mapIterator();
K k = null;
V v = null;
CacheObject o = null;
while (iterator.hasNext()) {
k = (K) iterator.next();
v = (V) iterator.getValue();
Product product = (Product) k;
o = (CacheObject) v;
int itemsSold = Integer.valueOf((o.value).toString());
convertedMap.put(product, itemsSold);
}
return convertedMap;
}
}

Why updating broadcast variable sample code didn't work?

I want to update broadcast variable every minute. So I use the sample code you give by Aastha in this question.
how can I update a broadcast variable in Spark streaming?
But it didn't work. The function updateAndGet() only works when the streaming application start. When I debug my code , it didn't went into the fuction updateAndGet() twice. So the broadcast variable didn't update every minute.
Why?
Here is my sample code.
public class BroadcastWrapper {
private Broadcast<List<String>> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();
private static BroadcastWrapper obj = new BroadcastWrapper();
private BroadcastWrapper(){}
public static BroadcastWrapper getInstance() {
return obj;
}
public JavaSparkContext getSparkContext(SparkContext sc) {
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
return jsc;
}
public Broadcast<List<String>> updateAndGet(JavaStreamingContext jsc) {
Date currentDate = Calendar.getInstance().getTime();
long diff = currentDate.getTime()-lastUpdatedAt.getTime();
if (broadcastVar == null || diff > 60000) { // Lets say we want to refresh every 1 min =
// 60000 ms
if (broadcastVar != null)
broadcastVar.unpersist();
lastUpdatedAt = new Date(System.currentTimeMillis());
// Your logic to refreshs
// List<String> data = getRefData();
List<String> data = new ArrayList<String>();
data.add("tang");
data.add("xiao");
data.add(String.valueOf(System.currentTimeMillis()));
broadcastVar = jsc.sparkContext().broadcast(data);
}
return broadcastVar;}}
//Here is the computing code submit to spark streaming.
lines.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {
Broadcast<List<String>> blacklist =
BroadcastWrapper.getInstance().updateAndGet(jsc);
#Override
public JavaRDD<String> call(JavaRDD<String> rdd) {
JavaRDD<String> dd=rdd.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String word) {
if (blacklist.getValue().contains(word)) {
return false;
} else {
return true;
}
}
});
return dd;
}});

How to group inside Flink with my model

Im using Flink with Java to make my recommendation system using our logic.
So i have a dataset:
[user] [item]
100 1
100 2
100 3
100 4
100 5
200 1
200 2
200 3
200 6
300 1
300 6
400 7
So i map all to a tuple :
DataSet<Tuple3<Long, Long, Integer>> csv = text.flatMap(new LineSplitter()).groupBy(0, 1).reduceGroup(new GroupReduceFunction<Tuple2<Long, Long>, Tuple3<Long, Long, Integer>>() {
#Override
public void reduce(Iterable<Tuple2<Long, Long>> iterable, Collector<Tuple3<Long, Long, Integer>> collector) throws Exception {
Long customerId = 0L;
Long itemId = 0L;
Integer count = 0;
for (Tuple2<Long, Long> item : iterable) {
customerId = item.f0;
itemId = item.f1;
count = count + 1;
}
collector.collect(new Tuple3<>(customerId, itemId, count));
}
});
After i get all Customers and is Items inside arraylist:
DataSet<CustomerItems> customerItems = csv.groupBy(0).reduceGroup(new GroupReduceFunction<Tuple3<Long, Long, Integer>, CustomerItems>() {
#Override
public void reduce(Iterable<Tuple3<Long, Long, Integer>> iterable, Collector<CustomerItems> collector) throws Exception {
ArrayList<Long> newItems = new ArrayList<>();
Long customerId = 0L;
for (Tuple3<Long, Long, Integer> item : iterable) {
customerId = item.f0;
newItems.add(item.f1);
}
collector.collect(new CustomerItems(customerId, newItems));
}
});
Now i need get all "similar" customers. But after try a lot of things, nothing work.
The logic will be:
for ci : CustomerItems
c1 = c1.customerId
for ci2 : CustomerItems
c2 = ci2.cstomerId
if c1 != c2
if c2.getItems() have any item inside c1.getItems()
collector.collect(new Tuple2<c1, c2>)
I tried it using reduce, but i cant iterate on iterator two time (loop inside loop).
Can anyone help me?

You can cross the dataset with itself and basically insert your logic 1:1 into a cross function (excluding the 2 loops since the cross does that for you).

I solve the problem, but i need group and reduce after the "cross". I dont know that it is the best method. Can anyone suggest something?
The result is here:
package org.myorg.quickstart;
import org.apache.flink.api.common.functions.CrossFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.GroupReduceFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.util.Collector;
import java.io.Serializable;
import java.util.ArrayList;
public class UserRecommendation {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// le o arquivo cm o dataset
DataSet<String> text = env.readTextFile("/Users/paulo/Downloads/dataset.csv");
// cria tuple com: customer | item | count
DataSet<Tuple3<Long, Long, Integer>> csv = text.flatMap(new LineFieldSplitter()).groupBy(0, 1).reduceGroup(new GroupReduceFunction<Tuple2<Long, Long>, Tuple3<Long, Long, Integer>>() {
#Override
public void reduce(Iterable<Tuple2<Long, Long>> iterable, Collector<Tuple3<Long, Long, Integer>> collector) throws Exception {
Long customerId = 0L;
Long itemId = 0L;
Integer count = 0;
for (Tuple2<Long, Long> item : iterable) {
customerId = item.f0;
itemId = item.f1;
count = count + 1;
}
collector.collect(new Tuple3<>(customerId, itemId, count));
}
});
// agrupa os items do customer dentro do customer
final DataSet<CustomerItems> customerItems = csv.groupBy(0).reduceGroup(new GroupReduceFunction<Tuple3<Long, Long, Integer>, CustomerItems>() {
#Override
public void reduce(Iterable<Tuple3<Long, Long, Integer>> iterable, Collector<CustomerItems> collector) throws Exception {
ArrayList<Long> newItems = new ArrayList<>();
Long customerId = 0L;
for (Tuple3<Long, Long, Integer> item : iterable) {
customerId = item.f0;
newItems.add(item.f1);
}
collector.collect(new CustomerItems(customerId, newItems));
}
});
// obtém todos os itens do customer que pertence a um usuário parecido
DataSet<CustomerItems> ci = customerItems.cross(customerItems).with(new CrossFunction<CustomerItems, CustomerItems, CustomerItems>() {
#Override
public CustomerItems cross(CustomerItems customerItems, CustomerItems customerItems2) throws Exception {
if (!customerItems.customerId.equals(customerItems2.customerId)) {
boolean has = false;
for (Long item : customerItems2.items) {
if (customerItems.items.contains(item)) {
has = true;
break;
}
}
if (has) {
for (Long item : customerItems2.items) {
if (!customerItems.items.contains(item)) {
customerItems.ritems.add(item);
}
}
}
}
return customerItems;
}
}).groupBy(new KeySelector<CustomerItems, Long>() {
#Override
public Long getKey(CustomerItems customerItems) throws Exception {
return customerItems.customerId;
}
}).reduceGroup(new GroupReduceFunction<CustomerItems, CustomerItems>() {
#Override
public void reduce(Iterable<CustomerItems> iterable, Collector<CustomerItems> collector) throws Exception {
CustomerItems c = new CustomerItems();
for (CustomerItems current : iterable) {
c.customerId = current.customerId;
for (Long item : current.ritems) {
if (!c.ritems.contains(item)) {
c.ritems.add(item);
}
}
}
collector.collect(c);
}
});
ci.first(100).print();
System.out.println(ci.count());
}
public static class CustomerItems implements Serializable {
public Long customerId;
public ArrayList<Long> items = new ArrayList<>();
public ArrayList<Long> ritems = new ArrayList<>();
public CustomerItems() {
}
public CustomerItems(Long customerId, ArrayList<Long> items) {
this.customerId = customerId;
this.items = items;
}
#Override
public String toString() {
StringBuilder itemsData = new StringBuilder();
if (items != null) {
for (Long item : items) {
if (itemsData.length() == 0) {
itemsData.append(item);
} else {
itemsData.append(", ").append(item);
}
}
}
StringBuilder ritemsData = new StringBuilder();
if (ritems != null) {
for (Long item : ritems) {
if (ritemsData.length() == 0) {
ritemsData.append(item);
} else {
ritemsData.append(", ").append(item);
}
}
}
return String.format("[ID: %d, Items: %s, RItems: %s]", customerId, itemsData, ritemsData);
}
}
public static final class LineFieldSplitter implements FlatMapFunction<String, Tuple2<Long, Long>> {
#Override
public void flatMap(String value, Collector<Tuple2<Long, Long>> out) {
// normalize and split the line
String[] tokens = value.split("\t");
if (tokens.length > 1) {
out.collect(new Tuple2<>(Long.valueOf(tokens[0]), Long.valueOf(tokens[1])));
}
}
}
}
Link with gist:
https://gist.github.com/prsolucoes/b406ae98ea24120436954967e37103f6

Spark streaming transform function

I am having compilation errors in the transform function for spark streaming.
Specifically seem to be missing finalizing the DStream variable or something similar. I have copied from the amplab tutorials so slightly confused...
Here is the code, the problem is in the transform function towards the end.
Here is the error:
[ERROR] /home/nipun/ngla-stable/online/src/main/java/org/necla/ngla/spark_streaming/Type4ViolationChecker.java:[120,63] error:
no suitable method found for transform(<anonymous Function<JavaPairRDD<Long,Integer>,JavaPairRDD<Long,Integer>>>)
[INFO] 1 error
Code:
public class Type4ViolationChecker {
private static final Pattern NEWSPACE = Pattern.compile("\n");
public static Long generateTSKey(String line) throws ParseException{
JSONObject obj = new JSONObject(line);
String time = obj.getString("mts");
DateFormat formatter = new SimpleDateFormat("yyyy / MM / dd HH : mm : ss");
Date date = (Date)formatter.parse(time);
long since = date.getTime();
long key = (long)(since/10000) * 10000;
return key;
}
public static void main(String[] args) {
Type4ViolationChecker obj = new Type4ViolationChecker();
SparkConf sparkConf = new SparkConf().setAppName("Type4ViolationChecker");
final JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(10000));
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String x) {
return Lists.newArrayList(NEWSPACE.split(x));
}
});
words.persist();
JavaDStream<String> matched = words.filter(new Function<String, Boolean>() {
public Boolean call(String line) {
return line.contains("pattern");
}});
JavaPairDStream<Long, Integer> keyValStream = matched.mapToPair(
new PairFunction<String, Long, Integer>(){
/**
* Here we are converting the string to a key value tuple
* Key -> time bucket calculated using the 1970 GMT date as anchor, and dividing by the polling interval
* Value -> is the original message
*/
#Override
public Tuple2<Long, Integer> call(String arg0)
throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Long,Integer>(generateTSKey(arg0),1);
}
});
JavaPairDStream<Long, Integer> tsStream = keyValStream.reduceByKey(
new Function2<Integer,Integer,Integer>(){
public Integer call(Integer i1, Integer i2){
return i1+ i2;
}});
JavaPairDStream<Long,Integer> sortedtsStream = tsStream.transform(
new Function<JavaPairRDD<Long, Integer>, JavaPairRDD<Long,Integer>>() {
#Override
public JavaPairRDD<Long, Integer> call(JavaPairRDD<Long, Integer> longIntegerJavaPairRDD) throws Exception {
return longIntegerJavaPairRDD.sortByKey(false);
}
});
//sortedtsStream.print();
ssc.start();
ssc.awaitTermination();
}
}

Thanks to #GaborBakos for providing the answer...
The following seems to work! Had to use transformtoPair, instead of transform
JavaPairDStream<Long,Integer> sortedtsStream = tsStream.transformToPair(
new Function<JavaPairRDD<Long, Integer>, JavaPairRDD<Long,Integer>>() {
#Override
public JavaPairRDD<Long, Integer> call(JavaPairRDD<Long, Integer> longIntegerJavaPairRDD) throws Exception {
return longIntegerJavaPairRDD.sortByKey(true);
}
});

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Kafka stream processing tumbling windows - java

Related

How to configure correct parallelism in persistor bolt?

Get the top 3 elements from the memory cache

Why updating broadcast variable sample code didn't work?

How to group inside Flink with my model

Spark streaming transform function

Categories

Resources