Background
I have a project where we are using akka-streams with Java.
In this project I have a stream of strings and a graph that does some operations on them.
Objective
In my graph, I want to broadcast that stream to 2 workers. One will replace all characters 'a' with 'A' and send data as it receives it in real time.
The other one will receive the data, and every 3 strings, it will concat those 3 strings and map them to numbers.
It would look like the following:
Obviously Sink 2 will not receive information as fast as Sink 1. but that is expected behavior. The interesting part here, is worker 2.
Problem
Doing worker 1 is easy, and not hard. The issue here is doing worker 2. I know akka has buffers that can save up to X messages, but then it looks like I am forced to choose one of the existing Overflow strategies which often result in choosing which message I want to drop or if I want to keep the stream alive or not.
All I want is to, when my buffer in worke2 reaches the maximum size of the buffer, to perform the concat and map operations on all the messages it has, and then send them along ( resetting the buffer after ).
But even after reading the stream-rate documentation for akka I couldn't find a way of doing it, at least using Java.
Research
I also checked a similar SO question, Selective request-throttling using akka-http stream however it has been over an year and no one has responded.
Questions
Using the graph DSL, how would I create the path from:
Source -> bcast -> worker2 -> Sink 2
??
After your bcast apply the groupedWithin operator with an unlimited duration and a number of element set to 3.
https://doc.akka.io/docs/akka/2.5/stream/operators/Source-or-Flow/groupedWithin.html
You can also do it yourself, adding a stage that stores element in a List and emit the list every time it reaches 3 elements.
import akka.stream.Attributes;
import akka.stream.FlowShape;
import akka.stream.Inlet;
import akka.stream.Outlet;
import akka.stream.stage.AbstractInHandler;
import akka.stream.stage.GraphStage;
import akka.stream.stage.GraphStageLogic;
import com.google.common.collect.ImmutableList;
import java.util.ArrayList;
import java.util.List;
public class RecordGrouper<T> extends GraphStage<FlowShape<T, List<T>>> {
private final Inlet<T> inlet = Inlet.create("in");
private final Outlet<List<T>> outlet = Outlet.create("out");
private final FlowShape<T, List<T>> shape = new FlowShape<>(inlet, outlet);
#Override
public GraphStageLogic createLogic(Attributes inheritedAttributes) {
return new GraphStageLogic(shape) {
List<T> batch = new ArrayList<>(3);
{
setHandler(
inlet,
new AbstractInHandler() {
#Override
public void onPush() {
T record = grab(inlet);
batch.add(record);
if (batch.size() == 3) {
emit(outlet, ImmutableList.copyOf(batch));
batch.clear();
}
pull(inlet);
}
});
}
#Override
public void preStart() {
pull(inlet);
}
};
}
#Override
public FlowShape<T, List<T>> shape() {
return shape;
}
}
As a side node, I don't think the buffer operator will work as it only kicks in when there's backpressure. So if everything is quiet, elements will still be emitted one by one instead of 3 by 3. https://doc.akka.io/docs/akka/2.5/stream/operators/Source-or-Flow/buffer.html
Related
I'm working on a project that monitors a micro-service based system.
the mock micro-services I created produce data and upload it to Amazon
Kinesis, now I use this code here from Amazon to produce to and consume from the Kinesis. But I have failed to understand how can I add more processors
(workers) that will work on the same records list (possibly concurrently),
meaning I'm trying to figure out where and how to plug in my code to the added code of Amazon I added here below.
I'm going to have two processors in my program:
Will save each record to a DB.
Will update a GUI that will show monitoring of the system, given it can
compare a current transaction to a valid transaction. My valid transactions
will also be stored in a DB. meaning we will be able to see all of the data flow in the system and see how each request was handled from end to end.
I would really appreciate some guidance, as this is my first industry project and I'm also kind of new to AWS (though I have read about it a lot).
Thanks!
Here is the code from amazon taken from this link:
https://github.com/awslabs/amazon-kinesis-producer/blob/master/java/amazon-kinesis-producer-sample/src/com/amazonaws/services/kinesis/producer/sample/SampleConsumer.java
/*
* Copyright 2015 Amazon.com, Inc. or its affiliates. All Rights Reserved.
*
* Licensed under the Amazon Software License (the "License").
* You may not use this file except in compliance with the License.
* A copy of the License is located at
*
* http://aws.amazon.com/asl/
*
* or in the "license" file accompanying this file. This file is distributed
* on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
* express or implied. See the License for the specific language governing
* permissions and limitations under the License.
*/
package com.amazonaws.services.kinesis.producer.sample;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain;
import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessor;
import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorCheckpointer;
import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorFactory;
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream;
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibConfiguration;
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker;
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownReason;
import com.amazonaws.services.kinesis.model.Record;
/**
* If you haven't looked at {#link SampleProducer}, do so first.
*
* <p>
* As mentioned in SampleProducer, we will check that all records are received
* correctly by the KCL by verifying that there are no gaps in the sequence
* numbers.
*
* <p>
* As the consumer runs, it will periodically log a message indicating the
* number of gaps it found in the sequence numbers. A gap is when the difference
* between two consecutive elements in the sorted list of seen sequence numbers
* is greater than 1.
*
* <p>
* Over time the number of gaps should converge to 0. You should also observe
* that the range of sequence numbers seen is equal to the number of records put
* by the SampleProducer.
*
* <p>
* If the stream contains data from multiple runs of SampleProducer, you should
* observe the SampleConsumer detecting this and resetting state to only count
* the latest run.
*
* <p>
* Note if you kill the SampleConsumer halfway and run it again, the number of
* gaps may never converge to 0. This is because checkpoints may have been made
* such that some records from the producer's latest run are not processed
* again. If you observe this, simply run the producer to completion again
* without terminating the consumer.
*
* <p>
* The consumer continues running until manually terminated, even if there are
* no more records to consume.
*
* #see SampleProducer
* #author chaodeng
*
*/
public class SampleConsumer implements IRecordProcessorFactory {
private static final Logger log = LoggerFactory.getLogger(SampleConsumer.class);
// All records from a run of the producer have the same timestamp in their
// partition keys. Since this value increases for each run, we can use it
// determine which run is the latest and disregard data from earlier runs.
private final AtomicLong largestTimestamp = new AtomicLong(0);
// List of record sequence numbers we have seen so far.
private final List<Long> sequenceNumbers = new ArrayList<>();
// A mutex for largestTimestamp and sequenceNumbers. largestTimestamp is
// nevertheless an AtomicLong because we cannot capture non-final variables
// in the child class.
private final Object lock = new Object();
/**
* One instance of RecordProcessor is created for every shard in the stream.
* All instances of RecordProcessor share state by capturing variables from
* the enclosing SampleConsumer instance. This is a simple way to combine
* the data from multiple shards.
*/
private class RecordProcessor implements IRecordProcessor {
#Override
public void initialize(String shardId) {}
#Override
public void processRecords(List<Record> records, IRecordProcessorCheckpointer checkpointer) {
long timestamp = 0;
List<Long> seqNos = new ArrayList<>();
for (Record r : records) {
// Get the timestamp of this run from the partition key.
timestamp = Math.max(timestamp, Long.parseLong(r.getPartitionKey()));
// Extract the sequence number. It's encoded as a decimal
// string and placed at the beginning of the record data,
// followed by a space. The rest of the record data is padding
// that we will simply discard.
try {
byte[] b = new byte[r.getData().remaining()];
r.getData().get(b);
seqNos.add(Long.parseLong(new String(b, "UTF-8").split(" ")[0]));
} catch (Exception e) {
log.error("Error parsing record", e);
System.exit(1);
}
}
synchronized (lock) {
if (largestTimestamp.get() < timestamp) {
log.info(String.format(
"Found new larger timestamp: %d (was %d), clearing state",
timestamp, largestTimestamp.get()));
largestTimestamp.set(timestamp);
sequenceNumbers.clear();
}
// Only add to the shared list if our data is from the latest run.
if (largestTimestamp.get() == timestamp) {
sequenceNumbers.addAll(seqNos);
Collections.sort(sequenceNumbers);
}
}
try {
checkpointer.checkpoint();
} catch (Exception e) {
log.error("Error while trying to checkpoint during ProcessRecords", e);
}
}
#Override
public void shutdown(IRecordProcessorCheckpointer checkpointer, ShutdownReason reason) {
log.info("Shutting down, reason: " + reason);
try {
checkpointer.checkpoint();
} catch (Exception e) {
log.error("Error while trying to checkpoint during Shutdown", e);
}
}
}
/**
* Log a message indicating the current state.
*/
public void logResults() {
synchronized (lock) {
if (largestTimestamp.get() == 0) {
return;
}
if (sequenceNumbers.size() == 0) {
log.info("No sequence numbers found for current run.");
return;
}
// The producer assigns sequence numbers starting from 1, so we
// start counting from one before that, i.e. 0.
long last = 0;
long gaps = 0;
for (long sn : sequenceNumbers) {
if (sn - last > 1) {
gaps++;
}
last = sn;
}
log.info(String.format(
"Found %d gaps in the sequence numbers. Lowest seen so far is %d, highest is %d",
gaps, sequenceNumbers.get(0), sequenceNumbers.get(sequenceNumbers.size() - 1)));
}
}
#Override
public IRecordProcessor createProcessor() {
return this.new RecordProcessor();
}
public static void main(String[] args) {
KinesisClientLibConfiguration config =
new KinesisClientLibConfiguration(
"KinesisProducerLibSampleConsumer",
SampleProducer.STREAM_NAME,
new DefaultAWSCredentialsProviderChain(),
"KinesisProducerLibSampleConsumer")
.withRegionName(SampleProducer.REGION)
.withInitialPositionInStream(InitialPositionInStream.TRIM_HORIZON);
final SampleConsumer consumer = new SampleConsumer();
Executors.newScheduledThreadPool(1).scheduleAtFixedRate(new Runnable() {
#Override
public void run() {
consumer.logResults();
}
}, 10, 1, TimeUnit.SECONDS);
new Worker.Builder()
.recordProcessorFactory(consumer)
.config(config)
.build()
.run();
}
}
Your question is very broad, but here are some suggestions on Kinesis consumers hopefully relevant to your use case.
Each Kinesis stream is partitioned into one or more shards. There are limitations imposed per shard, like you can't write more than a MiB of data per second into a shard, and you can't initiate more than 5 GetRecords (which consumer's processRecords calls under the hood) requests per second to a single shard. (See full list of constraints here.) If you are working with amounts of data that come close to or exceed these constraints, you'd want to increase the number of shards in your stream.
When you have only one consumer application and one worker, it takes the responsibility of processing all shards of the corresponding stream. If there are multiple workers, they each assume responsibility for some subset of shards, so that each shard is assigned to one and only one worker (if you watch consumer logs, you can find this referenced as "taking leases" on shards).
If you'd like to have several processors that independently ingest Kinesis traffic and process records, you need to register two separate consumer applications. In the code you referenced above, the application name is the first parameter of KinesisClientLibConfiguration constructor. Note that even though they are separate consumer apps, the limit of total of 5 GetRecords per second still applies.
In other words, you need to have two separate processes, one will instantiate the consumer that talks to DB, the other will instantiate the consumer that updates GUI:
KinesisClientLibConfiguration databaseSaverKclConfig =
new KinesisClientLibConfiguration(
"DatabaseSaverKclApp",
"your-stream",
new DefaultAWSCredentialsProviderChain(),
// I believe worker ids don't need to be unique, but it's a good practice to make them unique so you can easily identify the workers
"unique-worker-id")
.withRegionName(SampleProducer.REGION)
// this only matters the very first time your consumer is launched, subsequent launches will read the checkpoint from the previous runs
.withInitialPositionInStream(InitialPositionInStream.TRIM_HORIZON);
final IRecordProcessorFactory databaseSaverConsumer = new DatabaseSaverConsumer();
KinesisClientLibConfiguration guiUpdaterKclConfig =
new KinesisClientLibConfiguration(
"GuiUpdaterKclApp",
"your-stream",
new DefaultAWSCredentialsProviderChain(),
"unique-worker-id")
.withRegionName(SampleProducer.REGION)
.withInitialPositionInStream(InitialPositionInStream.TRIM_HORIZON);
final IRecordProcessorFactory guiUpdaterConsumer = new GuiUpdaterConsumer();
What about the implementation of DatabaseSaverConsumer and GuiUpdaterConsumer? Each of them needs to implement custom logic in processRecords method. You need to make sure that each of them does the right amount of work inside this method, and that checkpoint logic is sound. Let's decipher these:
Let's say processRecords takes 10 seconds for 100 records, but the corresponding shard receives 500 records in 10 seconds. Every subsequent invocation of processRecords would be falling further behind the shard. That means that either some work needs to be extracted out of processRecords, or number of shards needs to be scaled up.
Conversely, if processRecords only takes 0.1 seconds, then processRecords will be called 10 times a second, exceeding the allotted 5 transactions per second per shard. If I understand/remember correctly, there is no way to add a pause between subsequent calls to processRecords in the KCL config, so you have to add a sleep inside your code.
Checkpointing: the each worker needs to track its progress, so that if it's unexpectedly interrupted and another worker takes over the same shard, it knows where to continue from. It's usually done in either of two ways: at the beginning of processRecords, or in the end. In the former case, you are saying "I am okay with jumping over some records in the stream, but definitely don't want to process them twice"; in the latter, you are saying "I am okay processing some records twice, but definitely can't lose any of them". (When you need the best of both worlds, i.e., process records once and only once, you need to keep the state in some datastore outside the workers.) In your case, the database writer most probably needs to checkpoint after processing; I am not so sure about he GUI.
Speaking of GUI, what do you use to display data, and why does a Kinesis consumer need to update it, rather the GUI itself querying underlying datastores?
Anyway, I hope this helps. Let me know if you have more specific questions.
I have a file of records, each row begins with a timestamp and then a few fields.. it implements Iterable
#SuppressWarnings("unchecked")
#Override
public <E extends MarkedPoint>
Stream<E>
stream()
{
return (Stream<E>) StreamSupport.stream(spliterator(), false);
}
I would like to implement with Lambda expression/streams API what is essentially not just a filter, but a mapping/accumulator that would merge neighboring records ( stream elements coming from the an Iterable interface ) having the same timestamp. I would need an interface that was something like this
MarkedPoint prevPoint = null;
void nextPoint(MarkedPoint in, Stream<MarkedPoint> inputStream, Stream<MarkedPoint> outputStream )
{
while ( prevPoint.time == in.time )
{
updatePrevPoint(in);
in = stream.next();
}
outputStream.emit(in);
prevPoint = in;
}
}
that is rough-pseudocode of what I imagine is close to some API as how it is supposed to be used.. can someone please point me towards the most straightforward way of implementing this stream transformation ? The resulting stream will be necessarily of the same or lesser number of elements as the input, as it is essentially a filter and and option transformation of records occuring at the same timestamp are encountered.
Thanks in advance
Streams don’t work like that; there can be only 1 terminating method (consumer). What you seem to be asking for is an on-the-fly reduction with a possible consumption of the next element(s) within your class. No dice with the standard stream API.
You could first create a list of un-merged lines, then create an iterator that peeks at the next elenent(s) and merges them before returning the next merged element.
After reading through nested classes, nested lists, and mapping, I'm still having trouble deciding the proper method to use and even still, how to implement those three methods.
Objective: tracking statistics of multiple services up time from a log file. Each service status change is on one line and contains the Name, the oldStatus, newStatus, and timeChanged
In the end, I'd like to see the time between these lines but for now I'm simply trying to organize the data properly.
Currently, I'm going down the road of using a class and here is the building of it:
public class Services {
private List<String> AllServices = new ArrayList<String>();
// Initial thought of adding another list here, containing
// the variables, but how would I associate that to the
// service above?
private List<String> ServiceStatus;
public boolean AddService(String name) {
if (!AllServices.contains(name)) {
AllServices.add(name);
return true;
}
return false;
}
public List<String> GetServices() {
return AllServices;
}
}
That is all find and good. As the parser discovers a new service, it adds it to the list so there aren't duplicates. Then, I moved on to adding each time it find a service status has changed. I can't figure out how, for each service, to store that data.
I guess I could liken it to using a database with the unique identifier being the service name, holding records for each time the status changes, what it changed to, and the time. Once I get that down, I can start comparing the times between.
Initial thought of adding another list here, containing the variables, but how would I associate that to each service?
I considered arrays, but in Java they seem rather static, as in having to resize them to add more data. I've since forgotten the work around for this in PHP, but from my limited past experience I recall being able to build up and use arrays such as:
statusChanged = myServices["TheService"][i][date];
statusChangedTo = myServices["TheService"][i][new_status];
statusChangedTo = myServices["OtherService"][i][date];
statusChangedTo = myServices["OtherService"][i][new_status];
AllServices
|- TheService
| |- TimeDate
| | - New Status
| | - Old Status
|- OtherService
|- TimeDate
| - New Status
| - Old Status
Which lead me down the path of using extended Classes. I could have Services and also Service. Services would result in just having a List of services, but then how does Service again associate it's underlying variables to specific services listed in the parent Services class?
Again, I feel like I'm overthinking this after reading too many examples that are similar but not to what I'm trying to accomplish. Or, I'm just completely off the wall entirely and there could be a far better method.
I suggest to use any DB to persist logged events. It is preferrable solution in case of many services with many events.
Otherwise, I provide a few steps for storing them in memory:
First of all, you should to divide data by name of service. Use Map<K,V> for it.
Secondly, I suggest you to combine oldStatus, newStatus and timeChanged into ServiceEvent class:
public class ServiceEvent {
private final String oldStatus;
private final String newStatus;
private final DateTime timeChanged;
// methods...
}
It is convinient to order events by timeChanged with Comparator in case of unordered sequence of events:
public class ServiceEventComparator implements Comparator<ServiceEvent> {
#Override
public int compare(ServiceEvent e1, ServiceEvent e2) {
return e1.getTimeChanged().compareTo(e2.getTimeChanged());
}
}
Thus, your ServiceEventRepository can be like this:
public class ServiceEventRepository {
private Map<String, SortedSet<ServiceEvent>> storage = new HashMap<>();
public void addEvent(String service, ServiceEvent event) {
SortedSet<ServiceEvent> events = storage.get(service);
if (events == null) {
events = new TreeSet<ServiceEvent>(new ServiceEventComparator());
storage.put(service, events);
}
events.add(event);
}
}
Use simple List<ServiceEvent> if sequence of events has natural order.
Note that the ServiceEventRepository above is not thread safe.
I create a list of object with java
public class Cleint {
private int id;
private String user;
private int age;
public static List<Client> last100CleintList;
The list is supposed to be a kind of queue where the last 100 clients are stored. So when the list is empty I just want to add clients but when I reach 100 I want to delete the last one and add the new one. I could do it manually but is there a function for that? Or maybe in arrays, I am not forced to use lists.
There is no built-in library to achieve that (data-structure is there) without creating a utility method yourself.
Since you want to keep the last 100Clients every-time you append; and the list size is 100, you have to remove the first Client. You could try something like this (with Client objects).
import java.util.Queue;
import org.apache.commons.collections4.queue.CircularFifoQueue;`
Queue<String> circularQueue = new CircularFifoQueue<String>(2);
circularQueue.add("Bob");
circularQueue.add("Doe");
circularQueue.add("Joe");
then
System.out.println(circularQueue);
outputs ["Doe", "Joe"];
You can also do this with:
com.google.common.collect.EvictingQueue
MinMaxPriorityQueue by guava
Based on our experiments we see that stateful Spark Streaming internal processing costs take significant amount of time when state becomes more than a million of objects. As a result latency suffers, because we have to increase batch interval to avoid unstable behavior (processing time > batch interval).
It has nothing to do with specifics of our app, since it can be reproduced by code below.
What are exactly those Spark internal processing/infrastructure costs that take it so much time to handle user state? Is there any options to decrease processing time besides of simply increasing batch interval?
We planned to use state extensively: at least 100MB or so on a each of a few nodes to keep all data in memory and only dump it once in hour.
Increasing batch interval helps, but we want to keep batch interval minimal.
The reason is probably not space occupied by state, but rather large object graph, because when we changed list to large array of primitives, the problem gone.
Just a guess: it might has something to do with org.apache.spark.util.SizeEstimator used internally by Spark, because it shows up while profiling from time to time.
Here is simple demo to reproduce the picture above on modern iCore7:
less than 15 MB of state
no stream input at all
quickest possible (dummy) 'updateStateByKey' function
batch interval 1 second
checkpoint (required by Spark, must have) to local disk
tested both locally and on YARN
Code:
package spark;
import org.apache.commons.lang3.RandomStringUtils;
import org.apache.spark.HashPartitioner;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.util.SizeEstimator;
import scala.Tuple2;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
public class SlowSparkStreamingUpdateStateDemo {
// Very simple state model
static class State implements Serializable {
final List<String> data;
State(List<String> data) {
this.data = data;
}
}
public static void main(String[] args) {
SparkConf conf = new SparkConf()
// Tried KryoSerializer, but it does not seem to help much
//.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setMaster("local[*]")
.setAppName(SlowSparkStreamingUpdateStateDemo.class.getName());
JavaStreamingContext javaStreamingContext = new JavaStreamingContext(conf, Durations.seconds(1));
javaStreamingContext.checkpoint("checkpoint"); // a must (if you have stateful operation)
List<Tuple2<String, State>> initialRddGeneratedData = prepareInitialRddData();
System.out.println("Estimated size, bytes: " + SizeEstimator.estimate(initialRddGeneratedData));
JavaPairRDD<String, State> initialRdd = javaStreamingContext.sparkContext().parallelizePairs(initialRddGeneratedData);
JavaPairDStream<String, State> stream = javaStreamingContext
.textFileStream(".") // fake: effectively, no input at all
.mapToPair(input -> (Tuple2<String, State>) null) // fake to get JavaPairDStream
.updateStateByKey(
(inputs, maybeState) -> maybeState, // simplest possible dummy function
new HashPartitioner(javaStreamingContext.sparkContext().defaultParallelism()),
initialRdd); // set generated state
stream.foreachRDD(rdd -> { // simplest possible action (required by Spark)
System.out.println("Is empty: " + rdd.isEmpty());
return null;
});
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
private static List<Tuple2<String, State>> prepareInitialRddData() {
// 'stateCount' tuples with value = list of size 'dataListSize' of strings of length 'elementDataSize'
int stateCount = 1000;
int dataListSize = 200;
int elementDataSize = 10;
List<Tuple2<String, State>> initialRddInput = new ArrayList<>(stateCount);
for (int stateIdx = 0; stateIdx < stateCount; stateIdx++) {
List<String> stateData = new ArrayList<>(dataListSize);
for (int dataIdx = 0; dataIdx < dataListSize; dataIdx++) {
stateData.add(RandomStringUtils.randomAlphanumeric(elementDataSize));
}
initialRddInput.add(new Tuple2<>("state" + stateIdx, new State(stateData)));
}
return initialRddInput;
}
}
State management has been improved in spark 1.6.
please refer to SPARK-2629 Improved state management for Spark Streaming;
And in the detailed design spec:
Improved state management in Spark Streaming
One performance drawback is metioned as below:
Need for more optimized state management that does not scan every key
Current updateStateByKey scan every key in every batch interval, even if there is no data for that key. While this semantics is useful is some workloads, most workloads require only ``scanning and updating the state for which there is new data. And only a small percentage of all the state needs to be touched for that in every batch interval. The cogroup-based implementation of updateStateByKey is not designed for this; cogroup scans all the keys every time. In fact, this causes the batch processing times of updateStateByKey to increase with the number of keys in the state, even if the data rate stays fixed.