kinesis getting data from multiple shards - java

I am trying to build a simple application that reads data from AWS Kinesis. I have managed to read data using a single shard but I want to get data from 4 different shards.
Problem is, I have a while loop which iterates as long as the shard is active which prevents me from reading data from different shards. So far I couldn't find an alternative algorithm nor was able to implement a KCL-based solution.
Many thanks in advance
public static void DoSomething() {
AmazonKinesisClient client = new AmazonKinesisClient();
//noinspection deprecation
client.setEndpoint(endpoint, serviceName, regionId);
/** get shards from the stream using describe stream method*/
DescribeStreamRequest describeStreamRequest = new DescribeStreamRequest();
describeStreamRequest.setStreamName(streamName);
List<Shard> shards = new ArrayList<>();
String exclusiveStartShardId = null;
do {
describeStreamRequest.setExclusiveStartShardId(exclusiveStartShardId);
DescribeStreamResult describeStreamResult = client.describeStream(describeStreamRequest);
shards.addAll(describeStreamResult.getStreamDescription().getShards());
if (describeStreamResult.getStreamDescription().getHasMoreShards() && shards.size() > 0) {
exclusiveStartShardId = shards.get(shards.size() - 1).getShardId();
} else {
exclusiveStartShardId = null;
}
}while (exclusiveStartShardId != null);
/** shards obtained */
String shardIterator;
GetShardIteratorRequest getShardIteratorRequest = new GetShardIteratorRequest();
getShardIteratorRequest.setStreamName(streamName);
getShardIteratorRequest.setShardId(shards.get(0).getShardId());
getShardIteratorRequest.setShardIteratorType("LATEST");
GetShardIteratorResult getShardIteratorResult = client.getShardIterator(getShardIteratorRequest);
shardIterator = getShardIteratorResult.getShardIterator();
GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
while (!shardIterator.equals(null)) {
getRecordsRequest.setShardIterator(shardIterator);
getRecordsRequest.setLimit(250);
GetRecordsResult getRecordsResult = client.getRecords(getRecordsRequest);
List<Record> records = getRecordsResult.getRecords();
shardIterator = getRecordsResult.getNextShardIterator();
if(records.size()!=0) {
for(Record r : records) {
System.out.println(r.getPartitionKey());
}
}
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
}
}
}

It is recommended that you will not read from a single process/worker from multiple shards. First, as you can see it is adding to the complexity of your code, but more importantly, you will have problems scaling up.
The "secret" of scalability is to have small and independent workers or other such units. Such design you can see in Hadoop, DynamoDB or Kinesis in AWS. It allows you to build small systems (micro-services), that can easily scale up and down as needed. You can easily add more units of work/data as your service becomes more successful, or other fluctuations in its usage.
As you can see in these AWS services, you sometimes can get this scalability automatically such in DynamoDB, and sometimes you need add shards to your kinesis streams. But for your application you need to control somehow your scalability.
In the case of Kinesis, you can scale up and down using AWS Lambda or Kinesis Client Library (KCL). Both of them are listening to the status of your streams (number of shards and events) and using it to add or remove workers and deliver the events for them to process. In both of these solutions you should build a worker that is working against a single shard.
If you need to align events from multiple shards, you can do that using some state service such as Redis or DynamoDB.

For a simpler and neater solution where you only have to worry about providing your own message processing code, I would recommend using the KCL Library.
Quoting from the documentation
The KCL acts as an intermediary between your record processing logic
and Kinesis Data Streams. The KCL performs the following tasks:
Connects to the data stream
Enumerates the shards within the data stream
Uses leases to coordinates shard associations with its workers
Instantiates a record processor for every shard it manages
Pulls data records from the data stream
Pushes the records to the corresponding record processor
Checkpoints processed records
Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)

Related

Spring Batch Partitioning with rate limit

Background
I'm using Spring Batch to fetch data from our customer sites through HTTP API. The progress contains 2 main steps:
Fetch the total documents from API, then calculate the total pages using a configurable page size. Each page will be assigned to one partition step using custom Paritioner.
A partition step will send a request to fetch page of data (a list of documents), process and write to our storage.
Customer sites might be "fragile". They could have rate limit or their sites might not respond after some heavy requests.
What I have done so far
I'm using spring-retry to re-run a request which is failed because of rate limit or server error. For e.g:
// the partition step's item reader
#StepScope
public class CustomItemReader extends ItemReader<Object> {
private List<Object> items;
#Override
public Object read() {
if (Objects.isNull(items)) {
this.items = ImportService.getPage(pageId);
}
if (Objects.nonNull(items) && !items.isEmpty()) {
return items.remove(0);
}
return null;
}
}
// config retry for fetching function
public class ImportService {
#Retryable(
value = RetryableException.class,
maxAttempts = 3,
backoff = #Backoff(
delay = 1000
)
)
public static List<Object> getPage(String pageId) throws RetryableException {
return ...;
}
}
The retry config contains Backoff policy, which has an incremental delay (1000 ms). I used this Retryable to handle both retry and rate limit.
Problem
Retryable will repeatedly wait and re-execute the function, which hold the thread for the whole time. The instance might crash when things get bigger.
Because each customer will have its own rate limit, using Retryable with Backoff is not an ideal way to control the rate. Eventhough I config core_pool_size for each customer sites, core_pool_size=1 is not enough for some.
Question
Is there any proper way to throttle the execution rate of Spring Batch, especially with Partitioning? For e.g: I want to config to send 2 requests in 10 seconds, and this will not be achieved by using sleep in step listener.
I have used scrapy for some crawlers, and it has pretty cool retry and rate limit features. With RetryMiddleware, it will enqueue the failed pages and has a RETRY_LIMIT in settings. With AutoThrottle, it can automatically throttle speed based on load on server. Is there any way to achieve kind of those features in Spring Batch? Or I have to rewrite my project with scrapy?
Thanh you very much!
Spring Batch does not provide such features. But you can use any rate limiting library where appropriate during the step (ie before/after reading data, before/after processing or writing data, etc).
This should help: Spring batch writer throttling.

How to limit the number of active Spring WebClient calls

I have a requirement where I read a bunch of rows (thousands) from a SQL DB using Spring Batch and call a REST Service to enrich content before writing them on a Kafka topic.
When using the Spring Reactive webClient, how do I limit the number of active non-blocking service calls? Should I somehow introduce a Flux in the loop after I read data using Spring Batch?
(I understand the usage of delayElements and that it serves a different purpose, as when a single Get Service Call brings in lot of data and you want the server to slow down -- here though, my use case is a bit different in that I have many WebClient calls to make and would like to limit the number of calls to avoid out of memory issues but still gain the advantages of non-blocking invocations).
Very interesting question. I pondered about it and I thought of a couple of ideas on how this could be done. I will share my thoughts on it and hopefully there are some ideas here that perhaps help you with your investigation.
Unfortunately, I'm not familiar with Spring Batch. However, this sounds like a problem of rate limiting, or the classical producer-consumer problem.
So, we have a producer that produces so many messages that our consumer cannot keep up, and the buffering in the middle becomes unbearable.
The problem I see is that your Spring Batch process, as you describe it, is not working as a stream or pipeline, but your reactive Web client is.
So, if we were able to read the data as a stream, then as records start getting into the pipeline those would get processed by the reactive web client and, using back-pressure, we could control the flow of the stream from producer/database side.
The Producer Side
So, the first thing I would change is how records get extracted from the database. We need to control how many records get read from the database at the time, either by paging our data retrieval or by controlling the fetch size and then, with back pressure, control how many of those are sent downstream through the reactive pipeline.
So, consider the following (rudimentary) database data retrieval, wrapped in a Flux.
Flux<String> getData(DataSource ds) {
return Flux.create(sink -> {
try {
Connection con = ds.getConnection();
con.setAutoCommit(false);
PreparedStatement stm = con.prepareStatement("SELECT order_number FROM orders WHERE order_date >= '2018-08-12'", ResultSet.TYPE_FORWARD_ONLY);
stm.setFetchSize(1000);
ResultSet rs = stm.executeQuery();
sink.onRequest(batchSize -> {
try {
for (int i = 0; i < batchSize; i++) {
if (!rs.next()) {
//no more data, close resources!
rs.close();
stm.close();
con.close();
sink.complete();
break;
}
sink.next(rs.getString(1));
}
} catch (SQLException e) {
//TODO: close resources here
sink.error(e);
}
});
}
catch (SQLException e) {
//TODO: close resources here
sink.error(e);
}
});
}
In the example above:
I control the amount of records we read per batch to be 1000 by setting a fetch size.
The sink will send the amount of records requested by the subscriber (i.e. batchSize) and then wait for it to request more using back pressure.
When there are no more records in the result set, then we complete the sink and close resources.
If an error occurs at any point, we send back the error and close resources.
Alternatively I could have used paging to read the data, probably simplifying the handling of resources by having to reissue a query at every request cycle.
You may consider also doing something if subscription is cancelled or disposed (sink.onCancel, sink.onDispose) since closing the connection and other resources is fundamental here.
The Consumer Side
At the consumer side you register a subscriber that only requests messages at a speed of 1000 at the time and it will only request more once it has processed that batch.
getData(source).subscribe(new BaseSubscriber<String>() {
private int messages = 0;
#Override
protected void hookOnSubscribe(Subscription subscription) {
subscription.request(1000);
}
#Override
protected void hookOnNext(String value) {
//make http request
System.out.println(value);
messages++;
if(messages % 1000 == 0) {
//when we're done with a batch
//then we're ready to request for more
upstream().request(1000);
}
}
});
In the example above, when subscription starts it requests the first batch of 1000 messages. In the onNext we process that first batch, making http requests using the Web client.
Once the batch is complete, then we request another batch of 1000 from the publisher, and so on and so on.
And there your have it! Using back pressure you control how many open HTTP requests you have at the time.
My example is very rudimentary and it will require some extra work to make it production ready, but I believe this hopefully offers some ideas that can be adapted to your Spring Batch scenario.

Spark Stream new Job after stream start

I have a situation where I am trying to stream using spark streaming from kafka. The stream is a direct stream. I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming.
The issue comes in when i have a new request to stream a new topic. Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request.
The way I figured out is
Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. The streaming keep on continuing on the first request (say topic1).
Second, I thought to stop the stream, create DStream and then start streaming again. I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one.
Here is the code, have a look
JavaStreamingContext javaStreamingContext;
if(null == javaStreamingContext) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
} else {
StreamingContextState streamingContextState = javaStreamingContext.getState();
if(streamingContextState == StreamingContextState.STOPPED) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
}
}
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
.map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
.foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction.
The above is approach 1, where i do not stop the existing streaming. When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. The issue is the DStream object is never returned.
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
This does not return a dstream, the control just terminates without an error.The steps further are not executed.
I have tried many things and read multiple article, but I belive this is a very common production level issue. Any streaming done is to be done on multiple different topics and each of them is handled differently.
Please help
The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted.
Few options I could think:
Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. From your API body you can supply the parameters or topic name. If you want to stop streaming from a specific topic, just stop respective job. It will give you a lot of flexibility and control on resources.
[Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. This is just an idea, I have not tried this option at all.
Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to.

Kafka Streams: Store is not ready

We recently upgraded Kafka to v1.1 and Confluent to v4.0.But upon upgrading we have encountered a persistent problems regarding state stores. Our application starts a collection of streams and we check for the state stores to be ready before killing the application after 100 tries. But after the upgrade there's atleast one stream that will have Store is not ready : the state store, <your stream>, may have migrated to another instance
The stream itself has RUNNING state and the messages will flow through but the state of the store still shows up as not ready. So I have no idea as to what may be happening.
Should I not check for store state?
And since our application has a lot of streams (~15), would starting
them simultaneously cause problems?
Should we not do a hard restart -- currently we run it as a service
on linux
We are running Kafka in cluster with 3 brokers.Below is a sample stream (not the entire code):
public BaseStream createStreamInstance() {
final Serializer<JsonNode> jsonSerializer = new JsonSerializer();
final Deserializer<JsonNode> jsonDeserializer = new JsonDeserializer();
final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);
MessagePayLoadParser<Note> noteParser = new MessagePayLoadParser<Note>(Note.class);
GenericJsonSerde<Note> noteSerde = new GenericJsonSerde<Note>(Note.class);
StreamsBuilder builder = new StreamsBuilder();
//below reducer will use sets to combine
//value1 in the reducer is what is already present in the store.
//value2 is the incoming message and for notes should have max 1 item in it's list (since its 1 attachment 1 tag per row, but multiple rows per note)
Reducer<Note> reducer = new Reducer<Note>() {
#Override
public Note apply(Note value1, Note value2) {
value1.merge(value2);
return value1;
}
};
KTable<Long, Note> noteTable = builder
.stream(this.subTopic, Consumed.with(jsonSerde, jsonSerde))
.map(noteParser::parse)
.groupByKey(Serialized.with(Serdes.Long(), noteSerde))
.reduce(reducer);
noteTable.toStream().to(this.pubTopic, Produced.with(Serdes.Long(), noteSerde));
this.stream = new KafkaStreams(builder.build(), this.properties);
return this;
}
There are some open questions here, like the ones Matthias put on comment, but will try to answer/give help to your actual questions:
Should I not check for store state?
Rebalancing is usually the case here. But in that case, you should not see that partition's thread keep consuming, but that processing should be "transferred" to be done to another thread that took over. Make sure if it is actually that very thread the one that keeps on processing that partition, and not the new one. Check kafka-consumer-groups utility to follow the consumers (threads) there.
And since our application has a lot of streams (~15), would starting them simultaneously cause problems? No, rebalancing is automatic.
Should we not do a hard restart -- currently we run it as a service on linux Are you keeping your state stores in a certain, non-default directory? You should configure your state stores directory properly and make sure it is accessible, insensitive to application restarts. Unsure about how you perform your hard restart, but some exception handling code should cover against it, closing your streams application.

Apache Flink: Correctly make async webservice calls within MapReduce()

I've a program with the following mapPartition function:
public void mapPartition(Iterable<Tuple> values, Collector<Tuple2<Integer, String>> out)
I collect batches of 100 from the inputted values & send them to a web-service for conversion. The result I add back to the out collection.
In order to speed up the process, I made the web-service calls async through the use of Executors. This created issues, either I get the taskManager released exception, or AskTimeoutException. I increased memory & timeouts, but it didn't help. There's quite a lot of input data. I believe this resulted in a lot of jobs being queued up with ExecutorService & hence taking up lots of memory.
What would be the best approach for this?
I was also looking at the taskManager vs taskSlot configuration, but got a little confused on the differences between the two (I guess they're similar to process vs threads?). Wasn't sure at what point do I increase the taskManagers vs taskSlots? e.g. if I've got three machines with 4cpus per machine, so then should my taskManager=3 while my taskSlot=4?
I was also considering increasing the mapPartition's parallelism alone to say 10 to get more threads hitting the web-service. Comments or suggestions?
You should check out Flink Asyncio which would enable you to query your webservice in an asynchronous way in your streaming application.
One thing to note is that the Asyncio function is not called multithreaded and is called once per record per partition sequentially, so your web application needs to deterministically return and potentially return fast for the job to not being held up.
Also, potentially higher number of partitions would help your case but again your webservice needs to fulfil those requests fast enough
Sample code block from Flinks Website:
// This example implements the asynchronous request and callback with Futures that have the
// interface of Java 8's futures (which is the same one followed by Flink's Future)
/**
* An implementation of the 'AsyncFunction' that sends requests and sets the callback.
*/
class AsyncDatabaseRequest extends RichAsyncFunction<String, Tuple2<String, String>> {
/** The database specific client that can issue concurrent requests with callbacks */
private transient DatabaseClient client;
#Override
public void open(Configuration parameters) throws Exception {
client = new DatabaseClient(host, post, credentials);
}
#Override
public void close() throws Exception {
client.close();
}
#Override
public void asyncInvoke(final String str, final AsyncCollector<Tuple2<String, String>> asyncCollector) throws Exception {
// issue the asynchronous request, receive a future for result
Future<String> resultFuture = client.query(str);
// set the callback to be executed once the request by the client is complete
// the callback simply forwards the result to the collector
resultFuture.thenAccept( (String result) -> {
asyncCollector.collect(Collections.singleton(new Tuple2<>(str, result)));
});
}
}
// create the original stream (In your case the stream you are mappartitioning)
DataStream<String> stream = ...;
// apply the async I/O transformation
DataStream<Tuple2<String, String>> resultStream =
AsyncDataStream.unorderedWait(stream, new AsyncDatabaseRequest(), 1000, TimeUnit.MILLISECONDS, 100);
Edit:
As the user wants to create batches of size 100 and asyncio is specific to Streaming API for the moment, thus the best way would be to create countwindows with size 100.
Also, to purge the last window which might not have 100 events, custom Triggers could be used with a combination of Count Triggers and Time Based Triggers such that the trigger fires after a count of elements or after every few minutes.
A good follow up is available here on Flink Mailing List where the user "Kostya" created a custom trigger which is available here

Categories