Spring Batch Partitioning with rate limit

Spring Batch Partitioning with rate limit - java

Background
I'm using Spring Batch to fetch data from our customer sites through HTTP API. The progress contains 2 main steps:
Fetch the total documents from API, then calculate the total pages using a configurable page size. Each page will be assigned to one partition step using custom Paritioner.
A partition step will send a request to fetch page of data (a list of documents), process and write to our storage.
Customer sites might be "fragile". They could have rate limit or their sites might not respond after some heavy requests.
What I have done so far
I'm using spring-retry to re-run a request which is failed because of rate limit or server error. For e.g:
// the partition step's item reader
#StepScope
public class CustomItemReader extends ItemReader<Object> {
private List<Object> items;
#Override
public Object read() {
if (Objects.isNull(items)) {
this.items = ImportService.getPage(pageId);
}
if (Objects.nonNull(items) && !items.isEmpty()) {
return items.remove(0);
}
return null;
}
}
// config retry for fetching function
public class ImportService {
#Retryable(
value = RetryableException.class,
maxAttempts = 3,
backoff = #Backoff(
delay = 1000
)
)
public static List<Object> getPage(String pageId) throws RetryableException {
return ...;
}
}
The retry config contains Backoff policy, which has an incremental delay (1000 ms). I used this Retryable to handle both retry and rate limit.
Problem
Retryable will repeatedly wait and re-execute the function, which hold the thread for the whole time. The instance might crash when things get bigger.
Because each customer will have its own rate limit, using Retryable with Backoff is not an ideal way to control the rate. Eventhough I config core_pool_size for each customer sites, core_pool_size=1 is not enough for some.
Question
Is there any proper way to throttle the execution rate of Spring Batch, especially with Partitioning? For e.g: I want to config to send 2 requests in 10 seconds, and this will not be achieved by using sleep in step listener.
I have used scrapy for some crawlers, and it has pretty cool retry and rate limit features. With RetryMiddleware, it will enqueue the failed pages and has a RETRY_LIMIT in settings. With AutoThrottle, it can automatically throttle speed based on load on server. Is there any way to achieve kind of those features in Spring Batch? Or I have to rewrite my project with scrapy?
Thanh you very much!

Spring Batch does not provide such features. But you can use any rate limiting library where appropriate during the step (ie before/after reading data, before/after processing or writing data, etc).
This should help: Spring batch writer throttling.

Related

How to limit concurrent http requests with Mono & Flux

I want to handle Flux to limit concurrent HTTP requests made by List of Mono.
When some requests are done (received responses), then service requests another until the total count of waiting requests is 15.
A single request returns a list and triggers another request depending on the result.
At this point, I want to send requests with limited concurrency.
Because consumer side, too many HTTP requests make an opposite server in trouble.
I used flatMapMany like below.
public Flux<JsonNode> syncData() {
return service1
.getData(param1)
.flatMapMany(res -> {
List<Mono<JsonNode>> totalTask = new ArrayList<>();
Map<String, Object> originData = service2.getDataFromDB(param2);
res.withArray("data").forEach(row -> {
String id = row.get("id").asText();
if (originData.containsKey(id)) {
totalTask.add(service1.updateRequest(param3));
} else {
totalTask.add(service1.deleteRequest(param4));
}
originData.remove(id);
});
for (left) {
totalTask.add(service1.createRequest(param5));
}
return Flux.merge(totalTask);
});
}
void syncData() {
syncDataService.syncData().????;
}
I tried chaining .window(15), but it doesn't work. All the requests are sent simultaneously.
How can I handle Flux for my goal?

I am afraid Project Reactor doesn't provide any implementation of either rate or time limit.
However, you can find a bunch of 3rd party libraries that provide such functionality and are compatible with Project Reactor. As far as I know, resilience4-reactor supports that and is also compatible with Spring and Spring Boot frameworks.
The RateLimiterOperator checks if a downstream subscriber/observer can acquire a permission to subscribe to an upstream Publisher. If the rate limit would be exceeded, the RateLimiterOperator could either delay requesting data from the upstream or it can emit a RequestNotPermitted error to the downstream subscriber.
RateLimiter rateLimiter = RateLimiter.ofDefaults("name");
Mono.fromCallable(backendService::doSomething)
.transformDeferred(RateLimiterOperator.of(rateLimiter))
More about RateLimiter module itself here: https://resilience4j.readme.io/docs/ratelimiter

You can use limitRate on a Flux. you need to probably reformat your code a bit but see docs here: https://projectreactor.io/docs/core/release/api/reactor/core/publisher/Flux.html#limitRate-int-

flatMap takes a concurrency parameter: https://projectreactor.io/docs/core/release/api/reactor/core/publisher/Flux.html#flatMap-java.util.function.Function-int-
Mono<User> getById(int userId) { ... }
Flux.just(1, 2, 3, 4).flatMap(client::getById, 2)
will limit the number of concurrent requests to 2.

How to limit the number of active Spring WebClient calls

I have a requirement where I read a bunch of rows (thousands) from a SQL DB using Spring Batch and call a REST Service to enrich content before writing them on a Kafka topic.
When using the Spring Reactive webClient, how do I limit the number of active non-blocking service calls? Should I somehow introduce a Flux in the loop after I read data using Spring Batch?
(I understand the usage of delayElements and that it serves a different purpose, as when a single Get Service Call brings in lot of data and you want the server to slow down -- here though, my use case is a bit different in that I have many WebClient calls to make and would like to limit the number of calls to avoid out of memory issues but still gain the advantages of non-blocking invocations).

Very interesting question. I pondered about it and I thought of a couple of ideas on how this could be done. I will share my thoughts on it and hopefully there are some ideas here that perhaps help you with your investigation.
Unfortunately, I'm not familiar with Spring Batch. However, this sounds like a problem of rate limiting, or the classical producer-consumer problem.
So, we have a producer that produces so many messages that our consumer cannot keep up, and the buffering in the middle becomes unbearable.
The problem I see is that your Spring Batch process, as you describe it, is not working as a stream or pipeline, but your reactive Web client is.
So, if we were able to read the data as a stream, then as records start getting into the pipeline those would get processed by the reactive web client and, using back-pressure, we could control the flow of the stream from producer/database side.
The Producer Side
So, the first thing I would change is how records get extracted from the database. We need to control how many records get read from the database at the time, either by paging our data retrieval or by controlling the fetch size and then, with back pressure, control how many of those are sent downstream through the reactive pipeline.
So, consider the following (rudimentary) database data retrieval, wrapped in a Flux.
Flux<String> getData(DataSource ds) {
return Flux.create(sink -> {
try {
Connection con = ds.getConnection();
con.setAutoCommit(false);
PreparedStatement stm = con.prepareStatement("SELECT order_number FROM orders WHERE order_date >= '2018-08-12'", ResultSet.TYPE_FORWARD_ONLY);
stm.setFetchSize(1000);
ResultSet rs = stm.executeQuery();
sink.onRequest(batchSize -> {
try {
for (int i = 0; i < batchSize; i++) {
if (!rs.next()) {
//no more data, close resources!
rs.close();
stm.close();
con.close();
sink.complete();
break;
}
sink.next(rs.getString(1));
}
} catch (SQLException e) {
//TODO: close resources here
sink.error(e);
}
});
}
catch (SQLException e) {
//TODO: close resources here
sink.error(e);
}
});
}
In the example above:
I control the amount of records we read per batch to be 1000 by setting a fetch size.
The sink will send the amount of records requested by the subscriber (i.e. batchSize) and then wait for it to request more using back pressure.
When there are no more records in the result set, then we complete the sink and close resources.
If an error occurs at any point, we send back the error and close resources.
Alternatively I could have used paging to read the data, probably simplifying the handling of resources by having to reissue a query at every request cycle.
You may consider also doing something if subscription is cancelled or disposed (sink.onCancel, sink.onDispose) since closing the connection and other resources is fundamental here.
The Consumer Side
At the consumer side you register a subscriber that only requests messages at a speed of 1000 at the time and it will only request more once it has processed that batch.
getData(source).subscribe(new BaseSubscriber<String>() {
private int messages = 0;
#Override
protected void hookOnSubscribe(Subscription subscription) {
subscription.request(1000);
}
#Override
protected void hookOnNext(String value) {
//make http request
System.out.println(value);
messages++;
if(messages % 1000 == 0) {
//when we're done with a batch
//then we're ready to request for more
upstream().request(1000);
}
}
});
In the example above, when subscription starts it requests the first batch of 1000 messages. In the onNext we process that first batch, making http requests using the Web client.
Once the batch is complete, then we request another batch of 1000 from the publisher, and so on and so on.
And there your have it! Using back pressure you control how many open HTTP requests you have at the time.
My example is very rudimentary and it will require some extra work to make it production ready, but I believe this hopefully offers some ideas that can be adapted to your Spring Batch scenario.

Hold thread in spring rest request for long-polling

As I wrote in title we need in project notify or execute method of some thread by another. This implementation is part of long polling. In following text describe and show my implementation.
So requirements are that:
UserX send request from client to server (poll action) immediately when he got response from previous. In service is executed spring async method where thread immediately check cache if there are some new data in database. I know that cache is usually used for methods where for specific input is expected specific output. This is not that case, because I use cache to reduce database calls and output of my method is always different. So cache help me store notification if I should check database or not. This checking is running in while loop which end when thread find notification to read database in cache or time expired.
Assume that UserX thread (poll action) is currently in while loop and checking cache.
In that moment UserY (push action) send some data to server, data are stored in database in separated thread, and also in cache is stored userId of recipient.
So when UserX is checking cache he found id of recipient (id of recipient == his id in this case), and then break loop and fetch these data.
So in my implementation I use google guava cache which provide manually write.
private static Cache<Long, Long> cache = CacheBuilder.newBuilder()
.maximumSize(100)
.expireAfterWrite(5, TimeUnit.MINUTES)
.build();
In create method I store id of user which should read these data.
public void create(Data data) {
dataRepository.save(data);
cache.save(data.getRecipient(), null);
System.out.println("SAVED " + userId + " in " + Thread.currentThread().getName());
}
and here is method of polling data:
#Async
public CompletableFuture<List<Data>> pollData(Long previousMessageId, Long userId) throws InterruptedException {
// check db at first, if there are new data no need go to loop and waiting
List<Data> data = findRecent(dataId, userId));
data not found so jump to loop for some time
if (data.size() == 0) {
short c = 0;
while (c < 100) {
// check if some new data added or not, if yes break loop
if (cache.getIfPresent(userId) != null) {
break;
}
c++;
Thread.sleep(1000);
System.out.println("SEQUENCE: " + c + " in " + Thread.currentThread().getName());
}
// check database on the end of loop or after break from loop
data = findRecent(dataId, userId);
}
// clear data for that recipient and return result
cache.clear(userId);
return CompletableFuture.completedFuture(data);
}
After User X got response he send poll request again and whole process is repeated.
Can you tell me if is this application design for long polling in java (spring) is correct or exists some better way? Key point is that when user call poll request, this request should be holded for new data for some time and not response immediately. This solution which I show above works, but question is if it will be works also for many users (1000+). I worry about it because of pausing threads which should make slower another requests when no threads will be available in pool. Thanks in advice for your effort.

Check Web Sockets. Spring supports it from version 4 on wards. It doesn't require client to initiate a polling, instead server pushes the data to client in real time.
Check the below:
https://spring.io/guides/gs/messaging-stomp-websocket/
http://www.baeldung.com/websockets-spring
Note - web sockets open a persistent connection between client and server and thus may result in more resource usage in case of large number of users. So, if you are not looking for real time updates and is fine with some delay then polling might be a better approach. Also, not all browsers support web sockets.
Web Sockets vs Interval Polling
Longpolling vs Websockets
In what situations would AJAX long/short polling be preferred over HTML5 WebSockets?
In your current approach, if you are having a concern with large number of threads running on server for multiple users then you can trigger the polling from front-end every time instead. This way only short lived request threads will be triggered from UI looking for any update in the cache. If there is an update, another call can be made to retrieve the data. However don't hit the server every other second as you are doing otherwise you will have high CPU utilization and user request threads may also suffer. You should do some optimization on your timing.
Instead of hitting the cache after a delay of 1 sec for 100 times, you can apply an intelligent algorithm by analyzing the pattern of cache/DB update over a period of time.
By knowing the pattern, you can trigger the polling in an exponential back off manner to hit the cache when the update is most likely expected. This way you will be hitting the cache less frequently and more accurately.

Apache Flink: Correctly make async webservice calls within MapReduce()

I've a program with the following mapPartition function:
public void mapPartition(Iterable<Tuple> values, Collector<Tuple2<Integer, String>> out)
I collect batches of 100 from the inputted values & send them to a web-service for conversion. The result I add back to the out collection.
In order to speed up the process, I made the web-service calls async through the use of Executors. This created issues, either I get the taskManager released exception, or AskTimeoutException. I increased memory & timeouts, but it didn't help. There's quite a lot of input data. I believe this resulted in a lot of jobs being queued up with ExecutorService & hence taking up lots of memory.
What would be the best approach for this?
I was also looking at the taskManager vs taskSlot configuration, but got a little confused on the differences between the two (I guess they're similar to process vs threads?). Wasn't sure at what point do I increase the taskManagers vs taskSlots? e.g. if I've got three machines with 4cpus per machine, so then should my taskManager=3 while my taskSlot=4?
I was also considering increasing the mapPartition's parallelism alone to say 10 to get more threads hitting the web-service. Comments or suggestions?

You should check out Flink Asyncio which would enable you to query your webservice in an asynchronous way in your streaming application.
One thing to note is that the Asyncio function is not called multithreaded and is called once per record per partition sequentially, so your web application needs to deterministically return and potentially return fast for the job to not being held up.
Also, potentially higher number of partitions would help your case but again your webservice needs to fulfil those requests fast enough
Sample code block from Flinks Website:
// This example implements the asynchronous request and callback with Futures that have the
// interface of Java 8's futures (which is the same one followed by Flink's Future)
/**
* An implementation of the 'AsyncFunction' that sends requests and sets the callback.
*/
class AsyncDatabaseRequest extends RichAsyncFunction<String, Tuple2<String, String>> {
/** The database specific client that can issue concurrent requests with callbacks */
private transient DatabaseClient client;
#Override
public void open(Configuration parameters) throws Exception {
client = new DatabaseClient(host, post, credentials);
}
#Override
public void close() throws Exception {
client.close();
}
#Override
public void asyncInvoke(final String str, final AsyncCollector<Tuple2<String, String>> asyncCollector) throws Exception {
// issue the asynchronous request, receive a future for result
Future<String> resultFuture = client.query(str);
// set the callback to be executed once the request by the client is complete
// the callback simply forwards the result to the collector
resultFuture.thenAccept( (String result) -> {
asyncCollector.collect(Collections.singleton(new Tuple2<>(str, result)));
});
}
}
// create the original stream (In your case the stream you are mappartitioning)
DataStream<String> stream = ...;
// apply the async I/O transformation
DataStream<Tuple2<String, String>> resultStream =
AsyncDataStream.unorderedWait(stream, new AsyncDatabaseRequest(), 1000, TimeUnit.MILLISECONDS, 100);
Edit:
As the user wants to create batches of size 100 and asyncio is specific to Streaming API for the moment, thus the best way would be to create countwindows with size 100.
Also, to purge the last window which might not have 100 events, custom Triggers could be used with a combination of Count Triggers and Time Based Triggers such that the trigger fires after a count of elements or after every few minutes.
A good follow up is available here on Flink Mailing List where the user "Kostya" created a custom trigger which is available here

kinesis getting data from multiple shards

I am trying to build a simple application that reads data from AWS Kinesis. I have managed to read data using a single shard but I want to get data from 4 different shards.
Problem is, I have a while loop which iterates as long as the shard is active which prevents me from reading data from different shards. So far I couldn't find an alternative algorithm nor was able to implement a KCL-based solution.
Many thanks in advance
public static void DoSomething() {
AmazonKinesisClient client = new AmazonKinesisClient();
//noinspection deprecation
client.setEndpoint(endpoint, serviceName, regionId);
/** get shards from the stream using describe stream method*/
DescribeStreamRequest describeStreamRequest = new DescribeStreamRequest();
describeStreamRequest.setStreamName(streamName);
List<Shard> shards = new ArrayList<>();
String exclusiveStartShardId = null;
do {
describeStreamRequest.setExclusiveStartShardId(exclusiveStartShardId);
DescribeStreamResult describeStreamResult = client.describeStream(describeStreamRequest);
shards.addAll(describeStreamResult.getStreamDescription().getShards());
if (describeStreamResult.getStreamDescription().getHasMoreShards() && shards.size() > 0) {
exclusiveStartShardId = shards.get(shards.size() - 1).getShardId();
} else {
exclusiveStartShardId = null;
}
}while (exclusiveStartShardId != null);
/** shards obtained */
String shardIterator;
GetShardIteratorRequest getShardIteratorRequest = new GetShardIteratorRequest();
getShardIteratorRequest.setStreamName(streamName);
getShardIteratorRequest.setShardId(shards.get(0).getShardId());
getShardIteratorRequest.setShardIteratorType("LATEST");
GetShardIteratorResult getShardIteratorResult = client.getShardIterator(getShardIteratorRequest);
shardIterator = getShardIteratorResult.getShardIterator();
GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
while (!shardIterator.equals(null)) {
getRecordsRequest.setShardIterator(shardIterator);
getRecordsRequest.setLimit(250);
GetRecordsResult getRecordsResult = client.getRecords(getRecordsRequest);
List<Record> records = getRecordsResult.getRecords();
shardIterator = getRecordsResult.getNextShardIterator();
if(records.size()!=0) {
for(Record r : records) {
System.out.println(r.getPartitionKey());
}
}
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
}
}
}

It is recommended that you will not read from a single process/worker from multiple shards. First, as you can see it is adding to the complexity of your code, but more importantly, you will have problems scaling up.
The "secret" of scalability is to have small and independent workers or other such units. Such design you can see in Hadoop, DynamoDB or Kinesis in AWS. It allows you to build small systems (micro-services), that can easily scale up and down as needed. You can easily add more units of work/data as your service becomes more successful, or other fluctuations in its usage.
As you can see in these AWS services, you sometimes can get this scalability automatically such in DynamoDB, and sometimes you need add shards to your kinesis streams. But for your application you need to control somehow your scalability.
In the case of Kinesis, you can scale up and down using AWS Lambda or Kinesis Client Library (KCL). Both of them are listening to the status of your streams (number of shards and events) and using it to add or remove workers and deliver the events for them to process. In both of these solutions you should build a worker that is working against a single shard.
If you need to align events from multiple shards, you can do that using some state service such as Redis or DynamoDB.

For a simpler and neater solution where you only have to worry about providing your own message processing code, I would recommend using the KCL Library.
Quoting from the documentation
The KCL acts as an intermediary between your record processing logic
and Kinesis Data Streams. The KCL performs the following tasks:
Connects to the data stream
Enumerates the shards within the data stream
Uses leases to coordinates shard associations with its workers
Instantiates a record processor for every shard it manages
Pulls data records from the data stream
Pushes the records to the corresponding record processor
Checkpoints processed records
Balances shard-worker associations (leases) when the worker instance count changes or when the data stream is resharded (shards are split or merged)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.