I've a program with the following mapPartition function:
public void mapPartition(Iterable<Tuple> values, Collector<Tuple2<Integer, String>> out)
I collect batches of 100 from the inputted values & send them to a web-service for conversion. The result I add back to the out collection.
In order to speed up the process, I made the web-service calls async through the use of Executors. This created issues, either I get the taskManager released exception, or AskTimeoutException. I increased memory & timeouts, but it didn't help. There's quite a lot of input data. I believe this resulted in a lot of jobs being queued up with ExecutorService & hence taking up lots of memory.
What would be the best approach for this?
I was also looking at the taskManager vs taskSlot configuration, but got a little confused on the differences between the two (I guess they're similar to process vs threads?). Wasn't sure at what point do I increase the taskManagers vs taskSlots? e.g. if I've got three machines with 4cpus per machine, so then should my taskManager=3 while my taskSlot=4?
I was also considering increasing the mapPartition's parallelism alone to say 10 to get more threads hitting the web-service. Comments or suggestions?
You should check out Flink Asyncio which would enable you to query your webservice in an asynchronous way in your streaming application.
One thing to note is that the Asyncio function is not called multithreaded and is called once per record per partition sequentially, so your web application needs to deterministically return and potentially return fast for the job to not being held up.
Also, potentially higher number of partitions would help your case but again your webservice needs to fulfil those requests fast enough
Sample code block from Flinks Website:
// This example implements the asynchronous request and callback with Futures that have the
// interface of Java 8's futures (which is the same one followed by Flink's Future)
/**
* An implementation of the 'AsyncFunction' that sends requests and sets the callback.
*/
class AsyncDatabaseRequest extends RichAsyncFunction<String, Tuple2<String, String>> {
/** The database specific client that can issue concurrent requests with callbacks */
private transient DatabaseClient client;
#Override
public void open(Configuration parameters) throws Exception {
client = new DatabaseClient(host, post, credentials);
}
#Override
public void close() throws Exception {
client.close();
}
#Override
public void asyncInvoke(final String str, final AsyncCollector<Tuple2<String, String>> asyncCollector) throws Exception {
// issue the asynchronous request, receive a future for result
Future<String> resultFuture = client.query(str);
// set the callback to be executed once the request by the client is complete
// the callback simply forwards the result to the collector
resultFuture.thenAccept( (String result) -> {
asyncCollector.collect(Collections.singleton(new Tuple2<>(str, result)));
});
}
}
// create the original stream (In your case the stream you are mappartitioning)
DataStream<String> stream = ...;
// apply the async I/O transformation
DataStream<Tuple2<String, String>> resultStream =
AsyncDataStream.unorderedWait(stream, new AsyncDatabaseRequest(), 1000, TimeUnit.MILLISECONDS, 100);
Edit:
As the user wants to create batches of size 100 and asyncio is specific to Streaming API for the moment, thus the best way would be to create countwindows with size 100.
Also, to purge the last window which might not have 100 events, custom Triggers could be used with a combination of Count Triggers and Time Based Triggers such that the trigger fires after a count of elements or after every few minutes.
A good follow up is available here on Flink Mailing List where the user "Kostya" created a custom trigger which is available here
Related
Background
I'm using Spring Batch to fetch data from our customer sites through HTTP API. The progress contains 2 main steps:
Fetch the total documents from API, then calculate the total pages using a configurable page size. Each page will be assigned to one partition step using custom Paritioner.
A partition step will send a request to fetch page of data (a list of documents), process and write to our storage.
Customer sites might be "fragile". They could have rate limit or their sites might not respond after some heavy requests.
What I have done so far
I'm using spring-retry to re-run a request which is failed because of rate limit or server error. For e.g:
// the partition step's item reader
#StepScope
public class CustomItemReader extends ItemReader<Object> {
private List<Object> items;
#Override
public Object read() {
if (Objects.isNull(items)) {
this.items = ImportService.getPage(pageId);
}
if (Objects.nonNull(items) && !items.isEmpty()) {
return items.remove(0);
}
return null;
}
}
// config retry for fetching function
public class ImportService {
#Retryable(
value = RetryableException.class,
maxAttempts = 3,
backoff = #Backoff(
delay = 1000
)
)
public static List<Object> getPage(String pageId) throws RetryableException {
return ...;
}
}
The retry config contains Backoff policy, which has an incremental delay (1000 ms). I used this Retryable to handle both retry and rate limit.
Problem
Retryable will repeatedly wait and re-execute the function, which hold the thread for the whole time. The instance might crash when things get bigger.
Because each customer will have its own rate limit, using Retryable with Backoff is not an ideal way to control the rate. Eventhough I config core_pool_size for each customer sites, core_pool_size=1 is not enough for some.
Question
Is there any proper way to throttle the execution rate of Spring Batch, especially with Partitioning? For e.g: I want to config to send 2 requests in 10 seconds, and this will not be achieved by using sleep in step listener.
I have used scrapy for some crawlers, and it has pretty cool retry and rate limit features. With RetryMiddleware, it will enqueue the failed pages and has a RETRY_LIMIT in settings. With AutoThrottle, it can automatically throttle speed based on load on server. Is there any way to achieve kind of those features in Spring Batch? Or I have to rewrite my project with scrapy?
Thanh you very much!
Spring Batch does not provide such features. But you can use any rate limiting library where appropriate during the step (ie before/after reading data, before/after processing or writing data, etc).
This should help: Spring batch writer throttling.
I want to handle Flux to limit concurrent HTTP requests made by List of Mono.
When some requests are done (received responses), then service requests another until the total count of waiting requests is 15.
A single request returns a list and triggers another request depending on the result.
At this point, I want to send requests with limited concurrency.
Because consumer side, too many HTTP requests make an opposite server in trouble.
I used flatMapMany like below.
public Flux<JsonNode> syncData() {
return service1
.getData(param1)
.flatMapMany(res -> {
List<Mono<JsonNode>> totalTask = new ArrayList<>();
Map<String, Object> originData = service2.getDataFromDB(param2);
res.withArray("data").forEach(row -> {
String id = row.get("id").asText();
if (originData.containsKey(id)) {
totalTask.add(service1.updateRequest(param3));
} else {
totalTask.add(service1.deleteRequest(param4));
}
originData.remove(id);
});
for (left) {
totalTask.add(service1.createRequest(param5));
}
return Flux.merge(totalTask);
});
}
void syncData() {
syncDataService.syncData().????;
}
I tried chaining .window(15), but it doesn't work. All the requests are sent simultaneously.
How can I handle Flux for my goal?
I am afraid Project Reactor doesn't provide any implementation of either rate or time limit.
However, you can find a bunch of 3rd party libraries that provide such functionality and are compatible with Project Reactor. As far as I know, resilience4-reactor supports that and is also compatible with Spring and Spring Boot frameworks.
The RateLimiterOperator checks if a downstream subscriber/observer can acquire a permission to subscribe to an upstream Publisher. If the rate limit would be exceeded, the RateLimiterOperator could either delay requesting data from the upstream or it can emit a RequestNotPermitted error to the downstream subscriber.
RateLimiter rateLimiter = RateLimiter.ofDefaults("name");
Mono.fromCallable(backendService::doSomething)
.transformDeferred(RateLimiterOperator.of(rateLimiter))
More about RateLimiter module itself here: https://resilience4j.readme.io/docs/ratelimiter
You can use limitRate on a Flux. you need to probably reformat your code a bit but see docs here: https://projectreactor.io/docs/core/release/api/reactor/core/publisher/Flux.html#limitRate-int-
flatMap takes a concurrency parameter: https://projectreactor.io/docs/core/release/api/reactor/core/publisher/Flux.html#flatMap-java.util.function.Function-int-
Mono<User> getById(int userId) { ... }
Flux.just(1, 2, 3, 4).flatMap(client::getById, 2)
will limit the number of concurrent requests to 2.
Since I don't have the code here I'll try to be as clear as I can...
I'm developing a rest service in java that will get some params (number of threads, ammount of messages) and will create the threads (via loop) and send this number of messages via MQ (I'm passing the number of mssages when creating the thread).
So for an example if someone sends 50 threads and 5000 msgs it will send 2.5M msgs...
Now my question is how could I create another rest service to monitor all those threads and give me a % of conclusions on the messages sent.
I'm considering calling this service to update a progress bar every 2 secs via ajax.
A simplified approach is to create a class to keep track of the statistics the status bar will need to display. For example:
public class MessageCreatorProgress {
private final int totalMessagesToBeCreated;
private final AtomicInteger successCount;
private final AtomicInteger failureCount;
// constructor to initialize values
// increment methods
// get methods
}
In the initial request which starts the threads, construct the threads with a shared instance of an MessageCreatorProgress. For example:
// endpoint method to create a bunch of messages
public String startCreatingMessages(CreateMessagesRequest request) {
MessageCreatorProgress progress = new MessageCreatorProgress(requesst.getThreadCount * request.getMessageCountPerThread());
for (...) {
new MyMessageCreator(progress, request.getSomeParameter(), ....).start();
}
String messageProgressId = some unique value...
// Store MessageCreatorProgress in the session or some other shared memory,
// so it can be accessed by subsequent calls.
session.setAttribute(messageProgressId, progress);
return messageProgressId;
}
Each MyMessageCreator instance would for example call progress.incrementSuccess() as a last step, or progress.incrementFailure() for an exception.
The AJAX call passes the messageProgressId to the status endpoint which knows how to access the MessageCreatorProgress:
// endpoint method to get the message creation progress
// transform to JSON or whatever
public MessageCreatorProgress getMessageCreationProgress(String messageProgressId) {
return session.getAttribute(messageProgressId);
}
A more complex approach is to use a database - for example when the AJAX call will not hit the same server running the threads which are creating the messages. When a thread is successful or has an exception it can update a record associated with messageProgressId, and the AJAX endpoint checks the database and constructs a MessageCreatorProgress to return to the client.
As I wrote in title we need in project notify or execute method of some thread by another. This implementation is part of long polling. In following text describe and show my implementation.
So requirements are that:
UserX send request from client to server (poll action) immediately when he got response from previous. In service is executed spring async method where thread immediately check cache if there are some new data in database. I know that cache is usually used for methods where for specific input is expected specific output. This is not that case, because I use cache to reduce database calls and output of my method is always different. So cache help me store notification if I should check database or not. This checking is running in while loop which end when thread find notification to read database in cache or time expired.
Assume that UserX thread (poll action) is currently in while loop and checking cache.
In that moment UserY (push action) send some data to server, data are stored in database in separated thread, and also in cache is stored userId of recipient.
So when UserX is checking cache he found id of recipient (id of recipient == his id in this case), and then break loop and fetch these data.
So in my implementation I use google guava cache which provide manually write.
private static Cache<Long, Long> cache = CacheBuilder.newBuilder()
.maximumSize(100)
.expireAfterWrite(5, TimeUnit.MINUTES)
.build();
In create method I store id of user which should read these data.
public void create(Data data) {
dataRepository.save(data);
cache.save(data.getRecipient(), null);
System.out.println("SAVED " + userId + " in " + Thread.currentThread().getName());
}
and here is method of polling data:
#Async
public CompletableFuture<List<Data>> pollData(Long previousMessageId, Long userId) throws InterruptedException {
// check db at first, if there are new data no need go to loop and waiting
List<Data> data = findRecent(dataId, userId));
data not found so jump to loop for some time
if (data.size() == 0) {
short c = 0;
while (c < 100) {
// check if some new data added or not, if yes break loop
if (cache.getIfPresent(userId) != null) {
break;
}
c++;
Thread.sleep(1000);
System.out.println("SEQUENCE: " + c + " in " + Thread.currentThread().getName());
}
// check database on the end of loop or after break from loop
data = findRecent(dataId, userId);
}
// clear data for that recipient and return result
cache.clear(userId);
return CompletableFuture.completedFuture(data);
}
After User X got response he send poll request again and whole process is repeated.
Can you tell me if is this application design for long polling in java (spring) is correct or exists some better way? Key point is that when user call poll request, this request should be holded for new data for some time and not response immediately. This solution which I show above works, but question is if it will be works also for many users (1000+). I worry about it because of pausing threads which should make slower another requests when no threads will be available in pool. Thanks in advice for your effort.
Check Web Sockets. Spring supports it from version 4 on wards. It doesn't require client to initiate a polling, instead server pushes the data to client in real time.
Check the below:
https://spring.io/guides/gs/messaging-stomp-websocket/
http://www.baeldung.com/websockets-spring
Note - web sockets open a persistent connection between client and server and thus may result in more resource usage in case of large number of users. So, if you are not looking for real time updates and is fine with some delay then polling might be a better approach. Also, not all browsers support web sockets.
Web Sockets vs Interval Polling
Longpolling vs Websockets
In what situations would AJAX long/short polling be preferred over HTML5 WebSockets?
In your current approach, if you are having a concern with large number of threads running on server for multiple users then you can trigger the polling from front-end every time instead. This way only short lived request threads will be triggered from UI looking for any update in the cache. If there is an update, another call can be made to retrieve the data. However don't hit the server every other second as you are doing otherwise you will have high CPU utilization and user request threads may also suffer. You should do some optimization on your timing.
Instead of hitting the cache after a delay of 1 sec for 100 times, you can apply an intelligent algorithm by analyzing the pattern of cache/DB update over a period of time.
By knowing the pattern, you can trigger the polling in an exponential back off manner to hit the cache when the update is most likely expected. This way you will be hitting the cache less frequently and more accurately.
I had a need to limit the connection rate (in my servlet) to certain external service and I decided to give ScheduledExecutorService a try. Scheduling itself seems to function just fine, but output gets printed only occasionally - in most cases nothing is outputted. Why does such happen? I'm using Tomcat 7 as a test server.
int waitingtimeinmilliseconds = 5000;
ScheduledExecutorService scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
ScheduledFuture scheduledFuture = scheduledExecutorService.schedule() {
public void run() {
Fetcher fetcher = new Fetcher(loginname, password);
List<Item> items = fetcher.fetchItems();
// do something with the results
//ServletOutputStream
out.print("teststring" + items.size());
}
}, waitingtimeinmilliseconds, TimeUnit.MILLISECONDS);
scheduledExecutorService.shutdown();
You'll find very exhaustive description of what is causing your problem in: HttpServletResponse seems to periodically send prematurely (also check: starting a new thread in servlet).
Basically you cannot use external threads to write to servlet output. Once you leave doGet()/doPost(), servlet container assumes you are done and discards the output after flushing it to the client. But since you are writing to the stream asynchronously, sometimes the output gets through, while other times gets discarded.
If you want your rate limiting to be very scalable, consider async servlets (from 3.0). If you just want to throttle some clients, RateLimiter from guava will work for you1.
1 - see RateLimiter - discovering Google Guava on my blog.