I have a requirement where I read a bunch of rows (thousands) from a SQL DB using Spring Batch and call a REST Service to enrich content before writing them on a Kafka topic.
When using the Spring Reactive webClient, how do I limit the number of active non-blocking service calls? Should I somehow introduce a Flux in the loop after I read data using Spring Batch?
(I understand the usage of delayElements and that it serves a different purpose, as when a single Get Service Call brings in lot of data and you want the server to slow down -- here though, my use case is a bit different in that I have many WebClient calls to make and would like to limit the number of calls to avoid out of memory issues but still gain the advantages of non-blocking invocations).
Very interesting question. I pondered about it and I thought of a couple of ideas on how this could be done. I will share my thoughts on it and hopefully there are some ideas here that perhaps help you with your investigation.
Unfortunately, I'm not familiar with Spring Batch. However, this sounds like a problem of rate limiting, or the classical producer-consumer problem.
So, we have a producer that produces so many messages that our consumer cannot keep up, and the buffering in the middle becomes unbearable.
The problem I see is that your Spring Batch process, as you describe it, is not working as a stream or pipeline, but your reactive Web client is.
So, if we were able to read the data as a stream, then as records start getting into the pipeline those would get processed by the reactive web client and, using back-pressure, we could control the flow of the stream from producer/database side.
The Producer Side
So, the first thing I would change is how records get extracted from the database. We need to control how many records get read from the database at the time, either by paging our data retrieval or by controlling the fetch size and then, with back pressure, control how many of those are sent downstream through the reactive pipeline.
So, consider the following (rudimentary) database data retrieval, wrapped in a Flux.
Flux<String> getData(DataSource ds) {
return Flux.create(sink -> {
try {
Connection con = ds.getConnection();
con.setAutoCommit(false);
PreparedStatement stm = con.prepareStatement("SELECT order_number FROM orders WHERE order_date >= '2018-08-12'", ResultSet.TYPE_FORWARD_ONLY);
stm.setFetchSize(1000);
ResultSet rs = stm.executeQuery();
sink.onRequest(batchSize -> {
try {
for (int i = 0; i < batchSize; i++) {
if (!rs.next()) {
//no more data, close resources!
rs.close();
stm.close();
con.close();
sink.complete();
break;
}
sink.next(rs.getString(1));
}
} catch (SQLException e) {
//TODO: close resources here
sink.error(e);
}
});
}
catch (SQLException e) {
//TODO: close resources here
sink.error(e);
}
});
}
In the example above:
I control the amount of records we read per batch to be 1000 by setting a fetch size.
The sink will send the amount of records requested by the subscriber (i.e. batchSize) and then wait for it to request more using back pressure.
When there are no more records in the result set, then we complete the sink and close resources.
If an error occurs at any point, we send back the error and close resources.
Alternatively I could have used paging to read the data, probably simplifying the handling of resources by having to reissue a query at every request cycle.
You may consider also doing something if subscription is cancelled or disposed (sink.onCancel, sink.onDispose) since closing the connection and other resources is fundamental here.
The Consumer Side
At the consumer side you register a subscriber that only requests messages at a speed of 1000 at the time and it will only request more once it has processed that batch.
getData(source).subscribe(new BaseSubscriber<String>() {
private int messages = 0;
#Override
protected void hookOnSubscribe(Subscription subscription) {
subscription.request(1000);
}
#Override
protected void hookOnNext(String value) {
//make http request
System.out.println(value);
messages++;
if(messages % 1000 == 0) {
//when we're done with a batch
//then we're ready to request for more
upstream().request(1000);
}
}
});
In the example above, when subscription starts it requests the first batch of 1000 messages. In the onNext we process that first batch, making http requests using the Web client.
Once the batch is complete, then we request another batch of 1000 from the publisher, and so on and so on.
And there your have it! Using back pressure you control how many open HTTP requests you have at the time.
My example is very rudimentary and it will require some extra work to make it production ready, but I believe this hopefully offers some ideas that can be adapted to your Spring Batch scenario.
Related
Background
I'm using Spring Batch to fetch data from our customer sites through HTTP API. The progress contains 2 main steps:
Fetch the total documents from API, then calculate the total pages using a configurable page size. Each page will be assigned to one partition step using custom Paritioner.
A partition step will send a request to fetch page of data (a list of documents), process and write to our storage.
Customer sites might be "fragile". They could have rate limit or their sites might not respond after some heavy requests.
What I have done so far
I'm using spring-retry to re-run a request which is failed because of rate limit or server error. For e.g:
// the partition step's item reader
#StepScope
public class CustomItemReader extends ItemReader<Object> {
private List<Object> items;
#Override
public Object read() {
if (Objects.isNull(items)) {
this.items = ImportService.getPage(pageId);
}
if (Objects.nonNull(items) && !items.isEmpty()) {
return items.remove(0);
}
return null;
}
}
// config retry for fetching function
public class ImportService {
#Retryable(
value = RetryableException.class,
maxAttempts = 3,
backoff = #Backoff(
delay = 1000
)
)
public static List<Object> getPage(String pageId) throws RetryableException {
return ...;
}
}
The retry config contains Backoff policy, which has an incremental delay (1000 ms). I used this Retryable to handle both retry and rate limit.
Problem
Retryable will repeatedly wait and re-execute the function, which hold the thread for the whole time. The instance might crash when things get bigger.
Because each customer will have its own rate limit, using Retryable with Backoff is not an ideal way to control the rate. Eventhough I config core_pool_size for each customer sites, core_pool_size=1 is not enough for some.
Question
Is there any proper way to throttle the execution rate of Spring Batch, especially with Partitioning? For e.g: I want to config to send 2 requests in 10 seconds, and this will not be achieved by using sleep in step listener.
I have used scrapy for some crawlers, and it has pretty cool retry and rate limit features. With RetryMiddleware, it will enqueue the failed pages and has a RETRY_LIMIT in settings. With AutoThrottle, it can automatically throttle speed based on load on server. Is there any way to achieve kind of those features in Spring Batch? Or I have to rewrite my project with scrapy?
Thanh you very much!
Spring Batch does not provide such features. But you can use any rate limiting library where appropriate during the step (ie before/after reading data, before/after processing or writing data, etc).
This should help: Spring batch writer throttling.
As I wrote in title we need in project notify or execute method of some thread by another. This implementation is part of long polling. In following text describe and show my implementation.
So requirements are that:
UserX send request from client to server (poll action) immediately when he got response from previous. In service is executed spring async method where thread immediately check cache if there are some new data in database. I know that cache is usually used for methods where for specific input is expected specific output. This is not that case, because I use cache to reduce database calls and output of my method is always different. So cache help me store notification if I should check database or not. This checking is running in while loop which end when thread find notification to read database in cache or time expired.
Assume that UserX thread (poll action) is currently in while loop and checking cache.
In that moment UserY (push action) send some data to server, data are stored in database in separated thread, and also in cache is stored userId of recipient.
So when UserX is checking cache he found id of recipient (id of recipient == his id in this case), and then break loop and fetch these data.
So in my implementation I use google guava cache which provide manually write.
private static Cache<Long, Long> cache = CacheBuilder.newBuilder()
.maximumSize(100)
.expireAfterWrite(5, TimeUnit.MINUTES)
.build();
In create method I store id of user which should read these data.
public void create(Data data) {
dataRepository.save(data);
cache.save(data.getRecipient(), null);
System.out.println("SAVED " + userId + " in " + Thread.currentThread().getName());
}
and here is method of polling data:
#Async
public CompletableFuture<List<Data>> pollData(Long previousMessageId, Long userId) throws InterruptedException {
// check db at first, if there are new data no need go to loop and waiting
List<Data> data = findRecent(dataId, userId));
data not found so jump to loop for some time
if (data.size() == 0) {
short c = 0;
while (c < 100) {
// check if some new data added or not, if yes break loop
if (cache.getIfPresent(userId) != null) {
break;
}
c++;
Thread.sleep(1000);
System.out.println("SEQUENCE: " + c + " in " + Thread.currentThread().getName());
}
// check database on the end of loop or after break from loop
data = findRecent(dataId, userId);
}
// clear data for that recipient and return result
cache.clear(userId);
return CompletableFuture.completedFuture(data);
}
After User X got response he send poll request again and whole process is repeated.
Can you tell me if is this application design for long polling in java (spring) is correct or exists some better way? Key point is that when user call poll request, this request should be holded for new data for some time and not response immediately. This solution which I show above works, but question is if it will be works also for many users (1000+). I worry about it because of pausing threads which should make slower another requests when no threads will be available in pool. Thanks in advice for your effort.
Check Web Sockets. Spring supports it from version 4 on wards. It doesn't require client to initiate a polling, instead server pushes the data to client in real time.
Check the below:
https://spring.io/guides/gs/messaging-stomp-websocket/
http://www.baeldung.com/websockets-spring
Note - web sockets open a persistent connection between client and server and thus may result in more resource usage in case of large number of users. So, if you are not looking for real time updates and is fine with some delay then polling might be a better approach. Also, not all browsers support web sockets.
Web Sockets vs Interval Polling
Longpolling vs Websockets
In what situations would AJAX long/short polling be preferred over HTML5 WebSockets?
In your current approach, if you are having a concern with large number of threads running on server for multiple users then you can trigger the polling from front-end every time instead. This way only short lived request threads will be triggered from UI looking for any update in the cache. If there is an update, another call can be made to retrieve the data. However don't hit the server every other second as you are doing otherwise you will have high CPU utilization and user request threads may also suffer. You should do some optimization on your timing.
Instead of hitting the cache after a delay of 1 sec for 100 times, you can apply an intelligent algorithm by analyzing the pattern of cache/DB update over a period of time.
By knowing the pattern, you can trigger the polling in an exponential back off manner to hit the cache when the update is most likely expected. This way you will be hitting the cache less frequently and more accurately.
I have a requirement to process a list of large number of users daily to send them email and SMS notifications based on some scenario. I am using Java EE batch processing model for this. My Job xml is as follows:
<step id="sendNotification">
<chunk item-count="10" retry-limit="3">
<reader ref="myItemReader"></reader>
<processor ref="myItemProcessor"></processor>
<writer ref="myItemWriter"></writer>
<retryable-exception-classes>
<include class="java.lang.IllegalArgumentException"/>
</retryable-exception-classes>
</chunk>
</step>
MyItemReader's onOpen method reads all users from database, and readItem() reads one user at a time using list iterator. In myItemProcessor, the actual email notification is sent to user, and then the users are persisted in database in myItemWriter class for that chunk.
#Named
public class MyItemReader extends AbstractItemReader {
private Iterator<User> iterator = null;
private User lastUser;
#Inject
private MyService service;
#Override
public void open(Serializable checkpoint) throws Exception {
super.open(checkpoint);
List<User> users = service.getUsers();
iterator = users.iterator();
if(checkpoint != null) {
User checkpointUser = (User) checkpoint;
System.out.println("Checkpoint Found: " + checkpointUser.getUserId());
while(iterator.hasNext() && !iterator.next().getUserId().equals(checkpointUser.getUserId())) {
System.out.println("skipping already read users ... ");
}
}
}
#Override
public Object readItem() throws Exception {
User user=null;
if(iterator.hasNext()) {
user = iterator.next();
lastUser = user;
}
return user;
}
#Override
public Serializable checkpointInfo() throws Exception {
return lastUser;
}
}
My problem is that checkpoint stores the last record that was executed in the previous chunk. If I have a chunk with next 10 users, and exception is thrown in myItemProcessor of the 5th user, then on retry the whole chunck will be executed and all 10 users will be processed again. I don't want notification to be sent again to the already processed users.
Is there a way to handle this? How should this be done efficiently?
Any help would be highly appreciated.
Thanks.
I'm going to build on the comments from #cheng. My credit to him here, and hopefully my answer provides additional value in organizing and presenting the options usefully.
Answer: Queue up messages for another MDB to get dispatched to send emails
Background:
As #cheng pointed out, a failure means the entire transaction is rolled back, and the checkpoint doesn't advance.
So how to deal with the fact that your chunk has sent emails to some users but not all? (You might say it rolled back but with "side effects".)
So we could restate your question then as: How to send email from a batch chunk step?
Well, assuming you had a way to send emails through an transactional API (implementing XAResource, etc.) you could use that API.
Assuming you don't, I would do a transactional write to a JMS queue, and then send the emails with a separate MDB (as #cheng suggested in one of his comments).
Suggested Alternative: Use ItemWriter to send messages to JMS queue, then use separate MDB to actually send the emails
With this approach you still gain efficiency by batching the processing and the updates to your DB (you were only sending the emails one at a time anyway), and you can benefit from simple checkpointing and restart without having to write complicated error handling.
This is also likely to be reusable as a pattern across batch jobs and outside of batch even.
Other alternatives
Some other ideas that I don't think are as good, listed for the sake of discussion:
Add batch application logic tracking users emailed (with ItemProcessListener)
You could build your own list of either/both successful/failed emails using the ItemProcessListener methods: afterProcess and onProcessError.
On restart, then, you could know which users had been emailed in the current chunk, which we are re-positioned to since the entire chunk rolled back, even though some emails have already been sent.
This certainly complicates your batch logic, and you also have to persist this success or failure list somehow. Plus this approach is probably highly specific to this job (as opposed to queuing up for an MDB to process).
But it's simpler in that you have a single batch job without the need for a messaging provider and a separate app component.
If you go this route you might want to use a combination of both a skippable and a "no-rollback" retryable exception.
single-item chunk
If you define your chunk with item-count="1", then you avoid complicated checkpointing and error handling code. You sacrifice efficiency though, so this would only make sense if the other aspects of batch were very compelling: e.g. scheduling and management of jobs through a common interface, the ability to restart at the failing step within a job
If you were to go this route, you might want to consider defining socket and timeout exceptions as "no-rollback" exceptions (using ) since there's nothing to be gained from rolling back, and you might want to retry on a network timeout issue.
Since you specifically mentioned efficiency, I'm guessing this is a bad fit for you.
use a Transaction Synchronization
This could work perhaps, but the batch API doesn't especially make this easy, and you still could have a case where the chunk completes but one or more email sends fail.
Your current item processor is doing something outside the chunk transaction scope, which has caused the application state to be out of sync. If your requirement is to send out emails only after all items in a chunk have successfully completed, then you can move the emailing part to a ItemWriterListener.afterWrite(items).
I am using the playframework (2.4) for Java and connecting it to Postgres. The play framework is being used as a restful service and all it is doing is insert,updates,reads and deletes using JDBC. On this play page https://www.playframework.com/documentation/2.3.x/JavaAsync it states clearly that JDBC is blocking and that play has few threads. For the people who know about this, how limiting could this be and is there some way I can work around this? My specific app can have a few hundred database calls per second. I will have all the hardware and extra servers but do not know how play can handle this or scale to handle this in the code. My code in play looks like this:
public static Result myprofile() {
DynamicForm requestData = Form.form().bindFromRequest();
Integer id = Integer.parseInt(requestData.get("id"));
try {
JSONObject jo = null;
Connection conn = DB.getConnection();
ResultSet rs;
JSONArray ja = new JSONArray();
PreparedStatement ps = conn.prepareStatement("SELECT p.fullname as fullname, s.post as post,to_char(s.created_on, 'MON DD,YYYY') as created_on,s.last_reply as last_reply,s.id as id,s.comments as comments,s.state as state,s.city as city,s.id as id FROM profiles as p INNER JOIN streams as s ON (s.profile_id=p.id) WHERE s.profile_id=? order by created_on desc");
ps.setInt(1, id);
rs = ps.executeQuery();
while (rs.next()) {
jo = new JSONObject();
jo.put("fullname", rs.getString("fullname"));
jo.put("post", rs.getString("post"));
jo.put("city", rs.getString("city"));
jo.put("state", rs.getString("state"));
jo.put("comments", rs.getInt("comments"));
jo.put("id", rs.getInt("id"));
jo.put("last_reply", difference(rs.getInt("last_reply"), rs.getString("created_on")));
ja.put(jo);
}
JSONObject mainObj = new JSONObject();
mainObj.put("myprofile", ja);
String total = mainObj.toString();
System.err.println(total);
conn.close();
return ok(total);
} catch (Exception e) {
e.getMessage();
}
return ok();
}
I also know that I can try to wrap that in a futures promise however the blocking still occurs. As stated before I will have all the servers and the other stuff taken care of, but would the play framework be able to scale to hundreds of requests per second using jdbc? I am asking and learning now to avoid serious mistakes later on.
Play can absolutely handle this load.
The documentation states that blocking code should be avoided inside controller methods - the default configuration is tuned for them to have asynchronous execution. If you stick some blocking calls in there, your controller will now be waiting for that call to finish before it can process another incoming request - this is bad.
You can’t magically turn synchronous IO into asynchronous by wrapping
it in a Promise. If you can’t change the application’s architecture to
avoid blocking operations, at some point that operation will have to
be executed, and that thread is going to block. So in addition to
enclosing the operation in a Promise, it’s necessary to configure it
to run in a separate execution context that has been configured with
enough threads to deal with the expected concurrency. See
Understanding Play thread pools for more information.
https://www.playframework.com/documentation/2.4.x/JavaAsync#Make-controllers-asynchronous
I believe you are aware of this but I wanted to point out the bolded section. Your database has a limited number of threads that are available for applications to make calls on - it may be helpful to track this number down, create a new execution context that is turned for these threads, and assign that new execution context to a promise that wraps your database call.
Check out this post about application turning for Play, it should give you an idea of what this looks like. I believe he is using Akka Actors, something that might be out of scope for you, but the idea for thread tuning is the same:
Play 2 is optimized out-of-the-box for HTTP requests which don’t
contain blocking calls (i.e. asynchronous). Most database-driven apps
in Java use synchronous calls via JDBC so Play 2 needs a bit of extra
configuration to tune Akka for these types of requests.
http://www.jamesward.com/2012/06/25/optimizing-play-2-for-database-driven-apps
If you try to execute a massive number of requests on the database without turning the threads, you run the risk of starving the rest of your application of threads, which will halt your application. For the load you are expecting, the default tuning might be ok, but it is worth performing some additional investigating.
Getting started with thread tuning:
https://www.playframework.com/documentation/2.4.x/ThreadPools
You should update your controller to return Promise and there is also no reason to make it static anymore with Play 2.4. https://www.playframework.com/documentation/2.4.x/Migration24#Routing
Define an execution context in the application.conf with name "jdbc-execution-context"
//reference to context
ExecutionContext jdbcExecutionContext = Akka.system().dispatchers()
.lookup("jdbc-execution-context");
return promise(() -> {
//db call
}, jdbcExecutionContext)
.map(callResult -> ok(callResult));
For some reason we plan to use kestrel queue in our project. We do some demons, the main problem is how to to fetch data from queue with low CPU utilization and effectively. The way we implemented to fetch is if we failed to fetch data from queue more than 5 times, we sleep the thread 100ms to reduce the CPU utilization.
while (running) {
try {
LoginLogQueueEntry data = kestrelQueue.fetch();
if (null != data && data.isLegal()) {
entryCacheList.add(data); //add the data to the local caceh
resetStatus();
} else {
failedCount++;
//if there is no data in the kestrel and the local cache is not empty, insert the data into mysql database
if (failedCount == 1 && !entryCacheList.isEmpty()) {
resetStatus();
insertLogList(entryCacheList); // insert current data into database
entryCacheList.clear(); //empty local cache
}
if (failedCount >= 5 && entryCacheList.isEmpty()) {
//fail 5 times. Sleep current thread.
failedCount = 0;
Thread.sleep((sleepTime + MIN_SLEEP_TIME) % MAX_SLEEP_TIME);
}
}
//Insert 1000 rows once
if (entryCacheList.size() >= 1000) {
insertLogList(entryCacheList);
entryCacheList.clear();
}
} catch (Exception e) {
logger.warn(e.getMessage());
}
Is there any other good way to do so? The perfect the way i think is the queue can notice to the worker that we got data and fetch them .
See the "Blocking Fetches" section at http://robey.lag.net/2008/11/27/scarling-to-kestrel.html
Blocking reads are described here, under "Memcache commands": https://github.com/robey/kestrel/blob/master/docs/guide.md
You can add option flags to a get command by separating them with slashes, so to fetch an item from the "jobs" queue, waiting up to one second:
get jobs/t=1000
If nothing shows up on the queue in one second, you'll get the same empty response, just one second later than you're getting it now. :)
It's important to tune your response timeout when you do this. If you use a blocking read with a timeout of one second, but your client library's response timeout is 500 milliseconds, the library will disconnect from the server before the blocking read is finished. So make sure the response timeout is greater than the timeout you're using in the read request.
You need to use a blocking get. I couldn't track down the API docs, but I found an article suggesting that it's possible in kestrel.