MongoDB: How to stop and resume change stream - java

How to stop a mongodb changestream temporarily and resume it again?
public Flux<Example> watch() {
final ChangeStreamOptions changeStreamOptions = ChangeStreamOptions.builder().returnFullDocumentOnUpdate().build();
return reactiveMongoTemplate.changeStream("collection", changeStreamOptions, Example.class)
.filter(e -> e.getOperationType() != null)
.mapNotNull(ChangeStreamEvent::getBody);
}
I'm trying to create a rest endpoint that should be able to stop the changestream for sometime while we do some database maintenance and then invoke the endpoint again to resume the stream from where it left off using resume token.

I found the solution to unsubscribe/stop changestream
Disposable subscription = service.watch()
.subscribe(exampleService::doSomething)
// cancel the subscription
subscription.dispose();

I am not a MongoDB expert but this is what I understood from one, hope I got it right and I am using the plain Java driver API for easier readability:
// 1. Open, consume, close, save token
MongoCursor<ChangeStreamDocument<Document>> cursor = inventoryCollection.watch().iterator();
ChangeStreamDocument<Document> next = cursor.next()
BsonDocument resumeToken = next.getResumeToken();
cursor.close();
// 2. Save the resume token in the database, in case your process goes down for any reason during your pause. Otherwise, you will not know where to start resuming.
...
// 3. When you want to reopen again start from the DB saved resumeToken
cursor = inventoryCollection.watch().resumeAfter(resumeToken).iterator();
The time window from the moment you receive the event until you save it should be very small but it may happen the process crashes before you save the continuation _id. If you have operations that are sensitive to that time window, then those operations should be idempotent so that in case you replay an already received event your data will not be affected.
It would have been nice for the Mongo server to keep track of the current offset for all change streams and uniquely identify clients. This is not possible now and this is why Mongo provides and asks for the resume token.

Related

Spring boot app and what approach to use to download bulk data

I have spring boot application with basic REST API.
My question is what shall we use to download some bulk data? What is preferable way how to download bulk data without memory leak? Let's suppose we have 10 million records.
Here are some approaches but not sure:
download with PipedInputStream when data are written with PipedOutputStream in separated thread. Is it fine or it is not good choice?
download with ByteArrayOutputStream when data are written into temp file in separated thread and after finish it is ready to download. We can mark this operation with some flags for end user eg. DOWNLOAD_ACTIVE, DOWNLOAD_DONE. The user is just initiating download with result flag DOWNLOAD_ACTIVE and trying to ping server for response flag DOWNLOAD_DONE. When it is done then the user is going to send request to download data.
Summary 2)
1. initiate request to download data - ACTIVE state
2. ping server and server returns current state - ACTIVE or DONE
3. if final state is DONE then user initiate final request to download data
Thanks
You can use the second approach. Which can prepare data in the background and once it's ready you can download it.
Send a request to prepare data. The server responds with a UUID.
Server starts preparing files in the background. The server has a Map that has the key with a new UUID and value as status ACTIVE.
Client saved UUID and checks the server after a certain interval by passing the UUID.
Once the server finishes the task it will update the Map for the given UUID value as status DONE.
As the status is DONE next request will provide the status DONE and UI and send another request to download the file.
The above approach will only work if you don't refresh the page. As page refresh will clear the UUID and you have to proceed again.
To achieve this after refresh/cross-logins then you need to use a database table instead of Map. Store the username along with other information and inform the user once it's ready.

How to scale more than 1 instance and deal with scheduled task in spring?

I am having a push notifications being send to android and ios application through spring boot every day at 8am Europe/Paris.
If I run multiple instances, the notifications will send multiple times. I am thinking to send every day notifications send on the database, and check them but I am worried it still run multiple times, this is what I am doing:
#Component
public class ScheduledTasks {
private static final Logger log = LoggerFactory.getLogger(ScheduledTasks.class);
private static final SimpleDateFormat dateFormat = new SimpleDateFormat("HH:mm:ss");
#Autowired
private ExpoPushTokenRepository expoPushTokenRepository;
#Autowired
private ExpoPushNotificationService expoPushNotificationService;
#Autowired
private MessageSource messageSource;
// TODO: if instances > 1, this will run multiple times, save to database the notifications send and prevent multiple sending.
#Scheduled(cron = "${cron.promotions.notification}", zone = "Europe/Paris")
public void sendNewPromotionsNotification() {
List<ExpoPushToken> expoPushTokenList = expoPushTokenRepository.findAll();
ArrayList<NotifyRequest> notifyRequestList = new ArrayList<>();
for (ExpoPushToken expoPushToken : expoPushTokenList) {
NotifyRequest notifyRequest = new NotifyRequest(
expoPushToken.getToken(),
"This is a test title",
"This is a test subtitle",
"This is a test body"
);
notifyRequestList.add(notifyRequest);
}
expoPushNotificationService.sendPushNotificationToList(notifyRequestList);
log.info("{} Send push notification to " + expoPushTokenList.size() + " userse", dateFormat.format(new Date()));
}
}
Does anybody have an idea on how I can prevent that safely?
Quartz would be my mostly database-agnostic solution for the task at hand, but was ruled out, so we are not going to discuss it.
The solution we are going to explore instead makes the following assumptions:
Postgres >= 9.5 is used (because we are going to use SKIP LOCKED, which was introduced in Postgresl 9.5).
It is okay to run a native query.
Under this conditions, we can retrieve batches of notifications from multiple instances of the application running through the following query:
SELECT * FROM expo_push_token FOR UPDATE SKIP LOCKED LIMIT 100;
This will retrieve and lock up to 100 entries from the table expo_push_token. If two instances of the application execute this query simultaneously, the received results will be disjoint. 100 is just some sample value. We may want to fine-tune this value for our use case. The locks stay active until the current transaction ends.
After an instance has fetched a batch of notifications, it has to also delete the entries it locked from the table or otherwise mark that this entry has been processed (if we go down this route, we have to modify the query above to filter-out already processed entires) and close the current transaction to release the locks. Each instance of the application would then repeat this query until the query returns zero entries.
There is also an alternative approach: an instance first fetches a batch size of notifications to send, keeps the transaction to the database open (thus continues holding the lock on the database), sends out its notification and then deletes/updates the entries and closes the transactions.
The two solutions have different strengths/weaknesses:
the first solutions keeps the transaction short. But if the application crashes in the middle of sending out notificatiosn, the part of its batch that was not send out is lost in this run.
the second solution keeps the transaction open, for possibly a long time. If it crashes in the middle fo sending out notifications, all entries will be unlocked and its batch would be re-processed, possibly resulting in some notifications being sent out twice.
For this solution to work, we also need some kind of job that fills table expo_push_token with the data we need. This job should run beforehand, i.e. its execution should not overlap with the notification sending process.

Hold thread in spring rest request for long-polling

As I wrote in title we need in project notify or execute method of some thread by another. This implementation is part of long polling. In following text describe and show my implementation.
So requirements are that:
UserX send request from client to server (poll action) immediately when he got response from previous. In service is executed spring async method where thread immediately check cache if there are some new data in database. I know that cache is usually used for methods where for specific input is expected specific output. This is not that case, because I use cache to reduce database calls and output of my method is always different. So cache help me store notification if I should check database or not. This checking is running in while loop which end when thread find notification to read database in cache or time expired.
Assume that UserX thread (poll action) is currently in while loop and checking cache.
In that moment UserY (push action) send some data to server, data are stored in database in separated thread, and also in cache is stored userId of recipient.
So when UserX is checking cache he found id of recipient (id of recipient == his id in this case), and then break loop and fetch these data.
So in my implementation I use google guava cache which provide manually write.
private static Cache<Long, Long> cache = CacheBuilder.newBuilder()
.maximumSize(100)
.expireAfterWrite(5, TimeUnit.MINUTES)
.build();
In create method I store id of user which should read these data.
public void create(Data data) {
dataRepository.save(data);
cache.save(data.getRecipient(), null);
System.out.println("SAVED " + userId + " in " + Thread.currentThread().getName());
}
and here is method of polling data:
#Async
public CompletableFuture<List<Data>> pollData(Long previousMessageId, Long userId) throws InterruptedException {
// check db at first, if there are new data no need go to loop and waiting
List<Data> data = findRecent(dataId, userId));
data not found so jump to loop for some time
if (data.size() == 0) {
short c = 0;
while (c < 100) {
// check if some new data added or not, if yes break loop
if (cache.getIfPresent(userId) != null) {
break;
}
c++;
Thread.sleep(1000);
System.out.println("SEQUENCE: " + c + " in " + Thread.currentThread().getName());
}
// check database on the end of loop or after break from loop
data = findRecent(dataId, userId);
}
// clear data for that recipient and return result
cache.clear(userId);
return CompletableFuture.completedFuture(data);
}
After User X got response he send poll request again and whole process is repeated.
Can you tell me if is this application design for long polling in java (spring) is correct or exists some better way? Key point is that when user call poll request, this request should be holded for new data for some time and not response immediately. This solution which I show above works, but question is if it will be works also for many users (1000+). I worry about it because of pausing threads which should make slower another requests when no threads will be available in pool. Thanks in advice for your effort.
Check Web Sockets. Spring supports it from version 4 on wards. It doesn't require client to initiate a polling, instead server pushes the data to client in real time.
Check the below:
https://spring.io/guides/gs/messaging-stomp-websocket/
http://www.baeldung.com/websockets-spring
Note - web sockets open a persistent connection between client and server and thus may result in more resource usage in case of large number of users. So, if you are not looking for real time updates and is fine with some delay then polling might be a better approach. Also, not all browsers support web sockets.
Web Sockets vs Interval Polling
Longpolling vs Websockets
In what situations would AJAX long/short polling be preferred over HTML5 WebSockets?
In your current approach, if you are having a concern with large number of threads running on server for multiple users then you can trigger the polling from front-end every time instead. This way only short lived request threads will be triggered from UI looking for any update in the cache. If there is an update, another call can be made to retrieve the data. However don't hit the server every other second as you are doing otherwise you will have high CPU utilization and user request threads may also suffer. You should do some optimization on your timing.
Instead of hitting the cache after a delay of 1 sec for 100 times, you can apply an intelligent algorithm by analyzing the pattern of cache/DB update over a period of time.
By knowing the pattern, you can trigger the polling in an exponential back off manner to hit the cache when the update is most likely expected. This way you will be hitting the cache less frequently and more accurately.

Firestore admin "listens" to all documents again on reboot

TL;DR
Every time my Fiestore admin server reboots my document listener is triggered for all documents even if I have already listened to the document and processed it. How do I get around this?
End TL;DR
I'm working on building a backend for my Firestore chat application. The basic idea is that whenever a users enters a chat message through a client app the backend server listens for new messages and processes them.
The problem I'm running into is that whenever I reboot my app server the listener is triggered for all of the existing already processed chats. So, it will respond to each chat even though it has already responded previously. I would like the app server to only respond to new chats that it hasn't already responded to.
One idea I have for a work around is to put a boolean flag on each chat document. When the backend processes the chat document it will set the flag. The listener will then only reply to chats that don't have the flag set.
Is this a sound approach or is there a better method? One concern I have is that every time I reboot my app server I will be charged heavily to re-query all of the previous chats. Another concern I have is that listening seems memory bound? If my app scales massively will I have to store all chat documents in memory? That doesn't seem like it will scale well...
//Example listener that processes chats based on whether or not the "hasBeenRepliedTo" flag is set
public void startFirestoreListener() {
CollectionReference docRef = db.collection("chats");
docRef.addSnapshotListener(new EventListener<QuerySnapshot>() {
#Override
public void onEvent(#javax.annotation.Nullable QuerySnapshot queryDocumentSnapshots, #javax.annotation.Nullable FirestoreException e) {
if(e != null) {
logger.error("There was an error listening to changes in the firestore chats collection. E: "+e.getLocalizedMessage());
e.printStackTrace();
}
else if(queryDocumentSnapshots != null && !queryDocumentSnapshots.isEmpty()) {
for(ChatDocument chatDoc : queryDocumentSnapshots.toObjects(ChatDocument.class)) {
if(!chatDoc.getHasBeenRepliedTo() {
//Do some processing
chatDoc.setHasBeenRepliedTo(true); //Set replied to flag
}
else {
//No-op, we've already replied to this chat
}
}
}
}
});
}
Yes, to avoid getting each document all the time, you will have to construct a query that yields only the documents that you know have been processed.
No, you are not charged to query documents. You are charged only to read them, which will happen if your query yields documents.
Yes, you will have to be able to hold all the results of a query in memory.
Your problem will be much easier to solve if you use Cloud Functions to receive events for each new document in a collection. You won't have to worry about any of the above things, and instead just worry about writing a Firestore trigger that does what you want with each new document, and paying for those invocations.

To get database updates using servlets or jsp

What I want is to get database updates.
i.e If any changes occur to the database or a new record is inserted it should notify to the user.
Up to know what I implemented is using jQuery as shown below
$(document).ready(function() {
var updateInterval = setInterval(function() {
$('#chat').load('Db.jsp?elect=<%=emesg%>');
},1000);
});
It worked fine for me, but my teacher told to me that it's not a good way to do recommended using comet or long polling technology.
Can anyone give me examples for getting database updates using comet or long polling
in servlets/jsp? I'm using Tomcat as server.
Just taking a shot in the dark since I don't know your exact environment... You could have the database trigger fire a call to a servlet each time a row is committed which would then run some code that looked like the following:
Get the script sessions that are active for the page that we want to update. This eliminates the need to check every reverse ajax script session that is running on the site. Once we have the script sessions we can use the second code block to take some data and update a table on the client side. All that the second code section does is send javascript to the client to be executed via the reverse ajax connection that is open.
String page = ServerContextFactory.get().getContextPath() + "/reverseajax/clock.html";
Browser.withPage(page, new Runnable() {
public void run() {
Util.setValue("clockDisplay", output);
}
});
// Creates a new Person bean.
Person person = new Person(true);
// Creates a multi-dimensional array, containing a row and the rows column data.
String[][] data = {
{person.getId(), person.getName(), person.getAddress(), person.getAge()+"", person.isSuperhero()+""}
};
// Call DWR's util which adds rows into a table. peopleTable is the id of the tbody and
// data conta
ins the row/column data.
Util.addRows("peopleTable", data);
Note that both of the above sections of code are pulled straight from the documentation examples # http://directwebremoting.org/dwr-demo/. These are only simple examples of how reverse ajax can sent data to the client, but your exact situation seems to be more dependent on how you receive the notification than how you update the client screen.
Without some type of database notification to the java code I think you will have to poll the system at set intervals. You could make the system a little more efficient even when polling by verifying that there are reverse ajax script sessions active for the page before polling the database for info.

Categories