Simultaneous Cosmos document creation - java

I am getting a race condition when trying to update (here creating the document for the first time) to Cosmos with the same primary key and part key simultaneously, and as a result, I am losing one of the updates.
The business logic is to update the document if it already exists and create it if it does not exist, but losing some data if two creations occur at the same timestamp. I know we can achieve concurrency using eTag, but here the issue happens only for the first update or creation of a document.
For example, consider the following models:
{ "id": 1, "pkey": 2, "value": ["first text"]}
and
{ "id": 1, "pkey": 2, "value": ["second text"]}
I have some internal business logic to update and append the value array.
{ "id": 1, "pkey": 2, "value": ["first text," "second text,"] }
But I lose one value when cosmos creation occurs simultaneously.
Please feel free to correct the question if any errors are found.
Please help if anyone has faced a similar issue. tried keeping etag in the request option, but the issue still persists.

Based on your description it looks like you are executing concurrent Replaces over existing documents.
The best way to avoid the scenario where a second concurrent Replace removed the data added by the first is to use Optimistic Concurrency.
In a nutshell:
Read the document you want to update.
From the response, you obtain the ETag.
Apply modification locally and send the Replace operation with the IfMatchETag option.
If there was a concurrent Replace operation, you will get an HTTP 412 response.
Repeat 1-4 until you get a success response.
Full example from the Java SDK (assuing Java because your question is tagged so) repo: https://github.com/Azure-Samples/azure-cosmos-java-sql-api-samples/blob/0ead4ca33dac72c223285e1db866c9dc06f5fb47/src/main/java/com/azure/cosmos/examples/documentcrud/async/DocumentCRUDQuickstartAsync.java#L405

Related

API Design: Making data human readable vs forcing more API calls [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Let's say we have a Product table in our database. This represents products that exist:
productId
brandId (FK)
name
description
capacity
pricePerNight
1
2
Tent
This is a tent
4
20
Let's say we also have a Brand table in our database. This represents brands and is linked to the Product table via an FK:
id
name
1
Vango
2
Quechua
Now let's say I was making a public API which allows clients to query all products. I want the client to (conveniently) be able to see the name of the brand for each product.
My options are: (a) When they GET /product return an edited version of the Product object with a new human-readable field (brand) in addition to/in replacement of the FK:
{
"productId": 1,
"brand": "Quechua",
"name": "Tent",
"description": "This is a tent",
"capacity": 4,
}
(b) Return the Product object as it exists in the database, thus compelling clients to make two calls if they want to access the brand name (e.g. one call to GET /product and one to GET /brand):
{
"productId": 1,
"brandId": 2,
"name": "Tent",
"description": "This is a tent",
"capacity": 4,
}
I am confronted by this situation quite regularly and I'm wondering what best practice is for a RESTful API. Any opinions very welcome.
EDIT: IRL I am building a public API which needs to be as intuitive and easy to use as possible for our company's integrating partners.
Table has the relationships with each other so you don't need to call two APIs.
What I normally practice in the above mentioned case is I call one api i.e. /product and returns the product object and inside the product object contains brand object as well.
This approach is also widely used internally by well renowned frameworks like springboot(Java) and Laravel(PHP).
{
"productId": 1,
"name": "Tent",
"description": "This is a tent",
"capacity": 4
"brand":{
"id":1,
"name":"brandName"
}
}
I am confronted by this situation quite regularly and I'm wondering what best practice is for a RESTful API. Any opinions very welcome.
Heuristic: how would you do it on the web?
With web pages, it is common to have one page that combines all of the information that you need, even though that data is distributed across multiple tables.
The spelling of the URI normally comes not from the tables, but instead by thinking about what the name of this report is, and choosing an identifier that has similar semantics while remaining consistent with your local spelling conventions.
Here, it looks like we've got some form of "product catalog" or "stock list", so that concept would likely be what we are expressing in the URI.
(Note: part of the point of a REST API is that it acts as a facade, hiding the underlying implementation details. You should not feel obligated to expose your table names in your identifiers.)
Starting from the DB layer, you could use a Read-Only View for reading data. This view could combine fields spanned across different tables (like that foreign key relationship in the example).
Also, as already mentioned, your response could include the objects the entity you query refers to.
At the end of the day, almost no one consumes data straight from an API; usually, there's some front-end infrastructure in between. As such you need to strike a balance between having a lean enough response payload whilst trying to make the client make as few requests as possible.
PS: This Stack Exchange thread may be relevant.
The best answer I can give is "let the client specify what data it wants", e.g. by using GraphQL.
The problem is similar to the challenges ORM frameworks face - do I load the entire dependency graph (e.g. "brand" belongs to a category and a category has a manager and a manager has a department and a department has a...).
GraphQL changes this dynamic - it lets the API client specify which data it wants.
Alternatives are "load one level of dependency" (similar to Danish's answer), though that very quickly means a client has to build an understanding of your database schema - which creates tight coupling between client and API server.

Should I use save or update to update one or more elements of an embedded document in Mongodb using Morphia

TL;DR
Using Morphia ORM, Should I use save(entity/doc) every time the elements (more than one element) of the list (subdocument) changes or use update with update-operations to only update the changed elements?
Background/Scenario
I have the following document (Exam) with the qa (QuestionAnswer) subdocument (that can grow between 50-100 entries). Taking into account the following:
Around 80% of the operations are update more than one entries of the qa subdocument at a time
A record (Exam) is specific to a User hence no concurrent updates
A screen displays 5-10 questions, when user moves to next/prev page, those questions on current screen will be posted to server to be updated
QuestionsAnswer is specific to Exam record
document (just to give an idea)
{
"user": "user-record-ref",
"name": "some-name",
"dob": "some-timestamp",
"qa": [
{
"question": "some-question1",
"choices": ["A", "B"],
"answer": ["A"]
...
},
{
"question": "some-question2",
"choices": [],
"answer": ["descriptive-answer"]
...
}
]
}
I have modelled the above using Spring Boot & Morphia in the following manner (just to give an idea)
public class Exam {
private String name;
private Date dob;
#Reference
private User user;
#Embedded
List<QuestionAnswer> qa;
}
Questions
Using Morphia ORM, Should I use save(entity/doc) every time the elements of the list (subdocument) changes or use update with update-operations to only update the changed elements?
If I were to use save, taking into account 80% freq. (mentioned above) is it efficient to keep re-saving the entire entity this often?
If I were to use update is it efficient to do x number of update calls where x is number of questions (elements of subdoc) requested to be updated (each question with diff. id hence, diff. query clause)?
Which is most performant?
Since qa subdocument will be updated quite often and is specific to the document (not shared) how else can I model this to make updating the elements of subodcument painless, efficient and scalable while exercising good practice?
I see the save method as being one of simplicity for coding. It will always be more efficient to send update calls to the backend since it reduces the among of traffic going over the wire and reduces the amount of data changing on disk. If you trust your database, you can also set your write concern to not even wait for the save - maybe the application response very fast.
However, since you're doing so many changes to this particular QA subdocument, I would consider saving the QA in its own collection. This was you can call save on the QA object itself and not have to worry about sending the rest of the User data over to the server on every call.

App Engine SearchAPI: java.util.concurrent.CancellationException: Task was cancelled

Some of my App Engine Search API queries give a 'java.util.concurrent.CancellationException: Task was cancelled' exception. The error is reproducable.
I have multiple indexes. On some indexes, those queries run, on others they fail.
The query is very basic. If I run it from the admin console (https://console.cloud.google.com/appengine/search/index), it gives no problem.
There is nothing special about the query.
The query filters on 2 atom fields: isReliable = "1" AND markedForDelete = "0", and sorts on a number field.
There seems nothing wrong with the code, as it runs many of such queries with no problem, far more difficult as the failing ones.
I've seen such exceptions caused by timeout limits. Check in the logs if you get them after app. the same execution time (e.g. 59-60 seconds).
If this is not a user-facing request, you can move it into a task, which has 10 minutes execution limit. If this is a user-facing request, some changes in the data model might be necessary. For example, you may combine some fields into flags for frequently used queries, e.g. isReliable = "1" AND markedForDelete = "0" becomes code = "10" or "reliableToDelete = "true".

Filter data on the query or with Java stream?

I have a Play 2.3.x Java application connected to a MongoDB database via Morphia.
I activated slow query profiling in MongoDB and saw that a query comes often. It looks like this :
"query": {
"field1": null,
"field2": true,
"field3": "a061ee3f-c2be-477c-ad81-32f852b706b5",
"$or": [
{
"field4.VAL.VALUE_1": "EXPECTED_VALUE_1",
"field4.VAL.VALUE_2": "EXPECTED_VALUE_2"
}
]
}
In its current state, there is no index so every time the query is executed the whole collection is scanned. I still have a few documents, but I anticipate the growth of the database.
So I was wondering what was the best solution :
Remove all the clauses above from the query, retrieve all results
(paginated) and filter with Java Stream API
Keep the query as is and index the fields
If you see another solution, feel free to suggest it :)
Thanks.
Always perform everything that's possible (filtering, sorting etc.) nearest to the source of the data.
Why haven't you indexed the fields? That's what they're meant for.

Designing HBase schema to best support specific queries

I have an HBase schema-design related question. The problem is fairly simple - I am storing "notifications" in hbase, each of which has a status ("new", "seen", and "read"). Here are the API's I need to provide:
Get all notifications for a user
Get all "new" notifications for a user
Get the count of all "new" notifications for a user
Update status for a notification
Update status for all of a user's notifications
Get all "new" notifications accross the database
Notifications should be scannable in reverse chronological order and allow pagination.
I have a few ideas, and I wanted to see if one of them is clearly best, or if I have missed a good strategy entirely. Common to all three, I think having one row per notification and having the user id in the rowkey is the way to go. To get chronological ordering for pagination, I need to have a reverse timestamp in there, too. I'd like to keep all notifs in one table (so I don't have to merge sort for the "get all notificatiosn for a user" call) and don't want to write batch jobs for secondary index tables (since updates to the count and status should be in real time).
The simplest way to do it would be (1) row key is "userId_reverseTimestamp" and do filtering for status on the client side. This seems naive, since we will be sending lots of unecessary data through the network.
The next possibility is to (2) encode the status into the rowkey as well, so either "userId_reverseTimestamp_status" and then doing rowkey regex filtering on the scans. The first issue I see is needing to delete a row and copy the notification data to a new row when status changes (which presumably, should happen exactly twice per notification). Also, since the status is the last part of the rowkey, for each user, we will be scanning lots of extra rows. Is this a big performance hit? Finally, in order to change status, I will need to know what the previous status was (to build the row key) or else I will need to do another scan.
The last idea I had is to (3) have two column families, one for the static notif data, and one as a flag for the status, i.e. "s:read" or "s:new" with 's' as the cf and the status as the qualifier. There would be exactly one per row, and I can do a MultipleColumnPrefixFilter or SkipFilter w/ ColumnPrefixFilter against that cf. Here too, I would have to delete and create columns on status change, but it should be much more lightweight than copying whole rows. My only concern is the warning in the HBase book that HBase doesn't do well with "more than 2 or 3 column families" - perhaps if the system needs to be extended with more querying capabilities, the multi-cf strategy won't scale.
So (1) seems like it would have too much network overhead. (2) seems like it would have wasted cost spent copying data and (3) might cause issues with too many families. Between (2) and (3), which type of filter should give better performance? In both cases, the scan will have look at each row for a user, which presumably has mostly read notifications - which would have better performance. I think I'm leaning towards (3) - are there other options (or tweaks) that I have missed?
You have put a lot of thought into this and I think all three are reasonable!
You want to have your main key be the username concatenated with the time stamp since most of your queries are "by user". This will help with easy pagination with a scan and can fetch user information pretty quickly.
I think the crux of your problem is this changing status part. In general, something like a "read" -> "delete" -> "rewrite" introduces all kinds of concurrency issues. What happens if your task fails between? Do you have data in an invalid state? Will you drop a record?
I suggest you instead treat the table as "append only". Basically, do what you suggest for #3, but instead of removing the flag, keep it there. If something has been read, it can have the three "s:seen", "s:read" there (if it is new, we can just assume it is empty). You could also be fancy and put a timestamp in each of the three to show when that event was satisfied. You shouldn't see much of a performance hit from doing this and then you don't have to worry about concurrency, since all operations are write-only and atomic.
I hope this is helpful. I'm not sure if I answered everything since your question was so broad. Please follow up with addition questions and I'll love to elaborate or discuss something else.
My solution is:
Don't save notifications status (seen, new) in hbase for each notification. For the notifications use simple schema. Key:userid_timestamp - column: notification_message.
Once client asks API "Get all new notifications", save the timestamp (All new notifications pushed). Key: userid - colimn: All_new_notifications_pushed_time
Every notification with timestamp is lower than "All new notifications pushed" assumed "seen", and if bigger assume "New"
To get all new notifications:
firstly get value (timestamp) for All_new_notifications_pushed_time by userid
then perform range scan on notification_message column by key: from current_timestamp to All_new_notifications_pushed_time.
This will significantly limit affected columns, and most of them should be in memstore.
Count the new notifications on the client.

Categories