Retrieve first document from collection using findOne() MongoDB

Retrieve first document from collection using findOne() MongoDB - java

I am creating a java program to process the Collection of MongoDB as queue. So when I dequeue, I want the document that was inserted first.
To do that so, I have a field called created, which represents the time stamp for the document creation, and my initial idea was to use aggregation $min to find the smallest document using created field.
However it occurred to me why not use findOne() without any argument. It will always return the first document in the collection.
So my question is should I do that? Would it be a good approach to use findOne() and dequeue first record from the Mongo Queue? And what are the drawback if I do that so.
PS: The Mongo Queue program is created to serve the requests of the devices on basis of First Come First Serve. But as it would take some time to execute the request and device can't accept another request while it is processing one. So to prevent the drop of one request I am using the queue to process request one by one.

Interesting how many people here commented incorrectly, but you are right in that a raw .findOne() with a blank query or .findOne({}) will return the first document in the collection, that being "the document with the lowest _id value".
Ideally for a queue processing system, you want to remove the document at the same time as doing this. For this purpose the Java API supports a .findAndRemove() method:
DBCollection data = mongoOperation.getCollection("data");
DBObject removed = data.findAndRemove(new DBObject());
So that will return the first document in the collection as described and "remove" it from the collection so that no other operations can find it.
You can call .findAndModify() and set all the options yourself alternately, but if all you are after is the "oldest document first" which is what the _id guarantees then this is all you want.

findOne returns element in natural order. This is not necessarily same as insertion order. It is the order in which document appears in the disk. It may appear that it is being retrieved in insertion order but with deletes and inserts, you will start seeing document appear out of order.
One of the ways to guarantee that elements always appear in insertion order is to use capped collections. If your application is not impacted by its restrictions, it might be the simplest way to get a queue implemented with capped collection.
Capped collections can also be used with tailable cursor so that the logic that is retrieving items from the queue can continue to wait for items if no items are available to process.
Update: If you can not use capped collection you would have to sort the result by _id if it is ObjectId or keep timestamp based field in collection and order the result by that field.

FindOne returns using the $natural order within the internal MongoDB bTree that exists behind the scenes.
The function does not, by default, sort by _id and nor will it pick the lowest _id.
If you find it returns the lowest _id regularly then that is because of document positioning within the $natural index.
Getting the first document of the collection and the first document of a sorted set are two totally different things.
If you wanted to use findAndModify to grab a document off the pile, which I personally would recommend a optimistic lock then you would need to use:
findAndModify({
sort: {_id: -1},
remove: true
})
The reason why I would not commend this approach is because of that process crashes or the server goes down in the distributed worker set then you have lost that data point. Instead you want a temporary (optimistic type) lock which can be released in the event that it has not been processed correctly.

Related

Can I use LinkedHashMap in Hazelcast?

can i somehow use linkedHashMap in Hazelcast (java spring). I need to get unique records from hazelcast shared in-memory cache but in order in which I inserted them. I found in hazelcast documentation (https://docs.hazelcast.org/docs/latest-dev/manual/html-single/) they offers distributed implementations of common data structures. But map doesnt preserves elements order and list or queue dont remove duplicite data. Do you know if i can use linkedHashMap or somehow get unique data and preserves their order?

Ordered or linked storage isn't compatible with the goals of a data grid - highly concurrent and distributed storage.
Ordered retrieval is possible. Hazelcast's Paging Predicate with a comparator would do it. Or the volume is not too high, you could retreive the entry set and sort it yourself.
The catch is, you have to provide the field to order upon.
If your data already has some sort of sequence number or timestamp that is always unique, this is easy.
If not, perhaps something like Atomic Long would do it. A getAndIncrement() would give you a unique number to use for each insert.
Watch though, this has a race condition if two or more threads insert concurrently. To solve this you'd need some sort of singleton #Service running somewhere to do the "get next seqno ; inset` step.
And if you restart the grid, the seqno in the atomic counter will need repositioned to the right place.

Does JDBC getGeneratedKeys() method always same order of inserted element

I use executeBatch() with JDBC to insert multiple rows and I want to get id of inserted rows for another insert I use this code for that purpose:
insertInternalStatement = dbConncetion.prepareStatement(INSERT_RECORD, generatedColumns);
for (Foo foo: foosHashSet) {
insertInternalStatement.setInt(1, foo.getMe());
insertInternalStatement.setInt(1, foo.getMe2());
// ..
insertInternalStatement.addBatch();
}
insertInternalStatement.executeBatch();
// now get inserted ids
try (ResultSet generatedKeys = insertInternalStatement.getGeneratedKeys()) {
Iterator<Foo> fooIterator= foosHashSet.iterator();
while (generatedKeys.next() && fooIterator.hasNext()) {
fooIterator.next().setId(generatedKeys.getLong(1));
}
}
It works fine and ids are returned, my question are:
if I iterate over getGeneratedKeys() and foosHashSet will ids return in same order so that each returned id from database belongs to corresponding Foo instance?
What about when I use multi thread and above code run in multiple threads simultaneously?
Is there any other solution for this? I have two table foo1 and foo2 and I want first insert foo1 records then use their primary ids as foo2 foreign key.

Given support for getGeneratedKeys for batch execution is not defined in the JDBC specification, the behavior will depend on the driver used. I would expect any driver that supports generated keys for batch execution, to return the ids in order they where added to the batch.
However the fact you are using a Set is problematic. Iteration order for most sets are not defined, and could change between iterations (usually only after modification, but in theory you can't assume anything about the order). You need to use something with a guaranteed order, eg a List or maybe a LinkedHashSet.
Applying multi-threading here would probably be a bad idea: you should only use a JDBC connection from a single-thread at a time. Accounting for multi-threading would either require correct locking, or requiring you to split up the workload so it can use separate connections. Whether that would improve or worsen performance is hard to say.

You should be able to iterate through multiple generated keys without problem. They will return in the correct order they were inserted.
I think there should not be any problem adding threads in this matter. The only thing I'm pretty sure is that you would not be able to control the order the ids are inserted on both tables without some code complication.
You could store all firstly inserted ids on a Collection and after all threads/iterations have finished, insert them on second table.

The iteration is the same as long as the fooHashSet is not altered.
One could think using a LinkedHashSet which yields the items in order of insertion. Especially when nothing is removed or overwritten that would be nice.
Concurrent access would be problematic.
Use LinkedHashSet without removal, only adding new items. And additionally wrap it in Collections.synchronizedMap. For set alterations one
would need a Semaphore or such, as synchronizing such a large code block is a no-go.
An - even better performing - solution might be to make a local copy:
List<Me> list = fooHashSet.stream().map(Foo::Me)
.collect(Collectors.toList());
However this still is a somewhat unsatisfying solution:
a batch for multiple inserts and then per insert several other updates/inserts.
Transition to JPA instead of JDBC would somewhat alleviate the situation.
After some experience however I would pose the question whether a database at that point is still the correct tool (hammer)? If it is a graph, a hierarchical data structure, then storing the entire data structure as XML with JAXB in a single database table, could be the best solution. Faster. Easier development. Verifiable data.
Using the database for main data, and the XML for an edited/processed document.

Yes as per the definition of executing batch it says
createFcCouponStatement.executeBatch()
Submits a batch of commands to the database for execution and if all commands execute successfully, returns an array of update counts. The int elements of the array that is returned are ordered to correspond to the commands in the batch, which are ordered according to the order in which they were added to the batch. The elements in the array returned by the method executeBatch may be one of the following:
A number greater than or equal to zero -- indicates that the command was processed successfully and is an update count giving the number of rows in the database that was affected by the command's execution
A value of SUCCESS_NO_INFO -- indicates that the command was processed successfully but that the number of rows affected is unknown
If one of the commands in a batch update fails to execute properly, this method throws a BatchUpdateException, and a JDBC driver may or may not continue to process the remaining commands in the batch. However, the driver's behavior must be consistent with a particular DBMS, either always continuing to process commands or never continuing to process commands. If the driver continues processing after a failure, the array returned by the method BatchUpdateException.getUpdateCounts will contain as many elements as there are commands in the batch, and at least one of the elements will be the following:
A value of EXECUTE_FAILED -- indicates that the command failed to execute successfully and occurs only if a driver continues to process commands after a command fails
The possible implementations and return values have been modified in the Java 2 SDK, Standard Edition, version 1.3 to accommodate the option of continuing to process commands in a batch update after a BatchUpdateException object has been thrown.

Cache management, how to auto-delete elements?

I'm realizing a cache with java, but I have the last problem to solve: how to deal with elements' deletion?
Elements are stored on the disk, each element has a validity period (then an expiration date) and also a size, my cache has obviously a maximum size and a maximum number of elements which may be stored.
I imagined three ways for performing elements' deletion:
When inserting a new element into the cache a scheduled thread (one for each element) is configured for starting at expiration time (in order to delete the element itself)
Execute a thread each X minutes in order to check which elements may be deleted (and delete them)
When a limit (size or number) is reached the oldest elements are deleted (or delete elements randomly (faster))
About the third point, using this policy the cache will continue to store also expired elements. Obviously when one of these is required a control is performed to check if the element is still valid.
What do you think about? What's the common behavior when managing a cache? Are there other solutions?
P.S. I'm developing this cache for Android, but I think this is not so important.

Basically you have to know how often your cached elements will be used, and in which order. A cache has to do the same as an OS in order to keep the best data in memory.
Hava a look at these strategies and take the one you need: http://en.wikipedia.org/wiki/Page_replacement_algorithm
A good tip would be LRU (Least-Recently-Used). But like all these strategies it has some faults. Which may not be suitable for your case of usage.
Implementation tips for LRU:
use a PriorityQueue to store the elements in addition to your map. Keep it being updated with a global counter that gets incremented every time you use one of your elements and reinsert the corresponding element in the PriorityQueue with the current value of the global counter.
If you need to remove an item from the queue, you just have to remove the first or last element from the queue (depending on your implementation of the compareTo(...) method). And remove it from the map as well.

Avoiding for loop and try to utilize collection APIs instead (performance)

I have a piece of code from an old project.
The logic (in a high level) is as follows:
The user sends a series of {id,Xi} where id is the primary key of the object in the database.
The aim is that the database is updated but the series of Xi values is always unique.
I.e. if the user sends {1,X1} and in the database we have {1,X2},{2,X1} the input should be rejected otherwise we end up with duplicates i.e. {1,X1},{2,X1} i.e. we have X1 twice in different rows.
In lower level the user sends a series of custom objects that encapsulate this information.
Currently the implementation for this uses "brute-force" i.e. continuous for-loops over input and jdbc resultset to ensure uniqueness.
I do not like this approach and moreover the actual implementation has subtle bugs but this is another story.
I am searching for a better approach, both in terms of coding and performance.
What I was thinking is the following:
Create a Set from the user's input list. If the Set has different size than list, then user's input has duplicates.Stop there.
Load data from jdbc.
Create a HashMap<Long,String> with the user's input. The key is the primary key.
Loop over result set. If HashMap does not contain a key with the same value as ResultSet's row id then add it to HashMap
In the end get HashMap's values as a List.If it contains duplicates reject input.
This is the algorithm I came up.
Is there a better approach than this? (I assume that I am not erroneous on the algorithm it self)

Purely from performance point of view , why not let the database figure out that there are duplicates ( like {1,X1},{2,X1} ) ? Have a unique constraint in place in the table and then when the update statement fails by throwing the exception , catch it and deal with what you would want to do under these input conditions. You may also want to run this as a single transaction just if you need to rollback any partial updates. Ofcourse this is assuming that you dont have any other business rules driving the updates that you havent mentioned here.
With your algorithm , you are spending too much time iterating over HashMaps and Lists to remove duplicates IMHO.

Since you can't change the database, as stated in the comments. I would probably extend out your Set idea. Create a HashMap<Long, String> and put all of the items from the database in it, then also create a HashSet<String> with all of the values from your database in it.
Then as you go through the user input, check the key against the hashmap and see if the values are the same, if they are, then great you don't have to do anything because that exact input is already in your database.
If they aren't the same then check the value against the HashSet to see if it already exists. If it does then you have a duplicate.
Should perform much better than a loop.
Edit:
For multiple updates perform all of the updates on the HashMap created from your database then once again check the Map's value set to see if its' size is different from the key set.
There might be a better way to do this, but this is the best I got.

I'd opt for a database-side solution. Assuming a table with the columns id and value, you should make a list with all the "values", and use the following SQL:
select count(*) from tbl where value in (:values);
binding the :values parameter to the list of values however is appropriate for your environment. (Trivial when using Spring JDBC and a database that supports the in operator, less so for lesser setups. As a last resort you can generate the SQL dynamically.) You will get a result set with one row and one column of a numeric type. If it's 0, you can then insert the new data; if it's 1, report a constraint violation. (If it's anything else you have a whole new problem.)
If you need to check for every item in the user input, change the query to:
select value from tbl where value in (:values)
store the result in a set (called e.g. duplicates), and then loop over the user input items and check whether the value of the current item is in duplicates.
This should perform better than snarfing the entire dataset into memory.

atomic writes to ehcache

Context
I am storing a java.util.List inside ehcache.
Key(String) --> List<UserDetail>
The ordered List contains a Top 10 ranking of my most active users.
Problem
Concurrent 3rd party clients might be requesting for this list.
I have a requirement to be as current as possible with regards to the ranking. Thus if the ranking is changed due the activities of users, the ordered List in the cache must not be left stale for very long. Once I've recalculated a new List, I want to replace the one in cache immediately.
Consider a busy scenario whereby multiple concurrent clients are requesting for the ranking; how can I replace the cache item in an fashion such that: Clients can continue to pull a possibly stale snapshot. They should never get a null value.
There will only be 1 server thread that writes to the cache.

I don't see what the problem is. Once you've replaced a cache item, clients will pull that new cache item. Up until that point they will pull the old cache item.
There should never be a time when they return a null cache item, unless you actually remove the item from the cache and then replace it.
If EHCache worked like that I would consider it pretty fundamentally broken, given that it's meant to be thread-safe!

You can simply store the new list in the cache. The next call to get will return it.
All you must make sure is that no one edits the list that is returned from the cache. For example in the server thread, you must copy the list:
List workingCopy = new ArrayList ((List)cache.get(key));
... modify list ...
cache.put (key, workingCopy);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Retrieve first document from collection using findOne() MongoDB - java

Related

Can I use LinkedHashMap in Hazelcast?

Does JDBC getGeneratedKeys() method always same order of inserted element

Cache management, how to auto-delete elements?

Avoiding for loop and try to utilize collection APIs instead (performance)

atomic writes to ehcache

Categories

Resources