I am using Java. I need to publish data to a FIFO queue. This queue will be processed by a separate thread. This way I avoid blocking the main thread.
My use case regarding publishing data is:-
Each data object has a field, which identifies it uniquely.. so there are 50 odd such 'keys'. There are other fields which is rest of the data of the object.
If a new data object comes along, it should not be blindly inserted in queue, but should replace old one.. only if their data is different based on field comparison etc.. otherwise it will be simply discarded. Remember, one of the field is the key.. rest are data and can wildly differ.
These data must be processed on FIFO basis.. thus I need a queue kind of.
Needless to say, it should be thread safe too.
Anyone knows any data structure that satisfies these criteria? Thanks.
When I had to do something like this in C#, I created a custom data structure that contained a Dictionary and a Queue. The Dictionary was indexed by the item key, and the value contained the data associated with the key. The queue contained only the keys.
To insert an item into the queue, I did the following (pseudocode)
lock the data structure
if key exists in dictionary
replace old item data in dictionary with new item data
else
add new item to the dictionary, indexed by key
add new item key to the queue
release lock
To dequeue an item:
lock
remove first item key from queue
lookup data for that key in dictionary
remove item from dictionary
release lock
process the item
We did thousands of updates a second across multiple threads, and didn't encounter any performance problems. The locks just aren't held for very long. Your mileage may vary, of course.
You can avoid locks on the main thread by adding a lock-free concurrent queue that the main thread can push updates to. Another thread can service that queue and add items to the hybrid dictionary/queue structure I described above.
Related
can i somehow use linkedHashMap in Hazelcast (java spring). I need to get unique records from hazelcast shared in-memory cache but in order in which I inserted them. I found in hazelcast documentation (https://docs.hazelcast.org/docs/latest-dev/manual/html-single/) they offers distributed implementations of common data structures. But map doesnt preserves elements order and list or queue dont remove duplicite data. Do you know if i can use linkedHashMap or somehow get unique data and preserves their order?
Ordered or linked storage isn't compatible with the goals of a data grid - highly concurrent and distributed storage.
Ordered retrieval is possible. Hazelcast's Paging Predicate with a comparator would do it. Or the volume is not too high, you could retreive the entry set and sort it yourself.
The catch is, you have to provide the field to order upon.
If your data already has some sort of sequence number or timestamp that is always unique, this is easy.
If not, perhaps something like Atomic Long would do it. A getAndIncrement() would give you a unique number to use for each insert.
Watch though, this has a race condition if two or more threads insert concurrently. To solve this you'd need some sort of singleton #Service running somewhere to do the "get next seqno ; inset` step.
And if you restart the grid, the seqno in the atomic counter will need repositioned to the right place.
I am creating a java program to process the Collection of MongoDB as queue. So when I dequeue, I want the document that was inserted first.
To do that so, I have a field called created, which represents the time stamp for the document creation, and my initial idea was to use aggregation $min to find the smallest document using created field.
However it occurred to me why not use findOne() without any argument. It will always return the first document in the collection.
So my question is should I do that? Would it be a good approach to use findOne() and dequeue first record from the Mongo Queue? And what are the drawback if I do that so.
PS: The Mongo Queue program is created to serve the requests of the devices on basis of First Come First Serve. But as it would take some time to execute the request and device can't accept another request while it is processing one. So to prevent the drop of one request I am using the queue to process request one by one.
Interesting how many people here commented incorrectly, but you are right in that a raw .findOne() with a blank query or .findOne({}) will return the first document in the collection, that being "the document with the lowest _id value".
Ideally for a queue processing system, you want to remove the document at the same time as doing this. For this purpose the Java API supports a .findAndRemove() method:
DBCollection data = mongoOperation.getCollection("data");
DBObject removed = data.findAndRemove(new DBObject());
So that will return the first document in the collection as described and "remove" it from the collection so that no other operations can find it.
You can call .findAndModify() and set all the options yourself alternately, but if all you are after is the "oldest document first" which is what the _id guarantees then this is all you want.
findOne returns element in natural order. This is not necessarily same as insertion order. It is the order in which document appears in the disk. It may appear that it is being retrieved in insertion order but with deletes and inserts, you will start seeing document appear out of order.
One of the ways to guarantee that elements always appear in insertion order is to use capped collections. If your application is not impacted by its restrictions, it might be the simplest way to get a queue implemented with capped collection.
Capped collections can also be used with tailable cursor so that the logic that is retrieving items from the queue can continue to wait for items if no items are available to process.
Update: If you can not use capped collection you would have to sort the result by _id if it is ObjectId or keep timestamp based field in collection and order the result by that field.
FindOne returns using the $natural order within the internal MongoDB bTree that exists behind the scenes.
The function does not, by default, sort by _id and nor will it pick the lowest _id.
If you find it returns the lowest _id regularly then that is because of document positioning within the $natural index.
Getting the first document of the collection and the first document of a sorted set are two totally different things.
If you wanted to use findAndModify to grab a document off the pile, which I personally would recommend a optimistic lock then you would need to use:
findAndModify({
sort: {_id: -1},
remove: true
})
The reason why I would not commend this approach is because of that process crashes or the server goes down in the distributed worker set then you have lost that data point. Instead you want a temporary (optimistic type) lock which can be released in the event that it has not been processed correctly.
I'm realizing a cache with java, but I have the last problem to solve: how to deal with elements' deletion?
Elements are stored on the disk, each element has a validity period (then an expiration date) and also a size, my cache has obviously a maximum size and a maximum number of elements which may be stored.
I imagined three ways for performing elements' deletion:
When inserting a new element into the cache a scheduled thread (one for each element) is configured for starting at expiration time (in order to delete the element itself)
Execute a thread each X minutes in order to check which elements may be deleted (and delete them)
When a limit (size or number) is reached the oldest elements are deleted (or delete elements randomly (faster))
About the third point, using this policy the cache will continue to store also expired elements. Obviously when one of these is required a control is performed to check if the element is still valid.
What do you think about? What's the common behavior when managing a cache? Are there other solutions?
P.S. I'm developing this cache for Android, but I think this is not so important.
Basically you have to know how often your cached elements will be used, and in which order. A cache has to do the same as an OS in order to keep the best data in memory.
Hava a look at these strategies and take the one you need: http://en.wikipedia.org/wiki/Page_replacement_algorithm
A good tip would be LRU (Least-Recently-Used). But like all these strategies it has some faults. Which may not be suitable for your case of usage.
Implementation tips for LRU:
use a PriorityQueue to store the elements in addition to your map. Keep it being updated with a global counter that gets incremented every time you use one of your elements and reinsert the corresponding element in the PriorityQueue with the current value of the global counter.
If you need to remove an item from the queue, you just have to remove the first or last element from the queue (depending on your implementation of the compareTo(...) method). And remove it from the map as well.
I have a collection of POJOs in memory and these POJOs come from other system. I have the following two problems with them:
I want to know what POJO is duplicate in terms of properties value.
I also validate against other collection i.e I have 200 shops in a city and shop ids start from 1 and ends at 200. I got a data from the shop and it submits me 500 as shop id. I want to verify the data is correct according to my collection of data or not?
I am currently stuck and don't know how to perform these operation.
I am collecting data for the market trends, shops from all over the city are registered with us. We assigned ID to each of the store. Store keeper will send us their selling details in the plain file format. My task is to collect correct data in DB. If shop or goods id doesn't match with my collection, then that record would be incorrect and I notify the shop keeper that this record is invalid. If file contains same row two or more times then also, I notify that it is duplicate.
Thanks,
I think you need to have equals() method implemented for these POJOs to say that an object is duplicate of another one. Then, you can keep inserting the POJOs you received, into a java.util.Set and everytime you receive a new one, you can check if the POJO is already received, using set.contains().
For #2, you can maintain a Map of ID to the object in that other collection to check if a newly arrived POJO is a valid one present in that map.
AFAIK, drools requires you to provide a canonical representation of your object to run the rules on the state of the object. The above two validations require you to maintain those data structures irrespective of using drools or any other rule engine.
Context
I am storing a java.util.List inside ehcache.
Key(String) --> List<UserDetail>
The ordered List contains a Top 10 ranking of my most active users.
Problem
Concurrent 3rd party clients might be requesting for this list.
I have a requirement to be as current as possible with regards to the ranking. Thus if the ranking is changed due the activities of users, the ordered List in the cache must not be left stale for very long. Once I've recalculated a new List, I want to replace the one in cache immediately.
Consider a busy scenario whereby multiple concurrent clients are requesting for the ranking; how can I replace the cache item in an fashion such that: Clients can continue to pull a possibly stale snapshot. They should never get a null value.
There will only be 1 server thread that writes to the cache.
I don't see what the problem is. Once you've replaced a cache item, clients will pull that new cache item. Up until that point they will pull the old cache item.
There should never be a time when they return a null cache item, unless you actually remove the item from the cache and then replace it.
If EHCache worked like that I would consider it pretty fundamentally broken, given that it's meant to be thread-safe!
You can simply store the new list in the cache. The next call to get will return it.
All you must make sure is that no one edits the list that is returned from the cache. For example in the server thread, you must copy the list:
List workingCopy = new ArrayList ((List)cache.get(key));
... modify list ...
cache.put (key, workingCopy);