Suppose I have a hash set of request IDs that I've sent from a client to a server. The server's response returns the request ID that I sent, which I can then remove from the hash set. This will be run in a multithreaded fashion, so multiple threads can be adding to and removing IDs from the hash set. However, since the IDs generated are unique (from a thread safe source, let's say an AtomicInteger for now that gets updated for each new request), does the HashSet need to be a ConcurrentHashSet?
I would think the only case this might cause a problem would be if the HashSet encounters collisions which may require datastructure changes to the underlying HashSet object, but it doesn't seem like this would occur in this use case.
Yes. Since the underlying array for the hash table might need to be resized for instance and also because of course IDs can collide. So having different keys will not help at all.
However, since you know that the IDs are increasing, and if you can have an upper bound on the maximum number of IDs outstanding (lets say 1000). You can work with an upper and lower bound and a fixed size array with offset indexing from the lowest key, in which case you will not need any mutexes or concurrent data structure. Such data structure is very fragile however since if you have more than your upper bound oustanding hell will break loose. So unless performance is of concern, just use the ConcurrentHashSet.
Related
can i somehow use linkedHashMap in Hazelcast (java spring). I need to get unique records from hazelcast shared in-memory cache but in order in which I inserted them. I found in hazelcast documentation (https://docs.hazelcast.org/docs/latest-dev/manual/html-single/) they offers distributed implementations of common data structures. But map doesnt preserves elements order and list or queue dont remove duplicite data. Do you know if i can use linkedHashMap or somehow get unique data and preserves their order?
Ordered or linked storage isn't compatible with the goals of a data grid - highly concurrent and distributed storage.
Ordered retrieval is possible. Hazelcast's Paging Predicate with a comparator would do it. Or the volume is not too high, you could retreive the entry set and sort it yourself.
The catch is, you have to provide the field to order upon.
If your data already has some sort of sequence number or timestamp that is always unique, this is easy.
If not, perhaps something like Atomic Long would do it. A getAndIncrement() would give you a unique number to use for each insert.
Watch though, this has a race condition if two or more threads insert concurrently. To solve this you'd need some sort of singleton #Service running somewhere to do the "get next seqno ; inset` step.
And if you restart the grid, the seqno in the atomic counter will need repositioned to the right place.
I have the following issue.
I'm connecting to some place using and API and getting the data as an inputstream.
the goal is to save the data after removing duplicate lines.
duplication defined by columns 10, 15, 22.
i'm getting the data using several threads.
currently I first save the data into a csv file and then remove duplicates.
I want to do it while i'm reading the data.
the volume of the data is about 10 million records.
I have limited memory that I can use.
the machine has 32gb of memory but I am limited since there are other applications that using it.
I read here about using hash maps.
but I'm not sure I have enough memory to use it.
does any one has a suggestion how to solve this issue?
A Hashmap will use up at least as much memory as your raw data. Therefore, it is probably not feasible for the size of your data set (however, you should check that, because if it is, it's the easiest option).
What I would do is write the data to a file or database, compute a hash value for the fields to be deduplicated, and store the hash values in memory with a suitable reference to the file (e.g. the byte index of where the original value is in the written file). The reference should of course be as small as possible.
When you hit a hash match, look up the original value and check whether it is identical (as hashes for different values may fall together).
The question, now, is how many duplicates you expect. If you expect few matches, I would choose a cheap write and expensive read solution, i.e. dumping everything linearly into a flat file and reading back from that file.
If you expect many matches, it's probably the other way round, i.e. having an indexed file or set of files, or even a database (make sure it's a database where write operations are not too expensive).
The solution depends on how big is your data in columns 10, 15, 22.
Assuming it's not too big (say, ca. 1kb) you can actually implement an in-memory solution.
Implement a Key class to store values from columns 10, 15, 22. Carefully implement equals and hashCode methods. (You may also use a normal ArrayList instead.)
Create a Set which would contain keys of all records you read.
For each record you read, check if it's key is already in that set. If yes, skip the record. If not, write the record to output, add the key to the set. Make sure you work with set in a thread-safe manner.
In the worst case you'll need number of records * size of key amount of memory. For 10000000 records and the assumed <1kb per key this should work with around 10GB.
If the key size is still too large, you'll probably need a database to store the set of key.
Another option would be storing hashes of keys instead of full keys. This will require much less memory but you may be getting hash collisions. This may lead to "false positives", i.e. false duplicates which aren't actually duplicates. To completely avoid this you'll need a database.
You can use ConcurrentHashSet. it will automatically remove the duplicate element and it's thread safe up to a certain limit
For not very big amount of data we store all keys in one bin with List.
But there are limitations on the size of bin.
Function scanAll with ScanCallback in Java client, actually works very slowly, so we cannot afford it in our project. Aerospike works fast when you give him the Key.
Now we have some sets where are a lot of records and keys. What is the best way to store all keys, or maybe there are some way to get it fast and without scanAll ?
Scanning small sets is currently an inefficient operation, because there are 4K logical partitions, and a scan thread has to reduce each of those partitions during the scan. Small sets don't necessarily have records in all the partitions, so you're paying for the overhead of scanning those regardless. This is likely to change in future versions, but is the case for now.
There are two ways to get all the records in a set faster:
If you actually know what the key space is like, you can iterate over batch-reads to fetch them (which can also be done in parallel). Trying to access a non-existent key in a batch-read does not cause an error, it just comes back with no value in the specific index.
Alternatively, you can add a bin that has the set name, and create a secondary index over that bin, then query for all the records WHERE setname=XYZ. This will come back much faster than the scan, for a small set.
I'm designing an application that maps an IP address to certain info about that IP address. Currently, I have the information stored in a ConcurrentHashMap. The list of keys could change frequently, so I grab the latest copy of the list and update it once every minute.
However, I could possibly be querying this data structure a few thousand times a minute. Does it make sense to use a ConcurrentHashMap? Would there be a significant delay (larger than 1ms) when the list is being updated? There could be up to 1000 items in the list.
Thanks for your help!
You can search about ConcurrentHashMap, in the documentation, and you will see that the retrieval operations generally do not block, so they act just like in a normal HashMap. Retrieval operations includes remove and get. You said that the list of keys could change frequently, so yes, I recommend you to use ConcurrentHashMap.
Here is the documentation: http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentHashMap.html
With so few items, it's very unlikely that there will be any delay. A good baseline with significant overhead for error is approximately 10 million operations per second when looking at big-O notation.
Given the fact that you are only querying a few thousand times a minute and a hash map is O(1), there shouldn't be any problem.
I'm realizing a cache with java, but I have the last problem to solve: how to deal with elements' deletion?
Elements are stored on the disk, each element has a validity period (then an expiration date) and also a size, my cache has obviously a maximum size and a maximum number of elements which may be stored.
I imagined three ways for performing elements' deletion:
When inserting a new element into the cache a scheduled thread (one for each element) is configured for starting at expiration time (in order to delete the element itself)
Execute a thread each X minutes in order to check which elements may be deleted (and delete them)
When a limit (size or number) is reached the oldest elements are deleted (or delete elements randomly (faster))
About the third point, using this policy the cache will continue to store also expired elements. Obviously when one of these is required a control is performed to check if the element is still valid.
What do you think about? What's the common behavior when managing a cache? Are there other solutions?
P.S. I'm developing this cache for Android, but I think this is not so important.
Basically you have to know how often your cached elements will be used, and in which order. A cache has to do the same as an OS in order to keep the best data in memory.
Hava a look at these strategies and take the one you need: http://en.wikipedia.org/wiki/Page_replacement_algorithm
A good tip would be LRU (Least-Recently-Used). But like all these strategies it has some faults. Which may not be suitable for your case of usage.
Implementation tips for LRU:
use a PriorityQueue to store the elements in addition to your map. Keep it being updated with a global counter that gets incremented every time you use one of your elements and reinsert the corresponding element in the PriorityQueue with the current value of the global counter.
If you need to remove an item from the queue, you just have to remove the first or last element from the queue (depending on your implementation of the compareTo(...) method). And remove it from the map as well.