I'm designing an application that maps an IP address to certain info about that IP address. Currently, I have the information stored in a ConcurrentHashMap. The list of keys could change frequently, so I grab the latest copy of the list and update it once every minute.
However, I could possibly be querying this data structure a few thousand times a minute. Does it make sense to use a ConcurrentHashMap? Would there be a significant delay (larger than 1ms) when the list is being updated? There could be up to 1000 items in the list.
Thanks for your help!
You can search about ConcurrentHashMap, in the documentation, and you will see that the retrieval operations generally do not block, so they act just like in a normal HashMap. Retrieval operations includes remove and get. You said that the list of keys could change frequently, so yes, I recommend you to use ConcurrentHashMap.
Here is the documentation: http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentHashMap.html
With so few items, it's very unlikely that there will be any delay. A good baseline with significant overhead for error is approximately 10 million operations per second when looking at big-O notation.
Given the fact that you are only querying a few thousand times a minute and a hash map is O(1), there shouldn't be any problem.
Related
Any suggestion regarding below problem would be appreciated.
Present situation:
I have an ArrayList of object. We have already implemented the sorting using comparator. The object has hundreds of field. So the size of one single object in an ArrayList is not small. Going forward when the size of ArrayList increases we feel like this will create problem in sorting because of overall size of the ArrayList.
Plan:
We will load the objects in Cache.
Instead of taking ArrayList of the object as input, we are planning to take ArrayList of an id (string) as input. And when an id is being compared we are planning to get the object from cache.
Problem:
I don't want to load all the objects in cache because this cache will be used only during the sorting. So I don't want to create a cache of huge size just for this.
What I was planning to do was load only half of the objects in cache and in case anything is not present in cache load it from DB and read it as well as put it in cache (Which will replace one of the object in cache). I don't want to query the DB for a single object because this way I would be hitting DB tens of thousands of time.
I want to do bulk read from DB, but I was not able to strategize that.
Any suggestion will be appreciated.
You're very confused.
The object has hundreds of field.
Irrelevant. Java uses references; that 'arraylist of object' you have is backed by an array, and each slot in that array takes about 8 bytes (depends on underlying VM details, could be 4 bytes too). These represent the location in memory, more or less, where the object lives.
Going forward when the size of ArrayList increases
.... no, it won't. If you put 100,000 entries in this list, that's a total memory load, at least for the list itself, of at most 800,000 bytes, that's less than a megabyte. Put it this way: On modern hardware, that list alone can contain 100 million items and your system wouldn't be breaking a sweat (that'd be less than a GB of memory for the references). Now, if you have 100 million unique objects too (vs. say, adding the exact same object 100 million times, or adding null 100 million times), that object ALSO occupies memory. That could be a problem. But the list is not the relevant part.
we feel like this will create problem in sorting because of overall size of the ArrayList.
No. When you sort an array list, you'll get ~ nlogn operations to sort it. The actual sorting infrastructure parts (moving objects around in the list) are near enough to zero cost (it's just blitting those 4 to 8 byte sequences around on a single memory page. Assuming the invocation of .compare() is cheap, even a throwaway $100 computer can sort millions of entries in fractions of seconds. Which just leaves the ~nlogn invokes of .compare(). If that's expensive, okay, you may have a problem. So, in a list of 1 million entries, you're looking at roughly 13 million invokes of your compare method.
How fast is it?
If calling .compare(a, b) (where a and b are pointers to instances of your 'hundreds of fields' objects) inspects every single one of those hundreds of fields, that could get a little tricky perhaps, but if it just checks a few of them, there's nothing to worry about here. CPUs are FAST. You may go: "MILLIONS? Oh my gosh!", but your CPU laughs at this job.
We will load the objects in Cache.
This plan is bad, because of the above reasons.
I want to do bulk read from DB
Okay, so when you started out with 'we have an arraylist of object', actually you don't have that, and you have a DB connection? Which one is it?
Either you have all your data in an arraylist, or you have your data in a DB. If it's all in an arraylist, the DB part is irrelevant. If you don't have all your data in an arraylist, your question misleads and isn't clear.
If the data is in a DB, set up proper indices and use the ORDER BY clause.
For not very big amount of data we store all keys in one bin with List.
But there are limitations on the size of bin.
Function scanAll with ScanCallback in Java client, actually works very slowly, so we cannot afford it in our project. Aerospike works fast when you give him the Key.
Now we have some sets where are a lot of records and keys. What is the best way to store all keys, or maybe there are some way to get it fast and without scanAll ?
Scanning small sets is currently an inefficient operation, because there are 4K logical partitions, and a scan thread has to reduce each of those partitions during the scan. Small sets don't necessarily have records in all the partitions, so you're paying for the overhead of scanning those regardless. This is likely to change in future versions, but is the case for now.
There are two ways to get all the records in a set faster:
If you actually know what the key space is like, you can iterate over batch-reads to fetch them (which can also be done in parallel). Trying to access a non-existent key in a batch-read does not cause an error, it just comes back with no value in the specific index.
Alternatively, you can add a bin that has the set name, and create a secondary index over that bin, then query for all the records WHERE setname=XYZ. This will come back much faster than the scan, for a small set.
I am new in NoSQL systems. I want to use Java+Spring+MongoDB (not important).
I try to build correct scheme for my data. I'll have too much log records (something about 3 000 000 000 per year). Record structure looks like this:
{
shop: 'shop1',
product: 'product1',
count: '10',
incost: '100',
outcost: '120',
operation: 'sell',
date: '2015-12-12'
}
I have about 1000 shops and about 30000 products.
I should have reports with sum of count or sum of (sum*(outcost-incost)) by [shops]+product splited by days or months.
*[shops] means optional filter. In this case (without shops) performance is not matter.
*Reports older than 1 year may be required but performance is not matter.
Can i use single collection "logs" with indexes on date, shop, product. Or i should split this collection to subcollections by shops and years explicitly?
Sorry if my question is stupid, i am just beginner...
Regards,
Minas
Unless and until the document grows further, this works fine. In case, if you want to add more fields to the existing document or append the existing fields and if you think it may grow beyond 16 MB, then its better to have separate collections.
Indexing keys also appear to be fine as you are having compound index on shop, date and product fields.
You would be having some performance gain(easy and fast as only single disk seek happens), if complete data is retrieved from single collection rather fetched from multiple collections.
I would not do much aggregation on the main collection, 3 billion records is quite a lot.
One massive problem I can think with this is that any query will likely be huge, returning a massive number of documents. Now, it is true that you can mitigate most negative factors of querying this collection by using sharding to spread out the weight of the data itself, however, the sheer amount of data returned to the mongos will likely be slow and painful.
There comes a time when no amount of index will save you because your collection is just too darn big.
This would not matter if you was just displaying the collection, MongoDB could do that easily, it is aggregation that will not work well.
I would do as you suggest: pre-aggregate into other collections based on data fragments and time buckets.
I'm considering CQEngine for a project where I need to handle lots of real time events and execute some queries from time to time. It works well for returning the results but I noticed that the larger the collection gets the slower it becomes to add or remove elements to/from it.
I have a few simple indexes added on the collection so I'm assuming the delay is because on each event added/removed the indexes are updated. I also get an OutOfMemoryError on large numbers of events, from the indexes increasing along with the collection I think.
So my question is, what is the indexing penalty in CQEngine for a fast changing collection (elements often added and removed from the collection)?
If you have a lot of unique values in attributes that you index, you would probably benefit from IndexQuantization discussed on the site.
This is a way to tune the tradeoff between memory usage and retrieval speed. But it's especially useful to reduce the size of indexes in memory, if you have a large number of unique values.
FYI you can also ask questions in the CQEngine discussion forum.
Hope that helps!
Does anyone know why the java jdk implementation of hashtable does not rehash the table upon remove ?
What if space usage is too low? Isnt it a reason to reduce size and rehash?
Just like load factor 0.75 which triggers rehash on put, we could have a lower bound like 0.25 (of course analysis can be done on the best value here) on the density of the table and trigger the rehash again, provided the size of the table is greater than the initialCapacity.
Rehashing is an expensive operation and the java hash based data structures try to avoid it. They only do rehashing when the lookup performance is bad. This is the purpose of this type of data structure: lookup performance.
Here is a quote from the HashMap java docs:
The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.
Beside this argument, the java creators might have thought that if you had that many elements in your hashtable the probability to have them again is quite large so there is no need to rehash twice the table.
You should ask Sun/Oracle engineers in order to know why there is not a threshold to decrease size.
Here's my two cents:
Rehashing the table takes time
Checking on every remove operation takes time
On the other hand:
Probably you won't save much memory (objects and nodes within the table will use much more space)
There might not be many scenarios where first you create (some) very big hashtables and then empty them and crave for the unused space.
You know any popular implementation which includes that behaviour (decrease table size)
In programming as in life, there are lots of things that might be done. Some are only worth for very specific cases. Some are not worth the pain at all.