Hashtable rehash on remove - java

Does anyone know why the java jdk implementation of hashtable does not rehash the table upon remove ?
What if space usage is too low? Isnt it a reason to reduce size and rehash?
Just like load factor 0.75 which triggers rehash on put, we could have a lower bound like 0.25 (of course analysis can be done on the best value here) on the density of the table and trigger the rehash again, provided the size of the table is greater than the initialCapacity.

Rehashing is an expensive operation and the java hash based data structures try to avoid it. They only do rehashing when the lookup performance is bad. This is the purpose of this type of data structure: lookup performance.
Here is a quote from the HashMap java docs:
The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.
Beside this argument, the java creators might have thought that if you had that many elements in your hashtable the probability to have them again is quite large so there is no need to rehash twice the table.

You should ask Sun/Oracle engineers in order to know why there is not a threshold to decrease size.
Here's my two cents:
Rehashing the table takes time
Checking on every remove operation takes time
On the other hand:
Probably you won't save much memory (objects and nodes within the table will use much more space)
There might not be many scenarios where first you create (some) very big hashtables and then empty them and crave for the unused space.
You know any popular implementation which includes that behaviour (decrease table size)
In programming as in life, there are lots of things that might be done. Some are only worth for very specific cases. Some are not worth the pain at all.

Related

Sort an ArrayList of an object using Cache

Any suggestion regarding below problem would be appreciated.
Present situation:
I have an ArrayList of object. We have already implemented the sorting using comparator. The object has hundreds of field. So the size of one single object in an ArrayList is not small. Going forward when the size of ArrayList increases we feel like this will create problem in sorting because of overall size of the ArrayList.
Plan:
We will load the objects in Cache.
Instead of taking ArrayList of the object as input, we are planning to take ArrayList of an id (string) as input. And when an id is being compared we are planning to get the object from cache.
Problem:
I don't want to load all the objects in cache because this cache will be used only during the sorting. So I don't want to create a cache of huge size just for this.
What I was planning to do was load only half of the objects in cache and in case anything is not present in cache load it from DB and read it as well as put it in cache (Which will replace one of the object in cache). I don't want to query the DB for a single object because this way I would be hitting DB tens of thousands of time.
I want to do bulk read from DB, but I was not able to strategize that.
Any suggestion will be appreciated.
You're very confused.
The object has hundreds of field.
Irrelevant. Java uses references; that 'arraylist of object' you have is backed by an array, and each slot in that array takes about 8 bytes (depends on underlying VM details, could be 4 bytes too). These represent the location in memory, more or less, where the object lives.
Going forward when the size of ArrayList increases
.... no, it won't. If you put 100,000 entries in this list, that's a total memory load, at least for the list itself, of at most 800,000 bytes, that's less than a megabyte. Put it this way: On modern hardware, that list alone can contain 100 million items and your system wouldn't be breaking a sweat (that'd be less than a GB of memory for the references). Now, if you have 100 million unique objects too (vs. say, adding the exact same object 100 million times, or adding null 100 million times), that object ALSO occupies memory. That could be a problem. But the list is not the relevant part.
we feel like this will create problem in sorting because of overall size of the ArrayList.
No. When you sort an array list, you'll get ~ nlogn operations to sort it. The actual sorting infrastructure parts (moving objects around in the list) are near enough to zero cost (it's just blitting those 4 to 8 byte sequences around on a single memory page. Assuming the invocation of .compare() is cheap, even a throwaway $100 computer can sort millions of entries in fractions of seconds. Which just leaves the ~nlogn invokes of .compare(). If that's expensive, okay, you may have a problem. So, in a list of 1 million entries, you're looking at roughly 13 million invokes of your compare method.
How fast is it?
If calling .compare(a, b) (where a and b are pointers to instances of your 'hundreds of fields' objects) inspects every single one of those hundreds of fields, that could get a little tricky perhaps, but if it just checks a few of them, there's nothing to worry about here. CPUs are FAST. You may go: "MILLIONS? Oh my gosh!", but your CPU laughs at this job.
We will load the objects in Cache.
This plan is bad, because of the above reasons.
I want to do bulk read from DB
Okay, so when you started out with 'we have an arraylist of object', actually you don't have that, and you have a DB connection? Which one is it?
Either you have all your data in an arraylist, or you have your data in a DB. If it's all in an arraylist, the DB part is irrelevant. If you don't have all your data in an arraylist, your question misleads and isn't clear.
If the data is in a DB, set up proper indices and use the ORDER BY clause.

removing duplicates in java on large scale data

I have the following issue.
I'm connecting to some place using and API and getting the data as an inputstream.
the goal is to save the data after removing duplicate lines.
duplication defined by columns 10, 15, 22.
i'm getting the data using several threads.
currently I first save the data into a csv file and then remove duplicates.
I want to do it while i'm reading the data.
the volume of the data is about 10 million records.
I have limited memory that I can use.
the machine has 32gb of memory but I am limited since there are other applications that using it.
I read here about using hash maps.
but I'm not sure I have enough memory to use it.
does any one has a suggestion how to solve this issue?
A Hashmap will use up at least as much memory as your raw data. Therefore, it is probably not feasible for the size of your data set (however, you should check that, because if it is, it's the easiest option).
What I would do is write the data to a file or database, compute a hash value for the fields to be deduplicated, and store the hash values in memory with a suitable reference to the file (e.g. the byte index of where the original value is in the written file). The reference should of course be as small as possible.
When you hit a hash match, look up the original value and check whether it is identical (as hashes for different values may fall together).
The question, now, is how many duplicates you expect. If you expect few matches, I would choose a cheap write and expensive read solution, i.e. dumping everything linearly into a flat file and reading back from that file.
If you expect many matches, it's probably the other way round, i.e. having an indexed file or set of files, or even a database (make sure it's a database where write operations are not too expensive).
The solution depends on how big is your data in columns 10, 15, 22.
Assuming it's not too big (say, ca. 1kb) you can actually implement an in-memory solution.
Implement a Key class to store values from columns 10, 15, 22. Carefully implement equals and hashCode methods. (You may also use a normal ArrayList instead.)
Create a Set which would contain keys of all records you read.
For each record you read, check if it's key is already in that set. If yes, skip the record. If not, write the record to output, add the key to the set. Make sure you work with set in a thread-safe manner.
In the worst case you'll need number of records * size of key amount of memory. For 10000000 records and the assumed <1kb per key this should work with around 10GB.
If the key size is still too large, you'll probably need a database to store the set of key.
Another option would be storing hashes of keys instead of full keys. This will require much less memory but you may be getting hash collisions. This may lead to "false positives", i.e. false duplicates which aren't actually duplicates. To completely avoid this you'll need a database.
You can use ConcurrentHashSet. it will automatically remove the duplicate element and it's thread safe up to a certain limit

How keep keys in aerospike effectively?

For not very big amount of data we store all keys in one bin with List.
But there are limitations on the size of bin.
Function scanAll with ScanCallback in Java client, actually works very slowly, so we cannot afford it in our project. Aerospike works fast when you give him the Key.
Now we have some sets where are a lot of records and keys. What is the best way to store all keys, or maybe there are some way to get it fast and without scanAll ?
Scanning small sets is currently an inefficient operation, because there are 4K logical partitions, and a scan thread has to reduce each of those partitions during the scan. Small sets don't necessarily have records in all the partitions, so you're paying for the overhead of scanning those regardless. This is likely to change in future versions, but is the case for now.
There are two ways to get all the records in a set faster:
If you actually know what the key space is like, you can iterate over batch-reads to fetch them (which can also be done in parallel). Trying to access a non-existent key in a batch-read does not cause an error, it just comes back with no value in the specific index.
Alternatively, you can add a bin that has the set name, and create a secondary index over that bin, then query for all the records WHERE setname=XYZ. This will come back much faster than the scan, for a small set.

Is ConcurrentHashMap good for rapidly updating concurrent data structures?

I'm designing an application that maps an IP address to certain info about that IP address. Currently, I have the information stored in a ConcurrentHashMap. The list of keys could change frequently, so I grab the latest copy of the list and update it once every minute.
However, I could possibly be querying this data structure a few thousand times a minute. Does it make sense to use a ConcurrentHashMap? Would there be a significant delay (larger than 1ms) when the list is being updated? There could be up to 1000 items in the list.
Thanks for your help!
You can search about ConcurrentHashMap, in the documentation, and you will see that the retrieval operations generally do not block, so they act just like in a normal HashMap. Retrieval operations includes remove and get. You said that the list of keys could change frequently, so yes, I recommend you to use ConcurrentHashMap.
Here is the documentation: http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentHashMap.html
With so few items, it's very unlikely that there will be any delay. A good baseline with significant overhead for error is approximately 10 million operations per second when looking at big-O notation.
Given the fact that you are only querying a few thousand times a minute and a hash map is O(1), there shouldn't be any problem.

What is the indexing penalty in CQEngine for a fast changing collection?

I'm considering CQEngine for a project where I need to handle lots of real time events and execute some queries from time to time. It works well for returning the results but I noticed that the larger the collection gets the slower it becomes to add or remove elements to/from it.
I have a few simple indexes added on the collection so I'm assuming the delay is because on each event added/removed the indexes are updated. I also get an OutOfMemoryError on large numbers of events, from the indexes increasing along with the collection I think.
So my question is, what is the indexing penalty in CQEngine for a fast changing collection (elements often added and removed from the collection)?
If you have a lot of unique values in attributes that you index, you would probably benefit from IndexQuantization discussed on the site.
This is a way to tune the tradeoff between memory usage and retrieval speed. But it's especially useful to reduce the size of indexes in memory, if you have a large number of unique values.
FYI you can also ask questions in the CQEngine discussion forum.
Hope that helps!

Categories