I've building a tree pagination in JSF1.2 and Richfaces 3.3.2, because I have a lot of tree nodes (something like 80k), and it's slow..
So, as first attempt, I create a HashMap with the page and the list of nodes of the page.
But, the performance isn't good enough...
So I was wondering if is something faster than a HashMap, maybe a List of Lists or something.
Someone have some experience with this? What can I do?
Thanks in advance.
EDIT.
The big problem is that I have to validate permissions of users in the childnodes of the tree. I knew that this is the big problem: this validation is slow, because I have to go inside the nodes, I don't have a good way to know if the user have permission in a 10th level node without iterate all of them. Plus to this, the same three has used in more places...
The basic reason for why I was doing this pagination, is that the client side will be much slow, because of the structure generated by richfaces, a lot of tr's and td's, the browser just going crazy with this.
So, unfortunatelly, I have to load all the nodes, and paginate just client side, and I need to know what of them is faster to iterate...
Sorry my bad english.
A hash map is the fastest data structure if you want to get all nodes for a page. The list of nodes can be fetched in constant time (O(1)) while with lists the time is O(n) (n=number of pages, faster on sorted lists but never getting near O(1))
What operations on your datastructure are too slow. That's what you have to analyse before you start optimization.
It's probably more due to the fact that JSF is a performance pig than a data structure choice. The one attempt I've seen to create a JSF app could be timed with a sundial.
You're making a mistake by guessing about solutions without more knowledge about the root cause. I'd recommend that you profile your app to see where the time is being spent.
The data structure to use always depends on how you need to store the data and how you need to access it. HashMap<K, V> is supposed to have constant time complexity in accessing the value, provided the key. When you call get(key), the hashCode() for key is computed and it's used to retrieve the related value. Unless you've got different keys that have the same hashcode (in which case you may have been doing something wrong, as while is not mandatory different objects should have different hash codes, at least in the majority of cases), this is usually fast.
Searching an element in a plain list requires scanning of the list, which will (almost) always be slower than computing an hashcode.
If you need to associate values with keys, a Map is the way. And HashMap should be fast enough.
I don't know too much about JSF, but I think - if the data structure and access pattern is the one that a Map is designed for - the problem is not the HashMap itself.
I would solve this with a javascript/ajax calls method that fetches childnodes.
Related
I am working on a web crawler / spider and I need some way to efficiently mass store strings as a reference for (1) already stored sites and (2) a queue for my crawler. These storage data structures must be able to hold well beyond millions of string values. I will start with the research I have researched and what I have done respectively.
The first method I tried was referenced from this thread
Java: optimize hashset for large-scale duplicate detection
In this thread, the OP talks about optimizing a HashSet and was given a lot of good feedback and warning. A HashSet was very expensive to use and caused my program to crash very quickly. In the replies, alternatives like Trove was suggested, however the project has since been discontinued and I believe there are better alternatives.
The second method I tried was to create a queue using MongoDB. I created a collection explicitly for a queue where I followed FIFO as Mongo uses locks so it should be thread safe. And from what I could tell, it worked very well. My crawler was running very well and used very little amount of memory (12~42MB) on average. However this method soon proved to be very poor as MongoDB has a search speed of o(n). Having created an iterator that checks two collections (website collection and queue collection) per single website to-be cached proved to be very detrimental.
Having followed this thread
Strategies for fast searches of billions of small documents in MongoDB
It did improve the search quality slightly, but it was a mild offset. Below is a simple pseudo code of my web crawler.
while(true){
parse();
}
public void parse(){
String next = // next url in queue to be parsed
Document document = // get HTML dom from next url
// store document inside of site storage (mongo collection)
// grab links from document
for( all links found ) {
if(next doesn't exist in website collection and next isn't already in queue){
add to queue
}
}
}
The check for "next doesn't exist in website collection and next isn't already in queue", I have to create an iterator or use mongo.collection.find().limit(1) (which is also an iterator, just behind the scenes) to check if the next element exists inside of the current stored websites or the queue. So as you can see, as the two collections grow, currently with over 100,000 entries in both, it can be very expensive and slow for the processor to constantly check both collections.
Which brings me be back to my first method, which is holding potentially up to billions URLs in memory for faster searching for duplicates in both storages. A majority of the things I read were very useful but are out dated and I was wondering what you guys thought the best method for this is?
holding potentially up to billions URLs in memory
This is surely something you needn't and should not do.
I have to create an iterator
This is surely something you must not do (unless the iterator runs only over a tiny fraction of your data).
next doesn't exist in website collection and next isn't already in queue
Think about data representation. For searching, lists are far too slow, so you need an indexed search. Something like HashMap or TreeMap, but on the disk.
I know close to nothing about MongoDB, but every database worth its name can do this. I guess, it works already with your collection, just the queue is a problem. The queue is more complicated as you need both a fast search and the queue-ness.
This problem can be eliminated trivially by putting every new element into both the queue and the collection, so you only need to check the collection for duplicates (which IIUYC can you do pretty fast). Obviously, you need a marker to distinguish elements not yet fetched.
A next optimization would be keeping a cache of a few recently accessed elements in memory, so that some repeated DB queries could be eliminated. I'd bet, a Bloom Filter could help, too.
You can also use a real Map on the disk: https://github.com/OpenHFT/Chronicle-Map
If each of them is guaranteed to have a unique key (generated and
enforced by an external keying system) which Map implementation is
the correct fit for me? Assume this has to be optimized for
concurrent lookup only (The data is initialized once during the
application startup).
Does this 300 million unique keys have any positive or negative
implications on bucketing/collisions?
Any other suggestions?
My map would look something like this
Map<String, <boolean, boolean, boolean, boolean>>
I would not use a map, this needs to much memory. Especially in your case.
Store the values in one data array, and store the keys in a sorted index array.
In the sorted array you use binSearch to find the position of a key in data[].
The tricky part will be building up the array, without running out of memory.
you dont need to consider concurreny because you only read from the data
Further try to avoid to use a String as key. try to convert them to long.
the advantage of this solution: search time garuanteed to not exceed log n. even in worst cases when keys make problems with hashcode
Other suggestion? You bet.
Use a proper key-value store, Redis is the first option that comes to mind. Sure it's a separate process and dependency, but you'll win big time when it comes to proper system design.
There should be a very good reason why you would want to couple your business logic with several gigs of data in same process memory, even if it's ephemeral. I've tried this several times, and was always proved wrong.
It seems to me, that you can simply use TreeMap, because it will give you O(log(n)) for data search due to its sorted structure. Furthermore, it is eligible method, because, as you said, all data will be loaded at startup.
If you need to keep everything in memory, then you will need to use some library meant to be used with these amount of elements like Huge collections. On top of that, if the number of writes will be big, then you have to also think about some more sophisticated solutions like Non-blocking hash map
Designing a system where a service endpoint (probably a simple servlet) will have to handle 3K requests per second (data will be http posted).
These requests will then be stored into mysql.
They key issue that I need guidance on is that their will be a high % of duplicate data posted to this endpoint.
I only need to store unique data to mysql, so what would you suggest I use to handle the duplication?
The posted data will look like:
<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>
I will write a method that will hash prop1, prop2, pro3 to create a unique hashcode (body can be different and still be considered unique).
I was thinking of creating some sort of concurrent dictionary that will be shared accross requests.
Their are more chances of duplication of posted data within a period of 24 hours. So I can purge data from this dictionary after every x hours.
Any suggestions on the data structure to store duplications? And what about purging and how many records I should store considering 3K requests per second i.e. it will get large very fast.
Note: Their are 10K different sources that will be posting, and the chances of duplication only occurrs for a given source. Meaning I could have more than one dictionary for maybe a group of sources to spread things out. Meaning if source1 posts data, and then source2 posts data, the changes of duplication are very very low. But if source1 posts 100 times in a day, the chances of duplication are very high.
Note: please ignore for now the task of saving the posted data to mysql as that is another issue on its own, duplication detection is my first hurdle I need help with.
Interesting question.
I would probably be looking at some kind of HashMap of HashMaps structure here where the first level of HashMaps would use the sources as keys and the second level would contain the actual data (the minimal for detecting duplicates) and use your hashcode function for hashing. For actual implementation, Java's ConcurrentHashMap would probably be the choice.
This way you have also set up the structure to partition your incoming load depending on sources if you need to distribute the load over several machines.
With regards to purging I think you have to measure the exact behavior with production like data. You need to learn how quickly the data grows when you successfully eliminate duplicates and how it becomes distributed in the HashMaps. With a good distribution and a not too quick growth I can imagine it is good enough to do a cleanup occasionally. Otherwise maybe a LRU policy would be good.
Sounds like you need a hashing structure that can add and check the existence of a key in constant time. In that case, try to implement a Bloom filter. Be careful that this is a probabilistic structure i.e. it may tell you that a key exists when it does not, but you can make the probability of failure extremely low if you tweak the parameters carefully.
Edit: Ok, so bloom filters are not acceptable. To still maintain constant lookup (albeit not a constant insertion), try to look into Cuckoo hashing.
1) Setup your database like this
ALTER TABLE Root ADD UNIQUE INDEX(Prop1, Prop2, Prop3);
INSERT INTO Root (Prop1, Prop2, Prop3, Body) VALUES (#prop1, #prop2, #prop3, #body)
ON DUPLICATE KEY UPDATE Body=#body
2) You don't need any algorithms or fancy hashing ADTs
shell> mysqlimport [options] db_name textfile1 [textfile2 ...]
http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
Make use of the --replace or --ignore flags, as well as, --compress.
3) All your Java will do is...
a) generate CSV files, use the StringBuffer class then every X seconds or so, swap with a fresh StringBuffer and pass the .toString of the old one to a thread to flush it to a file /temp/SOURCE/TIME_STAMP.csv
b) occasionally kick off a Runtime.getRuntime().exec of the mysqlimport command
c) delete the old CSV files if space is an issue, or archive them to network storage/backup device
Well you're basically looking for some kind of extremely large Hashmap and something like
if (map.put(key, val) != null) // send data
There are lots of different Hashmap implementations available, but you could look at NBHM. Non-blocking puts and designed with large, scalable problems in mind could work just fine. The Map also has iterators that do NOT throw a ConcurrentModificationException while using them to traverse the map which is basically a requirement for removing old data as I see it. Also putIfAbsent is all you actually need - but no idea if that's more efficient than just a simple put, you'd have to ask Cliff or check the source.
The trick then is to try to avoid resizing of the Map by making it large enough - otherwise the throughput will suffer while resizing (which could be a problem). And think about how to implement the removing of old data - using some idle thread that traverses an iterator and removes old data probably.
Use a java.util.ConcurrentHashMap for building a map of your hashes, but make sure you have the correct initialCapacity and concurrencyLevel assigned to the map at creation time.
The api docs for ConcurrentHashMap have all the relevant information:
initialCapacity - the initial capacity. The implementation performs
internal sizing to accommodate this many elements.
concurrencyLevel - the estimated number of concurrently updating threads. The
implementation performs internal sizing to try to accommodate this
many threads.
You should be able to use putIfAbsent for handling 3K requests as long as you have initialized the ConcurrentHashMap the right way - make sure this is tuned as part of your load testing.
At some point, though, trying to handle all the requests in one server may prove to be too much, and you will have to load-balance across servers. At that point you may consider using memcached for storing the index of hashes, instead of the CHP.
The interesting problems that you will still have to solve, though, are:
loading all of the hashes into memory at startup
determining when to knock off hashes from the in-memory map
If you use a strong hash formula, such as MD5 or SHA-1, you will not need to store any data at all. The probability of duplicate is virtually null, so if you find the same hash result twice, the second is a duplicate.
Given that MD5 is 16 bytes, and SHA-1 20 bytes, it should decrease memory requirements, therefore keeping more elements in the CPU cache, therefore dramatically improving speed.
Storing these keys requires little else than a small hash table followed by trees to handle collisions.
Project Background:
I am writing a map tile overlay class for java that can use gdal2tile.py tiles. Basically I will end up with thousands of jpg files that are in a file structure like
"Zoom Level/X coordinate/Y coordinate"
The coordinates are ints but will not necessarily start at 0 or 1.
I will have to search for tiles that are within a certain range to find out which ones I need to render.
My Problem:
I tried iterating using the file structure itself but it is wicked slow (not surprising).
I tried iterating using an ArrayList of strings of the file structure and .contains() but it seems to be even slower (not too surprising).
Optimally I would like to use a data structure that would let me choose a range on multiple dimensions so that I can call something like.
Tiles.getWhere(Zoom Level,min X,max X,min Y,maxY);
I assume that some sort of Collection or TreeMap would be the right choice but I'm not experienced enough with Java to know for sure and I'd prefer not to have to benchmark a lot of different approaches.
I could use SQLite to do it but that seems like overkill.
My Question:
What is the most efficient way to check for the existence of datasets given multiple dimensional constraints?
May be you are looking for a map with multiple keys.
Commons-collections provides a map with multiple lookup keys:
http://commons.apache.org/collections/apidocs/org/apache/commons/collections/map/MultiKeyMap.html
a map guarantees a O(1) insertion and O(1) selection timings.
Thinking of your problem I could find out three directions to which you could aim your search next (this is not a hand-by-hand guide but rather a out-of-the-box brain opener for a stucked situation you have faced):
1) Usage of Java built in structures. Yes, indeed, a list is the worst case of a searching method. A Map, as the name suggests, is far more convenient for maps. It is not only the name, but the indexing to a Map is signifigantly less time consuming compared to a List. You can imagine your map as a cube, where you have to handle about half of the dots inside it, if you use List and probably only a narrow layer of it when you search by indexing a Map. There is a magnitude of difference. So, my answer here: Map is a key word towards the correct direction (assuming you want to do it in this way after reading on my answer).
2) Usage of a Map Server solution. This is probably too far from your approach, but entire frameworks are made for solving your type of question. An example is GeoServer. It has a ready made solution for the entire problem. It is a stable solution for the great big problem possibly in your hand: showing a map to a user from a source.
3) Sticking to the GDAL framework you were using, you could select slightly different py-file, like gdal_proximity.py and - wow! - you have a searching possibility in your hand! This particular one searches by a center point and a distance, but will do the stuff you need =)
There is a starting point, how I would make it. Could this serve for something?
Sounds to me like you are looking for something like an Interval Tree.
http://en.wikipedia.org/wiki/Interval_tree
I have implemented one of these in the past but only in one dimension. The Wikipedia reference mentions extensions to more dimensions.
Paul
I need a data structure to store users which should be retrieved by id.
I noticed there are several classes that implement the Map interface. Which one should be my default choice? They all seem quite equivalent to me.
Probably it depends on how many users you plan to have and if you will need them ordered or just getting single items by id.
HashMap uses hash codes to store things so you have constant time for put and get operations but items are always unordered.
TreeMap instead uses a binary tree so you have log(n) time for basic operations but items are kept ordered in the tree.
I would use HashMap because it's the simpler one (remember to give it a suitable initial capacity). Remember that these datastructures are not synchronized by default, if you plan to use it from more than one thread take care of using ConcurrentHashMap.
A middle approach is the LinkedHashMap that uses same structure as HashMap (hashcode and equals method) but it also keeps a doubly linked list of element inserted in the map (mantaining the order of insertion). This hybrid has ordered items (ordered in sense of insertion order, as suggested by comments.. just to be precise but I had already specified that) without performance losses of TreeMap.
No concurrency: use java.util.HashMap
Concurrency: use java.util.concurrent.ConcurrentHashMap
If you want some control on the order used by iterators, use a TreeMap or a LinkedHashMap.
This is covered on the Java Collections Trail, Implementations page.
If they all seem equivalent then you haven't read the documentation. Sun's documentation is pretty much as terse as it gets and provides very important points for making your choices.
Start here.
Your choice could be modified by how you intend to use the data structure, and where you would rather have performance - reads, or writes?
In a user-login system, my guess is that you'll be doing more reads than writes.
(I know I already answered this once, but I feel this needs saying)
Have you considered using a database to store this information? Even if it's SQLite, it'd probably be easier than storing your user database in the program code or loading the entire dataset into memory each time.