LRU byte Cache java - java

I need to implement a cache in java with a maximum size, would like to do it using the real size of the cache in the memory and not the number of elements in the cache. This cache will basically have String as key and String as value. I have already implemented the cache using the LinkedHashMap structure of java but the question is how to know the actual size of the cache so that i can adapt the policy to drop an object when the size is too big.
Wanted to compute it using the getObjectSize() of the instrumentation package but it seems not working as desired.
When I do getObjectSize( a string ) whatever the size of the string is, it returns the same size : 32. I guess it's just using the reference size of the string or something like that and not the content. So don't know how to solve this problem efficiently.
Do you have any ideas ?
Thanks a lot!

You might want to consider using Ehcache with memory based cache sizing.

If your keys and values are both strings, then the calculation is easy: object overhead + 2 bytes per character in the strings. On a 32-bit Sun JVM, 32 bytes for overhead sounds correct.
There are a couple of caveats: first, the Map that you use to hold the cache adds its own overhead. This will depend on the size of the hash table and the number of entries in the map. Personally, I'd just ignore all overheads and base the calculation on the string lengths.
Second, unless you track strings by identity, you may over-count because the same string may be stored with multiple keys. Since tracking strings by identity would add yet more overhead, this is probably not worth doing.
And finally: while memory-limited caches seem like a good idea, they rarely are. If you know your application well enough, you should know the average string length, and can control the cache based on number of entries. And if you don't know your application that well, a simple LRU expiration policy is likely to get you into trouble: a large entry can cause many small entries to be expired. And if that happens, unless the cost to rebuild is proportional to the size, you've just made your cache less effective.

Related

What can I do if I require more memory than there is on the heap in Java?

I have a graph algorithm that generates intermediate results associated to different nodes. Currently, I have solved this by using a ConcurrentHashMap<Node, List<Result> (I am running multithreaded). So at first I add new results with map.get(node).add(result) and then I consume all results for a node at once with map.get(node).
However, I need to run on a pretty large graph where the number of intermediate results wan't fit into memory (good old OutOfMemory Exception). So I require some solution to write out the results on disk—because that's where there is still space.
Having looked at a lot of different "off-heap" maps and caches as well as MapDB I figured they are all not a fit for me. All of them don't seem to support Multimaps (which I guess you can call my map) or mutable values (which the list would be). Additionally, MapDB has been very slow for me when trying to create a new collection for every node (even with a custom serializer based on FST).
I can barely imagine, though, that I am the first and only to have such a problem. All I need is a mapping from a key to a list which I only need to extend or read as a whole. What would an elegant and simple solution look like? Or are there any existing libraries that I can use for this?
Thanks in advance for saving my week :).
EDIT
I have seen many good answers, however, I have two important constraints: I don't want to depend on an external database (e.g. Redis) and I can't influence the heap size.
You can increase the size of heap. The size of heap can be
configured to larger than physical memory size of your server while
you make sure the condition is right:
the size of heap + the size of other applications < the size of physical memory + the size of swap space
For instance, if the physical memory is 4G and the swap space is 4G,
the heap size can be configured to 6G.
But the program will suffer from page swapping.
You can use some database like Redis. Redis is key-value
database and has List structure.
I think this is the simplest way to solve your problem.
You can compress the Result instance. First, you serialize the
instance and compress that. And define the class:
class CompressResult {
byte[] result;
//...
}
And replace the Result to CompressResult. But you should deserialize
the result when you want to use it.
It will work well if the class Result has many fields and is very
complicated.
My recollection is that the JVM runs with a small initial max heap size. If you use the -Xmx10000m you can tell the JVM to run with a 10,000 MB (or whatever number you selected) heap. If your underlying OS resources support a larger heap that might work.

Java - Millions of records, HashMap throws OutOfMemoryError

I'm reading a file to parse few of the fields of each record as a reference key and another field as the reference value. These keys and values are referred for another process.
Hence, I chose a HashMap, so that I can get the values for each key, easily.
But, each of the file consists of tens of millions or records. Hence, the HashMap throws OutOfMemoryError. I hope increasing the heap memory will not be a good solution, if the input file in future grows.
For similar questions in SO, most have suggested to use a database. I fear I'll not be given option to use a DB. Is there any other way to handle the problem?
EDIT: I need to do this similar HashMap Loading for 4 such files :( I need all the four. Bcoz, If I dont find a matching entry for my input in the first Map, I need to find in second, then if there not, then third and finally in fourth.
Edit 2: The files I have sums up to, around 1 GB.
EDIT 3:
034560000010000001750
000234500010000100752
012340000010000300374
I have records like these in a file.. I need to have 03456000001000000 as key and 1750 as value.. for all the millions of records. I'll refer these keys and get the value for my another process.
Using a database will not reduce memory cost or runtime per itself.
However, the default hashmaps may not be what you are looking for, depending on your data types. When used with primitive values such as Integers then java hashmaps have a massive memory overhead. In a HashMap<Integer, Integer>, every entry uses like 24+16+16 bytes. Unused entries (and the hashmap keeps up to half of them unused) take 4 bytes extra. So you can roughly estimate >56 bytes per int->int entry in Java HashMap<Integer, Integer>.
If you encode the integers as String, and we're talking maybe 6 digit numbers, that is likely 24 bytes for the underlying char[] array (16 bit characters; 12 bytes overhead for the array, sizes are a multiple of 8!), plus 16 bytes for the String object around (maybe 24, too). For key and value each. So that is then around 24+40+40, i.e. over 104 bytes per entry.
(Update: as your keys are 17 characters in length, make this 24+62+40, i.e. 136 bytes)
If you used a primitive hashmap such as GNU Trove TIntIntHashMap, it would only take 8 bytes + unused, so lets estimate 16 bytes per entry, at least 6 times less memory.
(Update: for TLongIntHashMap, estimate 12 bytes per entry, 24 bytes with overhead of unused buckets.)
Now you could also just store everything in a massive sorted list. This will allow you to perform a fast join operation, and you will lose much of the overhead of unused entries, and can probably process twice as many in much shorter time.
Oh, and if you know the valid value range, you can abuse an array as "hashmap".
I.e. if your valid keys are 0...999999, then just use an int[1000000] as storage, and write each entry into the appropriate row. Don't store the key at all - it's the offset in the array.
Last but not least, Java by default only uses 25% of your memory. You probably want to increase its memory limit.
Short answer: no. It's quite clear that you can't load your entire dataset in memory. You need a way to keep it on disk together with an index, so that you can access the relevant bits of the dataset without rescanning the whole file every time a new key is requested.
Essentially, a DBMS is a mechanism for handling (large) quantities of data: storing, retrieving, combining, filtering etc. They also provide caching for commonly used queries and responses. So anything you are going to do will be a (partial) reimplementation of what a DBMS already does.
I understand your concerns about having an external component to depend on, however note that a DBMS is not necessarily a server daemon. There are tiny DBMS which link with your program and keep all the dataset in a file, like SQLite does.,
Such large data collections should be handled with a database. Java programs are limited in memory, varying from device to device. You provided no info about your program, but please remember that if it is run on different devices, some of them may have very little ram and will crash very quickly. DB (be it SQL or file-based) is a must when it comes to large-data programs.
You have to either
a) have enough memory load to load the data into memory.
b) have to read the data from disk, with an index which is either in memory or not.
Whether you use a database or not the problem is much the same. If you don't have enough memory, you will see a dramatic drop in performance if you start randomly accessing the disk.
There are alternatives like Chronicle Map which use off heap and performs well up to double your main memory size so you won't get an out of memory error, however you still have problem that you can't store more data in memory than you have main memory.
The memory footprint depends on how you approach the file in java. A widely used solution is based on streaming the file using the Apache Commons IO LineIterator. Their recommended usage
LineIterator it = FileUtils.lineIterator(file, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
it.close();
}
Its an optimized approach, but if the file is too big, you can still end up with OutOfMemory
Since you write that you fear that you will not be given the option to use a database some kind of embedded DB might be the answer. If it is impossible to keep everything in memory it must be stored somewhere else.
I believe that some kind of embedded database that uses the disk as storage might work. Examples include BerkeleyDB and Neo4j. Since both databases use a file index for fast lookups the memory load is lesser than if you keep the entire load in memory but they are still fast.
You could try lazy loading it.

'Big dictionary' implementation in Java

I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.
Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.
Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.
Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.
What it does useful for your task out of the box:
Persistance to disk via memory mapped files (see comment by Michał Kosmulski)
Lazy load (disk pages are loaded only on demand) -> fast startup
If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.
As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.
Please let me know if you have any questions!
As stated in a comment, a Trie will save you a lot of memory.
You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.
At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.
So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.
Then forget any database, as they're damn slow when compared to HashMaps.
If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).
I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.
Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.
I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.
I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.
It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).
Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.

Fastest way to access this object

Lets say I have a list of 1,000,000 users where their unique identifier is their username string. So to compare two User objects I just override the compareTo() method an compare the username members.
Given a username string I wish to find the User object from a list. What, in an average case, would be the fastest way to do this.
I'm guessing a HashMap, mapping usernames to User objects, but I wondered if there was something else that I didn't know about which would be better.
If you don't need to store them in a database (which is the usual scenario), a HashMap<String, User> would work fine - it has O(1) complexity for lookup.
As noted, the usual scenario is to have them in the database. But in order to get faster results, caching is utilized. You can use EhCache - it is similar to ConcurrentHashMap, but it has time-to-live for elements and the option to be distributed across multiple machines.
You should not dump your whole database in memory, because it will be hard to synchronize. You will face issues with invalidating the entries in the map and keeping them up-to-date. Caching frameworks make all this easier. Also note that the database has its own optimizations, and it is not unlikely that your users will be kept in memory there for faster access.
I'm sure you want a hash map. They're the fastest thing going, and memory efficient. As also noted in other replies, a String works as a great key, so you don't need to override anything. (This is also true of the following.)
The chief alternative is a TreeMap. This is slower and a uses a bit more memory. It's a lot more flexible, however. The same map will work great with 5 entries and 5 million entries. You don't need to clue it in in advance. If your list varies wildly in size, the TreeMap will grab memory as it needs and let it go when it doesn't. Hashmaps are not so good about letting go, and as I explain below, they can be awkward when grabbing more memory.
TreeMap's work better with Garbage Collectors. They ask for memory in small, easily found chunks. If you start a hashtable with room for 100,000 entries, when it gets full it will free the 100,000 element (almost a megabye on a 64 bit machine) array and ask for one that's even larger. If it does this repeatedly, it can get ahead of the GC, which tends to throw an out-of-memory exception rather than spend a lot of time gathering up and concentrating scattered bits of free memory. (It prefers to maintain its reputation for speed at the expense of your machine's reputation for having a lot of memory. You really can manage to run out of memory with 90% of your heap unused because it's fragmented.)
So if you are running your program full tilt, your list of names varies wildly in size--and perhaps you even have several lists of names varying wildly in size--a TreeMap will work a lot better for you.
A hash map will no doubt be just what you need. But when things get really crazy, there's the ConcurrentSkipListMap. This is everything a TreeMap is except it's a bit slower. On the other hand, it allows adds, updates, deletes, and reads from multiple threads willy-nilly, with no synchronization. (I mention it just to be complete.)
In terms of data structures the HashMapcan be a good choice. It favours larger datasets. The time for inserts is considered constant O(1).
In this case it sounds like you will be carrying out more lookups than inserts. For lookups the average time complexity is O(1 + n/k), the key factor here (sorry about the pun) is how effective the hashing algorithm is at evenly distributing the data across the buckets.
the risk here is that the usernames are short in length and use a small character set such as a-z. In which case there would be a lot of collisions causing the HashMap to be loaded unevenly and therefore slowing down the lookups. One option to improve this could be to create your own user key object and override the hashcode() method with an algorthim that suits your keys better.
in summary if you have a large data set, a good/suitable hashing algorithm and you have the space to hold it all in memory then HashMap can provide a relatively fast lookup
I think given your last post on the ArrayList and it's scalabilty I would take Bozho's suggestion and go for a purpose build cache such as EhCache. This will allow you to control memory usage and eviction policies. Still a lot faster than db access.
If you don't change your list of users very often then you may want to use Aho-Corasick. You will need a pre-processing step that will take O(T) time and space, where T is the sum of the lengths of all user names. After that you can match user names in O(n) time, where n is the length of the user name you are looking for. Since you will have to look at every character in the user name you are looking for I don't think it's possible to do better than this.

Reducing memory usage of very large HashMap

I have a very large hash map (2+ million entries) that is created by reading in the contents of a CSV file. Some information:
The HashMap maps a String key (which is less than 20 chars) to a String value (which is approximately 50 characters).
This HashMap is initialized with an initial capacity of 3 million so that the load factor is around .66.
The HashMap is only utilized by a single operation, and once that operation is completed, I "clear()" it. (Although it doesn't appear that this clear actually clears up memory, is a separate call to System.gc() necessary?).
One idea I had was to change the HashMap to HashMap and use the hashCode of the String as the key, this will end up saving a bit of memory but risks issues with collisions if two strings have identical hash codes ... how likely is this for strings that are less than 20 characters long?
Does anyone else have any ideas on what to do here? The CSV file itself is only 100 MB, but java ends up using over 600MB in memory for this HashMap.
Thanks!
It sounds like you have the framework to try this already. Instead of adding the string, add the string.hashCode() and see if you get collisions.
In terms of freeing up memory, the JVM generally doesn't get smaller, but it will garbage collect if it needs to.
Also, it sounds like you might have an algorithm that doesn't need the hash table at all. Could you describe what you're trying to do in a little more detail?
Parse the CSV, and build a Map whose keys are your existing keys, but values are Integer pointers to locations in the files for that key.
When you want the value for a key, find the index in the map, then use a RandomAccessFile to read that line from the file. Keep the RandomAccessFile open during processing, then close it when done.
what you are trying to do is exactly a JOIN operation. Try considering an in-memory DB like H2 and you can achieve this by loading both CSV files to temp tables and then do a JOIN over them.
And as per my experience h2 runs great with load operation and this code will certainly be faster and less memory intensive than ur manual HashMap based joining method.
If performance isn't the primary concern, store the entries in a database instead. Then memory isn't a concern, and you have good, if not great, search speed thanks to the database.

Categories