Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I would like to store key-value pairs, where key is an integer and values are ArrayLists of Strings.
I cannot use a database because I have to use code to solve a problem online for a particular contest.
For small amounts of data I am able to work with hashtables without any problem.
But when my data becomes big I run out of heap size. I can not change the heapsize as I have to upload just the code and I cannot provide a working environment.
That is the challenge.
If the strings are repeated often, have natural language frequences, do not use new object instances for the same string.
private Map<String, String> sharedStrings = new HashMap<>().
public void shareString(String s) {
String t = sharedStrings.get(s);
if (t == null) {
t = s;
sharedStrings.put(t, t);
}
return t;
}
A numbering of the strings probably is too slow.
Packing the list of strings in a single one (separator some control character),
and possibly Gzipping the String (GZipOutputStream, GZipInputStream).
Tune the hash map with a sufficient initial capacity. (Sorry if I state the obvious.)
Do your own allocation of all ArrayLists, using huge large String[]:
int count;
String[] allStrings = new String[999999];
Map<Integer, Long> map = new HashMap<>(9999);
void put(int key, List<String> strings) {
int start = count;
for (String s : strings) {
allStrings[count] = s;
++count;
}
// high: start index, low: size
long listDescriptor = (((long)start) << 32) | (count - start);
map.put(key, listDescriptor);
}
There are map implementations using primitives like int and long; the trove library for instance (did not use it myself).
Using a simple array instead of ArrayList may save some additional memory (but not much).
If search performance is not a priority, you may use a Pair<Integer, List<>> and do the search manually.
If the range of integers is limited, just instantiate an array of List[integer_range] and use the array index as key.
Since you are using Strings, you may try to intern() them and make sure there are no repeating values.
Let us know what statistical information about the data you have - what are the keys, whether the values repeat themselves, etc.
Some ideas
If you can write to a file store the data there. You could maybe keep the keys in a set in memory for faster lookup and just write out the values - either to a single file or maybe even a file per entry.
Create your own map implementation that serializes the list of values into a String or byte[] and then compress the serialised data. You will have to deserialise on a read. Every time you do a get/put you will take a big runtime hit for this though. See http://theplateisbad.blogspot.co.uk/2011/04/java-in-memory-compression.html for an example.
Every time the map data is looked up simply calculate the list values every time instead of storing them - if you can.
One possible optimization might be ArrayList.trimToSize which reduces the storage used by ArrayList to minimum.
You could store the ArrayList in serialized (maybe even compressed) ByteBuffers. When you need to access a list, you would need to deserialize it, change/read it, and then store it back.
Operations would be significantly slower, but you could do some caching to keep X Arraylists in the heap and store the rest outside.
If you cannot increase the heap size then you need to limit the size of your hashtable (or any other datastructure you use). I would recommend to try the Apache LRUMap:
LRUMap
An implementation of a Map which has a maximum size and uses a Least Recently Used algorithm to remove items from the Map when
the maximum size is reached and new items are added.
And if you really need a synchronized version then that is also available:
A synchronized version can be obtained with:
Collections.synchronizedMap( theMapToSynchronize ) If it will be
accessed by multiple threads, you must synchronize access to this
Map. Even concurrent get(Object) operations produce indeterminate
behaviour.
And if you don't want to loose using LRU, the data then you need to write an algorithm to keep some data in your datastructer and rest in the persistent storage such as file, etc.
Related
I am reading data from a file of students where each line is a student, I am then turning that data into a student object and I want to return an array of student objects. I am currently doing this by storing each student object in an arraylist then returning it as a standard Student[]. Is it better to use an arraylist to have a dynamic size array then turn it into a standard array for the return or should I first count the number of lines in the file, make a Student[] of that size then just populate that array. Or is there a better way entirely to do this.
Here is the code if it helps:
public Student[] readStudents() {
String[] lineData;
ArrayList<Student> students = new ArrayList<>();
while (scanner.hasNextLine()) {
lineData = scanner.nextLine().split(" ");
students.add(new Student(lineData));
}
return students.toArray(new Student[students.size()]);
}
Which is better depends on what you need and your data set size. Needs could be - simplest code, fastest load, least memory usage, fast iteration over resultind data set... Options could be
For one-off script or small data sets (tens of thousands of elements) probably anything would do.
Maybe do not store elements at all, and process them as you read them? - least memory used, good for very large data sets.
Use pre-allocated array - if you know data set size in advance - guaranteed least memory allocations - but counting elements itself might be expensive.
If unsure - use ArrayList to collect elements. It would work most efficiently if you can estimate upper bound of your data set size in advance, say you know that normally there is not more than 5000 elements. In that case create ArrayList with 5000 elements. It will resize itself if backing array is full.
LinkedList - probably the most conservative - it allocates space as you go but required memory per element is larger and iteration is slower than for arrays or ArrayLists.
Your own data structure optimized for your needs. Usually the effort is not worth it, so use this option only when you already know the problem you want to solve.
Note on ArrayList: it starts with pre-allocating an array with set of slots which are filled afterwards without memory re allocation. As long as backing array is full a new larger one is allocated and all elements are moved into it. New array size is by default twice the size of previous - normally this is not a problem but can cause out of memory if new one cannot get enough contiguous memory block.
Use an array for a fixed size array. For students that is not the case, so an ArrayList is more suited, as you saw on reading. A conversion from ArrayList to array is superfluous.
Then, use the most general type, here the List interface. The implementation, ArrayList or LinkedList then is a technical implementation question. You might later change an other implementation with an other runtime behavior.
But your code can handle all kinds of Lists which is really a powerful generalisation.
Here an incomplete list of useful interfaces with some implementations
List - ArrayList (fast, a tiny bit memory overhead), LinkedList
Set - HashSet (fast), TreeSet (is a SortedSet)
Map - HashMap (fast), TreeMap (is a SortedMap), LinkedHashMap (order of inserts)
So:
public List<Student> readStudents() {
List<Student> students = new ArrayList<>();
while (scanner.hasNextLine()) {
String[] lineData = scanner.nextLine().split(" ");
students.add(new Student(lineData));
}
return students;
}
In a code review one would comment on the constructor Student(String[] lineData) which risks a future change in data.
I have a very large file (10^8 lines) with counts of events as follows,
A 10
B 11
C 23
A 11
I need to accumulate the counts for each event, so that my map contains
A 21
B 11
C 23
My current approach:
Read the lines, maintain a map, and update the counts in the map as follows
updateCount(Map<String, Long> countMap, String key,
Long c) {
if (countMap.containsKey(key)) {
Long val = countMap.get(key);
countMap.put(key, val + c);
} else {
countMap.put(key, c);
}
}
Currently this is the slowest part of the code, (takes around 25 ms).
Note that the map is based on MapDB, but I doubt that updates are slow due to that (are they?)
This is the mapdb configs for the map,
DBMaker.newFileDB(dbFile).freeSpaceReclaimQ(3)
.mmapFileEnablePartial()
.transactionDisable()
.cacheLRUEnable()
.closeOnJvmShutdown();
Are there ways to speed this up?
EDIT:
The number of unique keys is of the order of the pages in wikipedia. The data is actually page traffic data from here.
You might try
class Counter {
long count;
}
void updateCount(Map<String, Counter> countMap, String key, int c) {
Counter counter = countMap.get(key);
if (counter == null) {
counter = new Counter();
countMap.put(key, counter);
counter.count = c;
} else {
counter.count += c;
}
}
This does not create many Long wrappers, but just allocates Counters the number of keys.
Note: do not create Long's. Above I made c an int to not oversee long/Long.
As a starting point, I'd suggest thinking about:
What is yardstick by which you're saying that 25ms is actually an unreasonable amount of time for the amount of data involved and for a generic map implementation? if you quantify that, it might help you work out if there is anything wrong.
How much time is being spent re-hashing the map versus other operations (e.g. calculation of hash codes on each put)?
What do your "events" as you call them consist of? How many unique events-- and hence unique keys-- are there? How are keys to the map being generated, and is there a more efficient way to do so? (In a standard hash map, for example, you create additional objects for each association, and actually store the key objects increasing the memory footprint.)
Depending on the answers to the previous, you could potentially roll a more efficient map structure yourself (see this example that you might be able to adapt). Essentially, you need to look specifically at what is taking the time (e.g. hash code calculation per put / cost of rehashing) and try and optimise that part.
If you are using a TreeMap, there are performance tuning options like
The number of entries in each node.
You could also use specific key and value serializer that will speed up the serialization and de-serilization.
You could use Pump mode to build the tree, which is very very fast. But one caveat is that this is useful when you are building a new map from scratch. You can find the full example here
https://github.com/jankotek/MapDB/blob/master/src/test/java/examples/Huge_Insert.java
I am doing some algorithm problems in Java, and from time to time the problem needs memoization to optimize speed. And often times, the key is an array. What I usually uses is
HashMap<ArrayList<Integer>, Integer> mem;
The main reason here to use ArrayList<Integer> instead of int[] is that the hashCode() of an primitive array is calculated based on the reference, but for ArrayList<Integer> the value of the actual array is compared, which is desired behavior.
However, it is not very efficient and code can be pretty lengthy as well. So I am wondering if there is any best practice for this kind of memoization in Java? Thanks.
UPDATE: As many have pointed this out in the comments: it is a very bad idea to use mutable objects as the key of a HashMap, which I totally agree.
And I am going to clarify the question a little bit more: when I use this type of memoization, I will NOT change the ArrayList<Integer> once it is inserted to the map. Normally the array represents some status, and I need to cache the corresponding value for that status in case it is visited again.
So please do not focus on how bad it is to use a mutable object as the key to a HashMap. Do suggest some better way to do this kind of memoization please.
UPDATE2: So at last I choose the Arrays.toString() approach since I am doing algorithm problems on TopCoder/Codeforces, and it is just dirty and fast to code.
However, I do think HashMap is the more reasonable and readable way to do this.
You can create a new class - Key, put an array with some numbers as a field and implement your own hascode() based on the contents of the array.
It will improve the readability as well:
HashMap<Key, Integer> mem;
If your ArrayList contains usually 3-4 elements,
I would not worry much about performance. Your approach is OK.
But as others pointed out, your key is thus mutable which is
a bad idea.
Another approach is to append all elements of the ArrayList
together using some separator (say #) and thus have this kind
of string for key: 123#555#66678 instead of an ArrayList of
these 3 integers. You can just call Arrays.toString(int[])
and get a decent string key out of an array of integers.
I would choose the second approach.
If the input array is large, the main problem seems to be the efficiency of lookup. On the other hand, your computation is probably much more expensive than that, so you've got same CPU cycles to spare.
Lookup time will depend both on the hashcode calculation and on the brute-force equals needed to pinpoint the key in a hash bucket. This is why the array as a key is out of the question.
The suggestion already given by user:XpressOneUp, creating a class which wraps the array and provides its custom hash code, seems like your best bet and you can optimize hashcode calculation to involve only some array elements. You'll know best which elements are the most salient.
If the values in the array are small integer than here is way to do it efficiently :-
HashMap<String,Integer> Map
public String encode(ArrayList arr) {
String key = "";
for(int i=0;i<arr.size();i++) {
key = key + arr.get(i) + ",";
}
return(key);
}
Use the encode method to convert your array to unique string use to add and lookup the values in HashMap
I'm looking for a way to store a string->int mapping. A HashMap is, of course, a most obvious solution, but as I'm memory constrained and need to store 2 million pairs, 7 characters long keys, I need something that's memory efficient, the retrieval speed is a secondary parameter.
Currently I'm going along the line of:
List<Tuple<String, int>> list = new ArrayList<Tuple<String, int>>();
list.add(...); // load from file
Collections.sort(list);
and then for retrieval:
Collections.binarySearch(list, key); // log(n), acceptable
Should I perhaps go for a custom tree (each node a single character, each leaf with result), or is there an existing collection that fits this nicely? The strings are practically sequential (UK postcodes, they don't differ much), so I'm expecting nice memory savings here.
Edit: I just saw you mentioned the String were UK postcodes so I'm fairly confident you couldn't get very wrong by using a Trove TLongIntHashMap (btw Trove is a small library and it's very easy to use).
Edit 2: Lots of people seem to find this answer interesting so I'm adding some information to it.
The goal here is to use a map containing keys/values in a memory-efficient way so we'll start by looking for memory-efficient collections.
The following SO question is related (but far from identical to this one).
What is the most efficient Java Collections library?
Jon Skeet mentions that Trove is "just a library of collections from primitive types" [sic] and, that, indeed, it doesn't add much functionality. We can also see a few benchmarks (by the.duckman) about memory and speed of Trove compared to the default Collections. Here's a snippet:
100000 put operations 100000 contains operations
java collections 1938 ms 203 ms
trove 234 ms 125 ms
pcj 516 ms 94 ms
And there's also an example showing how much memory can be saved by using Trove instead of a regular Java HashMap:
java collections oscillates between 6644536 and 7168840 bytes
trove 1853296 bytes
pcj 1866112 bytes
So even though benchmarks always need to be taken with a grain of salt, it's pretty obvious that Trove will save not only memory but will always be much faster.
So our goal now becomes to use Trove (seen that by putting millions and millions of entries in a regular HashMap, your app begins to feel unresponsive).
You mentioned 2 million pairs, 7 characters long keys and a String/int mapping.
2 million is really not that much but you'll still feel the "Object" overhead and the constant (un)boxing of primitives to Integer in a regular HashMap{String,Integer} which is why Trove makes a lot of sense here.
However, I'd point out that if you have control over the "7 characters", you could go even further: if you're using say only ASCII or ISO-8859-1 characters, your 7 characters would fit in a long (*). In that case you can dodge altogether objects creation and represent your 7 characters on a long. You'd then use a Trove TLongIntHashMap and bypass the "Java Object" overhead altogether.
You stated specifically that your keys were 7 characters long and then commented they were UK postcodes: I'd map each postcode to a long and save a tremendous amount of memory by fitting millions of keys/values pair into memory using Trove.
The advantage of Trove is basically that it is not doing constant boxing/unboxing of Objects/primitives: Trove works, in many cases, directly with primitives and primitives only.
(*) say you only have at most 256 codepoints/characters used, then it fits on 7*8 == 56 bits, which is small enough to fit in a long.
Sample method for encoding the String keys into long's (assuming ASCII characters, one byte per character for simplification - 7 bits would be enough):
long encode(final String key) {
final int length = key.length();
if (length > 8) {
throw new IndexOutOfBoundsException(
"key is longer than 8 characters");
}
long result = 0;
for (int i = 0; i < length; i++) {
result += ((long) ((byte) key.charAt(i))) << i * 8;
}
return result;
}
Use the Trove library.
The Trove library has optimized HashMap and HashSet classes for primitives. In this case, TObjectIntHashMap<String> will map the parameterized object (String) to a primitive int.
First of, did you measure that LinkedList is indeed more memory efficient than a HashMap, or how did you come to that conclusion? Secondly, a LinkedList's access time of an element is O(n), so you cannot do efficient binary search on it. If you want to do such approach, you should use an ArrayList, which should give you the beast compromise between performance and space. However, again, I doubt that a HashMap, HashTable or - in particular - a TreeMap would consume that much more memory, but the first two would provide constant access and the tree map logarithmic and provide a nicer interface that a normal list. I would try to do some measurements, how much the difference in memory consumption really is.
UPDATE: Given, as Adamski pointed out, that the Strings themselves, not the data structure they are stored in, will consume the most memory, it might be a good idea to look into data structures that are specific for strings, such as tries (especially patricia tries), which might reduce the storage space needed for the strings.
What you are looking for is a succinct-trie - a trie which stores its data in nearly the least amount of space theoretically possible.
Unfortunately, there are no succinct-trie classes libraries currently available for Java. One of my next projects (in a few weeks) is to write one for Java (and other languages).
In the meanwhile, if you don't mind JNI, there are several good native succinct-trie libraries you could reference.
Have you looked at tries. I've not used them but they may fit with what you're doing.
A custom tree would have the same complexity of O(log n), don't bother. Your solution is sound, but I would go with an ArrayList instead of the LinkedList because the linked list allocates one extra object per stored value, which will amount to a lot of objects in your case.
As Erick writes using the Trove library is a good place to start as you save space in storing int primitives rather than Integers.
However, you are still faced with storing 2 million String instances. Given that these are keys in the map, interning them won't offer any benefit so the next thing I'd consider is whether there's some characteristic of the Strings that can be exploited. For example:
If the Strings represent sentences of common words then you could transform the String into a Sentence class, and intern the individual words.
If the Strings only contain a subset of Unicode characters (e.g. only letters A-Z, or letters + digits) you could use a more compact encoding scheme than Java's Unicode.
You could consider transforming each String into a UTF-8 encoded byte array and wrapping this in class: MyString. Obviously the trade-off here is the additional time spent performing look-ups.
You could write the map to a file and then memory map a portion or all of the file.
You could consider libraries such as Berkeley DB that allow you to define persistent maps and cache a portion of the map in memory. This offers a scalable approach.
maybe you can go with a RadixTree?
Use java.util.TreeMap instead of java.util.HashMap. It makes use of a red black binary search tree and doesn't use more memory than what is required for holding notes containing the elements in the map. No extra buckets, unlike HashMap or Hashtable.
I think the solution is to step a little outside of Java. If you have that many values, you should use a database. If you don't feel like installing Oracle, SQLite is quick and easy. That way the data you don't immediately need is stored on the disk, and all of the caching/storage is done for you. Setting up a DB with one table and two columns won't take much time at all.
I'd consider using some cache as these often have the overflow-to-disk ability.
You might create a key class that matches your needs. Perhaps like this:
public class MyKey implements Comparable<MyKey>
{
char[7] keyValue;
public MyKey(String keyValue)
{
... load this.keyValue from the String keyValue.
}
public int compareTo(MyKey rhs)
{
... blah
}
public boolean equals(Object rhs)
{
... blah
}
public int hashCode()
{
... blah
}
}
try this one
OptimizedHashMap<String, int[]> myMap = new OptimizedHashMap<String, int[]>();
for(int i = 0; i < 2000000; i++)
{
myMap.put("iiiiii" + i, new int[]{i});
}
System.out.println(myMap.containsValue(new int[]{3}));
System.out.println(myMap.get("iiiiii" + 1));
public class OptimizedHashMap<K,V> extends HashMap<K,V>
{
public boolean containsValue(Object value) {
if(value != null)
{
Class<? extends Object> aClass = value.getClass();
if(aClass.isArray())
{
Collection values = this.values();
for(Object val : values)
{
int[] newval = (int[]) val;
int[] newvalue = (int[]) value;
if(newval[0] == newvalue[0])
{
return true;
}
}
}
}
return false;
}
Actually HashMap and List are too general for such specific task as a lookup of int by zipcode. You should use advantage of knowledge which data is used. One of the options is to use a prefix tree with leaves that stores the int value. Also, it could be pruned if (my guess) a lot of codes with same prefixes map to the same integer.
Lookup of the int by zipcode will be linear in such tree and will not grow if number of codes is increased, compare to O(log(N)) in case of binary search.
Since you are intending to use hashing, you can try numerical conversions of the strings based on ASCII values.
the simplest idea will be
int sum=0;
for(int i=0;i<arr.length;i++){
sum+=(int)arr[i];
}
hash "sum" using a well defined hash functions. You would use a hash function based on the expected input patterns.
e.g. if you use division method
public int hasher(int sum){
return sum%(a prime number);
}
selecting a prime number which is not close to an exact power of two improves performances and gives better uniformly hashed distribution of keys.
another method is to weigh the characters based on their respective position.
e.g: if you use the above method, both "abc" and "cab" will be hashed into a same location. but if you need them to be stored in two distinct location give weights for locations like we use the number systems.
int sum=0;
int weight=1;
for(int i=0;i<arr.length;i++){
sum+= (int)arr[i]*weight;
weight=weight*2; // using powers of 2 gives better results. (you know why :))
}
As your sample is quite large, you'd avoid collisions by a chaining mechanism rather than using a probe sequence.
After all,What method you would choose totally depends on the nature of your application.
The problem is objects' memory overhead, but using some tricks you can try to implement your own hashset. Something like this. Like others said strings have quite large overhead so you need to "compress" it somehow. Also try not to use too many arrays(lists) in hashtable (if you do chaining type hashtable) as they are also objects and also have overhead. Better yet do open addressing hashtable.
I'm having this problem: I'm reading 900 files and, after processing the files, my final output will be an HashMap<String, <HashMap<String, Double>>. First string is fileName, second string is word and the double is word frequency. The processing order is as follows:
read the first file
read the first line of the file
split the important tokens to a string array
copy the string array to my final map, incrementing word frequencies
repeat for all files
I'm using string BufferedReader. The problem is, after processing the first files, the Hash becomes so big that the performance is very low after a while. I would like to hear solution for this. My idea is to create a limited hash, after the limit reached store into a file. do that until everything is processed, mix all the hashs at the end.
Why not just read one file at a time, and dump that file's results to disk, then read the next file etc? Clearly each file is independent of the others in terms of the mapping, so why keep the results of the first file while you're writing the second?
You could possibly write the results for each file to another file (e.g. foo.txt => foo.txt.map), or you could create a single file with some sort of delimiter between results, e.g.
==== foo.txt ====
word - 1
the - 3
get - 3
==== bar.txt ====
apple - 2
// etc
By the way, why are you using double for the frequency? Surely it should be an integer value...
The time for a hash map to process shouldn't increase significantly as it grows. It is possible that your map is skewing because of an unsuited hashing function or filling up too much. Unless you're using more RAM than you can get from the system, you shouldn't have to break things up.
What I have seen with Java when running huge hash maps (or any collection) with a lots of objects in memory is that the VM goes crazy trying to run the garbage collector. It gets to the point where 90% of the time is spent with the JVM kicking off the garbage collector which takes a while and finds almost every object has a reference.
I suggest profiling your application, and if it is the garbage collector, then increasing heap space and tuning the garbage collector. Also, it will help if you can approximate the needed size of your hash maps and provide sufficiently large allocations (see initialCapacity and loadFactor options in the constructor).
I am trying to rethink your problem:
Since you are trying to construct an inverted index:
Use Multimap rather then Map<String, Map<String, Integer>>
Multimap<word, frequency, fileName, .some thing else tomorrow>
Now, read one file, construct the Multimap and save it on disk. (similar to Jon's answer)
After reading x files, merge all the Multimaps together: putAll(multimap) if you really need one common map of all the values.
You could try using this library to improve your performance.
http://high-scale-lib.sourceforge.net/
It is similar to the java collections api, but for high performance. It would be ideal if you can batch and merge these results after processing them in small batches.
Here is an article that will help you with some more inputs.
http://www.javaspecialists.eu/archive/Issue193.html
Why not use a custom class,
public class CustomData {
private String word;
private double frequency;
//Setters and Getters
}
and use your map as
Map<fileName, List<CustomData>>
this way atleast you will have only 900 keys in your map.
-Ivar