Memory-efficient sparse array in Java - java

(There are some questions about time-efficient sparse arrays but I am looking for memory efficiency.)
I need the equivalent of a List<T> or Map<Integer,T> which
Can grow on demand just by setting a key larger than any encountered before. (Can assume keys are nonnegative.)
Is about as memory-efficient as an ArrayList<T> in the case that most of the indices are not null, i.e. when the actual data is not very sparse.
When the indices are sparse, consumes space proportional to the number of non-null indices.
Uses less memory than HashMap<Integer,T> (as this autoboxes the keys and probably does not take advantage of the scalar key type).
Can get or set an element in amortized log(N) time where N is the number of entries: need not be linear time, binary search would be acceptable.
Implemented in a nonviral open-source pure Java library (preferably in Maven Central).
Does anyone know of such a utility class?
I would have expected Commons Collections to have one but it did not seem to.
I came across org.apache.commons.math.util.OpenIntToFieldHashMap which looks almost right except the value type is a FieldElement which seems gratuitous; I just want T extends Object. It looks like it would be easy to edit its source code to be more generic, though I would rather use a binary dependency if one is available.

I would try with trove collections, there is TIntObjectMap which can work for your intents.

I would look at Android's SparseArray implementation for inspiration. You can view the source by downloading AOSP's source code here http://source.android.com/source/downloading.html

I will suggest you to use OpenIntObjectHashMap from Colt library. Link

I have saved my test case as jglick/inthashmap. The results:
HashMap size: 1017504
TIntObjectMap size: 853216
IntHashMap size: 846984
OpenIntObjectHashMap size: 760472

Late to this question, but there is IntMap in libgdx which uses cuckoo hashing. If anything it would be interesting to compare with the others.

Related

What is a fast alternative to HashMap for mapping to primitive types?

First of all let me tell you that i have read the following questions that has been asked before Java HashMap performance optimization / alternative and i have a similar question.
What i want to do is take a LOT of dependencies from New york times text that will be processed by stanford parser to give dependencies and store the dependencies in a hashmap along with their scores, i.e. if i see a dependency twice i will increment the score from the hashmap by 1.
The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.
How will i be able to increase the performance of my hashmap? What kind of hashkey can i use?
Thanks a lot
Martinos
EDIT 1:
ok guys maybe i phrased my question wrongly ok , well the byte arrays are not used in MY project but in the similar question of another person above. I dont know what they are using it for hence thats why i asked.
secondly: i will not post code as i consider it will make things very hard to understand but here is a sample:
With sentence : "i am going to bed" i have dependencies:
(i , am , -1)
(i, going, -2)
(i,to,-3)
(am, going, -1)
.
.
.
(to,bed,-1)
These dependencies of all sentences(1 000 000 sentences) will be stored in a hashmap.
If i see a dependency twice i will get the score of the existing dependency and add 1.
And that is pretty much it. All is well but the rate of adding sentences in hashmap(or retrieving) scales down on this line:
dependancyBank.put(newDependancy, dependancyBank.get(newDependancy) + 1);
Can anyone tell me why?
Regards
Martinos
Trove has optimized hashmaps for the case where key or value are of primitive type.
However, much will still depend on smart choice of structure and hash code for your keys.
This part of your question is unclear: The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.. But you don't say what the performance is for the larger data. Your map grows, which is kind of obvious. Hashmaps are O(1) only in theory, in practice you will see some performance changes with size, due to less cache locality, and due to occasional jumps caused by rehashing. So, put() and get() times will not be constant, but still they should be close to that. Perhaps you are using the hashmap in a way which doesn't guarantee fast access, e.g. by iterating over it? In that case your time will grow linearly with size and you can't change that unless you change your algorithm.
Google 'fastutil' and you will find a superior solution for mapping object keys to scores.
Take a look at the Guava multimaps: http://www.coffee-bytes.com/2011/12/22/guava-multimaps They are designed to basically keep a list of things that all map to the same key. That might solve your need.
How will i be able to increase the performance of my hashmap?
If its taking more than 1 micro-second per get() or put(), you have a bug IMHO. You need to determine why its taking as long as it is. Even in the worst case where every object has the same hasCode, you won't have performance this bad.
What kind of hashkey can i use?
That depends on the data type of the key. What is it?
and finally what are byte[] a = new byte[2]; byte[] b = new byte[3]; in the question that was posted above?
They are arrays of bytes. They can be used as values to look up but its likely that you need a different value type.
An HashMap has an overloaded constructor which takes initial capacity as input. The scale off you see is because of rehashing during which the HashMap will virtually not be usable. To prevent frequent rehashing you need to start with a HashMap of greater initial capacity. You can also set a loading factor which indicates how much percentage do you load the hashes before rehashing.
public HashMap(int initialCapacity).
Pass the initial capacity to the HashMap during object construction. It is preferable to set a capacity to almost twice the number of elements you would want to add in the map during the course of execution of your program.

How to store table or matrix in Java?

I used to use matrix in octave to store data from data set, in Java how can I do that? Assume I have 10-20 columns and large data, I don't think
int [][]data;
would be the best option. Is nested map the only solution?
You could create a class Coordinate that takes an X and Y values and properly implement hashCode and equals.
Then create a HashMap<Coordinate, Data> and work with it.
Depends on what you need to do. If you know the size of the lists, then an array is definitely ideal since it means you will have instant access (read/write time) to any position in the array, this is very useful for speed.
Maps are better if you dont know the size and it needs to be able to adapt.
And finally, as I discovered in a previous question, if you have a TON of data, and a lot of it will be "0" you might want to also consider using a Sparse Martrix
This answer merges some of gnomed's answer and SJuan76's answer contents.
At a quick glance, I'd suggest you to use bidimentional arrays such as int[][].
It's not a very huge amount of data (we're speaking of ≈500 ints) so it's not a bad idea.
Advantages: It's the simpler, ideal (from the data-structuring side) way to go,
especially if every “slot” of the matrix contains data.
The inconvenient: You have to know the size of the matrix before constructing it.
Anyway, you can resize it later using the Arrays utilities.
If you want more effective handling of the data, you can use a single point-map.
That is, the key of every entry is a java.awt.Point that defines where is the value located.
Advantages: It's more effective than having a 2D array,
especially if part of your matrix doesn't contain data.
And it's adaptative; you don't need to know any sizes to construct/resize it.
The inconvenient: If every “slot” of your matrix contains data,
you'll loose (a lot of) space and performance. A 2D-array is more effective then.
Want more? If your data is really huge you can use a sparse matrix.
See this question for more details.
I would not discard multidimensional arrays so far: have you tried them? Are you finding specific limitations? IMHO as long as your data fits in memory, arrays can be good.
If your data is very sparse though, you may want to look at maps indeed.
Related question btw: Making a very large Java array
You can use multidimensional arrays or you can try any pairs like HashMap
I think multi-dimentional arrays are the best choice! They should serve your purpose. If your data set is only integers, int [] [] is an ideal choice.
Well, if your indices are small integers, you can certainly use nested arrays.
In a matrix class, you may want to use a plain array, like so: (assuming n is the number of columns)
double get(int i, int j) { return data[i*n + j]; }
For a general table (sparse matrix), you can use nested maps, but consider using com.google.common.collect.Table implementations from the Google Guava library.

How to treat file contents as String

I am creating a Scrabble game that uses a dictionary. For efficiency, instead of loading the entire dictionary (via txt file) to a Data Structure (Set, List etc.) is there any built in java class that can help me treat the contents of the file as String.
Specifically what I want to do is check whether a word made in the game is a valid word of the dictionary by doing something simple like fileName.contains (word) instead of having a huge list that is memory inefficient and using list.contains (word).
Do you guys have any idea on what I may be able to do. If the dictionary file has to be in something other than a txt file (e.g. xml file), I am open to try that as well.
NOTE: I am not looking for http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29
This method is not a part of the java API.
HashSet didn't come to mind, I was stuck in the idea that all contains () methods used O(n) time, thanks to Bozho for clearing that with me, looks like I will be using a HashSet.
I think your best option is to load them all in memory, in a HashSet. There contains(word) is O(1).
If you are fine with having it in memory, having it as String on which to call contains(..) is much less efficient than a HashSet.
And I have to mention another option - there's a data structure to represent dictionaries - it's called Trie. You can't find an implementation in the JDK though.
A very rough calculation says that with all English words (1 million) you will need ~12 megabytes of RAM. which is a few times less than the default memory settings of the JVM. (1 million * 6 letters on average * 2 bytes per letter = 12 milion bytes, which is ~12 megabytes). (Well, perhaps a bit more to store hashes)
If you really insist on not reading it in memory, and you want to scan the file for a given word, so you can use a java.util.Scanner and its scanner.findWithHorizon(..). But that would be inefficient - I assume O(n), and I/O overhead.
While a HashSet is likely a perfectly acceptable solution (see Bozho's answer), there are other data-structures that can be used including a Trie or Heap.
The advantage a Trie has is that, depending upon implementation details, the starting prefix letters can be shared (a trie is also called a "prefix tree", after all). Depending upon implementation structure and data, this may or may not actually be an improvement.
Another option, especially if file-based access is desired, is to use a Heap -- Java's PriorityQueue is actually a heap, but it is not file-based, so this would require finding/making an implementation.
All of these data-structures (and more) can be implemented to be file-based (use more IO per lookup -- which could actually be less overall -- but save memory) or implemented directly (e.g. use SQLite and let it do it's B-Tree thing). SQLite excels in that it can be a "common tool" (once used commonly ;-) in a toolbox; data importing, inspection, and modification is easy, and "it just works". SQLite is even used in less powerful systems such as Android.
HashSet comes "for free" with Java, but there is no standard Trie or file-based Heap implementation. I would start with a HashSet - Reasoning:
Dictionary = 5MB.
Loaded into HashSet (assuming lots of overhead) = 20MB.
Memory usage in relation to other things = Minimal (assumes laptop/desktop)
Time to implement with HashSet = 2 Minutes.
I will have only "lost" 2 Minutes if I decide a HashSet wasn't good enough :-)
Happy coding.
Links to random data-structure implementations (may or may not be suitable):
TernarySearchTrie Reads in a flat file (must be specially constructed?)
TrieTree Has support for creating the Trie file from a flat file. Not sure if this Trie works from disk.
FileHash Hash which uses a file backing.
HashStore Another disk-based hash
WB B-Tree Simple B-tree implementation / "database"
SQLite Small embedded RDBMS.
UTF8String Can be used to significantly reduce the memory requirements of using HashSet<String> when using a Latin dictionary. (String in Java uses UTF-16 encoding which is minimum of two bytes/character.)
You need to compress your data to avoid having to store all those words. The way to do so would be a tree in which nodes are letters and leaves reflect the end of a word. This way you're not storing repetitive data such as the there these where those words all have the same prefix.
There is a way to make this solution even more memory efficient. (Hint: letter order)
Use the readline() of java.io.BufferedReader. That returns a string.
String line = new BufferedReader (new FileReader (file) ).readline ();

Programmatical approach in Java for file comparison

What would be the best approach to compare two hexadecimal file signatures against each other for similarities.
More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.
The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.
As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.
This is for a project am doing for my BSc where am trying to develop an algorithm to detect polymorphic malware, this is only one part of the whole system, where the other is based on genetic algorithms to evolve the static virus signature. Any advice, comments, or general information such as resources are very welcome.
Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.
Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;
Longest common subsequence
Levenshtein algorithm
Needleman–Wunsch algorithm
Smith–Waterman algorithm
Boyer Moore algorithm
Aho Corasick algorithm
But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.
Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.
There is a copy of the finished paper on GitHub
For algorithms like these I suggest you look into the bioinformatics area. There is a similar problem setting there in that you have large files (genome sequences) in which you are looking for certain signatures (genes, special well-known short base sequences, etc.).
Also for considering polymorphic malware, this sector should offer you a lot, because in biology it seems similarly difficult to get exact matches. (Unfortunately, I am not aware of appropriate approximative searching/matching algorithms to point you to.)
One example from this direction would be to adapt something like the Aho Corasick algorithm in order to search for several malware signatures at the same time.
Similarly, algorithms like the Boyer Moore algorithm give you fantastic search runtimes especially for longer sequences (average case of O(N/M) for a text of size N in which you look for a pattern of size M, i.e. sublinear search times).
A number of papers have been published on finding near duplicate documents in a large corpus of documents in the context of websearch. I think you will find them useful. For example, see
this presentation.
There has been a serious amount of research recently into automating the detection of duplicate bug reports in bug repositories. This is essentially the same problem you are facing. The difference is that you are using binary data. They are similar problems because you will be looking for strings that have the same basic pattern, even though the patterns may have some slight differences. A straight-up distance algorithm probably won't serve you well here.
This paper gives a good summary of the problem as well as some approaches in its citations that have been tried.
ftp://ftp.computer.org/press/outgoing/proceedings/Patrick/apsec10/data/4266a366.pdf
As somebody has pointed out, similarity with known string and bioinformatics problem might help. Longest common substring is very brittle, meaning that one difference can halve the length of such a string. You need a form of string alignment, but more efficient than Smith-Waterman. I would try and look at programs such as BLAST, BLAT or MUMMER3 to see if they can fit your needs. Remember that the default parameters, for these programs, are based on a biology application (how much to penalize an insertion or a substitution for instance), so you should probably look at re-estimating parameters based on your application domain, possibly based on a training set. This is a known problem because even in biology different applications require different parameters (based, for instance, on the evolutionary distance of two genomes to compare). It is also possible, though, that even at default one of these algorithms might produce usable results. Best of all would be to have a generative model of how viruses change and that could guide you in an optimal choice for a distance and comparison algorithm.

searching list of tens or few hundreds short text strings, sorting by relevance

I have a list of people that I'd like to search through. I need to know 'how much' each item matches the string it is being tested against.
The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon.
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.)
What would be your choice in such case please?
EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. Only that gets easily fooled when the search query is "John" and the names in list are significantly longer.
You should look at various string comparison algorithms and see which one suits your data best. Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms.
If you are looking for a 'how much' match, you should use Soundex. Here is a Java implementation of this algorithm.
Check out Double Metaphone, an improved soundex from 1990.
http://commons.apache.org/codec/userguide.html
http://svn.apache.org/viewvc/commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java?view=markup
According to me Jaro-Winkler algorithm will suit your requirement best.
Here is a Short summary of Jaro-Winkler Distance Algo
One of the PDF which compares different algorithms --> Link to PDF

Categories