I am using the Mahout API within for a Naive Bayes Classifier. One of the functions is SparseVectorsFromSequenceFiles and although I have tried the old Google search, I still do not understanf what a sparse vector is.
The closest to an explanation I have is this site which didn't help me understand tbh.
Conceptually, vectors represent a generalization of arrays, i.e. data structures that allow arbitrary access to its elements using an index. Java's built-in arrays, Vector<T> and ArrayList<T> are examples of data structures implementing a "regular" (dense) vector concept.
Dense vectors provide constant-time access to its elements by translating a vector index into a memory address using a simple formula baseAddress + index * elementSize. This means that the size in memory is proportional to the largest index that the vector needs to support.
While this is acceptable in situations when the number of elements that you wish to put in a vector and the highest possible index are relatively close to each other. However, if you wish to use indexes from a wide range to index a relatively small number of elements (say, 1,000 elements scattered across a vector with 100,000 indexes) allocating 100,000 spaces is wasteful. You can save memory at the expense of CPU cycles by implementing a data structure that exposes the interface of a vector, but uses a smaller amount of memory for its internal representation.
The example at your link shows one possible implementation. Other implementations are possible, depending on the distribution of indexes in your data. If the indexes are distributed randomly, you could use a HashMap<Integer,T> as your backing storage for a sparse vector. If indexes are clustered together, you could split your index space by "pages", and allocate a real array only to pages that you need. This implementation would be similar to the way the physical memory is allocated to virtual memory space.
Related
From my understanding, a 2 dimensional matrix that is used in mathematics can be created in Java using a 2 dimensional array. There are of course things you can do with real math matrixes such as adding and subtracting them. With Java however, you would need to write code to do that and there are libraries available that provide that functionality.
What I would like to know though is whether a Java array is even an optimal way of dealing with matrix data. I can think of cases where a 2d matrix has some of its indexes filled in while many are just left blank due to the nature of the data. For me this raises the question whether this is a waste of memory space, especially if the matrix is very large and has a lot of empty indexes.
Do specialized Java math libraries deal with matrixes differently and don't rely upon a 2d array? Or do they use a regular Java array as well and just live with wasted memory?
A few things:
Never create matrices of 2 dimensional arrays. It's always preferable to have 1 dimensional array with class accessors that take 2 parameters. The reason is purely for performance. When you allocate a contiguous chunk of memory, you give the CPU the chance to allocate the whole matrix in one memory page, which will minimize cache misses, and hence boost performance.
Matrices with many zeros are called sparse matrices. It's always a trade-off between using sparse matrices and having many zeros in your array.
A contiguous array will allow the compiler to use vector operations, such as SIMD, for addition, subtraction, etc.
A non-contiguous array will be faster if the relative number of zeros is really high, and you implement it in a smart way.
I don't believe java provides a library for sparse matrices, but I'm sure there's some out there. I'm a C++ dev, and I came to this question because I dealt with matrices a lot during my academic life. A famous C++ library with easy and high-level interface is Armadillo.
Hope this helps.
Let me put the question first: considering the situation and requirements I'll describe further down, what data structures would make sense/help achieving the non-functional requirements?
I tried to look up several structures but wasn't very successful so far, which might be due to me missing some terminology.
Since we'll implement that in Java any answers should take that into account (e.g. no pointer-magic, assume 8-byte references etc.).
The situation
We have somewhat large set of values that are mapped via a 4-dimensional key (let's call those dimensions A, B, C and D). Each dimension can have a different size, so we'll assume the following:
A: 100
B: 5
C: 10000
D: 2
This means a completely filled structure would contain 10 million elements. Not considering their size the space needed to hold the references alone would be like 80 megabytes, so that would be considered a lower bound for memory consumption.
We further can assume that the structure won't be completely filled but quite densely.
The requirements
Since we build and query that structure quite often we have the following requirements:
constructing the structure should be fast
queries on single elements and ranges (e.g. [A1-A5, B3, any C, D0]) should be efficient
fast deletion of elements isn't required (won't happen too often)
the memory footprint should be low
What we already considered
kd-trees
Building such a tree takes some time since it can get quite deep and we'd either have to accept slower queries or take rebalancing measures. Additonally the memory footprint is quite high since we need to hold the complete key in each node (there might be ways to reduce that though).
Nested maps/map tree
Using nested maps we could store only the key for each dimension as well as a reference to the next dimension map or the values - effectively building a tree out of those maps. To support range queries we'd keep sorted sets of the possible keys and access those while traversing the tree.
Construction and queries were way faster than with kd-trees but the memory footprint was much higher (as expected).
A single large map
An alternative would be to keep the sets for individual available keys and use a single large map instead.
Construction and queries were fast as well but memory consumption was even higher due to each map node being larger (they need to hold all dimensions of a key now).
What we're thinking of at the moment
Building insertion-order index-maps for the dimension keys, i.e. we map each incoming key to a new integer index as it comes in. Thus we can make sure that those indices grow one step a time without any gaps (not considering deletions).
With those indices we'd then access a tree of n-dimensional arrays (flattened to a 1-d array of course) - aka n-ary tree. That tree would grow on demand, i.e. if we need a new array then instead of creating a larger one and copying all the data we'd just create the new block. Any needed non-leaf nodes would be created on demand, replacing the root if needed.
Let me illustrate that with an example of 2 dimensions A and B. We'll allocate 2 elements for each dimension resulting in a 2x2 matrix (array of length 4).
Adding the first element A1/B1 we'd get something like this:
[A1/B1,null,null,null]
Now we add element A2/B2:
[A1/B1,null,A2/B2,null]
Now we add element A3/B3. Since we can't map the new element to the existing array we'll create a new one as well as a common root:
[x,null,x,null]
/ \
[A1/B1,null,A2/B2,null] [A3/B3,null,null,null]
Memory consumption for densely filled matrices should be rather low depending on the size of each array (having 4 dimensions and 4 values per dimension in an array we'd have arrays of length 256 and thus get a maximum tree depth of 2-4 in most cases).
Does this make sense?
If the structure will be "quite densely" filled, then I think it makes sense to assume that it will be full. That simplifies things quite a bit. And it's not like you're going to save a lot (or anything) using a sparse matrix representation of a densely filled matrix.
I'd try the simplest possible structure first. It might not be the most memory efficient, but it should be reasonable and quite easy to work with.
First, a simple array of 10,000,000 references. That is (and please pardon the C#, as I'm not really a Java programmer):
MyStructure[] theArray = new MyStructure[](10000000);
As you say, that's going to consume 80 megabytes.
Next is four different dictionaries (maps, I think, in Java), one for each key type:
Dictionary<KeyAType, int> ADict;
Dictionary<KeyBType, int> BDict;
Dictionary<KeyCType, int> CDict;
Dictionary<KeyDType, int> DDict;
When you add an element at {A,B,C,D}, you look up the respective keys in the dictionary to get their indexes (or add a new index if that key doesn't exist), and do the math to compute an index into the array. The math is, I think:
DIndex + 2*(CIndex + 10000*(BIndex + 5*AIndex));
In .NET, dictionary overhead is something like 24 bytes per key. But you only have 11,007 total keys, so the dictionaries are going to consume something like 250 kilobytes.
This should be very quick to query directly, and range queries should be as fast as a single lookup and then some array manipulation.
One thing I'm not clear on is if you want a key, to resolve to the same index with every build. That is, if "foo" maps to index 1 in one build, will it always map to index 1?
If so, you probably should statically construct the dictionaries. I guess it depends on if your range queries always expect things in the same key order.
Anyway, this is a very simple and very effective data structure. If you can afford 81 megabytes as the maximum size of the structure (minus the actual data), it seems like a good place to start. You could probably have it working in a couple of hours.
At best it's all you'll have to do. And if you end up having to replace it, at least you have a working implementation that you can use to verify the correctness of whatever new structure you come up with.
There are other multidimensional trees that are usually better than kd-trees:quadtrees, R*Trees (like R-Tree, but much faster for updates) or PH-Tree.
The PH-Tree is like a quadtree, but much more space efficient, scales better with dimensions and depth is limited by maximum bitwidth of values, i.e. maximum '10000' requires 14 bit, so the depth will not be more than 14.
Java implementations of all trees can be found on my repo, either here (quadtree may be a bit buggy) or here.
EDIT
The following optimization can probably be ignored. Of course the described query will result in a full scan, but that may not be as bad as it sounds, because it will on average anyway return 33%-50% of the whole tree.
Possible optimisation (not tested, but might work for the PH-Tree):
One problem with range queries is the different selectivity of your dimensions, which may result in something to a full scan of the tree. For example when querying for [0..100][0..5][0..10000][1..1], i.e. constraining only the last dimension (with least selectivity).
To avoid this, especially for the PH-Tree, I would try to multiply your values by a fixed constant. For example multiply A by 100, B by 2000, C by 1 and D by 5000. This allows all values to range from 0 to 10000, which may improve query performance when constraining only dimensions with low selectivity (the 2nd or 4th).
I'm trying to find a counterexample to the Pólya Conjecture which will be somewhere in the 900 millions. I'm using a very efficient algorithm that doesn't even require any factorization (similar to a Sieve of Eratosthenes, but with even more information. So, a large array of ints is required.
The program is efficient and correct, but requires an array up to the x i want to check for (it checks all numbers from (2, x)). So, if the counterexample is in the 900 millions, I need an array that will be just as large. Java won't allow me anything over about 20 million. Is there anything I can possibly do to get an array that large?
You may want to extend the max size of the JVM Heap. You can do that with a command line option.
I believe it is -Xmx3600m (3600 megabytes)
Java arrays are indexed by int, so an array can't get larger than 2^31 (there are no unsigned ints). So, the maximum size of an array is 2147483648, which consumes (for a plain int[]) 8589934592 bytes (= 8GB).
Thus, the int-index is usually not a limitation, since you would run out of memory anyway.
In your algorithm, you should use a List (or a Map) as your data structure instead, and choose an implementation of List (or Map) that can grow beyond 2^31. This can get tricky, since the "usual" implementation ArrayList (and HashMap) uses arrays internally. You will have to implement a custom data structure; e.g. by using a 2-level array (a list/array). When you are at it, you can also try to pack the bits more tightly.
Java will allow up to 2 billions array entries. It’s your machine (and your limited memory) that can not handle such a large amount.
900 million 32 bit ints with no further overhead - and there will always be more overhead - would require a little over 3.35 GiB. The only way to get that much memory is with a 64 bit JVM (on a machine with at least 8 GB of RAM) or use some disk backed cache.
If you don't need it all loaded in memory at once, you could segment it into files and store on disk.
What do you mean by "won't allow". You probably getting an OutOfMemoryError, so add more memory with the -Xmx command line option.
You could define your own class which stores the data in a 2d array which would be closer to sqrt(n) by sqrt(n). Then use an index function to determine the two indices of the array. This can be extended to more dimensions, as needed.
The main problem you will run into is running out of RAM. If you approach this limit, you'll need to rethink your algorithm or consider external storage (ie a file or database).
If your algorithm allows it:
Compute it in slices which fit into memory.
You will have to redo the computation for each slice, but it will often be fast enough.
Use an array of a smaller numeric type such as byte.
Depending on how you need to access the array, you might find a RandomAccessFile will allow you to use a file which is larger than will fit in memory. However, the performance you get is very dependant on your access behaviour.
I wrote a version of the Sieve of Eratosthenes for Project Euler which worked on chunks of the search space at a time. It processes the first 1M integers (for example), but keeps each prime number it finds in a table. After you've iterated over all the primes found so far, the array is re-initialised and the primes found already are used to mark the array before looking for the next one.
The table maps a prime to its 'offset' from the start of the array for the next processing iteration.
This is similar in concept (if not in implementation) to the way functional programming languages perform lazy evaluation of lists (although in larger steps). Allocating all the memory up-front isn't necessary, since you're only interested in the parts of the array that pass your test for primeness. Keeping the non-primes hanging around isn't useful to you.
This method also provides memoisation for later iterations over prime numbers. It's faster than scanning your sparse sieve data structure looking for the ones every time.
I second #sfossen's idea and #Aaron Digulla. I'd go for disk access. If your algorithm can take in a List interface rather than a plain array, you could write an adapter from the List to the memory mapped file.
Use Tokyo Cabinet, Berkeley DB, or any other disk-based key-value store. They're faster than any conventional database but allow you to use the disk instead of memory.
could you get by with 900 million bits? (maybe stored as a byte array).
You can try splitting it up into multiple arrays.
for(int x = 0; x <= 1000000; x++){
myFirstList.add(x);
}
for(int x = 1000001; x <= 2000000; x++){
mySecondList.add(x);
}
then iterate over them.
for(int x: myFirstList){
for(int y: myFirstList){
//Remove multiples
}
}
//repeat for second list
Use a memory mapped file (Java 5 NIO package) instead. Or move the sieve into a small C library and use Java JNI.
I need to store around 200,000 SHA256 hashes in binary form in memory.
My requirements are,
The data structure should be most memory efficient.
I will be reading back the hashes in sorted order (Insertion order is NOT important), so, the data structure supporting
lexicographical reading is better.
It would be a plus (though not mandatory) if two structures of the same type can be compared to find the common hashes in them.
Here are the data structures I considered,
Arrays:
Arrays seems to be the most simple and memory efficient one, but I cannot use arrays because,
I will have to sort the data while reading it. The data structure by itself does not support it.
Since 200K hashes is not a hard limit and can also go more than that, I won't be knowing about the size before hand to allocate the array length. This means that I may sometimes need to resize the array by copying the whole contents of the array to a new array (having both the old and new ones in memory at the same time).
Compressed Radix Trie (Patricia Trie?)
Compressed Radix Trie seemed to be the most promising DS for my implementation. But a quick google search showed this link: https://news.ycombinator.com/item?id=101987 which said Radix Tries are not very memory optimized,
Quoting from the link:
Radix tries are nice. Use them when
...
(4) you don't care about memory usage that much.
I compared a simple 8-bit radix tree with some standard hash table implementation - the former took roughly ten times more memory. I then changed my radix to be based on 4 bits (each char is just split into 2 parts) and the memory usage improved twice. Now I'm wondering if radix tries have more room for improvement.
Hash Table?
I know hash tables don't support sorted reading like Radix tries do, but are they really so much memory optimal (10 times better than radix trees)?
I still don't understand/am not convinced, is a Compressed Radix Trie not a memory optimal data structure? If not, which Data structure would best suit my needs?
If Radix trie is the best one known, is there a optimal algorithm which compares 2 Radix tries to find the common hashes in them.
P.S: I found the following similar questions on SO but they didn't solve my problem:
Storing 1 million phone numbers: This didn't have much information so closed as "Not Constructive" and the answers are about finding deltas of the phone numbers. But deltas for hashes not be helpful?
Most memory efficient way to store 8M+ sha256 hashes: This was about storing a key-value mapping and the answers are asking to use databases.
Graphs are often represented using an adjacency matrix. Various sources indicate it is possible to avoid the cost of initialization to be |V^2| (V is the number of vertices) but could I have not figured out how.
In Java, simply by declaring the matrix, e.g. boolean adj [][], the runtime will automatically initialize the array with false and this will be at O(V^2) cost (the dimensions of the array).
Do I misunderstand? Is it possible to avoid the quadratic cost of initialization of the adjacency matrix, or is this just something theoretical that depends on the implementation language?
That would be possible by using a sparse matrix representation of an adjacency matrix where only the position of the "ones" is allocated rather than each and every element of the matrix (which might include a large number of zeros). You might find this thread useful as well
The default initialization of the matrix's values is in fact a feature. Were it not with the default initialization, wouldn't you still need to initialize every field yourself so you know what to expect its value to be?
Adjacency matrices have this drawback: they are bad in the sense of memory efficiency (they require O(n2) memory cells) and as you said their initialization is slower. The initialization, however, is never considered the biggest problem. Believe me, the memory allocation is a lot slower and the needed memory is much more limiting than the initialization time.
In many cases people prefer using adjacency lists, instead of the matrix. Such list require O(m) memory, where m is the number of edges in the graph. This is a lot more efficient, especially for sparse graphs. The only operations this graph representation is worse than the adjacency matrix is the query is there edge between vertices i and j. the matrix answers in O(1) time and the list will for sure be a lot slower.
However many of the typical graph algorithms (like Dijkstra, Bellman-Ford, Prim, Tarjan, BFS and DFS) will only need to iterate all the neighbours of a given vertex. All these algorithms benefit immensely if you use adjacency list instead of matrix.
There is a good deal of confusion and misinformation in this thread. In fact, there is a method of avoiding initialization costs of adjacency matrices (and any array in general). However, it is not possible to use the method with Java primitives since they are initialized with default values under the hood.
Suppose you could create an array data[0..n] that is not auto-initialized. To start, it is filled with junk from whatever was previously in memory. If we don't want to spend O(n) time overwriting it, we need some way to differentiate the good data we add from junk data.
The "trick" is to use an auxiliary stack that tracks cells containing good data. The first time we write to data[i], we add index i to the stack. Since a stack only grows as we add to it, it never contains any junk we need to worry about.
Now whenever we access data[k], we can check if its junk or not by scanning the stack for k. But that would take O(n) time for each read, defeating the point of an array in the first place!
To solve this, we make another auxiliary array stack_check[0..n] that also starts full of junk. This array contains pointers to elements in the stack. Now when we first write to data[i], we push i onto the stack and set stack_check[i] to point to the new stack element.
If data[k] is good data, then stack_check[k] points to a stack element holding k. If data[k] is junk, then the junk value of stack_check[k] either points outside of the stack or points to some stack element besides k (since k was never put on the stack). Checking this property only takes O(1) time so our array access is still fast.
Bringing it all together, we can create our array and helper structures in O(1) time by letting them be full of junk. On each read and write, we check if the given cell contains junk in O(1) time using our helpers. If we are writing over junk, we update our helper structures to mark the cell as valid data. If we read junk, we can handle it in whatever way is appropriate for the given problem. For example, we could return a default value like 0 (now you can't even tell we didn't initialize it!) or maybe throw an exception.
I'll elaborate on A_A's answer. He recommends a sparse matrix, which basically means you're back to maintaining adjacency lists.
You have two reasons to use a matrix - if you don't care about performance at all and like the simple code it offers, or if you do care about performance but your matrix is going to be relatively full (let's say at least 20% full, for the sake of this post).
You obviously do care about performance. If your matrix is going to be relatively empty, its initialization overhead can be meaningful, and you're better off using adjacency lists. If it's going to be quite full, initialization becomes negligible - you'll need to fill the right cells in the matrix (which will take more than initializing it), and you need to process them (which, again, will take more time than initializing it).