Fastest way to access a table of data Java - java

Basically I am amidst a friendly code optimisation battle (to get the fastest program), I am trying to find a way that is faster to access a dictionary of hard coded data than a multidimensional array.
e.g to get the value for x:
int x = array[v1][v2][v3] ;
I have read that nested switch statements in a custom array may possibly be faster. Or is there a way I can possibly access memory more directly similar to pointers in C. Any ideas appreciated!
My 'competitor' is using a truth table and idea is to find something faster!
Many Thanks
Sam

If the array is regular in shape (i.e. MxNxK for some fixed M, N and K), you could try flattening it to achieve better locality of reference:
int array[] = new int[M*N*K];
...
int x = array[v1*N*K + v2*K + v3];
Also, if the entire array doesn't fit in the CPU cache, you might want to examine the patterns in which the array is accessed, to perhaps re-order the indices or change your code to make better use of the caches.

Related

Multidimensional arrays with different sizes

I just had an idea to test something out and it worked:
String[][] arr = new String[4][4];
arr[2] = new String[5];
for(int i = 0; i < arr.length; i++)
{
System.out.println(arr[i].length);
}
The output obviously is:
4
4
5
4
So my questions are:
Is this good or bad style of coding?
What could this be good for?
And most of all, is there a way to create such a construct in the declaration itself?
Also... why is it even possible to do?
Is this good or bad style of coding?
Like anything, it depends on the situation. There are situations where jagged arrays (as they are called) are in fact appropriate.
What could this be good for?
Well, for storing data sets with different lengths in one array. For instance, if we had the strings "hello" and "goodbye", we might want to store their character arrays in one structure. These char arrays have different lengths, so we would use a jagged array.
And most of all, is there a way to create such a construct in the declaration itself?
Yes:
char[][] x = {{'h','e','l','l','o'},
{'g','o','o','d','b','y','e'}};
Also... why is it even possible to do?
Because it is allowed by the Java Language Specification, §10.6.
This is a fine style of coding, there's nothing wrong with it. I've created jagged arrays myself for different problems in the past.
This is good because you might need to store data in this way. Data stored this way would allow you to saves memory. It would be a natural way to map items more efficiently in certain scenarios.
In a single line, without explicitly populating the array? No. This is the closest thing I can think of.
int[][] test = new int[10][];
test[0] = new int[100];
test[1] = new int[500];
This would allow you to populate the rows with arrays of different lengths. I prefer this approach to populating with values like so:
int[][] test = new int[][]{{1,2,3},{4},{5,6,7}};
Because it is more readable, and practical to type when dealing with large ragged arrays.
Its possible to do for the reasons given in 2. People have valid reasons for needing ragged arrays, so the creators of the language gave us a way to do it.
(1) While nothing is technically/functionally/syntactically wrong with it, I would say it is bad coding style because it breaks the assumption provided by the initialization of the object (String[4][4]). This, ultimately, is up to user preference; if you're the only one reading it, and you know exactly what you're doing, it would be fine. If other people share/use your code, it adds confusion.
(2) The only concept I could think of is if you had multiple arrays to be read in, but didn't know the size of them beforehand. However, it would make more sense to use ArrayList<String> in that case, unless the added overhead was a serious matter.
(3) I'm not sure what you're asking about here. Do you mean, can you somehow specify individual array lengths in that initial declaration? The answer to that is no.
(4) Its possible to extend and shrink primitive array lengths because behind the scenes, you're just allocating and releasing chunks of memory.

How to store table or matrix in Java?

I used to use matrix in octave to store data from data set, in Java how can I do that? Assume I have 10-20 columns and large data, I don't think
int [][]data;
would be the best option. Is nested map the only solution?
You could create a class Coordinate that takes an X and Y values and properly implement hashCode and equals.
Then create a HashMap<Coordinate, Data> and work with it.
Depends on what you need to do. If you know the size of the lists, then an array is definitely ideal since it means you will have instant access (read/write time) to any position in the array, this is very useful for speed.
Maps are better if you dont know the size and it needs to be able to adapt.
And finally, as I discovered in a previous question, if you have a TON of data, and a lot of it will be "0" you might want to also consider using a Sparse Martrix
This answer merges some of gnomed's answer and SJuan76's answer contents.
At a quick glance, I'd suggest you to use bidimentional arrays such as int[][].
It's not a very huge amount of data (we're speaking of ≈500 ints) so it's not a bad idea.
Advantages: It's the simpler, ideal (from the data-structuring side) way to go,
especially if every “slot” of the matrix contains data.
The inconvenient: You have to know the size of the matrix before constructing it.
Anyway, you can resize it later using the Arrays utilities.
If you want more effective handling of the data, you can use a single point-map.
That is, the key of every entry is a java.awt.Point that defines where is the value located.
Advantages: It's more effective than having a 2D array,
especially if part of your matrix doesn't contain data.
And it's adaptative; you don't need to know any sizes to construct/resize it.
The inconvenient: If every “slot” of your matrix contains data,
you'll loose (a lot of) space and performance. A 2D-array is more effective then.
Want more? If your data is really huge you can use a sparse matrix.
See this question for more details.
I would not discard multidimensional arrays so far: have you tried them? Are you finding specific limitations? IMHO as long as your data fits in memory, arrays can be good.
If your data is very sparse though, you may want to look at maps indeed.
Related question btw: Making a very large Java array
You can use multidimensional arrays or you can try any pairs like HashMap
I think multi-dimentional arrays are the best choice! They should serve your purpose. If your data set is only integers, int [] [] is an ideal choice.
Well, if your indices are small integers, you can certainly use nested arrays.
In a matrix class, you may want to use a plain array, like so: (assuming n is the number of columns)
double get(int i, int j) { return data[i*n + j]; }
For a general table (sparse matrix), you can use nested maps, but consider using com.google.common.collect.Table implementations from the Google Guava library.

How should I map string keys to values in Java in a memory-efficient way?

I'm looking for a way to store a string->int mapping. A HashMap is, of course, a most obvious solution, but as I'm memory constrained and need to store 2 million pairs, 7 characters long keys, I need something that's memory efficient, the retrieval speed is a secondary parameter.
Currently I'm going along the line of:
List<Tuple<String, int>> list = new ArrayList<Tuple<String, int>>();
list.add(...); // load from file
Collections.sort(list);
and then for retrieval:
Collections.binarySearch(list, key); // log(n), acceptable
Should I perhaps go for a custom tree (each node a single character, each leaf with result), or is there an existing collection that fits this nicely? The strings are practically sequential (UK postcodes, they don't differ much), so I'm expecting nice memory savings here.
Edit: I just saw you mentioned the String were UK postcodes so I'm fairly confident you couldn't get very wrong by using a Trove TLongIntHashMap (btw Trove is a small library and it's very easy to use).
Edit 2: Lots of people seem to find this answer interesting so I'm adding some information to it.
The goal here is to use a map containing keys/values in a memory-efficient way so we'll start by looking for memory-efficient collections.
The following SO question is related (but far from identical to this one).
What is the most efficient Java Collections library?
Jon Skeet mentions that Trove is "just a library of collections from primitive types" [sic] and, that, indeed, it doesn't add much functionality. We can also see a few benchmarks (by the.duckman) about memory and speed of Trove compared to the default Collections. Here's a snippet:
100000 put operations 100000 contains operations
java collections 1938 ms 203 ms
trove 234 ms 125 ms
pcj 516 ms 94 ms
And there's also an example showing how much memory can be saved by using Trove instead of a regular Java HashMap:
java collections oscillates between 6644536 and 7168840 bytes
trove 1853296 bytes
pcj 1866112 bytes
So even though benchmarks always need to be taken with a grain of salt, it's pretty obvious that Trove will save not only memory but will always be much faster.
So our goal now becomes to use Trove (seen that by putting millions and millions of entries in a regular HashMap, your app begins to feel unresponsive).
You mentioned 2 million pairs, 7 characters long keys and a String/int mapping.
2 million is really not that much but you'll still feel the "Object" overhead and the constant (un)boxing of primitives to Integer in a regular HashMap{String,Integer} which is why Trove makes a lot of sense here.
However, I'd point out that if you have control over the "7 characters", you could go even further: if you're using say only ASCII or ISO-8859-1 characters, your 7 characters would fit in a long (*). In that case you can dodge altogether objects creation and represent your 7 characters on a long. You'd then use a Trove TLongIntHashMap and bypass the "Java Object" overhead altogether.
You stated specifically that your keys were 7 characters long and then commented they were UK postcodes: I'd map each postcode to a long and save a tremendous amount of memory by fitting millions of keys/values pair into memory using Trove.
The advantage of Trove is basically that it is not doing constant boxing/unboxing of Objects/primitives: Trove works, in many cases, directly with primitives and primitives only.
(*) say you only have at most 256 codepoints/characters used, then it fits on 7*8 == 56 bits, which is small enough to fit in a long.
Sample method for encoding the String keys into long's (assuming ASCII characters, one byte per character for simplification - 7 bits would be enough):
long encode(final String key) {
final int length = key.length();
if (length > 8) {
throw new IndexOutOfBoundsException(
"key is longer than 8 characters");
}
long result = 0;
for (int i = 0; i < length; i++) {
result += ((long) ((byte) key.charAt(i))) << i * 8;
}
return result;
}
Use the Trove library.
The Trove library has optimized HashMap and HashSet classes for primitives. In this case, TObjectIntHashMap<String> will map the parameterized object (String) to a primitive int.
First of, did you measure that LinkedList is indeed more memory efficient than a HashMap, or how did you come to that conclusion? Secondly, a LinkedList's access time of an element is O(n), so you cannot do efficient binary search on it. If you want to do such approach, you should use an ArrayList, which should give you the beast compromise between performance and space. However, again, I doubt that a HashMap, HashTable or - in particular - a TreeMap would consume that much more memory, but the first two would provide constant access and the tree map logarithmic and provide a nicer interface that a normal list. I would try to do some measurements, how much the difference in memory consumption really is.
UPDATE: Given, as Adamski pointed out, that the Strings themselves, not the data structure they are stored in, will consume the most memory, it might be a good idea to look into data structures that are specific for strings, such as tries (especially patricia tries), which might reduce the storage space needed for the strings.
What you are looking for is a succinct-trie - a trie which stores its data in nearly the least amount of space theoretically possible.
Unfortunately, there are no succinct-trie classes libraries currently available for Java. One of my next projects (in a few weeks) is to write one for Java (and other languages).
In the meanwhile, if you don't mind JNI, there are several good native succinct-trie libraries you could reference.
Have you looked at tries. I've not used them but they may fit with what you're doing.
A custom tree would have the same complexity of O(log n), don't bother. Your solution is sound, but I would go with an ArrayList instead of the LinkedList because the linked list allocates one extra object per stored value, which will amount to a lot of objects in your case.
As Erick writes using the Trove library is a good place to start as you save space in storing int primitives rather than Integers.
However, you are still faced with storing 2 million String instances. Given that these are keys in the map, interning them won't offer any benefit so the next thing I'd consider is whether there's some characteristic of the Strings that can be exploited. For example:
If the Strings represent sentences of common words then you could transform the String into a Sentence class, and intern the individual words.
If the Strings only contain a subset of Unicode characters (e.g. only letters A-Z, or letters + digits) you could use a more compact encoding scheme than Java's Unicode.
You could consider transforming each String into a UTF-8 encoded byte array and wrapping this in class: MyString. Obviously the trade-off here is the additional time spent performing look-ups.
You could write the map to a file and then memory map a portion or all of the file.
You could consider libraries such as Berkeley DB that allow you to define persistent maps and cache a portion of the map in memory. This offers a scalable approach.
maybe you can go with a RadixTree?
Use java.util.TreeMap instead of java.util.HashMap. It makes use of a red black binary search tree and doesn't use more memory than what is required for holding notes containing the elements in the map. No extra buckets, unlike HashMap or Hashtable.
I think the solution is to step a little outside of Java. If you have that many values, you should use a database. If you don't feel like installing Oracle, SQLite is quick and easy. That way the data you don't immediately need is stored on the disk, and all of the caching/storage is done for you. Setting up a DB with one table and two columns won't take much time at all.
I'd consider using some cache as these often have the overflow-to-disk ability.
You might create a key class that matches your needs. Perhaps like this:
public class MyKey implements Comparable<MyKey>
{
char[7] keyValue;
public MyKey(String keyValue)
{
... load this.keyValue from the String keyValue.
}
public int compareTo(MyKey rhs)
{
... blah
}
public boolean equals(Object rhs)
{
... blah
}
public int hashCode()
{
... blah
}
}
try this one
OptimizedHashMap<String, int[]> myMap = new OptimizedHashMap<String, int[]>();
for(int i = 0; i < 2000000; i++)
{
myMap.put("iiiiii" + i, new int[]{i});
}
System.out.println(myMap.containsValue(new int[]{3}));
System.out.println(myMap.get("iiiiii" + 1));
public class OptimizedHashMap<K,V> extends HashMap<K,V>
{
public boolean containsValue(Object value) {
if(value != null)
{
Class<? extends Object> aClass = value.getClass();
if(aClass.isArray())
{
Collection values = this.values();
for(Object val : values)
{
int[] newval = (int[]) val;
int[] newvalue = (int[]) value;
if(newval[0] == newvalue[0])
{
return true;
}
}
}
}
return false;
}
Actually HashMap and List are too general for such specific task as a lookup of int by zipcode. You should use advantage of knowledge which data is used. One of the options is to use a prefix tree with leaves that stores the int value. Also, it could be pruned if (my guess) a lot of codes with same prefixes map to the same integer.
Lookup of the int by zipcode will be linear in such tree and will not grow if number of codes is increased, compare to O(log(N)) in case of binary search.
Since you are intending to use hashing, you can try numerical conversions of the strings based on ASCII values.
the simplest idea will be
int sum=0;
for(int i=0;i<arr.length;i++){
sum+=(int)arr[i];
}
hash "sum" using a well defined hash functions. You would use a hash function based on the expected input patterns.
e.g. if you use division method
public int hasher(int sum){
return sum%(a prime number);
}
selecting a prime number which is not close to an exact power of two improves performances and gives better uniformly hashed distribution of keys.
another method is to weigh the characters based on their respective position.
e.g: if you use the above method, both "abc" and "cab" will be hashed into a same location. but if you need them to be stored in two distinct location give weights for locations like we use the number systems.
int sum=0;
int weight=1;
for(int i=0;i<arr.length;i++){
sum+= (int)arr[i]*weight;
weight=weight*2; // using powers of 2 gives better results. (you know why :))
}
As your sample is quite large, you'd avoid collisions by a chaining mechanism rather than using a probe sequence.
After all,What method you would choose totally depends on the nature of your application.
The problem is objects' memory overhead, but using some tricks you can try to implement your own hashset. Something like this. Like others said strings have quite large overhead so you need to "compress" it somehow. Also try not to use too many arrays(lists) in hashtable (if you do chaining type hashtable) as they are also objects and also have overhead. Better yet do open addressing hashtable.

Starting Size for an ArrayList

I want to use an ArrayList (or some other collection) like how I would use a standard array.
Specifically, I want it to start with an intial size (say, SIZE), and be able to set elements explicitly right off the bat,
e.g.
array[4] = "stuff";
could be written
array.set(4, "stuff");
However, the following code throws an IndexOutOfBoundsException:
ArrayList<Object> array = new ArrayList<Object>(SIZE);
array.set(4, "stuff"); //wah wahhh
I know there are a couple of ways to do this, but I was wondering if there was one that people like, or perhaps a better collection to use. Currently, I'm using code like the following:
ArrayList<Object> array = new ArrayList<Object>(SIZE);
for(int i = 0; i < SIZE; i++) {
array.add(null);
}
array.set(4, "stuff"); //hooray...
The only reason I even ask is because I am doing this in a loop that could potentially run a bunch of times (tens of thousands). Given that the ArrayList resizing behavior is "not specified," I'd rather it not waste any time resizing itself, or memory on extra, unused spots in the Array that backs it. This may be a moot point, though, since I will be filling the array (almost always every cell in the array) entirely with calls to array.set(), and will never exceed the capacity?
I'd rather just use a normal array, but my specs are requiring me to use a Collection.
The initial capacity means how big the array is. It does not mean there are elements there. So size != capacity.
In fact, you can use an array, and then use Arrays.asList(array) to get a collection.
I recomend a HashMap
HashMap hash = new HasMap();
hash.put(4,"Hi");
Considering that your main point is memory. Then you could manually do what the Java arraylist do, but it doesn't allow you to resize as much you want. So you can do the following:
1) Create a vector.
2) If the vector is full, create a vector with the old vector size + as much you want.
3) Copy all items from the old vector to your new vector.
This way, you will not waste memory.
Or you can implement a List (not vector) struct. I think Java already has one.
Yes, hashmap would be a great ideia.
Other way, you could just start the array with a big capacity for you purpose.

Array access optimization

I have a 10x10 array in Java, some of the items in array which are not used, and I need to traverse through all elements as part of a method. What Would be better to do :
Go through all elements with 2 for loops and check for the nulltype to avoid errors, e.g.
for(int y=0;y<10;y++){
for(int x=0;x<10;x++){
if(array[x][y]!=null)
//perform task here
}
}
Or would it be better to keep a list of all the used addresses... Say an arraylist of points?
Something different I haven't mentioned.
I look forward to any answers :)
Any solution you try needs to be tested in controlled conditions resembling as much as possible the production conditions. Because of the nature of Java, you need to exercise your code a bit to get reliable performance stats, but I'm sure you know that already.
This said, there are several things you may try, which I've used to optimize my Java code with success (but not on Android JVM)
for(int y=0;y<10;y++){
for(int x=0;x<10;x++){
if(array[x][y]!=null)
//perform task here
}
}
should in any case be reworked into
for(int x=0;x<10;x++){
for(int y=0;y<10;y++){
if(array[x][y]!=null)
//perform task here
}
}
Often you will get performance improvement from caching the row reference. Let as assume the array is of the type Foo[][]:
for(int x=0;x<10;x++){
final Foo[] row = array[x];
for(int y=0;y<10;y++){
if(row[y]!=null)
//perform task here
}
}
Using final with variables was supposed to help the JVM optimize the code, but I think that modern JIT Java compilers can in many cases figure out on their own whether the variable is changed in the code or not. On the other hand, sometimes this may be more efficient, although takes us definitely into the realm of microoptimizations:
Foo[] row;
for(int x=0;x<10;x++){
row = array[x];
for(int y=0;y<10;y++){
if(row[y]!=null)
//perform task here
}
}
If you don't need to know the element's indices in order to perform the task on it, you can write this as
for(final Foo[] row: array){
for(final Foo elem: row
if(elem!=null)
//perform task here
}
}
Another thing you may try is to flatten the array and store the elements in Foo[] array, ensuring maximum locality of reference. You have no inner loop to worry about, but you need to do some index arithmetic when referencing particular array elements (as opposed to looping over the whole array). Depending on how often you do it, it may or not be beneficial.
Since most of the elements will be not-null, keeping them as a sparse array is not beneficial for you, as you lose locality of reference.
Another problem is the null test. The null test itself doesn't cost much, but the conditional statement following it does, as you get a branch in the code and lose time on wrong branch predictions. What you can do is to use a "null object", on which the task will be possible to perform but will amount to a non-op or something equally benign. Depending on the task you want to perform, it may or may not work for you.
Hope this helps.
You're better off using a List than an array, especially since you may not use the whole set of data. This has several advantages.
You're not checking for nulls and may not accidentally try to use a null object.
More memory efficient in that you're not allocating memory which may not be used.
For a hundred elements, it's probably not worth using any of the classic sparse array
implementations. However, you don't say how sparse your array is, so profile it and see how much time you spend skipping null items compared to whatever processing you're doing.
( As Tom Hawtin - tackline mentions ) you should, when using an array of arrays, try to loop over members of each array rather than than looping over the same index of different arrays. Not all algorithms allow you to do that though.
for ( int x = 0; x < 10; ++x ) {
for ( int y = 0; y < 10; ++y ) {
if ( array[x][y] != null )
//perform task here
}
}
or
for ( Foo[] row : array ) {
for ( Foo item : row ) {
if ( item != null )
//perform task here
}
}
You may also find it better to use a null object rather than testing for null, depending what the complexity of the operation you're performing is. Don't use the polymorphic version of the pattern - a polymorphic dispatch will cost at least as much as a test and branch - but if you were summing properties having an object with a zero is probably faster on many CPUs.
double sum = 0;
for ( Foo[] row : array ) {
for ( Foo item : row ) {
sum += item.value();
}
}
As to what applies to android, I'm not sure; again you need to test and profile for any optimisation.
Holding an ArrayList of points would be "over engineering" the problem. You have a multi-dimensional array; the best way to iterate over it is with two nested for loops. Unless you can change the representation of the data, that's roughly as efficient as it gets.
Just make sure you go in row order, not column order.
Depends on how sparse/dense your matrix is.
If it is sparse, you better store a list of points, if it is dense, go with the 2D array. If in between, you can have a hybrid solution storing a list of sub-matrices.
This implementation detail should be hidden within a class anyway, so your code can also anytime convert between any of these representations.
I would discourage you from settling on any of these solutions without profiling with your real application.
I agree an array with a null test is the best approach unless you expect sparsely populated arrays.
Reasons for this:
1- More memory efficient for dense arrays (a list needs to store the index)
2- More computationally efficient for dense arrays (You need only compare the value you just retrieved to NULL, instead of having to also get the index from memory).
Also, a small suggestion, but in Java especially you are often better off faking a multi dimensional array with a 1D array where possible (square/rectangluar arrays in 2D). Bounds checking only happens once per iteration, instead of twice. Not sure if this still applies in the android VMs, but it has traditionally been an issue. Regardless, you can ignore it if the loop is not a bottleneck.

Categories