I am currently writing some code in java meant to be a little framework for a project which revolves around a database with some billions of entries. I want to keep it high-level and the data retriueved from the database shoud be easily usable for statistic inference. I am resolved to use the Map interface in this project.
a core concept is mapping the attributes ("columns in the database") to values ("cells") when handling single datasets (with which I mean a columns in the database) for readable code: I use enum objects (named "Attribute") for the attribute types, which means mapping <Attribute, String>, because the data elements are all String (also not very large, maximum 40 characters or so).
There are 15 columns, so there are 15 enums, and the maps will have only so much entries, or less.
So it appears, I will be having a very large number of Map objects floating around, at times, but with comparatively little payload (15-). My goal is to not make the memory explode due to the implementation memory overhead, compared to the actual payload. (Stretch goal: do the same with cpu usage ;] )
I was not really familiar with all the different implementations of Java Collections to date, and when the problem dawned at me today, I looked into my to-date all-time favorite 'HashMap', and was not happy with how much memory overhead there was declared. I am sure, that additonal to the standard implementations, there are a number of implementations not shipped with Java. Googling my case brought not up much of a result, So I am asking you:
Do you know a good implementation of Map for my use case (low entry count, low value size, enumerable keys, ...)
I hope I made my use case clear, and am anxious for your input =)
Thanks a lot!
Stretch answer goal, absolutely optional and only if you got the time and knowledge:
What other implementations of collections are suitable for:
handling attribute (the String things) vectors, and matrices for inference data (counts/probabilities) (Matrices: here I am really clueless for now, Did really no serious math work with java to date)
math libraries for statistical inference, see above
Use EnumMap, this is the best map implementation if you have enums as key, for both performance and memory usage.
The trick is that this map implementation is the only one that that does not store the keys, it only needs a single array with the values (similar to an ArrayList of the values). There is only a little bit of overhead if there are keys that are not mapped to a value, but in most cases this won't be a problem because enums usually do not have too many instances.
Compared to HashMap, you additionally get a predictable iteration order for free.
Since you start off saying you want to store lots of data, eventually, you'll also want to access/modify that data. There are many high performance libraries out there.
Look at
Trove4j : https://bitbucket.org/robeden/trove/
HPPC: http://labs.carrotsearch.com/hppc.html
FastUtil: http://fastutil.di.unimi.it/
When you find a bottleneck, you can switch to using a lower level API (more efficient)
You'll many more choices if look a bit more: What is the most efficient Java Collections library?
EDIT: if your strings are not unique, you could save significant amounts of memory using String.intern() : Is it good practice to use java.lang.String.intern()?
You can squeeze out a bit of memory with a simple map implementation that uses two array lists (keys and values). For larger maps, that is going to mean insertion and look up speeds become much slower because you have to scan the entire list. However, for small maps it is actually faster this way since you don't have to calculate any hashcodes and only have to look at a small number of entries.
If you need an implementation, take a look at my SimpleMap in my jsonj project: https://github.com/jillesvangurp/jsonj/blob/master/src/main/java/com/github/jsonj/SimpleMap.java
Related
I have a big HashMap in java storing mappings from String to Integer. It has 400K records. It runs ok, but i am wondering if there is a better to optimize in terms of memory usage. After the Map is initialized, it will only be searched, no other update operations.
I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that. Please help or share your ideas on this.
Thanks.
I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that.
If the strings keys are actually the string representations of integers, then it could make sense to convert them to Integer or Long objects ... using Integer.valueOf(String). You will save some memory, since the primitive wrapper classes use less memory than the corresponding String objects. The space saving is likely to be significant (maybe ~16 bytes versus ~40 bytes per key ... depending on your platform.)
The flip-side of this is that you would need to convert candidate keys from String to the real key type before doing a hashmap lookup. That conversion takes a bit of time, and typically generates a bit of garbage.
But if the String keys do not represent integers, then this is simply not going to work. (Or at least ... I don't know what "conversion" you are referring to ...)
Note also that the key type has to be Integer / Long rather than int / long. Generic type parameters must be reference types.
There may be 3rd-party collection implementations that would help as well ... depending on precisely your data structure works; e.g. Trove, Guava, Fastutil. Try combining then with the String -> Integer preconversion ...
On the suggestion of using a database. If
you don't need the query / update / transactional capabilities of a database, AND
you can afford the memory to hold the data in memory, AND
you can afford the startup cost of loading the data into memory,
then using a database is just a big, unnecessary performance hit on each lookup.
You might want to tune initialCapacity and loadFactor also improving hashCode() to avoid collision if you want read at higher rate, if you have too many write you might want to benchmark hashCode(),
Even if this is too big for your app you might want to consider it moving out side of jvm to some cache (redis) or may be database if you can afford the little read/write delay
Writing the data to a database is ultimately the best solution if the data gets too big, but 400k is still doable in memory.
However, Java's built-in HashMap implementation uses separate chaining and every key-value pair has a separate class. I've gotten great (30%) speed improvements and awesome (50%) memory improvements by building a quadratic probing implementation of Map.
I suggest you search around on the internet. There are plenty of good implementations around!
You could use Guava's ImmutableMap -- it's optimized for write-once data, and takes ~15% less memory than HashMap.
I habitually use HashMap in my programs, since I know it is usually the most efficient (if properly used) and can cope with large maps easily. I know about EnumMap which is very useful for enumeration keys, but often I am generating a small map which will never get very big, is likely to be discarded pretty soon, and has no concurrency issues.
Is HashMap<K,V> too complicated for these small, local and temporary uses? Is there another, simple, implementation which I can use in these cases?
I think I'm looking for a Map implementation which is analogous to ArrayList for List. Does it exist?
Added later after responses:
Here is a scenario where a slow but very simple implementation might be better -- when I have many, many of these Maps. Suppose, for example, I have a million or so of these tiny little maps, each with a handful (often less than three) of entries. I have a low reference rate -- perhaps I don't actually reference them before they are discarded most of the time. Is it still the case that HashMap is the best choice for them?
Resource utilisation is more than just speed -- I would like something that doesn't fragment the heap a lot and make GCs take a long time, for example.
It may be that HashMap is the right answer, but this is not a case of premature optimisation (or at least it may not be).
Added much later after some thought:
I decided to hand-code my own SmallMap. It is easy to make one with AbstractMap. I have also added a couple of constructors so that a SmallMap can be constructed from an existing Map.
Along the way I had to decide how to represent Entrys and to implement SmallSet for the entrySet method.
I learned a lot by coding (and unit-testing this) and want to share this, in case anyone else wants one. It is on github here.
There is no standard small implementation of Map in Java. HashMap is one of the best and most flexible Map implementations around, and is hard to beat. However, in the very small requirement area -- where heap usage and speed of construction is paramount -- it is possible to do better.
I have implemented SmallCollections on GitHub to demonstrate how this might be done. I would love some comments on whether I have succeeded. It is by no means certain that I have.
Although the answers offered here were sometimes helpful, they tended, in general, to misunderstand the point. In any case, answering my own question was, in the end, much more useful to me than being given one.
The question here has served its purpose, and that is why I have 'answered it myself'.
I think this is premature optimization. Are you having memory problems? Performance problems from creating too many maps? If not I think HashMap is fine.
Besides, looking at the API, I'm not seeing anything simpler than a HashMap.
If you are having issue, you could roll your own Map implementation, that has very simple internals. But I doubt you would do better than default Map implementations, plus you have the overhead of making sure your new class works. In this case there might be a problem with your design.
A HashMap is possibly the most light weight and simple collection.
Sometimes the more efficient solution is to use a POJO. e.g. if your keys are field names and/or your values are primitives.
HashMap is a good choice because it offers average case O(1) puts and gets. It does not guarantee ordering though like SortedMap implementations (i.e. TreeMap O(log n) puts and gets) but if you have no requirement for ordering then HashMap is better.
IdentityHashMap
If you are truly concerned about performance, consider IdentityHashMap.
This class tracks keys by their reference in memory rather than by the content of the key. This avoids having to call the hashCode method on your key, which may or may not be lengthy if not cached internally within its class.
This class provides constant-time performance for the basic operations (get and put).
Read the Javadoc carefully when choosing this implementation of Map.
Map.of
Java 9 and JEP 269: Convenience Factory Methods for Collections brought us the convenience of literals syntax for concisely defining a map with Map.of (and list with List.of).
If your situation can use an unmodifiable map, use this simple syntax.
Note that the Map.of methods return a Map interface object. The underlying concrete Map class is unknown and may vary depending on your particular code and the version of Java. The Java implementation is free to optimize in its choice of a concrete class. For example, if your Map.of uses an enum for keys, the Map.of command might choose an EnumMap, or it might choose some yet-to-be-devised class optimized for very small maps.
Map< DayOfWeek , Employee > dailyWorker =
Map.of(
DayOfWeek.MONDAY , alice ,
DayOfWeek.TUESDAY , bob ,
DayOfWeek.WEDNESDAY , bob ,
DayOfWeek.THURSDAY , alice ,
DayOfWeek.FRIDAY , carol ,
DayOfWeek.SATURDAY , carol ,
DayOfWeek.SUNDAY , carol
)
;
Premature Optimization
As others noted, you are likely falling into the trap of premature optimization. Your app's performance is not likely to be impacted significantly by your choice of small map implementation.
Other considerations
There are other considerations you should likely be focusing on, such as concurrency, tolerance for NULLs, and iteration-order.
Here is a graphic chart I made listing those aspects of the ten implementations of Map bundled with Java 11.
I agree with #hvgotcodes that it is premature optimization but it is still good to know all tools in the toolbox.
If you do a lot of iterations over what is in a map, a LinkedHashMap is usually quite a lot faster than a HashMap, if you have a lot of threads working with the Map at the same time, a ConcurrentHashMap is often a better choice. I wouldn't worry about any Map implementation being inefficient for small sets of data. It is typically the other way around, an incorrectly constructed map easily gets inefficient with large amounts of data if you have bad hash values or if something causes it to have too few buckets for its load.
Then of course there are cases when a HashMap makes no sense at all, like if you have three values which you will always index with the keys 0, 1 and 2 but I assume you understand that :-)
HashMap uses more or less memory (when created) depending on how you initialize it: more buckets mean more memory usage, but faster access for large amounts of items; if you need only a small number of items you can initialize it with a small value, which will produce less buckets that will still be fast (since they will each receive a few items). There is no waste of memory if you set it correctly (the tradeoff is basically memory usage vs speed).
As for heap fragmentation and GC cycle wasting and whatnot, there is not much that a Map implementation can do about them; it all falls back to how you set it. Understand that this is not about Java's implementation, but the fact that generic (as in, for example, cannot assume anything about key values like EnumMap does) hashtables (not HashTables) are the best possible implementations of a map structure.
Android has an ArrayMap with the intent of minimizing memory. In addition to being in the core, it's in the v4 support library, which, theoretically, should be able to compile for the Oracle or OpenJDK JREs as well. Here is a link to the source of ArrayMap in a fork of the v4 support library on github.
There is an alternative called AirConcurrentMap that is more memory efficient above 1K Entries than any other Map I have found, and is faster than ConcurrentSkipListMap for key-based operations and faster than any Map for iterations, and has an internal thread pool for parallel scans. It is an ordered i.e. NavigableMap and a ConcurrentMap. It is free for non-commercial no-source use, and commercially licensed with or without source. See boilerbay.com for graphs. Full disclosure: I am the author.
AirConcurrentMap conforms to the standards so it is plug-compatible everywhere, even for a regular Map.
Iterators are already very fast especially over 1K Entries. The higher-speed scans use a 'visitor' model with a single visit(k, v) callback that reaches the speed of Java 8 parallel streams. The AirConcurrentMap parallel scan exceeds Java 8 parallel streams by about 4x. The threaded visitor adds split() and merge() methods to the single-thread visitor that remind one of map/reduce:
static class ThreadedSummingVisitor<K> extends ThreadedMapVisitor<K, Long> {
private long sum;
// This is idiomatic
long getSum(VisitableMap<K, Long> map) {
sum = 0;
map.getVisitable().visit(this);
return sum;
}
#Override
public void visit(Object k, Long v) {
sum += ((Long)v).longValue();
}
#Override
public ThreadedMapVisitor<K, Long> split() {
return new ThreadedSummingVisitor<K>();
}
#Override
public void merge(ThreadedMapVisitor<K, Long> visitor) {
sum += ((ThreadedSummingVisitor<K>)visitor).sum;
}
}
...
// The threaded summer can be re-used in one line now.
long sum = new ThreadedSummingVisitor().getSum((VisitableMap)map);
I also was interested and just for an experiment I created a map which stores keys and values just in fields and allows up to 5 entries. It consumes 4 less memory and works 16 times faster than HashMap
https://github.com/stokito/jsmallmap
I would like to find and reuse (if possible) a map implementation which has the following attributes:
While the number of entries is small, say < 32, underlying storage should be done in an array like this [key0, val0, key1, val1, ...] This storage scheme avoids many small Entry objects and provides for extremely fast look ups (even tho they are sequential scans!) on modern CPU's due to the CPU's cache not being invalidated and lack of pointer indirection into heap.
The map should maintain insertion order for key/value pairs regardless of the number of entries similar to LinkedHashMap
We are working on an in-memory representations of huge (millions of nodes/edges) graphs in Scala and having such a Map would allow us to store Node/Edge attributes as well as Edges per node in a much more efficient way for 99%+ of Nodes and Edges which have few attributes or neighbors while preserving chronological insertion order for both attributes and edges.
If anyone knows of a Scala or Java map with such characteristics I would be much obliged.
Thanx
While I'm not aware of any implementations that exactly fit your requirements, you may be interested in peeking at Flat3Map (source) in the Jakarta Commons library.
Unfortunately, the Jakarta libraries are rather outdated (e.g., no support for generics in the latest stable release, although it is promising to see that this is changing in trunk) and I usually prefer the Google Collections, but it might be worth your time to see how Apache implemented things.
Flat3Map doesn't preserve the order of the keys, unfortunately, but I do have a suggestion in regard to your original post. Instead of storing the keys and values in a single array like [key0, val0, key1, val1, ...], I recommend using parallel arrays; that is, one array with [key0, key1, ...] and another with [val0, val1, ...]. Normally I am not a proponent of parallel arrays, but at least this way you can have one array of type K, your key type, and another of type V, your value type. At the Java level, this has its own set of warts as you cannot use the syntax K[] keys = new K[32]; instead you'll need to use a bit of typecasting.
Have you measured with profiler if LinkedHashMap is too slow for you? Maybe you don't need that new map - premature optimization is the root of all evil..
Anyway for processing millions or more pieces of data in a second, even best-optimized map can be too slow, because every method call decreases performance as well in that cases. Then all you can do is to rewrite your algorithms from Java collections to arrays (i.e. int -> object maps).
Under java you can maintain a 2d array(spreadsheet). I wrote a program which basically defines a 2 d array with 3 coloumns of data, and 3 coloumns for looking up the data. the three coloumns are testID, SubtestID and Mode. This allows me to basically look up a value by testid and mode or any combination, or I can also reference by static placement. The table is loaded into memory at startup and referenced by the program. It is infinately expandable and new values can be added as needed.
If you are interested, I can post a code source example tonight.
Another idea may be to maintain a database in your program. Databases are designed to organize large amounts of data.
Would a hashtable/hashmap use a lot of memory if it only consists of object references and int's?
As for a school project we had to map a database to objects (that's what being done by orm/hibernate nowadays) but eager to find a good way not to store id's in objects in order to save them again we thought of putting all objects we created in a hashmap/hashtable, so we could easily retrieve it's ID. My question is if it would cost me performance using this, in my opinion more elegant way to solve this problem.
Would a hashtable/hashmap use a lot of
memory if it only consists of object
references and int's?
"a lot" depends on how many objects you have. For a few hundreds or a few thousands, you're not going to notice.
But typically the default Java collections are really incredibly inefficient when you're working with primitives (because of the constant boxing/unboxing from "primitive to wrapper" going on, like say "int to Integer") , both from a performances and memory standpoint (the two being related but not identical).
If you have a lot of entries, like hundreds of thousands or millions, I suggest using for example the Trove collections.
In your case, you'd use this:
TIntObjectHashMap<SomeJavaClass>
or this:
TObjectIntHashMap<SomeJavaClass>
In any case, that shall run around circle the default Java collections perf-wise and cpu-wise (and it shall trigger way less GC, etc.).
You're dodging the unnecessary automatic (un)boxing from/to int/Integer, the collections are creating way less garbage, resizing in a much smarter way, etc.
Don't even get me started on the default Java HashMap<Integer,Integer> compared to Trove's TIntIntHashMap or I'll go berzerk ;)
Minimally, you'd need an implementation of the Map.Entry interface with a reference to the key object and a reference to the value object. If either the the key or value are primitive types, such as int, you'll need a wrapper type (e.g. Integer) to wrap it as well. The Map.Entrys are stored in an array and allocated in blocks.
Take a look at this question for more information on how to measure your memory consumption in Java.
It's impossible to answer this without some figures. How many objects are you looking to store? Don't forget you're storing the objects already, so the key/object reference combination should be fairly small.
The only sensible thing to do is to try this and see if it works for you. Don't forget that the JVM will have a default maximum memory allocation and you can increase this (if you need) via -Xmx
Hmmm. I have a table which is an array of structures I need to store in Java. The naive don't-worry-about-memory approach says do this:
public class Record {
final private int field1;
final private int field2;
final private long field3;
/* constructor & accessors here */
}
List<Record> records = new ArrayList<Record>();
If I end up using a large number (> 106 ) of records, where individual records are accessed occasionally, one at a time, how would I figure out how the preceding approach (an ArrayList) would compare with an optimized approach for storage costs:
public class OptimizedRecordStore {
final private int[] field1;
final private int[] field2;
final private long[] field3;
Record getRecord(int i) { return new Record(field1[i],field2[i],field3[i]); }
/* constructor and other accessors & methods */
}
edit:
assume the # of records is something that is changed infrequently or never
I'm probably not going to use the OptimizedRecordStore approach, but I want to understand the storage cost issue so I can make that decision with confidence.
obviously if I add/change the # of records in the OptimizedRecordStore approach above, I either have to replace the whole object with a new one, or remove the "final" keyword.
kd304 brings up a good point that was in the back of my mind. In other situations similar to this, I need column access on the records, e.g. if field1 and field2 are "time" and "position", and it's important for me to get those values as an array for use with MATLAB, so I can graph/analyze them efficiently.
The answers that give the general "optimise when you have to" is unhelpful in this case because , IMHO, programmers should always be aware of the performance in different in design choices when that choice leads to an order of magnitude performance penalty, particularly API writers.
The original question is quite valid and I would tend to agree that the second approach is better, given his particular situation. I've written image processing code where each pixel requires a data structure, a situation not too dissimilar to this, except I needed frequent random access to each pixel. The overhead of creating one object for each pixel was enormous.
The second version is much, much worse. Instead of resizing one array, you're resizing three arrays when you do an insert or delete. What's more, the second version will lead to the creation of many more temporary objects and it will do so on accesses. That could lead to a lot of garbage (from a GC point of view). Not good.
Generally speaking, you should worry about how you use the objects long before you think about performance. So you have a record with three fields or three arrays. Which one more accurately depicts what you're modeling? By this I mean, when you insert or delete an item, are you doing one of the three arrays or all three as a block?
I suspect it's the latter in which case the former makes far more sense.
If you're really concerned about insertion/deletion performance then perhaps a different data structure is appropriate, perhaps a SortedSet or a Map or SortedMap.
If you have millions of records, the second approach has several advantages:
Memory usage: the first approach uses more memory because a) every Java object in heap has a header (containing class id, lock state etc.); b) objects are aligned in memory; c) each reference to an object costs 4 bytes (on 64-bit JVMs with Compressed OOPs or 32-bit JVMs) or 8 bytes (64-bit JVMs without Compressed OOPs). See e. g. CompressedOops for more details. So the first approach takes about two times more memory (more precisely: according to my benchmark, an object with 16 bytes of payload + a reference to it took 28 bytes on 32-bit Java 7, 36 bytes on 64-bit Java 7 with compressed OOPs, and 40 bytes on 64-bit Java 7 w/o compressed OOPs).
Garbage collection: although the second approach seems to create many objects (one on each call of getRecord), it might not be so, as modern server JVMs (e. g. Oracle's Java 7) can apply escape analysis and stack allocation to avoid heap allocation of temporary objects in some cases; anyway, GCing short-lived objects is cheap. On the other hand, it is probably easier for the garbage collector if there are not millions of long-lived objects (as there are in the first approach) whose reachability to check (or at least, such objects may make your application need more careful tuning of GC generation sizes). Thus the second approach may be better for GC performance. However, to see whether it makes a difference in the real situation, one should make a benchmark oneself.
Serialization speed: the speed of (de)serializing a large array of primitives on disk is only limited by HDD speed; serializing many small objects is inevitably slower (especially if you use Java's default serialization).
Therefore I have used the second approach quite often for very large collections. But of course, if you have enough memory and don't care about serialization, the first approach is simpler.
How are you going to access the data? If the accesses over the fields are always coupled, then use the first option, if you are going to process the fields by its own, then the second option is better.
See this article in wikipedia: Parallel Array
A good example about when it's more convenient to have separate arrays could be simulations where the numerical data is packed together in the same array, and other attributes like name, colour, etc. that are accessed just for presentation of the data in other array.
I was curious so I actually ran a benchmark. If you don't re-create the object like you are[1], then SoA beats AoS by 5-100% depending on workload[2]. See my code here:
https://gist.github.com/twolfe18/8168262c5420c7a62d39
[1] I didn't add that because if you are concerned enough about speed to consider this refactor, it would be silly to do that.
[2] This also doesn't account for re-allocation, but again, this is often something you can either amortize away or know statically. This is a reasonable assumption for a pure-speed benchmark.
Notice that the second approach might have negative impact on caching behaviour. If you want to access a single record at a time, you'd better have that record not scattered all across the place.
Also, the only memory you win in the second approach, is (possibly) due to member alignment. (and having to allocate a separate object).
Otherwise, they have exactly the same memory use, asymptotically. The first option is much better due to locality, IMO
Whenever I have tried doing number crunching in Java, I have always had to revert to C-style coding (i.e. close to your option 2). It minimised the number of objects floating around in your system, as instead of 1,000,000 objects, you only have 3. I was able to do a bit of FFT analysis of real-time sound data using the C-style, and it was far too slow using objects.
I'd choose the first method (array of structures) unless you access the store relatively infrequently and are running into serious memory pressure issues.
First version basically stores the objects in their "natural" form (+1 BTW for using immutable records). This uses a little more memory because of the per-object overhead (probably around 8-16 bytes depending on your JVM) but is very good for accessing and returning objects in a convenient and human-understandable form in one simple step.
Second version uses less memory overall, but the allocation of a new object on every "get" is a pretty ugly solution that will not perform well if accesses are frequent.
Some other possibilities to consider:
An interesting "extreme" variant would be to take the second version but write your algorithms / access methods to interact with the underlying arrays directly. This is clearly going to result in complex inter-dependencies and some ugly code, but would probably give you the absolute best performance if you really needed it. It's quite common to use this approach for intensive graphics applications such as manipulating a large array of 3D coordinates.
A "hybrid" option would be to store the underlying data in a structure of arrays as in the second version, but cache the accessed objects in a HashMap so that you only generate the object the first time a particular index is accessed. Might make sense if only a small fraction of objects are ever likely to accessed, but all data is needed "just in case".
(Not a direct answer, but one that I think should be given)
From your comment,
"cletus -- I greatly respect your thoughts and opinions, but you gave me the high-level programming & software design viewpoint which is not what I'm looking for. I cannot learn to ignore optimization until I can get an intuitive sense for the cost of different implementation styles, and/or the ability to estimate those costs. – Jason S Jul 14 '09 at 14:27"
You should always ignore optimization until it presents itself as a problem. Most important is to have the system be usable by a developer (so they can make it usable by a user). There are very few times that you should concern yourself with optimization, in fact in ~20 years of professional coding I have cared about optimization a total of two times:
Writing a program that had its primary purpose to be faster than another product
Writing a smartphone app with the intention of reducing the amount of data sent between the client and server
In the first case I wrote some code, then ran it through a profiler, when I wanted to do something and I was not sure which approach was best (for speed/memory) I would code one way and see the result in the profiler, then code the other way and see the result. Then I would chose the faster of the two. This works and you learn a lot about low level decisions. I did not, however, allow it to impact the higher level classes.
In the second case, there was no programming involved, but I did the same basic thing of looking at the data being sent and figuring out how to reduce the number of messages being sent as well as the number of bytes being sent.
If your code is clear then it will be easier to speed up once you find out it is slow. As Cletus said in his answer, you are resizing one time -vs- three times... one time will be faster than three. From a higher point of view the one time is simpler to understand than the three times, thus it is more likely to be correct.
Personally I'd rather get the right answer slowly then the wrong answer quickly. Once I know how to get the right answer then I can find out where the system is slow and replace those parts of it with faster implementations.
Because you are making the int[] fields final, you are stuck with just the one initialization of the array and that is it. Thus, if you wanted 10^6 field1's, Java would need to separate that much memory for each of those int[], because you cannot reassign the size of those arrays. With an ArrayList, if you do not know the number of records beforehand and will be removing records potentially, you save a lot of space upfront and then later on as well when you go to remove records.
I would go for the ArrayList version too, so I don't need to worry about growing it. Do you need to have a column like access to values? What is your scenario behind your question?
Edit You could also use a common long[][] matrix.
I don't know how you pass the columns to Matlab, but I guess you don't gain much speed with a column based storage, more likely you loose speed in the java computation.