Java - how to efficiently store a large amount of String arrays

Java - how to efficiently store a large amount of String arrays - java

I'm trying to load large CSV formatted files (typically 200-600mb) efficiently with Java (less memory and as fast as possible access). Currently, the program is utilizing a List of String Arrays. This operation was previously handled with a Lua program using a table for each CSV row and a table to hold each "row" table.
Below is an example of the memory differences and load times:
CSV File - 232mb
Lua - 549mb in memory - 157 seconds to load
Java - 1,378mb in memory - 12 seconds to load
If I remember correctly, duplicate items in a Lua table exist as a reference to the actual value. I suspect in the Java example, the List is holding separate copies of each duplicate value and that may be related to the larger memory usage.
Below is some background on the data within the CSV files:
Each field consists of a String
Specific fields within each row may include one of a set of Strings (E.g. field 3 could be "red", "green", or "blue").
There are many duplicate Strings within the content.
Below are some examples of what may be required of the loaded data:
Search through all Strings attempting to match with a given String and return the matching Strings
Display matches in a GUI table (sort able via fields).
Alter or replace Strings.
My question - Is there a collection that will require less memory to hold the data yet still offer features to easily and quickly search/sort the data?

One easy solution. You can have some HashMap were you will put references to all unique strings.
And in ArrayList you will just have reference to existing unique strings in HashMap.
Something like :
private HashMap<String, String> hashMap = new HashMap<String, String>();
public String getUniqueString(String ns) {
String oldValue = hashMap.get(ns);
if (oldValue != null) { //I suppose there will be no null strings inside csv
return oldValue;
}
hashMap.put(ns, ns);
return ns;
}
Simple usage:
List<String> s = Arrays.asList("Pera", "Zdera", "Pera", "Kobac", "Pera", "Zdera", "rus");
List<String> finS = new ArrayList<String>();
for (String er : s) {
String ns = a.getUniqueString(er);
finS.add(ns);
}

Maybe this article can be of some help :
http://www.javamex.com/tutorials/memory/string_saving_memory.shtml

DAWG
A directed acyclic word graph is the most efficient way to store words (best for memory consumption anyway).
But probably overkill here, as others have said don't create duplicates just make multiple references to the same instance.

To optimise your your Memory problem i advice to use the Flyweight pattern, specially for fields that have a lot of duplicates.
As a Collection you can use a TreeSet or TreeMap.
If you give a good implementation to your LineItem class (implement equals, hashcode and Comparable) you can optimise the memory use a lot.

just as a side note.
For the duplicate string data you doubt, you don't need to worry about that, as java itself cares of that as all strings are final, and all references target the same object in memory.
so not sure how lua does the job, but in java it should be also quite efficient

Related

Difference in time complexity between storing data as a HashMap and record instance

For one of my school assigments, I have to parse GenBank files using Java. I have to store and retrieve the content of the files together with the extracted information maintaining the smallest time complexity possible. Is there a difference between using HashMaps or storing the data as records? I know that using HashMaps would be O(1), but the readability and immutability of using records leads me to prefer using them instead. The objects will be stored in an array.
This my approach now
public static GenBankRecord parseGenBankFile(File gbFile) throws IOException {
try (var fileReader = new FileReader(gbFile); var reader = new BufferedReader(fileReader)) {
String organism = null;
List<String> contentList = new ArrayList<>();
while (true) {
String line = reader.readLine();
if (line == null) break; //Breaking out if file end has been reached
contentList.add(line);
if (line.startsWith(" ORGANISM ")) {
// Organism type found
organism = line.substring(12); // Selecting the correct part of the line
}
}
// Loop ended
var content = String.join("\n", contentList);
return new GenBankRecord(gbFile.getName(),organism, content);
}
}
with GenBankRecord being the following:
record GenBankRecord(String fileName,String organism, String content) {
#Override
public String toString(){
return organism;
}
}
Is there a difference between using a record and a HashMap, assuming the keys-value pairs are the same as the fields of the record?
String current_organism = gbRecordInstance.organism();
and
String current_organism = gbHashMap.get("organism");

I have to store and retrieve the content of the files together with the extracted information maintaining the smallest time complexity possible.
Firstly, I am somewhat doubtful that your teachers actually stated the requirements like that. It doesn't make a lot of sense to optimize just for time complexity.
Complexity is not efficiency.
Big O complexity is not about the value of the measure (e.g. time taken) itself. It is actually about how the measure (e.g. time taken) changes as some variable gets very large.
For example, HashMap.get(nameStr) and someRecord.name are both O(1) complexity.
But they are not equivalent in terms of efficiency. Using Java 17 record types or regular Java classes with named fields will be orders of magnitude faster than using a HashMap. (And it will use orders of magnitude less memory.)
Assuming that your objects have a fixed number of named fields, the complexity (i.e how the performance changes with an ever increasing number of fields) is not even a relevant.
Performance is not everything.
The most differences between HashMap and a record class are actually in the functionality that they provide:
A Map<String, SomeType> provides an set of name / value pairs where:
the number of pairs in the set is not fixed
the names are not fixed
the types of the values are all instances of SomeType or a subtype.
A record (or classic class) can be viewed as set of fieldname / value pairs where:
the number of pairs is fixed at compile time
the field names are fixed at compile time
the field types don't have to be subtypes of any single given type.
As #Louis Wasserman commented:
Records and HashMap are apples and oranges -- it doesn't really make sense to compare them.
So really, you should be choosing between records and hashmaps by comparing the functionality / constraints that they provide versus what your application actually needs.
(The problem description in your question is not clear enough for us to make that judgement.)
Efficiency concerns may be relevant, but it is a secondary concern. (If the code doesn't meet functional requirements, efficiency is moot.)
Is Complexity relevant to your assignment?
Well ... maybe yes. But not in the area that you are looking at.
My reading of the requirements is that one of them is that you be able to retrieve information from your in-memory data structures efficiently.
But so far you have been thinking about storing individual records. Retrieval implies that you have a collection of records and you have to (efficiently) retrieve a specific record, or maybe a set of records matching some criteria. So that implies you need to consider the data structure to represent the collection.
Suppose you have a collection of N records (or whatever) representing (say) N organisms:
If the collection is a List<SomeRecord>, you need to iterate the list to find the record for (say) "cat". That is O(N).
If the collection is a HashMap<String, SomeRecord> keyed by the organism name, you can find the "cat" record in O(1).

HashMap - Fastest access using a key with several fields

I have an object that is identified by 3 fields. One them is a String that represents 6 hex bytes, the other two are integers of not more than 1 bytes each. This all summed up is 8 bytes of data, which fits in a 64 bit integer.
I need to map these objects for fast access, and I can think of two approaches:
Use the 3 fields to generate a 64 bit key used to map the objects. This however would mean parsing the String to Hex for every access (and there will a lot of accesses, which need to be fast).
Use 3 HashMap levels, each nested inside the next, to represent the 3 identifying fields.
My question is which of these approaches should be the fastest.

Why not use a MultiKeyMap?
This might be not related to your question.

I have a suggestion for you.
Create an object with the 3 attributes that will form the key. Use the object has the key because it will be unique.
Map<ObjectKey,Object> map = new HashMap<>();
This makes sense for your use case? If you can add a bit more explanation maybe I can go further in suggest you possible solutions.
EDIT: You can override the equals and do something using this kind of logic:
#Override
public boolean equals(Object obj) {
if (!(obj instanceof Key))
return false;
ObjectKey objectKey= (Key) obj;
return this.key1.equals(objectKey.key1) && this.key2.equals(objectKey.key2) &&
...
this.keyN.equals(objectKey.keyN)
}

I would take the following steps:
Write it in the most readable way first, and profile it.
Refactor it to an implementation you think might be faster, then profile it again.
Compare.
Repeat.
Your key fits into a 64-bit value. Assuming you will build the HashMap in one go and then read from it multiple times (using it as a lookup table), my hunch is that using a Long type as the key of your HashMap will be about as fast as you can get.
You are concerned about having to parse the string as a hex number every time you look up a key in the map. What's the alternative? If you use a key containing the three separate fields, you will still have to parse the string to calculate its hash code (or, rather, the Java API implementation will calculate its hash code by parsing the string contents). The HashMap will not only call String.hashCode() but also String.equals(), so your string will be iterated twice. By contrast, calculating a Long and comparing it to the precalculated keys in the HashMap will consist of iterating the string only once.
If you use three levels of HashMap, as per your second suggestion, you will still have to calculate the hash code of your string, as well as having to look up the values of all three fields anyway, so the multi-level map doesn't give you any performance advantage.
You should also experiment with the HashMap constructor arguments to get the most efficiency. These will determine how efficiently your data will get spread into separate buckets.

What is the most memory efficient method of storing a large number of Strings in a map?

I want to store huge amounts of Strings in a Map<String, MagicObject>, so that the MagicObjects can be accessed quickly. There are so many entries to this Map that memory is becoming a bottleneck. Assuming the MagicObjects can't be optimized, what is the most efficient type of map I could use for this situation? I am currently using the following:
gnu.trove.map.hash.TCustomHashMap<byte[], MagicObject>

If your keys are long enough and have a lot of long enough common prefixes then you can save memory by using a trie (prefix tree) data structure. Answers to this question point to a a couple of Java implementations of trie.

To open mind, consider Huffman coding to compress your strings first before
put in map, as long as your strings are fixed(number and content of string don't change).

I'm a little late to this party but this question came up in a related search and piqued my interest. I don't usually answer Java questions.
There are so many entries to this Map that memory is becoming a bottleneck.
I doubt it.
For the storage of strings in memory to become a bottleneck you need an awfully large number of unique strings[1]. To put things into perspective, I recently worked with a 1.8m word dictionary (1.8m unique english words) and they took up around 1.6MB in RAM at runtime.
If you used every word in the dictionary as a key you'll still only use 1.6MB of RAM[2] to store the keys, hence memory cannot be your bottleneck.
What I suspect you are experiencing is the O(n^2) performance of string matching. By this I mean that as more keys are added performance slows down exponentially[3]. This is unavoidable if you are using strings are keys.
If you want to speed things up a bit, store each key into a hashtable that doesn't store duplicates and use the hash key as the key to your map.
NOTES:
[1] I'm assuming the strings are all unique or else you would not attempt to use them as a key into a map.
[2] Even if Java uses 2 bytes per character, it still only comes to 3.2MB of memory, total.
[3] It slows down even more if you choose the wrong data structure, such as an unbalanced binary tree, to store your values. I don't know how map stores values internally, but an unbalanced binary tree will have O(2^n) performance - pretty much the worst performance you can find.

Looking for a table-like data structure

I have 2 sets of data.
Let say one is a people, another is a group.
A people can be in multiple groups while a group can have multiple people.
My operations will basically be CRUD on group and people.
As well as a method that makes sure a list of people are in different groups (which gets called alot).
Right now I'm thinking of making a table of binary 0's and 1's with horizontally representing all the people and vertically all the groups.
I can perform the method in O(n) time by adding each list of binaries and compare with the "and" operation of the list of binaries.
E.g
Group A B C D
ppl1 1 0 0 1
ppl2 0 1 1 0
ppl3 0 0 1 0
ppl4 0 1 0 0
check (ppl1, ppl2) = (1001 + 0110) == (1001 & 0110)
= 1111 == 1111
= true
check (ppl2, ppl3) = (0110 + 0010) == (0110+0010)
= 1000 ==0110
= false
I'm wondering if there is a data structure that does something similar already so I don't have to write my own and maintain O(n) runtime.

I don't know all of the details of your problem, but my gut instinct is that you may be over thinking things here. How many objects are you planning on storing in this data structure? If you have really large amounts of data to store here, I would recommend that you use an actual database instead of a data structure. The type of operations you are describing here are classical examples of things that relational databases are good at. MySQL and PostgreSQL are examples of large scale relational databases that could do this sort of thing in their sleep. If you'd like something lighter-weight SQLite would probably be of interest.
If you do not have large amounts of data that you need to store in this data structure, I'd recommend keeping it simple, and only optimizing it when you are sure that it won't be fast enough for what you need to do. As a first shot, I'd just recommend using java's built in List interface to store your people and a Map to store groups. You could do something like this:
// Use a list to keep track of People
List<Person> myPeople = new ArrayList<Person>();
Person steve = new Person("Steve");
myPeople.add(steve);
myPeople.add(new Person("Bob"));
// Use a Map to track Groups
Map<String, List<Person>> groups = new HashMap<String, List<Person>>();
groups.put("Everybody", myPeople);
groups.put("Developers", Arrays.asList(steve));
// Does a group contain everybody?
groups.get("Everybody").containsAll(myPeople); // returns true
groups.get("Developers").containsAll(myPeople); // returns false
This definitly isn't the fastest option available, but if you do not have a huge number of People to keep track of, you probably won't even notice any performance issues. If you do have some special conditions that would make the speed of using regular Lists and Maps unfeasible, please post them and we can make suggestions based on those.
EDIT:
After reading your comments, it appears that I misread your issue on the first run through. It looks like you're not so much interested in mapping groups to people, but instead mapping people to groups. What you probably want is something more like this:
Map<Person, List<String>> associations = new HashMap<Person, List<String>>();
Person steve = new Person("Steve");
Person ed = new Person("Ed");
associations.put(steve, Arrays.asList("Everybody", "Developers"));
associations.put(ed, Arrays.asList("Everybody"));
// This is the tricky part
boolean sharesGroups = checkForSharedGroups(associations, Arrays.asList(steve, ed));
So how do you implement the checkForSharedGroups method? In your case, since the numbers surrounding this are pretty low, I'd just try out the naive method and go from there.
public boolean checkForSharedGroups(
Map<Person, List<String>> associations,
List<Person> peopleToCheck){
List<String> groupsThatHaveMembers = new ArrayList<String>();
for(Person p : peopleToCheck){
List<String> groups = associations.get(p);
for(String s : groups){
if(groupsThatHaveMembers.contains(s)){
// We've already seen this group, so we can return
return false;
} else {
groupsThatHaveMembers.add(s);
}
}
}
// If we've made it to this point, nobody shares any groups.
return true;
}
This method probably doesn't have great performance on large datasets, but it is very easy to understand. Because it's encapsulated in it's own method, it should also be easy to update if it turns out you need better performance. If you do need to increase performance, I would look at overriding the equals method of Person, which would make lookups in the associations map faster. From there you could also look at a custom type instead of String for groups, also with an overridden equals method. This would considerably speed up the contains method used above.
The reason why I'm not too concerned about performance is that the numbers you've mentioned aren't really that big as far as algorithms are concerned. Because this method returns as soon as it finds two matching groups, in the very worse case you will call ArrayList.contains a number of times equal to the number of groups that exist. In the very best case scenario, it only needs to be called twice. Performance will likely only be an issue if you call the checkForSharedGroups very, very often, in which case you might be better off finding a way to call it less often instead of optimizing the method itself.

Have you considered a HashTable? If you know all of the keys you'll be using, it's possible to use a Perfect Hash Function which will allow you to achieve constant time.

How about having two separate entities for People and Group. Inside People have a set of Group and vice versa.
class People{
Set<Group> groups;
//API for addGroup, getGroup
}
class Group{
Set<People> people;
//API for addPeople,getPeople
}
check(People p1, People p2):
1) call getGroup on both p1,p2
2) check the size of both the set,
3) iterate over the smaller set, and check if that group is present in other set(of group)
Now, you can basically store People object in any data structure. Preferably a linked list if size is not fixed otherwise an array.

Get a value from hashtable by a part of its key

Say I have a Hashtable<String, Object> with such keys and values:
apple => 1
orange => 2
mossberg => 3
I can use the standard get method to get 1 by "apple", but what I want is getting the same value (or a list of values) by a part of the key, for example "ppl". Of course it may yield several results, in this case I want to be able to process each key-value pair. So basically similar to the LIKE '%ppl%' SQL statement, but I don't want to use a (in-memory) database just because I don't want to add unnecessary complexity. What would you recommend?
Update:
Storing data in a Hashtable isn't a requirement. I'm seeking for a kind of a general approach to solve this.

The obvious brute-force approach would be to iterate through the keys in the map and match them against the char sequence. That could be fine for a small map, but of course it does not scale.
This could be improved by using a second map to cache search results. Whenever you collect a list of keys matching a given char sequence, you can store these in the second map so that next time the lookup is fast. Of course, if the original map is changed often, it may get complicated to update the cache. As always with caches, it works best if the map is read much more often than changed.
Alternatively, if you know the possible char sequences in advance, you could pre-generate the lists of matching strings and pre-fill your cache map.
Update: Hashtable is not recommended anyway - it is synchronized, thus much slower than it should be. You are better off using HashMap if no concurrency is involved, or ConcurrentHashMap otherwise. Latter outperforms a Hashtable by far.
Apart from that, out of the top of my head I can't think of a better collection to this task than maps. Of course, you may experiment with different map implementations, to find the one which suits best your specific circumstances and usage patterns. In general, it would thus be
Map<String, Object> fruits;
Map<String, List<String>> matchingKeys;

Not without iterating through explicitly. Hashtable is designed to go (exact) key->value in O(1), nothing more, nothing less. If you will be doing query operations with large amounts of data, I recommend you do consider a database. You can use an embedded system like SQLite (see SQLiteJDBC) so no separate process or installation is required. You then have the option of database indexes.
I know of no standard Java collection that can do this type of operation efficiently.

Sounds like you need a trie with references to your data. A trie stores strings and lets you search for strings by prefix. I don't know the Java standard library too well and I have no idea whether it provides an implementation, but one is available here:
http://www.cs.duke.edu/~ola/courses/cps108/fall96/joggle/trie/Trie.java
Unfortunately, a trie only lets you search by prefixes. You can work around this by storing every possible suffix of each of your keys:
For 'apple', you'd store the strings
'apple'
'pple'
'ple'
'le'
'e'
Which would allow you to search for every prefix of every suffix of your keys.
Admittedly, this is the kind of "solution" that would prompt me to continue looking for other options.

first of all, use hashmap, not hashtable.
Then, you can filter the map using a predicate by using utilities in google guava
public Collection<Object> getValues(){
Map<String,Object> filtered = Maps.filterKeys(map,new Predicate<String>(){
//predicate methods
});
return filtered.values();
}

Can't be done in a single operation
You may want to try to iterate the keys and use the ones that contain your desired string.

The only solution I can see (I'm not Java expert) is to iterate over the keys and check for matching against a regular expression. If it matches, you put the matched key-value pair in the hashtable that will be returned.

If you can somehow reduce the problem to searching by prefix, you might find a NavigableMap helpful.

it will be interesting to you to look throw these question: Fuzzy string search library in Java
Also take a look on Lucene (answer number two)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.