I have an object that is identified by 3 fields. One them is a String that represents 6 hex bytes, the other two are integers of not more than 1 bytes each. This all summed up is 8 bytes of data, which fits in a 64 bit integer.
I need to map these objects for fast access, and I can think of two approaches:
Use the 3 fields to generate a 64 bit key used to map the objects. This however would mean parsing the String to Hex for every access (and there will a lot of accesses, which need to be fast).
Use 3 HashMap levels, each nested inside the next, to represent the 3 identifying fields.
My question is which of these approaches should be the fastest.
Why not use a MultiKeyMap?
This might be not related to your question.
I have a suggestion for you.
Create an object with the 3 attributes that will form the key. Use the object has the key because it will be unique.
Map<ObjectKey,Object> map = new HashMap<>();
This makes sense for your use case? If you can add a bit more explanation maybe I can go further in suggest you possible solutions.
EDIT: You can override the equals and do something using this kind of logic:
#Override
public boolean equals(Object obj) {
if (!(obj instanceof Key))
return false;
ObjectKey objectKey= (Key) obj;
return this.key1.equals(objectKey.key1) && this.key2.equals(objectKey.key2) &&
...
this.keyN.equals(objectKey.keyN)
}
I would take the following steps:
Write it in the most readable way first, and profile it.
Refactor it to an implementation you think might be faster, then profile it again.
Compare.
Repeat.
Your key fits into a 64-bit value. Assuming you will build the HashMap in one go and then read from it multiple times (using it as a lookup table), my hunch is that using a Long type as the key of your HashMap will be about as fast as you can get.
You are concerned about having to parse the string as a hex number every time you look up a key in the map. What's the alternative? If you use a key containing the three separate fields, you will still have to parse the string to calculate its hash code (or, rather, the Java API implementation will calculate its hash code by parsing the string contents). The HashMap will not only call String.hashCode() but also String.equals(), so your string will be iterated twice. By contrast, calculating a Long and comparing it to the precalculated keys in the HashMap will consist of iterating the string only once.
If you use three levels of HashMap, as per your second suggestion, you will still have to calculate the hash code of your string, as well as having to look up the values of all three fields anyway, so the multi-level map doesn't give you any performance advantage.
You should also experiment with the HashMap constructor arguments to get the most efficiency. These will determine how efficiently your data will get spread into separate buckets.
Related
For one of my school assigments, I have to parse GenBank files using Java. I have to store and retrieve the content of the files together with the extracted information maintaining the smallest time complexity possible. Is there a difference between using HashMaps or storing the data as records? I know that using HashMaps would be O(1), but the readability and immutability of using records leads me to prefer using them instead. The objects will be stored in an array.
This my approach now
public static GenBankRecord parseGenBankFile(File gbFile) throws IOException {
try (var fileReader = new FileReader(gbFile); var reader = new BufferedReader(fileReader)) {
String organism = null;
List<String> contentList = new ArrayList<>();
while (true) {
String line = reader.readLine();
if (line == null) break; //Breaking out if file end has been reached
contentList.add(line);
if (line.startsWith(" ORGANISM ")) {
// Organism type found
organism = line.substring(12); // Selecting the correct part of the line
}
}
// Loop ended
var content = String.join("\n", contentList);
return new GenBankRecord(gbFile.getName(),organism, content);
}
}
with GenBankRecord being the following:
record GenBankRecord(String fileName,String organism, String content) {
#Override
public String toString(){
return organism;
}
}
Is there a difference between using a record and a HashMap, assuming the keys-value pairs are the same as the fields of the record?
String current_organism = gbRecordInstance.organism();
and
String current_organism = gbHashMap.get("organism");
I have to store and retrieve the content of the files together with the extracted information maintaining the smallest time complexity possible.
Firstly, I am somewhat doubtful that your teachers actually stated the requirements like that. It doesn't make a lot of sense to optimize just for time complexity.
Complexity is not efficiency.
Big O complexity is not about the value of the measure (e.g. time taken) itself. It is actually about how the measure (e.g. time taken) changes as some variable gets very large.
For example, HashMap.get(nameStr) and someRecord.name are both O(1) complexity.
But they are not equivalent in terms of efficiency. Using Java 17 record types or regular Java classes with named fields will be orders of magnitude faster than using a HashMap. (And it will use orders of magnitude less memory.)
Assuming that your objects have a fixed number of named fields, the complexity (i.e how the performance changes with an ever increasing number of fields) is not even a relevant.
Performance is not everything.
The most differences between HashMap and a record class are actually in the functionality that they provide:
A Map<String, SomeType> provides an set of name / value pairs where:
the number of pairs in the set is not fixed
the names are not fixed
the types of the values are all instances of SomeType or a subtype.
A record (or classic class) can be viewed as set of fieldname / value pairs where:
the number of pairs is fixed at compile time
the field names are fixed at compile time
the field types don't have to be subtypes of any single given type.
As #Louis Wasserman commented:
Records and HashMap are apples and oranges -- it doesn't really make sense to compare them.
So really, you should be choosing between records and hashmaps by comparing the functionality / constraints that they provide versus what your application actually needs.
(The problem description in your question is not clear enough for us to make that judgement.)
Efficiency concerns may be relevant, but it is a secondary concern. (If the code doesn't meet functional requirements, efficiency is moot.)
Is Complexity relevant to your assignment?
Well ... maybe yes. But not in the area that you are looking at.
My reading of the requirements is that one of them is that you be able to retrieve information from your in-memory data structures efficiently.
But so far you have been thinking about storing individual records. Retrieval implies that you have a collection of records and you have to (efficiently) retrieve a specific record, or maybe a set of records matching some criteria. So that implies you need to consider the data structure to represent the collection.
Suppose you have a collection of N records (or whatever) representing (say) N organisms:
If the collection is a List<SomeRecord>, you need to iterate the list to find the record for (say) "cat". That is O(N).
If the collection is a HashMap<String, SomeRecord> keyed by the organism name, you can find the "cat" record in O(1).
I have an
HashMap<String,AnObject>
and I'd like to give the string key a value from some infos the AnObject value contains.
Suppose AnObject is made this way:
public class AnObject(){
public String name;
public String surname;
}
Is it correct to assign the key to:
String.valueOf(o.name.hashcode()+o.surname.hashcode());
? Or Is there a better way to compute a String hash code from a value list?
No, absolutely not. hashCode() is not guaranteed to be unique.
The rules of a hash code are simple:
Two equal values must have the same hash code
Two non-equal values will ideally have different hash codes, but can have the same hash code. In particular, there are only 232 possible values to return from hashCode(), but more than 232 possible strings, making uniqueness impossible.
The hash code of an object should not change unless some equality-sensitive aspect of it changes. Indeed, it's generally a good idea to make types implementing value equality immutable, at least in equality-sensitive aspects. Otherwise you can easily find that you can't look up an entry using the exact same object reference that you previously used for the key!
Hash codes are an optimization technique to make it quick to find a "probably small" set of candidate values equal to some target, which you then iterate through with a rigorous equality check to find whether any of them is actually equal to the target. That's what lets you quickly look something up by key in a hash-based collection. The key isn't the hash itself.
If you need to create a key from two strings, you're going to basically have to make it from those two strings (with some sort of delimiter so you can tell the difference between {"a", "bc"} and {"ab", "c"} - understanding that the delimiter itself might appear in the values if you're not careful).
See Eric Lippert's blog post on the topic for more information; that's based on .NET rather than Java, but they all apply. It's also worth understanding that the semantics of hashCode aren't necessarily the same as those of a cryptographic hash. In particular, it's fine for the result of hashCode() to change if you start a new JVM but create an object with the same fields - no-one should be persisting the results of hashCode. That's not the case with something like SHA-256, which should be permanently stable for a particular set of data.
The hash code for String is lossy; many String values will result in the same hash code. An integer has 32 bit positions and each position has two values. There's no way to map even just the 32-character strings (for instance) (each character having lots of possibilities) into 32 bits without collisions. They just won't fit.
If you want to use arbitrary precision arithmetic (say, BigInteger), then you can just take each character as an integer and concatenate them all together.
No, hashCode() (BTW pay attention on case of letter C) does not guarantee uniqueness. You can have a lot of objects that produce the same hash code.
If you need unique identifier use class java.util.UUID.
I'm trying to load large CSV formatted files (typically 200-600mb) efficiently with Java (less memory and as fast as possible access). Currently, the program is utilizing a List of String Arrays. This operation was previously handled with a Lua program using a table for each CSV row and a table to hold each "row" table.
Below is an example of the memory differences and load times:
CSV File - 232mb
Lua - 549mb in memory - 157 seconds to load
Java - 1,378mb in memory - 12 seconds to load
If I remember correctly, duplicate items in a Lua table exist as a reference to the actual value. I suspect in the Java example, the List is holding separate copies of each duplicate value and that may be related to the larger memory usage.
Below is some background on the data within the CSV files:
Each field consists of a String
Specific fields within each row may include one of a set of Strings (E.g. field 3 could be "red", "green", or "blue").
There are many duplicate Strings within the content.
Below are some examples of what may be required of the loaded data:
Search through all Strings attempting to match with a given String and return the matching Strings
Display matches in a GUI table (sort able via fields).
Alter or replace Strings.
My question - Is there a collection that will require less memory to hold the data yet still offer features to easily and quickly search/sort the data?
One easy solution. You can have some HashMap were you will put references to all unique strings.
And in ArrayList you will just have reference to existing unique strings in HashMap.
Something like :
private HashMap<String, String> hashMap = new HashMap<String, String>();
public String getUniqueString(String ns) {
String oldValue = hashMap.get(ns);
if (oldValue != null) { //I suppose there will be no null strings inside csv
return oldValue;
}
hashMap.put(ns, ns);
return ns;
}
Simple usage:
List<String> s = Arrays.asList("Pera", "Zdera", "Pera", "Kobac", "Pera", "Zdera", "rus");
List<String> finS = new ArrayList<String>();
for (String er : s) {
String ns = a.getUniqueString(er);
finS.add(ns);
}
Maybe this article can be of some help :
http://www.javamex.com/tutorials/memory/string_saving_memory.shtml
DAWG
A directed acyclic word graph is the most efficient way to store words (best for memory consumption anyway).
But probably overkill here, as others have said don't create duplicates just make multiple references to the same instance.
To optimise your your Memory problem i advice to use the Flyweight pattern, specially for fields that have a lot of duplicates.
As a Collection you can use a TreeSet or TreeMap.
If you give a good implementation to your LineItem class (implement equals, hashcode and Comparable) you can optimise the memory use a lot.
just as a side note.
For the duplicate string data you doubt, you don't need to worry about that, as java itself cares of that as all strings are final, and all references target the same object in memory.
so not sure how lua does the job, but in java it should be also quite efficient
All,
I am wondering what's the most efficient way to check if a row already exists in a List<Set<Foo>>. A Foo object has a key/value pair(as well as other fields which aren't applicable to this question). Each Set in the List is unique.
As an example:
List[
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:2][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:3]
]
I want to be able to check if a new Set (Ex: Set[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]) exists in the List.
Each Set could contain anywhere from 1-20 Foo objects. The List can contain anywhere from 1-100,000 Sets. Foo's are not guaranteed to be in the same order in each Set (so they will have to be pre-sorted for the correct order somehow, like a TreeSet)
Idea 1: Would it make more sense to turn this into a matrix? Where each column would be the Foo_Key and each row would contain a Foo_Value?
Ex:
A B C
-----
1 3 4
1 2 4
1 3 3
And then look for a row containing the new values?
Idea 2: Would it make more sense to create a hash of each Set and then compare it to the hash of a new Set?
Is there a more efficient way I'm not thinking of?
Thanks
If you use TreeSets for your Sets can't you just do list.contains(set) since a TreeSet will handle the equals check?
Also, consider using Guava's MultSet class.Multiset
I would recommend you use a less weird data structure. As for finding stuff: Generally Hashes or Sorting + Binary Searching or Trees are the ways to go, depending on how much insertion/deletion you expect. Read a book on basic data structures and algorithms instead of trying to re-invent the wheel.
Lastly: If this is not a purely academical question, Loop through the lists, and do the comparison. Most likely, that is acceptably fast. Even 100'000 entries will take a fraction of a second, and therefore not matter in 99% of all use cases.
I like to quote Knuth: Premature optimisation is the root of all evil.
Say I have a Hashtable<String, Object> with such keys and values:
apple => 1
orange => 2
mossberg => 3
I can use the standard get method to get 1 by "apple", but what I want is getting the same value (or a list of values) by a part of the key, for example "ppl". Of course it may yield several results, in this case I want to be able to process each key-value pair. So basically similar to the LIKE '%ppl%' SQL statement, but I don't want to use a (in-memory) database just because I don't want to add unnecessary complexity. What would you recommend?
Update:
Storing data in a Hashtable isn't a requirement. I'm seeking for a kind of a general approach to solve this.
The obvious brute-force approach would be to iterate through the keys in the map and match them against the char sequence. That could be fine for a small map, but of course it does not scale.
This could be improved by using a second map to cache search results. Whenever you collect a list of keys matching a given char sequence, you can store these in the second map so that next time the lookup is fast. Of course, if the original map is changed often, it may get complicated to update the cache. As always with caches, it works best if the map is read much more often than changed.
Alternatively, if you know the possible char sequences in advance, you could pre-generate the lists of matching strings and pre-fill your cache map.
Update: Hashtable is not recommended anyway - it is synchronized, thus much slower than it should be. You are better off using HashMap if no concurrency is involved, or ConcurrentHashMap otherwise. Latter outperforms a Hashtable by far.
Apart from that, out of the top of my head I can't think of a better collection to this task than maps. Of course, you may experiment with different map implementations, to find the one which suits best your specific circumstances and usage patterns. In general, it would thus be
Map<String, Object> fruits;
Map<String, List<String>> matchingKeys;
Not without iterating through explicitly. Hashtable is designed to go (exact) key->value in O(1), nothing more, nothing less. If you will be doing query operations with large amounts of data, I recommend you do consider a database. You can use an embedded system like SQLite (see SQLiteJDBC) so no separate process or installation is required. You then have the option of database indexes.
I know of no standard Java collection that can do this type of operation efficiently.
Sounds like you need a trie with references to your data. A trie stores strings and lets you search for strings by prefix. I don't know the Java standard library too well and I have no idea whether it provides an implementation, but one is available here:
http://www.cs.duke.edu/~ola/courses/cps108/fall96/joggle/trie/Trie.java
Unfortunately, a trie only lets you search by prefixes. You can work around this by storing every possible suffix of each of your keys:
For 'apple', you'd store the strings
'apple'
'pple'
'ple'
'le'
'e'
Which would allow you to search for every prefix of every suffix of your keys.
Admittedly, this is the kind of "solution" that would prompt me to continue looking for other options.
first of all, use hashmap, not hashtable.
Then, you can filter the map using a predicate by using utilities in google guava
public Collection<Object> getValues(){
Map<String,Object> filtered = Maps.filterKeys(map,new Predicate<String>(){
//predicate methods
});
return filtered.values();
}
Can't be done in a single operation
You may want to try to iterate the keys and use the ones that contain your desired string.
The only solution I can see (I'm not Java expert) is to iterate over the keys and check for matching against a regular expression. If it matches, you put the matched key-value pair in the hashtable that will be returned.
If you can somehow reduce the problem to searching by prefix, you might find a NavigableMap helpful.
it will be interesting to you to look throw these question: Fuzzy string search library in Java
Also take a look on Lucene (answer number two)