I ran across some code that was doing something like this:
Map<String,String> fullNameById = buildMap1(dataSource1);
Map<String,String> nameById = buildMap2(dataSource2);
Map<String,String> nameByFullName = new HashMap<String,String>();
Map<String,String> idByName = new HashMap<String,String>();
Set<String> ids = fullNameById.keySet();
for (String nextId : ids) {
String name = nameById.get(nextId);
String fullName = fullNameById.get(nextId);
nameByFullName.put(fullName, name);
idByName.put(name, nextId);
}
I had to stare at it for several minutes to figure out what was going on. All of that amounts to a join operation on id's and an inversion of one of the original maps. Since Id, FullName and Name are always 1:1:1 it seemed to me that there should be some way to simplify this. I also discovered that the first two maps are never used again, and I find that the above code is a bit hard to read. So I'm considering replacing it with something like this that (to me) reads much cleaner
Table<String, String, String> relations = HashBasedTable.create();
addRelationships1(dataSource1, relations);
addRelationships2(dataSource2, relations);
Map<String,String> idByName = relations.column("hasId");
Map<String,String> nameByFullName = relations.column("hasName");
relations = null; // not used hereafter
In addRelationships1 I do
relations.put(id, "hasFullName", fullname);
And in addRelationships2 where my query yields values for id and name I do
relations.put(relations.remove(id,"hasFullName"), "hasName", name);
relations.put(name, "hasId", id);
So my questions are these:
Is there a lurking inefficiency in what I have done either via processor or memory, or GC load? I don't think so, but I'm not that familiar with the efficiency of Table. I am aware that the Table object won't be GC'd after relations = null, I just want to communicate that it's not used again in the rather lengthy section of code that follows.
Have I gained any efficiency? I keep convincing and unconvincing myself that I have and have not.
Do you find this more readable? Or is this only easy for me to read because I wrote it? I'm a tad worried on that front due to the fact Table is not well known. On the other hand, the top level now pretty clearly says, "gather data from two sources and make these two maps from it." I also like the fact that it doesn't leave you wondering if/where the other two maps are being used (or not).
Do you have an even better, cleaner, faster, simpler way to do it than either of the above?
Please Lets not have the optimize early/late discussion here. I'm well aware of that pitfall. If it improves readability without hurting performance I am satisfied. Performance gain would be a nice bonus.
Note: my variable and method names have been sanitized here to keep the business area from distracting from the discussion, I definitely won't name them addRelationships1 or datasource1! Similarly, the final code will of course use constants not raw strings.
So I did some mini benchmarking myself and came up with the conclusion that there is little difference in the two methods in terms of execution time. I kept the total size of the data being processed constant by trading runs for data-set size. I did 4 runs and chose the lowest time for each implementation from among all 4 runs. Re-reassuringly both implementations were always fastest on the same run. My code can be found here. Here are my results:
Case Maps (ms) Table (ms) Table vs Maps
100000 runs of size 10 2931 3035 104%
10000 runs of size 100 2989 3033 101%
1000 runs of size 1000 3129 3160 101%
100 runs of size 10000 4126 4429 107%
10 runs of size 100000 5081 5866 115%
1 run of size 1000000 5489 5160 94%
So using Table seems to be slightly slower for small data sets. Something interesting happens around 100,000 and then by 1 million the table is actually faster. My data will hang out in the 100 to 1000 range, so at least in execution time the performance should be nearly identical.
As for readability, my opinion is that if someone is trying to figure out what is happening near by and reads the code it will be significantly easier to see the intent. If they have to actually debug this bit of code it may be a bit harder since Table is less common, and requires some sophistication to understand.
Another thing I am unsure of is whether or not it's more efficient to create the hash maps, or to just query the table directly in the case where all keys of the map will subsequently be iterated. However that's a different question :)
And the comedic ending is that in fact as I analyzed the code further (hundreds of lines), I found that the only significant use of nameByFullname.get() outside of logging (of questionable value) was to pass the result of the to idByName.get(). So in the end I'll actually be building an idByFullName map and an idByName map instead with no need for any joining, and dropping the whole table thing anyway. But it made for an interesting SO question I guess.
tl;dr, but I'm afraid that you'd need to make a bigger step away from the original design. Simulating DB tables might be a nice exercise, but for me your code isn't really readable.
Is there a lurking inefficiency in what I have done... No idea.
Have I gained any efficiency? I'm afraid you need to measure it first. Removing some indirections surely helps, but using a more complicated data structure might offset it. And performance in general is simply too complicated.
Do you find this more readable? I'm afraid not.
Do you have an even better, cleaner, faster, simpler way to do it than either of the above? I hope so....
Where I get lost in such code is the use of strings for everything - it's just too easy to pass a wrong string as an argument. So I'd suggest to aggregate them into an object and provide maps for accessing the objects via any part of them. Something as trivial as this should do:
class IdNameAndFullName {
String id, name, fullName;
}
class IdNameAndFullNameMaps {
Map<String, IdNameAndFullName> byId;
Map<String, IdNameAndFullName> byName;
Map<String, IdNameAndFullName> byFullName;
}
You could obviously replace the class IdNameAndFullNameMaps by a Table. However, besides using a nice pre-existing data structure I see no advantages therein. The disadvantages are:
loss of efficiency
loss of readability (I wouldn't use Table here for the very same reason Tuple should be avoided)
use of String keys (your "hasId" and "hasName").
Related
Let's imagine I have a lib which contains the following simple method:
private static final String CONSTANT = "Constant";
public static String concatStringWithCondition(String condition) {
return "Some phrase" + condition + CONSTANT;
}
What if someone wants to use my method in a loop? As I understand, that string optimisation (where + gets replaced with StringBuilder or whatever is more optimal) is not working for that case? Or this is valid for strings initialised outside of the loop?
I'm using java 11 (Dropwizard).
Thanks.
No, this is fine.
The only case that string concatenation can be problematic is when you're using a loop to build one single string. Your method by itself is fine. Callers of your method can, of course, mess things up, but not in a way that's related to your method.
The code as written should be as efficient as making a StringBuilder and appending these 3 constants to it. There certainly is absolutely no difference at all between a literal ("Some phrase"), and an expression that the compiler can treat as a Compile Time Constant (which CONSTANT, here, clearly is - given that CONSTANT is static, final, not null, and of a CTCable type (All primitives and strings)).
However, is that 'efficient'? I doubt it - making a stringbuilder is not particularly cheap either. It's orders of magnitude cheaper than continually making new strings, sure, but there's always a bigger fish:
It doesn't matter
Computers are fast. Really, really fast. It is highly likely that you can write this incredibly badly (performance wise) and it still won't be measurable. You won't even notice. Less than a millisecond slower.
In general, anybody that worries about performance at this level simply lacks perspective and knowledge: If you apply that level of fretting to your java code and you have the knowledge to know what could in theory be non-perfectly-performant, you'll be sweating every 3rd character you ever type. That's no way to program. So, gain that perspective (or take it from me, "just git gud" is not exactly something you can do in a week - take it on faith for now, as you learn you can start verifying) - and don't worry about it. Unless you actually run into an actual situation where the code is slower than it feels like it could be, or slower than it needs to be, and then toss profilers and microbenchmark testing frameworks at it, and THEN, armed with all that information (and not before!), consider optimizing. The reports tell you what to optimize, because literally less than 1% of the code is responsible for 99% of the performance loss, so spending any time on code that isn't in that 1% is an utter waste of time, hence why you must get those reports first, or not start at all.
... or perhaps it does
But if it does matter, and it's really that 1% of the code that is responsible for 99% of the loss, then usually you need to go a little further than just 'optimize the method'. Optimize the entire pipeline.
What is happening with this string? Take that into consideration.
For example, let's say that it, itself, is being appended to a much bigger stringbuilder. In which case, making a tiny stringbuilder here is incredibly inefficient compared to rewriting the method to:
public static void concatStringWithCondition(StringBuilder sb, String condition) {
sb.append("Some phrase").append(condition).append(CONSTANT);
}
Or, perhaps this data is being turned into bytes using UTF_8 and then tossed onto a web socket. In that case:
private static final byte[] PREFIX = "Some phrase".getBytes(StandardCharsets.UTF_8);
private static final byte[] SUFFIX = "Some Constant".getBytes(StandardCharsets.UTF_8);
public void concatStringWithCondition(OutputStream out, String condition) {
out.write(PREFIX);
out.write(condition.getBytes(StandardCharsets.UTF_8));
out.write(SUFFIX);
}
and check if that outputstream is buffered. If not, make it buffered, that'll help a ton and would completely dwarf the cost of not using string concatenation. If the 'condition' string can get quite large, the above is no good either, you want a CharsetEncoder that encodes straight to the OutputStream, and may even want to replace all that with some ByteBuffer based approach.
Conclusion
Assume performance is never relevant until it is.
IF performance truly must be tackled, strap in, it'll take ages to do it right. Doing it 'wrong' (applying dumb rules of thumb that do not work) isn't useful. Either do it right, or don't do it.
IF you're still on bard, always start with profiler reports and use JMH to gather information.
Be prepared to rewrite the pipeline - change the method signatures, in order to optimize.
That means that micro-optimizing, which usually sacrifices nice abstracted APIs, is actively bad for performance - because changing pipelines is considerably more difficult if all code is micro-optimized, given that this usually comes at the cost of abstraction.
And now the circle is complete: Point 5 shows why the worrying about performance as you are doing in this question is in fact detrimental: It is far too likely that this worry results in you 'optimizing' some code in a way that doesn't actually run faster (because the JVM is a complex beast), and even if it did, it is irrelevant because the code path this code is on is literally only 0.01% or less of the total runtime expenditure, and in the mean time you've made your APIs worse and lack abstraction which would make any actually useful optimization much harder than it needs to be.
But I really want rules of thumb!
Allright, fine. Here are 2 easy rules of thumb to follow that will lead to better performance:
When in rome...
The JVM is an optimising marvel and will run the craziest code quite quickly anyway. However, it does this primarily by being a giant pattern matching machine: It finds recognizable code snippets and rewrites these to the fastest, most carefully tuned to juuust your combination of hardware machine code it can. However, this pattern machine isn't voodoo magic: It's got limited patterns. Which patterns do JVM makers 'ship' with their JVMs? Why, the common patterns, of course. Why include a pattern for exotic code virtually nobody ever writes? Waste of space.
So, write code the way java programmers tend to write it. Which very much means: Do not write crazy code just because you think it might be faster. It'll likely be slower. Just follow the crowd.
Trivial example:
Which one is faster:
List<String> list = new ArrayList<String>();
for (int i = 0; i < 10000; i++) list.add(someRandomName());
// option 1:
String[] arr = list.toArray(new String[list.size()]);
// option 2:
String[] arr = list.toArray(new String[0]);
You might think, obviously, option 1, right? Option 2 'wastes' a string array, making a 0-length array just to toss it in the garbage right after. But you'd be wrong: Option 2 is in fact faster (if you want an explanation: The JVM recognizes it, and does a hacky move: It makes an new string array that does not need to be initialized with all zeroes first. Normal java code cannot do this (arrays are neccessarily initialized blank, to prevent memory corruption issues), but specifically .toArray(new X[0])? Those pattern matching machines I told you about detect this and replace it with code that just blits the refs straight into a patch of memory without wasting time writing zeroes to it first.
It's a subtle difference that is highly unlikely to matter - it just highlights: Your instincts? They will mislead you every time.
Fortunately, .toArray(new X[0]) is common java code. And easier and shorter. So just write nice, convenient code that looks like how other folks write and you'd have gotten the right answer here. Without having to know such crazy esoterics as having to reason out how the JVM needs to waste time zeroing out that array and how hotspot / pattern matching might possibly eliminate this, thus making it faster. That's just one of 5 million things you'd have to know - and nobody can do that. Thus: Just write java code in simple, common styles.
Algorithmic complexity is a thing hotspot can't fix for you
Given an O(n^3) algorithm fighting an O(log(n) * n^2) algorithm, make n large enough and the second algorithm has to win, that's what big O notation means. The JVM can do a lot of magic but it can pretty much never optimize an algorithm into a faster 'class' of algorithmic complexity. You might be surprised at the size n has to be before algorithmic complexity dominates, but it is acceptable to realize that your algorithm can be fundamentally faster and do the work on rewriting it to this more efficient algorithm even without profiler reports and benchmark harnesses and the like.
I wonder that if I use a HashMap to collect the conditions and loop each one in one if statement can I reach higher performance rather than to write one by one if - else if statement?
In my opinion, one-by-one if-else, if statements may be faster because in for loop runs one more condition in each loop like, does the counter reach the target number? So actually each if statement, it runs 2 if statements. Of course inside of the statements different but if we talk about just statement performance, I think one-by-one type would be better?
Edit: this is just a sample code, my question is about the performance differences between the usage of these statements.
Map<String, Integer> words = new HashMap<String, Integer>
String letter ="d";
int n = 4;
words.put("a",1);
words.put("b",2);
words.put("c",3);
words.put("d",4);
words.put("e",5);
words.forEach((word,number)->{
if(letter.equals(word){
System.out.println(number*n);
});
String letter ="d";
int n = 4;
if(letter.equals("a"){
System.out.println(number*1);
}else if(letter.equals("b"){
System.out.println(number*2);
}else if(letter.equals("c"){
System.out.println(number*3);
}else if(letter.equals("d"){
System.out.println(number*4);
}else if(letter.equals("e"){
System.out.println(number*5);
}
For your example, having a HashMap but then doing an iterative lookup seems to be a bad idea. The point of using a HashMap is to be able to do a hash based lookup. That is much faster than doing an iterative lookup.
Also, from your example, cascading if-then tests will definitely be faster, since they will avoid the overhead of the map iterator and extra function calls. Also, they will avoid the overhead of the map iterator skipping empty storage locations in the hash map backing array. A better question is whether the cascading if-thens are faster than iterating across a simple list. That is hard to answer. Cascading if-thens seem likely to be faster, except that if there are a lot of if-thens, then a cost of loading the code should be added.
For string lookups, a list data structure provides adequate behavior up to a limiting value, above which a more sophisticated data structure must be used. What is the limiting value depends on the environment. For string comparisons, I've found the transition between 20 and 100 elements.
For particular lookups, and whether low level optimizations are available, the transition value may be much larger. For example, doing integer lookups using "C", which will can do direct memory lookups, the transition value is much higher.
Typical data structures are HashMaps, Tries, and sorted arrays. Each fits particular patterns of access. For example, sorted arrays are fastest and most compact, but are expensive to update. HashMaps support dynamic updates, and for good hash functions, provide constant time lookups. But, HashMaps are space inefficient, since they depend on having empty cells between hash values.
For cases which do not involve "very large" data sets, and which are not in critical "hot" code paths, HashMaps are the usual structure which is used.
If you have a Map and you want to retrieve one letter, I'm not sure why you would loop at all?
Map<String, Integer> words = new HashMap<String, Integer>
String letter ="d";
int n = 4;
words.put("a",1);
words.put("b",2);
words.put("c",3);
words.put("d",4);
words.put("e",5);
if (words.containsKey(letter) {
System.out.println(words.get(letter)*n);
}
else
{
System.out.println(letter + " doesn't exist in Map");
}
If you aren't using the benefits of a Map, then why use a Map at all?
A forEach will actually touch every key in the list. The number of checks on your if/else is dependent on where it is in the list and how long the list of available letters is. If the letter you choose is the last one in the list then it would complete all checks before printing. If it is first then it will only do one which is much faster than having to check all.
It would be easy for you to write the two examples and run a timer to determine which is actually faster.
https://www.baeldung.com/java-measure-elapsed-time
There are a lot of wasted calculations if you have to run through 1 million if/else statements and only select one which could be anywhere in the list. This doesn't include typos and the horror of code maintenance. Using a Map with an index would be much quicker. If you are only talking about 100 if/else statements (still too many in my opinion) then you may be able to break even on speed.
I know this might be a kind of "silly" question. I have created software applications before where I initialized basically all of my variables as strings, and saved them in my database as VARCHARs. Then, I would gather them from the database and convert them as needed. Is there any reason this is not an efficient method for initializing variables and saving them in my database?
I know that for extremely large applications, this can cause an issue with computing time, because I am unnecessarily converting variables that could have been initialized as the appropriate type to begin with. But, for smaller applications, is this "okay" to do?
Some reasons to use proper types
1. Least surprise. If developers are going to grab numerical data from your database, they would find it weird that you're storing them as strings.
2. Developer convenience. Another is the nuisance of having to parse the data into the correct type every time. If you just store it as the correct type, then you would save people the trouble of having to put
int age = 0;
try {
age = Integer.parseInt(ageStr);
} catch (NumberFormatException e) {
throw new RuntimeException(e);
}
all over the code.
3. Data quality. The code example above hints at a third problem. Now it's possible for somebody to store "no_age" or "foo" or something in the column, which is a data quality issue. The best way to deal with errors is to make them impossible in the first place.
4. Storage efficiency. Storage efficiency is a factor as well. Different types have different ways of encoding data, and strings are not an efficient way to store numbers, bits, etc.
5. Network efficiency. If you store data in wasteful formats, then that often translates to unnecessary network utilization. This is why binary formats are generally more efficient than text formats like JSON or XML. But web services don't typically treat network efficiency as the driving engineering concern.
6. Processing efficiency. If the data is inherently numeric, then forcing everybody to parse it incurs processing cost.
7. Different types support different rules. In his answer, Hightower makes the good point that different types have special rules for ordering, which impacts ranges and sorts. I like this point because it impacts actual program behavior, whereas the concerns I mention above might be more academic for small apps with a single developer.
An example illustrating the efficiency benefit
Suppose you want to store eight bits. If you were to store that as a string you might have "TFFTFFTF", which under UTF-8 and ASCII would take 64 bits (8 chars x 8 bits per char) to store eight bits of actual information. Relatively speaking that's a big difference.
Incidentally, even if your data is numeric, it's not good to just use BIGINT, for example. The different types of integer in a database have different storage requirements and so you should think about the number of bits you actually need, use unsigned representations if appropriate (no reason to waste a sign bit on numbers that can't be negative), etc. Wrong choices tend to add up quickly as you create new foreign keys that have to be BIGINTs now, new rows that all have a bunch of BIGINTs, etc. Your storage and backup requirements end up being needlessly demanding.
So. Is it "OK" to use strings?
These efficiency concerns may not matter at all for something small, which is what you were asking. Or there may be reasons to prefer an inefficient format over one that's more efficient, as my JSON/XML example above suggests. So as far as whether it's "OK", I can't answer that, but hopefully the considerations above give you some tools to make that decision yourself.
Still I'd try to get into the habit of using the right type, and I certainly wouldn't go out of my way to store things as strings without some reason. In bitset cases I could see potentially avoiding having to deal with bit manipulation, which can be tricky til you get the hang of it. (But some databases have special bitset types.) You mention not knowing the type and maybe that's a plausible reason in some cases, though I would lean more on refactoring here.
There are some reasons. For examples, think about searching for a time range. This is easy to find using datetime fields. But not easy with strings, because you have to do it at your application.
Other point is sorting on a varchar will be different to a int type field. At varchar 10 is before 2, but at int it comes after that.
I have to process 450 unique strings about 500 million times. Each string has unique integer identifier. There are two options for me to use.
I can append the identifier with the string and on arrival of the
string I can split the string to get the identifier and use it.
I can store the 450 strings in HashMap<String, Integer> and on
arrival of the string, I can query HashMap to get the identifier.
Can someone suggest which option will be more efficient in terms of processing?
It all depends on the sizes of the strings, etc.
You can do all sorts of things.
You can use a binary search to get the index in a list, and at that index is the identifier.
You can hash just the first 2 characters, rather than the entire string, that would likely be faster than the binary search, assuming the strings have an OK distribution.
You can use the first character, or first two characters, if they're unique as a "perfect index" in to 255 or 65K large array that points to the identifier.
Also, if your identifier is numeric, it's better to pre-calculate that, rather than convert it on the fly all the time. Text -> Binary is actually rather expensive (Binary -> Text is worse). So it's probably nice to avoid that if possible.
But it behooves you work the problem. 1 million anything at 1ms each, is 20 minutes of processing. At 500m, every nano-second wasted adds up to 8+ minutes extra of processing. You may well not care, but just demonstrating that at these scales "every little bit helps".
So, don't take our words for it, test different things to find what gives you the best result for your work set, and then go with that. Also consider excessive object creation, and avoiding that. Normally, I don't give it a second thought. Object creation is fast, but a nano-second is a nano-second.
If you're working in Java, and you don't REALLY need Unicode (i.e. you're working with single characters of the 0-255 range), I wouldn't use strings at all. I'd work with raw bytes. String are based on Java characters, which are UTF-16. Java Readers convert UTF-8 in to UTF-16 every. single. time. 500 million times. Yup! Another few nano-seconds. 8 nano-seconds adds an hour to your processing.
So, again, look in all the corners.
Or, don't, write it easy, fire it up, run it over the weekend and be done with it.
If each String has a unique identifier then retrieval is O(1) only in case of hashmaps.
I wouldn't suggest the first method because you are splitting every string for 450*500m, unless your order is one string for 500m times then on to the next. As Will said, appending numeric to strings then retrieving might seem straight forward but is not recommended.
So if your data is static (just the 450 strings) put them in a Hashmap and experiment it. Good luck.
Use HashMap<Integer, String>. Splitting a string to get the identifier is an expensive operation because it involves creating new Strings.
I don't think anyone is going to be able to give you a convincing "right" answer, especially since you haven't provided all of the background / properties of the computation. (For example, the average length of the strings could make a lot of difference.)
So I think your best bet would be to write a benchmark ... using the actual strings that you are going to be processing.
I'd also look for a way to extract and test the "unique integer identifier" that doesn't entail splitting the string.
Splitting the string should work faster if you write your code well enough. In fact if you already have the int-id, I see no reason to send only the string and maintain a mapping.
Putting into HashMap would need hashing the incoming string every time. So you are basically comparing the performance of the hashing function vs the code you write to append (prepending might be a bit more tricky) on sending end and to parse on receiving end.
OTOH, only 450 strings aren't a big deal, and if you're into it, writing your own hashing algo/function would actually be the most elegant and performant.
First of all let me tell you that i have read the following questions that has been asked before Java HashMap performance optimization / alternative and i have a similar question.
What i want to do is take a LOT of dependencies from New york times text that will be processed by stanford parser to give dependencies and store the dependencies in a hashmap along with their scores, i.e. if i see a dependency twice i will increment the score from the hashmap by 1.
The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.
How will i be able to increase the performance of my hashmap? What kind of hashkey can i use?
Thanks a lot
Martinos
EDIT 1:
ok guys maybe i phrased my question wrongly ok , well the byte arrays are not used in MY project but in the similar question of another person above. I dont know what they are using it for hence thats why i asked.
secondly: i will not post code as i consider it will make things very hard to understand but here is a sample:
With sentence : "i am going to bed" i have dependencies:
(i , am , -1)
(i, going, -2)
(i,to,-3)
(am, going, -1)
.
.
.
(to,bed,-1)
These dependencies of all sentences(1 000 000 sentences) will be stored in a hashmap.
If i see a dependency twice i will get the score of the existing dependency and add 1.
And that is pretty much it. All is well but the rate of adding sentences in hashmap(or retrieving) scales down on this line:
dependancyBank.put(newDependancy, dependancyBank.get(newDependancy) + 1);
Can anyone tell me why?
Regards
Martinos
Trove has optimized hashmaps for the case where key or value are of primitive type.
However, much will still depend on smart choice of structure and hash code for your keys.
This part of your question is unclear: The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.. But you don't say what the performance is for the larger data. Your map grows, which is kind of obvious. Hashmaps are O(1) only in theory, in practice you will see some performance changes with size, due to less cache locality, and due to occasional jumps caused by rehashing. So, put() and get() times will not be constant, but still they should be close to that. Perhaps you are using the hashmap in a way which doesn't guarantee fast access, e.g. by iterating over it? In that case your time will grow linearly with size and you can't change that unless you change your algorithm.
Google 'fastutil' and you will find a superior solution for mapping object keys to scores.
Take a look at the Guava multimaps: http://www.coffee-bytes.com/2011/12/22/guava-multimaps They are designed to basically keep a list of things that all map to the same key. That might solve your need.
How will i be able to increase the performance of my hashmap?
If its taking more than 1 micro-second per get() or put(), you have a bug IMHO. You need to determine why its taking as long as it is. Even in the worst case where every object has the same hasCode, you won't have performance this bad.
What kind of hashkey can i use?
That depends on the data type of the key. What is it?
and finally what are byte[] a = new byte[2]; byte[] b = new byte[3]; in the question that was posted above?
They are arrays of bytes. They can be used as values to look up but its likely that you need a different value type.
An HashMap has an overloaded constructor which takes initial capacity as input. The scale off you see is because of rehashing during which the HashMap will virtually not be usable. To prevent frequent rehashing you need to start with a HashMap of greater initial capacity. You can also set a loading factor which indicates how much percentage do you load the hashes before rehashing.
public HashMap(int initialCapacity).
Pass the initial capacity to the HashMap during object construction. It is preferable to set a capacity to almost twice the number of elements you would want to add in the map during the course of execution of your program.