Java optimization, gain from hashMap?

Java optimization, gain from hashMap? - java

I've been give some lovely Java code that has a lot of things like this (in a loop that executes about 1.5 million times).
code = getCode();
for (int intCount = 1; intCount < vA.size() + 1; intCount++)
{
oA = (A)vA.elementAt(intCount - 1);
if (oA.code.trim().equals(code))
currentName= oA.name;
}
Would I see significant increases in speed from switching to something like the following
code = getCode();
//AMap is a HashMap
strCurrentAAbbreviation = (String)AMap.get(code);
Edit: The size of vA is approximately 50. The trim shouldn't even be necessary, but definitely would be nice to call that 50 times instead of 50*1.5 million. The items in vA are unique.
Edit: At the suggestion of several responders, I tested it. Results are at the bottom. Thanks guys.

There's only one way to find out.

Ok, Ok, I tested it.
Results follow for your enlightenment:
Looping: 18391ms
Hash: 218ms
Looping: 18735ms
Hash: 234ms
Looping: 18359ms
Hash: 219ms
I think I will be refactoring that bit ..
The framework:
public class OptimizationTest {
private static Random r = new Random();
public static void main(String[] args){
final long loopCount = 1000000;
final int listSize = 55;
long loopTime = TestByLoop(loopCount, listSize);
long hashTime = TestByHash(loopCount, listSize);
System.out.println("Looping: " + loopTime + "ms");
System.out.println("Hash: " + hashTime + "ms");
}
public static long TestByLoop(long loopCount, int listSize){
Vector vA = buildVector(listSize);
A oA;
StopWatch sw = new StopWatch();
sw.start();
for (long i = 0; i< loopCount; i++){
String strCurrentStateAbbreviation;
int j = r.nextInt(listSize);
for (int intCount = 1; intCount < vA.size() + 1; intCount++){
oA = (A)vA.elementAt(intCount - 1);
if (oA.code.trim().equals(String.valueOf(j)))
strCurrentStateAbbreviation = oA.value;
}
}
sw.stop();
return sw.getElapsedTime();
}
public static long TestByHash(long loopCount, int listSize){
HashMap hm = getMap(listSize);
StopWatch sw = new StopWatch();
sw.start();
String strCurrentStateAbbreviation;
for (long i = 0; i < loopCount; i++){
int j = r.nextInt(listSize);
strCurrentStateAbbreviation = (String)hm.get(j);
}
sw.stop();
return sw.getElapsedTime();
}
private static HashMap getMap(int listSize) {
HashMap hm = new HashMap();
for (int i = 0; i < listSize; i++){
String code = String.valueOf(i);
String value = getRandomString(2);
hm.put(code, value);
}
return hm;
}
public static Vector buildVector(long listSize)
{
Vector v = new Vector();
for (int i = 0; i < listSize; i++){
A a = new A();
a.code = String.valueOf(i);
a.value = getRandomString(2);
v.add(a);
}
return v;
}
public static String getRandomString(int length){
StringBuffer sb = new StringBuffer();
for (int i = 0; i< length; i++){
sb.append(getChar());
}
return sb.toString();
}
public static char getChar()
{
final String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int i = r.nextInt(alphabet.length());
return alphabet.charAt(i);
}
}

Eh, there's a good chance that you would, yes. Retrieval from a HashMap is going to be constant time if you have good hash codes.
But the only way you can really find out is by trying it.

This depends on how large your map is, and how good your hashCode implementation is (such that you do not have colisions).

You should really do some real profiling to be sure if any modification is needed, as you may end up spending your time fixing something that is not broken.
What actually stands out to me a bit more than the elementAt call is the string trimming you are doing with each iteration. My gut tells me that might be a bigger bottleneck, but only profiling can really tell.
Good luck

I'd say yes, since the above appears to be a linear search over vA.size(). How big is va?

Why don't you use something like YourKit (or insert another profiler) to see just how expensive this part of the loop is.

Using a Map would certainly be an improvement that helps maintaining that code later on.
If you can use a map depends on whether the (vector?) contains unique codes or not. The for loop given would remember the last object in the list with a given code, which would mean a hash is not the solution.
For small (stable) list sizes simply converting the list to an array of objects would show a performance increase on top of some better readability.
If none of the above holds, at least use an itarator to inspect the list, giving better readability and some (probable) performance increase.

Depends. How much memory you got?
I would guess much faster, but profile it.

I think the dominant factor here is how big vA is, since the loop needs to run n times, where n is the size of vA. With the map, there is no loop, no matter how big vA is. So if n is small, the improvement will be small. If it is huge, the improvement will be huge. This is especially true because even after finding the matching element the loop keeps going! So if you find your match at element 1 of a 2 million element list, you still need to check the last 1,999,999 elements!

Yes, it'll almost certainly be faster. Looping an average of 25 times (half-way through your 50) is slower than a hashmap lookup, assuming your vA contents decently hashable.
However, speaking of your vA contents, you'll have to trim them as you insert them into your aMap, because aMap.get("somekey") will not find an entry whose key is "somekey ".
Actually, you should do that as you insert into vA, even if you don't switch to the hashmap solution.

Related

USA Coding Olympiad 1st Timed Out and/or Too Much Memory

A few days ago, I participated in the USA coding olympiad for the first time, and got the same error on all my codes. I can't figure out why because it told me that I did incredibly well on the first test case, so I don't understand how the other 9 all timed out. Could someone please explain what is wrong with my code.
Problem
Error Message
import java.io.*;
public class milkmeasure {
private static int [] cows ={7,7,7};
public static void main(String[] args) throws IOException {
// initialize file I/O
BufferedReader br = new BufferedReader(new FileReader("measurement.in"));
PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter("measurement.out")));
int N = Integer.parseInt(br.readLine());
String [] entries = new String [N];
for (int i=0;i<N;i++){
entries [i]= br.readLine();
}
int topCow = 1;
int finalN = 0;
for (int i=0; i<N; i++){
String lowEntry = entries[lownum(N,entries)];
String name = lowEntry.substring(2,lowEntry.substring(2).indexOf(" ")+2);
int effect = Integer.parseInt(lowEntry.substring(lowEntry.substring(2).indexOf(" ")+3));
if (name.equals("Bessie")){cows[1]+=effect;}
else if (name.equals("Elsie")){cows[2]+=effect;}
else if (name.equals("Mildred")){cows[0]+=effect;}
int newTop = findTop();
if (newTop!=topCow){finalN++;}
topCow = newTop;
entries[lownum(N,entries)]="101 ";
}
pw.println(finalN);
pw.close();
}
private static int lownum (int N, String [] entries){
int lowNum = 101;
int returnInt=0;
for (int i =0; i<N; i++){
int a = Integer.parseInt(entries[i].substring(0,entries[i].indexOf(" ")));
if (a<lowNum){
lowNum = a;
returnInt =i;
}
}
return returnInt;
}
private static int findTop (){
int maxval = 0;
int returnval =0;
for (int i =0; i<3; i++){
if (cows[i]>= maxval){
returnval += cows[i]*cows[i];
maxval=cows[i];
}
}
return returnval;
}
}

Algorithmic complexity issue
For each entry, your main() method invokes lownum() (twice). lownum() scans all the entries to identify and return the one with the lowest day number. Overall, then, the complexity of your program scales at least as o(N2) in the number of entries.
That lower bound could be reduced to o(N log N) by sorting the entries once and then simply processing them in order.
With a reasonable bound on the maximum day number of the entries, and the given assurance that there is at most one entry per day, it could be reduced further to o(N) by assigning entries to an array or List at positions corresponding to their day numbers, so that no actual sorting is required.
It turns out that this is the main driver of your asymptotic complexity, so improving this lower bound allows you to improve the upper bound, too, all the way to O(N).
General efficiency issues
Since the problem specifies that there will be at most one entry per day for 100 days, however, you are probably in a regime that is still strongly influenced by (in-)efficiencies that affect the cost coefficient. And you in fact have quite a few inefficiencies. Among them:
You parse each entry many times, scanning to split them into fields and converting some of those into integers. That's terribly wasteful. It would be far more efficient to parse each entry just once, and then store the parsed results. In fact, you can get the parsing at input for almost free by using a Scanner.
You invoke the lownum() method twice for each entry. The current implementation of this method is expensive, as discussed above, and nothing changes between the first and second invocation that would affect the result.
(minor) you perform full string comparisons on the cow names, even though it would be sufficient to look only at their first letters
(minor) you invoke separate methods to find the next entry and to compute the new top cow. Method invocation is comparatively expensive, so it is a bit inefficient to make large numbers of invocations of methods that do very little work. That's probably not a significant effect for your particular code, however.

Java - Return random index of specific character in string

So given a string such as: 0100101, I want to return a random single index of one of the positions of a 1 (1, 5, 6).
So far I'm using:
protected int getRandomBirthIndex(String s) {
ArrayList<Integer> birthIndicies = new ArrayList<Integer>();
for (int i = 0; i < s.length(); i++) {
if ((s.charAt(i) == '1')) {
birthIndicies.add(i);
}
}
return birthIndicies.get(Randomizer.nextInt(birthIndicies.size()));
}
However, it's causing a bottle-neck on my code (45% of CPU time is in this method), as the strings are over 4000 characters long. Can anyone think of a more efficient way to do this?

If you're interested in a single index of one of the positions with 1, and assuming there is at least one 1 in your input, you can just do this:
String input = "0100101";
final int n=input.length();
Random generator = new Random();
char c=0;
int i=0;
do{
i = generator.nextInt(n);
c=input.charAt(i);
}while(c!='1');
System.out.println(i);
This solution is fast and does not consume much memory, for example when 1 and 0 are distributed uniformly. As highlighted by #paxdiablo it can perform poorly in some cases, for example when 1 are scarce.

You could use String.indexOf(int) to find each 1 (instead of iterating every character). I would also prefer to program to the List interface and to use the diamond operator <>. Something like,
private static Random rand = new Random();
protected int getRandomBirthIndex(String s) {
List<Integer> birthIndicies = new ArrayList<>();
int index = s.indexOf('1');
while (index > -1) {
birthIndicies.add(index);
index = s.indexOf('1', index + 1);
}
return birthIndicies.get(rand.nextInt(birthIndicies.size()));
}
Finally, if you need to do this many times, save the List as a field and re-use it (instead of calculating the indices every time). For example with memoization,
private static Random rand = new Random();
private static Map<String, List<Integer>> memo = new HashMap<>();
protected int getRandomBirthIndex(String s) {
List<Integer> birthIndicies;
if (!memo.containsKey(s)) {
birthIndicies = new ArrayList<>();
int index = s.indexOf('1');
while (index > -1) {
birthIndicies.add(index);
index = s.indexOf('1', index + 1);
}
memo.put(s, birthIndicies);
} else {
birthIndicies = memo.get(s);
}
return birthIndicies.get(rand.nextInt(birthIndicies.size()));
}

Well, one way would be to remove the creation of the list each time, by caching the list based on the string itself, assuming the strings are used more often than they're changed. If they're not, then caching methods won't help.
The caching method involves, rather than having just a string, have an object consisting of:
current string;
cached string; and
list based on the cached string.
You can provide a function to the clients to create such an object from a given string and it would set the string and the cached string to whatever was passed in, then calculate the list. Another function would be used to change the current string to something else.
The getRandomBirthIndex() function then receives this structure (rather than the string) and follows the rule set:
if the current and cached strings are different, set the cached string to be the same as the current string, then recalculate the list based on that.
in any case, return a random element from the list.
That way, if the list changes rarely, you avoid the expensive recalculation where it's not necessary.
In pseudo-code, something like this should suffice:
# Constructs fastie from string.
# Sets cached string to something other than
# that passed in (lazy list creation).
def fastie.constructor(string s):
me.current = s
me.cached = s + "!"
# Changes current string in fastie. No list update in
# case you change it again before needing an element.
def fastie.changeString(string s):
me.current = s
# Get a random index, will recalculate list first but
# only if necessary. Empty list returns index of -1.
def fastie.getRandomBirthIndex()
me.recalcListFromCached()
if me.list.size() == 0:
return -1
return me.list[random(me.list.size())]
# Recalculates the list from the current string.
# Done on an as-needed basis.
def fastie.recalcListFromCached():
if me.current != me.cached:
me.cached = me.current
me.list = empty
for idx = 0 to me.cached.length() - 1 inclusive:
if me.cached[idx] == '1':
me.list.append(idx)
You also have the option of speeding up the actual searching for the 1 character by, for example, useing indexOf() to locate them using the underlying Java libraries rather than checking each character individually in your own code (again, pseudo-code):
def fastie.recalcListFromCached():
if me.current != me.cached:
me.cached = me.current
me.list = empty
idx = me.cached.indexOf('1')
while idx != -1:
me.list.append(idx)
idx = me.cached.indexOf('1', idx + 1)
This method can be used even if you don't cache the values. It's likely to be faster using Java's probably-optimised string search code than doing it yourself.
However, you should keep in mind that your supposed problem of spending 45% of time in that code may not be an issue at all. It's not so much the proportion of time spent there as it is the absolute amount of time.
By that, I mean it probably makes no difference what percentage of the time being spent in that function if it finishes in 0.001 seconds (and you're not wanting to process thousands of strings per second). You should only really become concerned if the effects become noticeable to the user of your software somehow. Otherwise, optimisation is pretty much wasted effort.

You can even try this with best case complexity O(1) and in worst case it might go to O(n) or purely worst case can be infinity as it purely depends on Randomizer function that you are using.
private static Random rand = new Random();
protected int getRandomBirthIndex(String s) {
List<Integer> birthIndicies = new ArrayList<>();
int index = s.indexOf('1');
while (index > -1) {
birthIndicies.add(index);
index = s.indexOf('1', index + 1);
}
return birthIndicies.get(rand.nextInt(birthIndicies.size()));
}

If your Strings are very long and you're sure it contains a lot of 1s (or the String you're looking for), its probably faster to randomly "poke around" in the String until you find what you are looking for. So you save the time iterating the String:
String s = "0100101";
int index = ThreadLocalRandom.current().nextInt(s.length());
while(s.charAt(index) != '1') {
System.out.println("got not a 1, trying again");
index = ThreadLocalRandom.current().nextInt(s.length());
}
System.out.println("found: " + index + " - " + s.charAt(index));
I'm not sure about the statistics, but it rare cases might happen that this Solution take much longer that the iterating solution. On case is a long String with only a very few occurrences of the search string.
If the Source-String doesn't contain the search String at all, this code will run forever!

One possibility is to use a short-circuited Fisher-Yates style shuffle. Create an array of the indices and start shuffling it. As soon as the next shuffled element points to a one, return that index. If you find you've iterated through indices without finding a one, then this string contains only zeros so return -1.
If the length of the strings is always the same, the array indices can be static as shown below, and doesn't need reinitializing on new invocations. If not, you'll have to move the declaration of indices into the method and initialize it each time with the correct index set. The code below was written for strings of length 7, such as your example of 0100101.
// delete this and uncomment below if string lengths vary
private static int[] indices = { 0, 1, 2, 3, 4, 5, 6 };
protected int getRandomBirthIndex(String s) {
int tmp;
/*
* int[] indices = new int[s.length()];
* for (int i = 0; i < s.length(); ++i) indices[i] = i;
*/
for (int i = 0; i < s.length(); i++) {
int j = randomizer.nextInt(indices.length - i) + i;
if (j != i) { // swap to shuffle
tmp = indices[i];
indices[i] = indices[j];
indices[j] = tmp;
}
if ((s.charAt(indices[i]) == '1')) {
return indices[i];
}
}
return -1;
}
This approach terminates quickly if 1's are dense, guarantees termination after s.length() iterations even if there aren't any 1's, and the locations returned are uniform across the set of 1's.

JAVA - Out Of Memory - Voxel World Generation

Currently I have this for code and my game either uses way to much memory when generating (over a GB) or if I set it low, it will give a
WORLD_SIZE_X & WORLD_SIZE_Z = 256;
WORLD_SIZE_Y = 128;
Does anyone know how I could improve this so it doesn't use so much RAM?
Thanks! :)
public void generate() {
for(int xP = 0; xP < WORLD_SIZE_X; xP++) {
for(int zP = 0; zP < WORLD_SIZE_Z; zP++) {
for(int yP = 0; yP < WORLD_SIZE_Y; yP++) {
try {
blocks[xP][yP][zP] = new BlockAir();
if(yP == 4) {
blocks[xP][yP][zP] = new BlockGrass();
}
if(yP < 4) {
blocks[xP][yP][zP] = new BlockDirt();
}
if(yP == 0) {
blocks[xP][yP][zP] = new BlockUnbreakable();
}
} catch(Exception e) {}
}
//Tree Generation :D
Random rX = new Random();
Random rZ = new Random();
if(rX.nextInt(WORLD_SIZE_X) < WORLD_SIZE_X / 6 && rZ.nextInt(WORLD_SIZE_Z) < WORLD_SIZE_Z / 6) {
for(int j = 0; j < 5; j++) {
blocks[xP][5 + j][zP] = new BlockLog();
}
}
}
}
generated = true;
}

Delay object creation until you really need to access one of these voxels. You can write a method (I'm assuming Block as the common subclass of all the Block classes):
Block getBlockAt( int x, int y, int z )
using code similar what you have in your threefold loop, plus using a hash map Map<Integer,Block> for storing the random stuff, e.g. trees: from x, y and z compute an integer (x*128 + y)*256 + z and use this as the key.
Also, consider that for all "air", "log", "dirt" blocks you may not need a separate object unless something must be changed at a certain block. Until then, share a single object of a kind.

Cause you just give small piece of code, I can give you two suggestions:
compact the object size. Seems very stupid but very easy to do. Just imagine you have thousands of objects in your memory. If everyone can be compacted half size, you can save half memory :).
Just assign the value to array when you need it. Sometime it is not work if you need really need a assigned array. So just assign values to elements in array as LESS as you can. If you can show me more code, I can help you more.

Are you sure the problem is in this method? Unless Block objects are really big, 256*256*128 ~= 8M objects should not require 1 GB ...
That said, if the blocks do not hold state, it would be more memory efficient to use an enum (or even a byte instead), as we would not need a separate object for each block:
enum Block {
air, grass, dirt, log, unbreakable;
}
Block[][][] map = ...

count distinct values in big long array (performance issue)

I have this:
long hnds[] = new long[133784560]; // 133 million
Then I quickly fill the array (couple of ms) and then I somehow want to know the number of unique (i.e. distinct) values. Now, I don't even need this realtime, I just need to try out a couple of variations and see how many unique values each gives.
I tried e.g. this:
import org.apache.commons.lang3.ArrayUtils;
....
HashSet<Long> length = new HashSet<Long>(Arrays.asList(ArrayUtils.toObject(hnds)));
System.out.println("size: " + length.size());
and after waiting for half an hour it gives a heap space error (I have Xmx4000m).
I also tried initializing Long[] hnds instead of long[] hnds, but then the initial filling of the array takes forever. Or for example use a Set from the beginning when adding the values, but also then it takes forever. Is there any way to count the distinct values of a long[] array without waiting forever? I'd write it to a file if I have to, just some way.

My best suggestion would be to use a library like fastutil (http://fastutil.di.unimi.it/) and then use the custom unboxed hash set:
import it.unimi.dsi.fastutil.longs.LongOpenHashSet;
System.out.println(new LongOpenHashSet(hnds).size());
(Also, by the way, if you can accept approximate answers, there are much more efficient algorithms you can try; see e.g. this paper for details.)

Just sort it and count.
int sz = 133784560;
Random randy = new Random();
long[] longs = new long[sz];
for(int i = 0; i < sz; i++) { longs[i] = randy.nextInt(10000000); }
Arrays.sort(longs);
long lastSeen = longs[0];
long count = 0;
for(int i = 1; i < sz; i++) {
if(longs[i] != lastSeen) count++;
lastSeen = longs[i];
}
Takes about 15 seconds on my laptop.

Java : linear algorithm but non-linear performance drop, where does it come from?

I am currently having heavy performance issues with an application I'm developping in natural language processing. Basically, given texts, it gathers various data and does a bit of number crunching.
And for every sentence, it does EXACTLY the same. The algorithms applied to gather the statistics do not evolve with previously read data and therefore stay the same.
The issue is that the processing time does not evolve linearly at all: 1 min for 10k sentences, 1 hour for 100k and days for 1M...
I tried everything I could, from re-implementing basic data structures to object pooling to recycles instances. The behavior doesn't change. I get non-linear increase in time that seem impossible to justify by a little more hashmap collisions, nor by IO waiting, nor by anything! Java starts to be sluggish when data increases and I feel totally helpless.
If you want an example, just try the following: count the number of occurences of each word in a big file. Some code is shown below. By doing this, it takes me 3 seconds over 100k sentences and 326 seconds over 1.6M ...so a multiplicator of 110 times instead of 16 times. As data grows more, it just get worse...
Here is a code sample:
Note that I compare strings by reference (for efficiency reasons), this can be done thanks to the 'String.intern()' method which returns a unique reference per string. And the map is never re-hashed during the whole process for the numbers given above.
public class DataGathering
{
SimpleRefCounter<String> counts = new SimpleRefCounter<String>(1000000);
private void makeCounts(String path) throws IOException
{
BufferedReader file_src = new BufferedReader(new FileReader(path));
String line_src;
int n = 0;
while (file_src.ready())
{
n++;
if (n % 10000 == 0)
System.out.print(".");
if (n % 100000 == 0)
System.out.println("");
line_src = file_src.readLine();
String[] src_tokens = line_src.split("[ ,.;:?!'\"]");
for (int i = 0; i < src_tokens.length; i++)
{
String src = src_tokens[i].intern();
counts.bump(src);
}
}
file_src.close();
}
public static void main(String[] args) throws IOException
{
String path = "some_big_file.txt";
long timestamp = System.currentTimeMillis();
DataGathering dg = new DataGathering();
dg.makeCounts(path);
long time = (System.currentTimeMillis() - timestamp) / 1000;
System.out.println("\nElapsed time: " + time + "s.");
}
}
public class SimpleRefCounter<K>
{
static final double GROW_FACTOR = 2;
static final double LOAD_FACTOR = 0.5;
private int capacity;
private Object[] keys;
private int[] counts;
public SimpleRefCounter()
{
this(1000);
}
public SimpleRefCounter(int capacity)
{
this.capacity = capacity;
keys = new Object[capacity];
counts = new int[capacity];
}
public synchronized int increase(K key, int n)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
{
key_count++;
keys[id] = key;
if (key_count > LOAD_FACTOR * capacity)
{
resize((int) (GROW_FACTOR * capacity));
}
}
counts[id] += n;
total += n;
return counts[id];
}
public synchronized void resize(int capacity)
{
System.out.println("Resizing counters: " + this);
this.capacity = capacity;
Object[] new_keys = new Object[capacity];
int[] new_counts = new int[capacity];
for (int i = 0; i < keys.length; i++)
{
Object key = keys[i];
int count = counts[i];
int id = System.identityHashCode(key) % capacity;
while (new_keys[id] != null && new_keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
new_keys[id] = key;
new_counts[id] = count;
}
this.keys = new_keys;
this.counts = new_counts;
}
public int bump(K key)
{
return increase(key, 1);
}
public int get(K key)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
return 0;
else
return counts[id];
}
}
Any explanations? Ideas? Suggestions?
...and, as said in the beginning, it is not for this toy example in particular but for the more general case. This same exploding behavior occurs for no reason in the more complex and larger program.

Rather than feeling helpless use a profiler! That would tell you where exactly in your code all this time is spent.

Bursting the processor cache and thrashing the Translation Lookaside Buffer (TLB) may be the problem.
For String.intern you might want to do your own single-threaded implementation.
However, I'm placing my bets on the relatively bad hash values from System.identityHashCode. It clearly isn't using the top bit, as you don't appear to get ArrayIndexOutOfBoundsExceptions. I suggest replacing that with String.hashCode.

String[] src_tokens = line_src.split("[ ,.;:?!'\"]");
Just an idea -- you are creating a new Pattern object for every line here (look at the String.split() implementation). I wonder if this is also contributing to a ton of objects that need to be garbage collected?
I would create the Pattern once, probably as a static field:
final private static Pattern TOKEN_PATTERN = Pattern.compile("[ ,.;:?!'\"]");
And then change the split line do this:
String[] src_tokens = TOKEN_PATTERN.split(line_src);
Or if you don't want to create it as a static field, as least only create it once as a local variable at the beginning of the method, before the while.

In get, when you search for a nonexistent key, search time is proportional to the size of the set of keys.
My advice: if you want a HashMap, just use a HashMap. They got it right for you.

You are filling up the Perm Gen with the string intern. Have you tried viewing the -Xloggc output?

I would guess it's just memory filling up, growing outside the processor cache, memory fragmentation and the garbage collection pauses kicking in. Have you checked memory use at all? Tried to change the heap size the JVM uses?

Try to do it in python, and run the python module from Java.
Enter all the keys in the database, and then execute the following query:
select key, count(*)
from keys
group by key
Have you tried to only iterate through the keys without doing any calculations? is it faster? if yes then go with option (2).

Can't you do this? You can get your answer in no time.

It's me, the original poster, something went wrong during registration, so I post separately. I'll try the various suggestions given.
PS for Tom Hawtin: thanks for the hints, perhaps the 'String.intern()' takes more and more time as vocabulary grows, i'll check that tomorrow, as everything else.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.