Java fastest way to get matching range - java

I have a set of integer ranges, which represent lower and upper bounds of classes. For example:
0..500 xsmall
500..1000 small
1000..1500 medium
1500..2500 large
In my case there can be over 500 classes. These classes do not overlap, but they can differ in size.
I can implement finding the matching range as a simple linear search through a list, for example
class Range
{
int lower;
int upper;
String category;
boolean contains(int val)
{
return lower <= val && val < upper;
}
}
public String getMatchingCategory(int val)
{
for (Range r : listOfRanges)
{
if (r.contains(val))
{
return r.category;
}
}
return null;
}
However, this seems slow; as I need on average N/2 look-ups. If the classes were equally sized, I could use division. Is there a standard technique to find the correct range faster?

What you are looking for is a SortedMap and its methods tailMap and firstKey. Check out the documentation for full details.
The advantage of this approach over plain arrays is in the ease of maintaining your ranges: you can insert/remove new boundaries at any point with almost no runtime cost; with arrays it means copying both parallel arrays in full.
Update
I've written code for both variants and benchmarked it:
#State(Scope.Thread)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
public class BinarySearch
{
static final int ARRAY_SIZE = 128, INCREMENT = 1000;
static final int[] arrayK = new int[ARRAY_SIZE];
static final String[] arrayV = new String[ARRAY_SIZE];
static final SortedMap<Integer,String> map = new TreeMap<>();
static {
for (int i = 0, j = 0; i < arrayK.length; i++) {
arrayK[i] = j; arrayV[i] = String.valueOf(j);
map.put(j, String.valueOf(j));
j += INCREMENT;
}
}
final Random rnd = new Random();
int rndInt;
#Setup(Level.Invocation) public void nextInt() {
rndInt = rnd.nextInt((ARRAY_SIZE-1)*INCREMENT);
}
#GenerateMicroBenchmark
public String array() {
final int i = Arrays.binarySearch(arrayK, rndInt);
return arrayV[i >= 0? i : -(i+1)];
}
#GenerateMicroBenchmark
public String sortedMap() {
return map.tailMap(rndInt).values().iterator().next();
}
}
Benchmark results:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
array thrpt 1 5 5 10.948 0.033 ops/usec
sortedMap thrpt 1 5 5 5.752 0.070 ops/usec
Interpretation: array search is only twice as fast and this factor is quite stable across array sizes. In the presented code the array size is 1024 and the factor is 1.9. I've also tested with array size 128, where the factor is 2.05.

Here, Arrays.binarySearch is your friend. Simply put all the boundaries in and handle the possible cases. Assuming you ranges leave no holes between them, you only need to put the upper bounds in.
For you example
0..500 xsmall
500..1000 small
1000..1500 medium
1500..2500 large
you'd use
int[] boundaries = {500, 1000, 1500, 2500};
and look up the input. Handle the two cases (found/not found) and you're done. Forget about ranges, they're nice but they don't fit you problem.
Update
I also wrote a benchmark and no matter how I try I'd lose my bet as the ratio is about 3 rather than 5. The strange things like S001024 in my results stand for the size 1024.

Related

How to return a random value within a range, from a Map with Integer and List of Longs in Java

I am trying to find a way to get a random value from a provided list of different ranges using ThreadLocalRandom, and return that one random value from a method. I've been trying different approaches, and not having much luck.
I've tried this:
private static final Long[][] values = {
{ 233L, 333L },
{ 377L, 477L },
{ 610L, 710L }
};
// This isn't correct
long randomValue = ThreadLocalRandom.current().nextLong(values[0][0]);
But I could not figure out how to get a random value out of it for a specific range, so thought I'd try the Map approach, I tried creating a Map of Integers and List of Longs:
private static Map<Integer, List<Long>> mapValues = new HashMap<>();
{{233L, 333L}, {377L, 477L}, {610L, 710L}} // ranges I want
I am not sure how to store those value ranges into the Map.
I've tried adding in values, for example:
// Need to get the other value for the range in here, in this case 333L
map.put(1, 233L);
I am not sure how to add the 333L to the List, I have searched and tried various things but always get errors, such as: found 'long', required List
I want the Integer in the Map to be an id for the associated range, for example, 1 for 233L-333L, so that I can tell it first, get a random Int key from the Map, for example 1, and then use ThreadLocalRandom.current().nextLong(origin, bound) where origin would be 233L and bound would be 333L, and then return a random value within that range of 233L-333L.
I am not sure if this is possible, or I am simply approaching this the wrong way - any guidance/help is appreciated!
It's pretty straightforward. Your long[][] will do fine.
First, select a random index, then select a long between values[index][0] and values[index][1]1.
long[][] values = {
{ 233L, 333L },
{ 377L, 477L },
{ 610L, 710L }
};
// Select a random index
int index = ThreadLocalRandom.current().nextInt(0, values.length);
// Determine lower and upper bounds
long min = values[index][0];
long max = values[index][1];
long rnd = ThreadLocalRandom.current().nextLong(min, max);
Of course, you could also abstract it away into some convenient classes.
Note that, for the distribution of values to be even, all ranges must have the same size (which seem to be the case in your code).
Implementation with even distribution
However, if you want to support different ranges while the distribution has to remain even, another approach is required.
We could calculate a single random number with as upper bound the total number of possible values. Then we could check in which 'bucket' the value is to be retrieved.
Here is a working example. In order to test the distribution which is said to be even, a random number is generated a million times. As you can see, each value occurs approximately 200,000 times.
1 In my examples, the upper bound is exclusive. This is consistent with many methods from the Java standard libraries, like ThreadLocalRandom.nextLong(origin, bound) or LongStream.range(long start, long end).
int range = ThreadLocalRandom.current().nextInt(3);
long randomValue = ThreadLocalRandom.current().nextLong(values[range][0],values[range][1]);
this will work with the array solution you tried first. first you select the range then you get the random value.
The easiest is the most straight forward.
private static final long[][] values = { { 233L, 333L
}, { 377L, 477L
}, { 610L, 710L
}
};
public static void main(String[] args) {
for (long v[] : values) {
long low = v[0];
long high = v[1];
System.out.println("Between " + low + " and " + high + " -> "
+ getRandom(low, high));
}
}
public static long getRandom(long low, long high) {
// add 1 to high to make range inclusive
return ThreadLocalRandom.current().nextLong(low, high + 1);
}

Java - Return random index of specific character in string

So given a string such as: 0100101, I want to return a random single index of one of the positions of a 1 (1, 5, 6).
So far I'm using:
protected int getRandomBirthIndex(String s) {
ArrayList<Integer> birthIndicies = new ArrayList<Integer>();
for (int i = 0; i < s.length(); i++) {
if ((s.charAt(i) == '1')) {
birthIndicies.add(i);
}
}
return birthIndicies.get(Randomizer.nextInt(birthIndicies.size()));
}
However, it's causing a bottle-neck on my code (45% of CPU time is in this method), as the strings are over 4000 characters long. Can anyone think of a more efficient way to do this?
If you're interested in a single index of one of the positions with 1, and assuming there is at least one 1 in your input, you can just do this:
String input = "0100101";
final int n=input.length();
Random generator = new Random();
char c=0;
int i=0;
do{
i = generator.nextInt(n);
c=input.charAt(i);
}while(c!='1');
System.out.println(i);
This solution is fast and does not consume much memory, for example when 1 and 0 are distributed uniformly. As highlighted by #paxdiablo it can perform poorly in some cases, for example when 1 are scarce.
You could use String.indexOf(int) to find each 1 (instead of iterating every character). I would also prefer to program to the List interface and to use the diamond operator <>. Something like,
private static Random rand = new Random();
protected int getRandomBirthIndex(String s) {
List<Integer> birthIndicies = new ArrayList<>();
int index = s.indexOf('1');
while (index > -1) {
birthIndicies.add(index);
index = s.indexOf('1', index + 1);
}
return birthIndicies.get(rand.nextInt(birthIndicies.size()));
}
Finally, if you need to do this many times, save the List as a field and re-use it (instead of calculating the indices every time). For example with memoization,
private static Random rand = new Random();
private static Map<String, List<Integer>> memo = new HashMap<>();
protected int getRandomBirthIndex(String s) {
List<Integer> birthIndicies;
if (!memo.containsKey(s)) {
birthIndicies = new ArrayList<>();
int index = s.indexOf('1');
while (index > -1) {
birthIndicies.add(index);
index = s.indexOf('1', index + 1);
}
memo.put(s, birthIndicies);
} else {
birthIndicies = memo.get(s);
}
return birthIndicies.get(rand.nextInt(birthIndicies.size()));
}
Well, one way would be to remove the creation of the list each time, by caching the list based on the string itself, assuming the strings are used more often than they're changed. If they're not, then caching methods won't help.
The caching method involves, rather than having just a string, have an object consisting of:
current string;
cached string; and
list based on the cached string.
You can provide a function to the clients to create such an object from a given string and it would set the string and the cached string to whatever was passed in, then calculate the list. Another function would be used to change the current string to something else.
The getRandomBirthIndex() function then receives this structure (rather than the string) and follows the rule set:
if the current and cached strings are different, set the cached string to be the same as the current string, then recalculate the list based on that.
in any case, return a random element from the list.
That way, if the list changes rarely, you avoid the expensive recalculation where it's not necessary.
In pseudo-code, something like this should suffice:
# Constructs fastie from string.
# Sets cached string to something other than
# that passed in (lazy list creation).
def fastie.constructor(string s):
me.current = s
me.cached = s + "!"
# Changes current string in fastie. No list update in
# case you change it again before needing an element.
def fastie.changeString(string s):
me.current = s
# Get a random index, will recalculate list first but
# only if necessary. Empty list returns index of -1.
def fastie.getRandomBirthIndex()
me.recalcListFromCached()
if me.list.size() == 0:
return -1
return me.list[random(me.list.size())]
# Recalculates the list from the current string.
# Done on an as-needed basis.
def fastie.recalcListFromCached():
if me.current != me.cached:
me.cached = me.current
me.list = empty
for idx = 0 to me.cached.length() - 1 inclusive:
if me.cached[idx] == '1':
me.list.append(idx)
You also have the option of speeding up the actual searching for the 1 character by, for example, useing indexOf() to locate them using the underlying Java libraries rather than checking each character individually in your own code (again, pseudo-code):
def fastie.recalcListFromCached():
if me.current != me.cached:
me.cached = me.current
me.list = empty
idx = me.cached.indexOf('1')
while idx != -1:
me.list.append(idx)
idx = me.cached.indexOf('1', idx + 1)
This method can be used even if you don't cache the values. It's likely to be faster using Java's probably-optimised string search code than doing it yourself.
However, you should keep in mind that your supposed problem of spending 45% of time in that code may not be an issue at all. It's not so much the proportion of time spent there as it is the absolute amount of time.
By that, I mean it probably makes no difference what percentage of the time being spent in that function if it finishes in 0.001 seconds (and you're not wanting to process thousands of strings per second). You should only really become concerned if the effects become noticeable to the user of your software somehow. Otherwise, optimisation is pretty much wasted effort.
You can even try this with best case complexity O(1) and in worst case it might go to O(n) or purely worst case can be infinity as it purely depends on Randomizer function that you are using.
private static Random rand = new Random();
protected int getRandomBirthIndex(String s) {
List<Integer> birthIndicies = new ArrayList<>();
int index = s.indexOf('1');
while (index > -1) {
birthIndicies.add(index);
index = s.indexOf('1', index + 1);
}
return birthIndicies.get(rand.nextInt(birthIndicies.size()));
}
If your Strings are very long and you're sure it contains a lot of 1s (or the String you're looking for), its probably faster to randomly "poke around" in the String until you find what you are looking for. So you save the time iterating the String:
String s = "0100101";
int index = ThreadLocalRandom.current().nextInt(s.length());
while(s.charAt(index) != '1') {
System.out.println("got not a 1, trying again");
index = ThreadLocalRandom.current().nextInt(s.length());
}
System.out.println("found: " + index + " - " + s.charAt(index));
I'm not sure about the statistics, but it rare cases might happen that this Solution take much longer that the iterating solution. On case is a long String with only a very few occurrences of the search string.
If the Source-String doesn't contain the search String at all, this code will run forever!
One possibility is to use a short-circuited Fisher-Yates style shuffle. Create an array of the indices and start shuffling it. As soon as the next shuffled element points to a one, return that index. If you find you've iterated through indices without finding a one, then this string contains only zeros so return -1.
If the length of the strings is always the same, the array indices can be static as shown below, and doesn't need reinitializing on new invocations. If not, you'll have to move the declaration of indices into the method and initialize it each time with the correct index set. The code below was written for strings of length 7, such as your example of 0100101.
// delete this and uncomment below if string lengths vary
private static int[] indices = { 0, 1, 2, 3, 4, 5, 6 };
protected int getRandomBirthIndex(String s) {
int tmp;
/*
* int[] indices = new int[s.length()];
* for (int i = 0; i < s.length(); ++i) indices[i] = i;
*/
for (int i = 0; i < s.length(); i++) {
int j = randomizer.nextInt(indices.length - i) + i;
if (j != i) { // swap to shuffle
tmp = indices[i];
indices[i] = indices[j];
indices[j] = tmp;
}
if ((s.charAt(indices[i]) == '1')) {
return indices[i];
}
}
return -1;
}
This approach terminates quickly if 1's are dense, guarantees termination after s.length() iterations even if there aren't any 1's, and the locations returned are uniform across the set of 1's.

Fast real valued random generator in java

java.util.Random.nextDouble() is slow for me and I need something really fast.
I did some google search and I've found only integers based fast random generators. Is here anything for real numbers from interval <0, 1) ?
If you need something fast and have access to Java8, I can recommend the java.utils SplittableRandom. It is faster (~twice as fast) and has better statistical distribution.
If you need a even faster or better algorithm I can recommend one of these specialized XorShift variants:
XorShift128PlusRandom (faster & better)
XorShift1024StarPhiRandom (similar speed, even longer period)
Information on these algorithms and their quality can be found in this big PRNG comparison.
I made an independent Performance comparison you can find the detailed results and the code here: github.com/tobijdc/PRNG-Performance
Futhermore Apache Commons RNG has a performance test of all their implemented algoritms
TLDR
Never use java.util.Random, use java.util.SplittableRandom.
If you need faster or better PRNG use a XorShift variant.
You could modify an integer based RNG to output doubles in the interval [0,1) in the following way:
double randDouble = randInt()/(RAND_INT_MAX + 1.0)
However, if randInt() generates a 32-bit integer this won't fill all the bits of the double because double has 53 mantissa bits. You could obviously generate two random integers to fill all mantissa bits. Or you could take a look at the source code of the Ramdom.nextDouble() implementation. It almost surely uses an integer RNG and simply converts the output to a double.
As for performance, the best-performing random number generators are linear congruential generators. Of these, I recommend using the Numerical Recipes generator. You can see more information about LCGs from Wikipedia: http://en.wikipedia.org/wiki/Linear_congruential_generator
However, if you want good randomness and performance is not that important, I think Mersenne Twister is the best choice. It also has a Wikipedia page: http://en.wikipedia.org/wiki/Mersenne_Twister
There is a recent random number generator called PCG, explained in http://www.pcg-random.org/. This is essentially a post-processing step for LCG that improves the randomness of the LCG output. Note that PCG is slower than LCG because it is simply a post-processing step for LCG. Thus, if performance is very important and randomness quality not that important, you want to use LCG instead of PCG.
Note that none of the generators I mentioned are cryptographically secure. If you need use the values for cryptographical applications, you should be using a cryptographically secure algorithm. However, I don't really believe that doubles would be used for cryptography.
Note that all these solutions miss a fundamental fact (that I wasn't aware of up to a few weeks ago): passing from 64 bits to a double using a multiplication is a major loss of time. The implementation of xorshift128+ and xorshift1024+ in the DSI utilities (http://dsiutils.di.unimi.it/) use direct bit manipulation and the results are impressive.
See the benchmarks for nextDouble() at
http://dsiutils.di.unimi.it/docs/it/unimi/dsi/util/package-summary.html#package.description
and the quality reported at
http://prng.di.unimi.it/
Imho you should just accept juhist's answer - here's why.
nextDouble is slow because it makes two calls to next() - it's written right there in the documentation.
So your best options are:
use a fast 64 bit generator, convert that to double (MT, PCG, xorshift*, ISAAC64, ...)
generate doubles directly
Here's an overly long benchmark with java's Random, an LCG (as bad as java.util.Random), and Marsaglia's universal generator (the version generating doubles).
import java.util.*;
public class d01 {
private static long sec(double x)
{
return (long) (x * (1000L*1000*1000));
}
// ns/op: nanoseconds to generate a double
// loop until it takes a second.
public static double ns_op(Random r)
{
long nanos = -1;
int n;
for(n = 1; n < 0x12345678; n *= 2) {
long t0 = System.nanoTime();
for(int i = 0; i < n; i++)
r.nextDouble();
nanos = System.nanoTime() - t0;
if(nanos >= sec(1))
break;
if(nanos < sec(0.1))
n *= 4;
}
return nanos / (double)n;
}
public static void bench(Random r)
{
System.out.println(ns_op(r) + " " + r.toString());
}
public static void main(String[] args)
{
for(int i = 0; i < 3; i++) {
bench(new Random());
bench(new LCG64(new Random().nextLong()));
bench(new UNI_double(new Random().nextLong()));
}
}
}
// straight from wikipedia
class LCG64 extends java.util.Random {
private long x;
public LCG64(long seed) {
this.x = seed;
}
#Override
public long nextLong() {
x = x * 6364136223846793005L + 1442695040888963407L;
return x;
}
#Override
public double nextDouble(){
return (nextLong() >>> 11) * (1.0/9007199254740992.0);
}
#Override
protected int next(int nbits)
{
throw new RuntimeException("TODO");
}
}
class UNI_double extends java.util.Random {
// Marsaglia's UNIversal random generator extended to double precision
// G. Marsaglia, W.W. Tsang / Statistics & Probability Letters 66 (2004) 183 – 187
private final double[] U = new double[98];
static final double r=9007199254740881.0/9007199254740992.;
static final double d=362436069876.0/9007199254740992.0;
private double c=0.;
private int i=97,j=33;
#Override
public double nextDouble(){
double x;
x=U[i]- U[j];
if(x<0.0)
x=x+1.0;
U[i]=x;
if(--i==0) i=97;
if(--j==0) j=97;
c=c-d;
if(c<0.0)
c=c+r;
x=x-c;
if(x<0.)
return x+1.;
return x;
}
//A two-seed function for filling the static array U[98] one bit at a time
private
void fillU(int seed1, int seed2){
double s,t;
int x,y,i,j;
x=seed1;
y=seed2;
for (i=1; i<98; i++){
s= 0.0;
t=0.5;
for (j=1; j<54; j++){
x=(6969*x) % 65543;
// typo in the paper:
//y=(8888*x) % 65579;
//used forthe demo in the last page of the paper.
y=(8888*y) % 65579;
if(((x^y)& 32)>0)
s=s+t;
t=.5*t;
}
if(x == 0)
throw new IllegalArgumentException("x");
if(y == 0)
throw new IllegalArgumentException("y");
U[i]=s;
}
}
// Marsaglia's test code is useless because of a typo in fillU():
// x=(6969*x)%65543;
// y=(8888*x)% 65579;
public UNI_double(long seed)
{
Random r = new Random(seed);
for(;;) {
try {
fillU(r.nextInt(), r.nextInt());
break;
} catch(Exception e) {
// loop again
}
}
}
#Override
protected int next(int nbits)
{
throw new RuntimeException("TODO");
}
}
You could create an array of random doubles when you init your program and then just repeat it. This is much faster but the random values reapeat themselfs.

Coefficient Correlation Over a Large Binary Image Data-Set - Slow Performance

I am trying to build an OCR by calculating the Coefficient Correlation between characters extracted from an image with every character I have pre-stored in a database. My implementation is based on Java and pre-stored characters are loaded into an ArrayList upon the beginning of the application, i.e.
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
...
...
public double findCorrelation(byte [] extractedCharacter, byte [] storedCharacter)
{
double mag1, mag2, corr = 0;
for(int i=0; i < extractedCharacter.length; i++)
{
mag1 += extractedCharacter[i] * extractedCharacter[i];
mag2 += storedCharacter[i] * storedCharacter[i];
corr += extractedCharacter[i] * storedCharacter[i];
} // for
corr /= Math.sqrt(mag1*mag2);
return corr;
}
The number of extractedCharacters are around 100-150 per image but the database has 15600 stored binary characters. Checking the coefficient correlation between every extracted character and every stored character has an impact on the performance as it needs around 15-20 seconds to complete for every image, with an Intel i5 CPU.
Is there a way to improve the speed of this program, or suggesting another path of building this bringing similar results. (The results produced by comparing every character with such a large dataset is quite good).
Thank you in advance
UPDATE 1
public static void run() {
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
computeNorms(charComps, extractedCharacters);
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
}
}
private static double[] storedNorms;
private static double[] extractedNorms;
// Correlation between to binary images
public static double findCorrelation(byte[] arr1, byte[] arr2, int strCharIndex, int extCharNo){
final int dotProduct = dotProduct(arr1, arr2);
final double corr = dotProduct * storedNorms[strCharIndex] * extractedNorms[extCharNo];
return corr;
}
public static void computeNorms(ArrayList<byte[]> storedCharacters, ArrayList<byte[]> extractedCharacters) {
storedNorms = computeInvNorms(storedCharacters);
extractedNorms = computeInvNorms(extractedCharacters);
}
private static double[] computeInvNorms(List<byte []> a) {
final double[] result = new double[a.size()];
for (int i=0; i < result.length; ++i)
result[i] = 1 / Math.sqrt(dotProduct(a.get(i), a.get(i)));
return result;
}
private static int dotProduct(byte[] arr1, byte[] arr2) {
int dotProduct = 0;
for(int i = 0; i< arr1.length; i++)
dotProduct += arr1[i] * arr2[i];
return dotProduct;
}
Nowadays, it's hard to find a CPU with a single core (even in mobiles). As the tasks are nicely separated, you can do it with a few lines only. So I'd go for it, though the gain is limited.
In case you really mean cross-correlation, then a transform like DFT or DCT could help. They surely do for big images, but with yours 12x16, I'm not sure.
Maybe you mean just a dot product? And maybe you should tell us?
Note that you actually don't need to compute the correlation, most of the time you only need is find out if it's bigger than a threshold:
corr = findCorrelation(extractedCharacter, storedCharacter)
..... more code to check if this is the best match ......
This may lead to some optimizations or not, depending on how the images look like.
Note also that a simple low level optimization can give you nearly a factor of 4 as in this question of mine. Maybe you really should tell us what you're doing?
UPDATE 1
I guess that due to the computation of three products in the loop, there's enough instruction level parallelism, so a manual loop unrolling like in my above question is not necessary.
However, I see that those three products get computed some 100 * 15600 times, while only one of them depends on both extractedCharacter and storedCharacter. So you can compute
100 + 15600 + 100 * 15600
dot products instead of
3 * 100 * 15600
This way you may get a factor of three pretty easily.
Or not. After this step there's a single sum computed in the relevant step and the problem linked above applies. And so does its solution (unrolling manually).
Factor 5.2
While byte[] is nicely compact, the computation involves extending them to ints, which costs some time as my benchmark shows. Converting the byte[]s to int[]s before all the correlations gets computed saves time. Even better is to make use of the fact that this conversion for storedCharacters can be done beforehand.
Manual loop unrolling twice helps but unrolling more doesn't.

Java : linear algorithm but non-linear performance drop, where does it come from?

I am currently having heavy performance issues with an application I'm developping in natural language processing. Basically, given texts, it gathers various data and does a bit of number crunching.
And for every sentence, it does EXACTLY the same. The algorithms applied to gather the statistics do not evolve with previously read data and therefore stay the same.
The issue is that the processing time does not evolve linearly at all: 1 min for 10k sentences, 1 hour for 100k and days for 1M...
I tried everything I could, from re-implementing basic data structures to object pooling to recycles instances. The behavior doesn't change. I get non-linear increase in time that seem impossible to justify by a little more hashmap collisions, nor by IO waiting, nor by anything! Java starts to be sluggish when data increases and I feel totally helpless.
If you want an example, just try the following: count the number of occurences of each word in a big file. Some code is shown below. By doing this, it takes me 3 seconds over 100k sentences and 326 seconds over 1.6M ...so a multiplicator of 110 times instead of 16 times. As data grows more, it just get worse...
Here is a code sample:
Note that I compare strings by reference (for efficiency reasons), this can be done thanks to the 'String.intern()' method which returns a unique reference per string. And the map is never re-hashed during the whole process for the numbers given above.
public class DataGathering
{
SimpleRefCounter<String> counts = new SimpleRefCounter<String>(1000000);
private void makeCounts(String path) throws IOException
{
BufferedReader file_src = new BufferedReader(new FileReader(path));
String line_src;
int n = 0;
while (file_src.ready())
{
n++;
if (n % 10000 == 0)
System.out.print(".");
if (n % 100000 == 0)
System.out.println("");
line_src = file_src.readLine();
String[] src_tokens = line_src.split("[ ,.;:?!'\"]");
for (int i = 0; i < src_tokens.length; i++)
{
String src = src_tokens[i].intern();
counts.bump(src);
}
}
file_src.close();
}
public static void main(String[] args) throws IOException
{
String path = "some_big_file.txt";
long timestamp = System.currentTimeMillis();
DataGathering dg = new DataGathering();
dg.makeCounts(path);
long time = (System.currentTimeMillis() - timestamp) / 1000;
System.out.println("\nElapsed time: " + time + "s.");
}
}
public class SimpleRefCounter<K>
{
static final double GROW_FACTOR = 2;
static final double LOAD_FACTOR = 0.5;
private int capacity;
private Object[] keys;
private int[] counts;
public SimpleRefCounter()
{
this(1000);
}
public SimpleRefCounter(int capacity)
{
this.capacity = capacity;
keys = new Object[capacity];
counts = new int[capacity];
}
public synchronized int increase(K key, int n)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
{
key_count++;
keys[id] = key;
if (key_count > LOAD_FACTOR * capacity)
{
resize((int) (GROW_FACTOR * capacity));
}
}
counts[id] += n;
total += n;
return counts[id];
}
public synchronized void resize(int capacity)
{
System.out.println("Resizing counters: " + this);
this.capacity = capacity;
Object[] new_keys = new Object[capacity];
int[] new_counts = new int[capacity];
for (int i = 0; i < keys.length; i++)
{
Object key = keys[i];
int count = counts[i];
int id = System.identityHashCode(key) % capacity;
while (new_keys[id] != null && new_keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
new_keys[id] = key;
new_counts[id] = count;
}
this.keys = new_keys;
this.counts = new_counts;
}
public int bump(K key)
{
return increase(key, 1);
}
public int get(K key)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
return 0;
else
return counts[id];
}
}
Any explanations? Ideas? Suggestions?
...and, as said in the beginning, it is not for this toy example in particular but for the more general case. This same exploding behavior occurs for no reason in the more complex and larger program.
Rather than feeling helpless use a profiler! That would tell you where exactly in your code all this time is spent.
Bursting the processor cache and thrashing the Translation Lookaside Buffer (TLB) may be the problem.
For String.intern you might want to do your own single-threaded implementation.
However, I'm placing my bets on the relatively bad hash values from System.identityHashCode. It clearly isn't using the top bit, as you don't appear to get ArrayIndexOutOfBoundsExceptions. I suggest replacing that with String.hashCode.
String[] src_tokens = line_src.split("[ ,.;:?!'\"]");
Just an idea -- you are creating a new Pattern object for every line here (look at the String.split() implementation). I wonder if this is also contributing to a ton of objects that need to be garbage collected?
I would create the Pattern once, probably as a static field:
final private static Pattern TOKEN_PATTERN = Pattern.compile("[ ,.;:?!'\"]");
And then change the split line do this:
String[] src_tokens = TOKEN_PATTERN.split(line_src);
Or if you don't want to create it as a static field, as least only create it once as a local variable at the beginning of the method, before the while.
In get, when you search for a nonexistent key, search time is proportional to the size of the set of keys.
My advice: if you want a HashMap, just use a HashMap. They got it right for you.
You are filling up the Perm Gen with the string intern. Have you tried viewing the -Xloggc output?
I would guess it's just memory filling up, growing outside the processor cache, memory fragmentation and the garbage collection pauses kicking in. Have you checked memory use at all? Tried to change the heap size the JVM uses?
Try to do it in python, and run the python module from Java.
Enter all the keys in the database, and then execute the following query:
select key, count(*)
from keys
group by key
Have you tried to only iterate through the keys without doing any calculations? is it faster? if yes then go with option (2).
Can't you do this? You can get your answer in no time.
It's me, the original poster, something went wrong during registration, so I post separately. I'll try the various suggestions given.
PS for Tom Hawtin: thanks for the hints, perhaps the 'String.intern()' takes more and more time as vocabulary grows, i'll check that tomorrow, as everything else.

Categories