Mergesort running faster on larger inputs

Mergesort running faster on larger inputs - java

I'm working on an empirical analysis of merge sort (sorting strings) for school, and I've run into a strange phenomenon that I can't explain or find an explanation of. When I run my code, I capture the running time using the built in system.nanotime() method, and for some reason at a certain input size, it actually takes less time to execute the sort routine than with a smaller input size.
My algorithm is just a basic merge sort, and my test code is simple too:
//Get current system time
long start = System.nanoTime();
//Perform mergesort procedure
a = q.sort(a);
//Calculate total elapsed sort time
long time = System.nanoTime()-start;
The output I got for elapsed time when sorting 900 strings was: 3928492ns
For 1300 strings it was: 3541923ns
With both of those being the average of about 20 trials, so it's pretty consistent. After 1300 strings, the execution time continues to grow as expected. I'm thinking there might be some peak input size where this phenomenon is most noticeable.
So my Question: What might be causing this sudden increase in speed of the program? I was thinking there might be some sort of optimization going on with arrays holding larger amounts of data, although 1300 items in an array is hardly large.
Some info:
Compiler: Java version 1.7.0_07
Algorithm: Basic recursive merge sort (using arrays)
Input type: Strings 6-10 characters long, shuffled (random order)
Am I missing anything?

Am I missing anything?
You're trying to do a microbenchmark, but the code you've posted so far does not resemble a well working sample. To do so, please follow the rules stated here: How do I write a correct micro-benchmark in Java?.
The explanation about your code being faster is because after some iterations of your method, the JIT will trigger and the performance of your code will be optimized, thus your code getting faster, even when processing larger data.
Some recommendations:
Use several array/list inputs with different size. Good values to do this kind of analysis are 100, 1000 (1k), 10000 (10k), 100000 (100k), 1000000 (1m) and random size values between these. You will get more accurate results when performing evaluations that take longer time.
Use arrays/lists of different objects. Create a POJO and make it implement the Comparable interface, then execute your sort method. As explained above, use different arrays values.
Not directly related to your question, but the execution results are based on the JDK used. Eclipse is just an IDE and can work with different JDK versions, e.g. at my workplace I use JDK 6 u30 to work on projects on the company, but for personal projects (like proof of concepts) I use JDK 7 u40.

Related

Ambiguity in a CodeForces Problem - usage of HashSet Vs LinkedHashSet

I was solving a Codeforces problem yesterday. The problem's URL is this
I will just explain the question in short below.
Given a binary string, divide it into a minimum number of subsequences
in such a way that each character of the string belongs to exactly one
subsequence and each subsequence looks like "010101 ..." or "101010
..." (i.e. the subsequence should not contain two adjacent zeros or
ones).
Now, for this problem, I had submitted a solution yesterday during the contest. This is the solution. It got accepted temporarily and on final test cases got a Time limit exceeded status.
So today, I again submitted another solution and this passed all the cases.
In the first solution, I used HashSet and in the 2nd one I used LinkedHashSet. I want to know, why didn't HashSet clear all the cases? Does this mean I should use LinkedHashSet whenever I need a Set implementation? I saw this article and found HashSet performs better than LinkedHashSet. But why my code doesn't work here?

This question would probably get more replies on Codeforces, but I'll answer it here anyways.
After a contest ends, Codeforces allows other users to "hack" solutions by writing custom inputs to run on other users' programs. If the defending user's program runs slowly on the custom input, the status of their code submission will change from "Accepted" to "Time Limit Exceeded".
The reason why your code, specifically, changed from "Accepted" to "Time Limit Exceeded" is that somebody created an "anti-hash test" (a test on which your hash function results in many collisions) on which your program ran slower than usual. If you're interested in how such tests are generated, you can find several posts on Codeforces, like this one: https://codeforces.com/blog/entry/60442.
As linked by #Photon, there's a post on Codeforces explaining why you should avoid using Java.HashSet and Java.HashMap: https://codeforces.com/blog/entry/4876, which is essentially due to anti-hash tests. In some instances, adding the extra log(n) factor from a balanced BST may not be so bad (by using TreeSet or TreeMap). In many cases, an extra log(n) factor won't make your code time out, and it gives you protection from anti-hash tests.
How do you determine whether your algorithm is fast enough to add the log(n) factor? I guess this comes with some experience, but most people suggest performing some sort of calculation. Most online judges (including Codeforces) show the time that your program is allowed to run on a particular problem (usually somewhere between one and four seconds), and you can use 10^9 constant-time operations per second as a rule of thumb when performing calculations.

What Java class can be used to calculate runtime?

I'm trying to write a java program that generates a million random numbers, and then use Bubble Sort, Insertion Sort and Merge Sort to sort them. Finally I want to display the runtime in nanoseconds of each sorting algorithm. Is there a class in Java that allows me to do so?

You can use System.nanoTime() to measure the time spent in your code. (Use the time difference between the start and end of the code under test.) Be aware, however, that the actual time measurement probably does not have nanosecond resolution.
You might want to look into a good Java benchmarking framework for measuring your code's performance. A general web search will turn up quite a few good candidates. Doing timing tests is not at all an easy thing to get right.

Running time of Construction Heuristic in OptaPlanner

I am using the OptaPlanner to optimize a chained planning problem which is similar to the VehicleRoutingExample. My planning entities have a planning variable which is another planning entity.
Now I am testing a huge dataset with ca. 1500 planning entities.
I am using an EasyJavaScoreCalculator to get a HardSoftScore. The Score includes several time and other factors which are calculated in loops.
My Problem is that the ConstrucionHeuristic (FIRST_FIT or FIRST_FIT_DECREASING) takes more than ten minutes to initialize a Solution.
I have already reduced the number of constraints and the number of loops with which I am calculating the score, but it did not have a real effect on the running duration.
Is there a way to make the CH need less time? (I thought that it would take less time than the LocalSearch stuff but it isn’t…)

EasyJavaScoreCalculator is very slow and doesn't scale beyond a few 100 entities. Use an IncremantalJavaScoreCalculator or Drools calculator instead. To see the difference for yourself, take the VRP example and switch between the 3 implementations (easy, inc and drools).
Also see the docs section about incremental score calculation, to explain why that's much faster.

The more I use a Java HashMap, the more the performance drops - even with stable size

I want to scan through a huge corpus of text and count word frequencies (n-gram frequencies actually for those who are familiar with NLP/IR). I use a Java HashMap for this. So what happens is I process the text line by line. For each line, I extract the words, and for each word, I update the corresponding frequency in the hashmap.
The problem is that this process gets slower and slower. For example, it starts by processing around 100k lines / second - and the performance starts falling right away. After about 28 million lines , the performance has fallen to 16k lines / second - and of course keeps falling.
First thing that came to mind was that it was caused of too many entries in the hashmap, which caused every put and every get to be slower every time. So what I tried was to only keep the most (say 100k) frequent entries in the hashmap at anytime. This was done by using a second map that mapped frequencies to words (as in here: Automatically sorted by values map in Java )
This performed a lot faster in general. (although it started at 56 k lines / sec, by the time it reached 28 mil lines, the performance had only dropped to 36.5k lines / sec). However, this also kept falling, at a much slower rate - but the fact remains, that it kept falling.
Have you got any possible explanation of why does this happen when the hashmap's size remains the same? Do you think this has anything to do with the garbage collector? Meaning, that the fact that I keep putting, and removing object to/from hashmaps fragments up the memory or something? Or could it be a hashing function problem? Since I'm using strings, the hashing function is Java's default hashing function for strings.
Here is the part of my code that performs the aforementioned task:
http://pastebin.com/P8S6Sj86
NOTE: I am a Java newbie so any elaboration in your answers is more than welcome

I recommend using Java VisualVM to do some profiling. This comes with Java - go to the command line and type jvisualvm to run it. This makes it easy to see if memory churn is your problem, or if particular types of objects are being created hundreds of thousands of times.
If you break up your code into several methods, you'll also be able to tell which methods take too long to run.
I did notice that you are unnecessarily creating lots of objects in inner loops. This certainly won't help performance, although it may not be the main culprit.
For example:
float avg = new Float(sumItems) / new Float (freqMap.size());
should just be
float avg = (float)sumItems / freqMap.size();
Another piece of your code which also could be troublesome is:
System.out.println(numItems + " items counted");
Depending on your operating system or IDE, writing 100,000s of lines to the console requires significant time. Instead, just write an update of progress for each 1000 items.

Suggestion:
Try implementing a custom hashCode method for the object you're storing in your hashmap. Here are some links:
Java HashMap performance optimization / alternative
http://www.ibm.com/developerworks/java/library/j-jtp05273/index.html
http://www.javamex.com/tutorials/collections/hash_function_guidelines.shtml
Bad idea to use String key in HashMap?

How to sort 100GB worth of strings

Given a harddrive with 120GB, 100 of which are filled with the strings of length 256 and 2 GB Ram how do I sort those strings in Java most efficiently?
How long will it take?

A1. You probably want to implement some form of merge-sort.
A2: Longer than it would if you had 256GB RAM on your machine.
Edit: stung by criticism, I quote from Wikipedia's article on merge sort:
Merge sort is so inherently sequential that it is practical to run it using slow tape drives as input and output devices. It requires very
little memory, and the memory required does not depend on the number
of data elements.
For the same reason it is also useful for sorting data on disk that is
too large to fit entirely into primary memory. On tape drives that can
run both backwards and forwards, merge passes can be run in both
directions, avoiding rewind time.

Here is how I'd do it:
Phase 1 is to split the 100Gb into 50 partitions of 2Gb, read each of the 50 partitions into memory, sort using quicksort, and write out. You want the sorted partitions at the top end of the disc.
Phase 2 is to then merge the 50 sorted partitions. This is the tricky bit because you don't have enough space on the disc to store the partitions AND the final sorted output. So ...
Do a 50-way merge to fill the first 20Gb at the bottom end of disc.
Slide the remaining data in the 50 partitions to the top to make another 20Gb of free space contiguous with the end of the first 20Gb.
Repeat steps 1. and 2. until completed.
This does a lot of disc IO, but you can make use of your 2Gb of memory for buffering in the copying and merging steps to get data throughput by minimizing the number of disc seeks, and do large data transfers.
EDIT - #meriton has proposed a clever way to reduce copying. Instead of sliding, he suggests that the partitions be sorted into reverse order and read backwards in the merge phase. This would allows the algorithm to release disc space used by partitions (phase 2, step 2) by simply truncating the partition files.
The potential downsides of this are increased disk fragmentation, and loss of performance due reading the partitions backwards. (On the latter point, reading a file backwards on Linux / UNIX requires more syscalls, and the FS implementation may not be able to do "read-ahead" in the reverse direction.)
Finally, I'd like to point out that any theoretically predictions of the time taken by this algorithm (and others) are largely guesswork. The behaviour of these algorithms on a real JVM + real OS + real discs are just too complex for "back for the envelope" calculations to give reliable answers. A proper treatment would require actual implementation, tuning and benchmarking.

I am basically repeating Krystian's answer, but elaborating:
Yes you need to do this more-or-less in place, since you have little RAM available. But naive in-place sorts would be a disaster here just due to the cost of moving strings around.
Rather than actually move strings around, just keep track of which strings should swap with which others and actually move them, once, at the end, to their final spot. That is, if you had 1000 strings, make an array of 1000 ints. array[i] is the location where string i should end up. If array[17] == 133 at the end, it means string 17 should end up in the spot for string 133. array[i] == i for all i to start. Swapping strings, then, is just a matter of swapping two ints.
Then, any in-place algorithm like quicksort works pretty well.
The running time is surely dominated by the final move of the strings. Assuming each one moves, you're moving around about 100GB of data in reasonably-sized writes. I might assume the drive / controller / OS can move about 100MB/sec for you. So, 1000 seconds or so? 20 minutes?
But does it fit in memory? You have 100GB of strings, each of which is 256 bytes. How many strings? 100 * 2^30 / 2^8, or about 419M strings. You need 419M ints, each is 4 bytes, or about 1.7GB. Voila, fits in your 2GB.

Sounds like a task that calls for External sorting method. Volume 3 of "The Art of Computer Programming" contains a section with extensive discussion of external sorting methods.

I think you should use BogoSort. You might have to modify the algorithm a bit to allow for inplace sorting, but that shouldn't be too hard. :)

You should use a trie (aka: a prefix tree): to build a tree-like structure that allows you to easily walk through your strings in an ordered manner by comparing their prefixes. In fact, you don't need to store it in memory. You can build the trie as a tree of directories on your file system (obviously, not the one which the data is coming from).

AFAIK, merge-sort requires as much free space as you have data. This may be a requirement for any external sort that avoids random access, though I'm not sure of this.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.