count distinct values in big long array (performance issue)

count distinct values in big long array (performance issue) - java

I have this:
long hnds[] = new long[133784560]; // 133 million
Then I quickly fill the array (couple of ms) and then I somehow want to know the number of unique (i.e. distinct) values. Now, I don't even need this realtime, I just need to try out a couple of variations and see how many unique values each gives.
I tried e.g. this:
import org.apache.commons.lang3.ArrayUtils;
....
HashSet<Long> length = new HashSet<Long>(Arrays.asList(ArrayUtils.toObject(hnds)));
System.out.println("size: " + length.size());
and after waiting for half an hour it gives a heap space error (I have Xmx4000m).
I also tried initializing Long[] hnds instead of long[] hnds, but then the initial filling of the array takes forever. Or for example use a Set from the beginning when adding the values, but also then it takes forever. Is there any way to count the distinct values of a long[] array without waiting forever? I'd write it to a file if I have to, just some way.

My best suggestion would be to use a library like fastutil (http://fastutil.di.unimi.it/) and then use the custom unboxed hash set:
import it.unimi.dsi.fastutil.longs.LongOpenHashSet;
System.out.println(new LongOpenHashSet(hnds).size());
(Also, by the way, if you can accept approximate answers, there are much more efficient algorithms you can try; see e.g. this paper for details.)

Just sort it and count.
int sz = 133784560;
Random randy = new Random();
long[] longs = new long[sz];
for(int i = 0; i < sz; i++) { longs[i] = randy.nextInt(10000000); }
Arrays.sort(longs);
long lastSeen = longs[0];
long count = 0;
for(int i = 1; i < sz; i++) {
if(longs[i] != lastSeen) count++;
lastSeen = longs[i];
}
Takes about 15 seconds on my laptop.

Related

Java Array/ArrayList/LinkedList performance

I used the following code to test the performance between Array/ArrayList/LinkedList
import java.util.ArrayList;
import java.util.LinkedList;
public class Main3 {
public static void main(String[] args) throws Exception{
int n = 20000000;
long bt = 0, et = 0;
int[] a0 = new int[n];
ArrayList<Integer> a1 = new ArrayList<>(n);
LinkedList<Integer> a2 = new LinkedList<>();
Integer[] a3 = new Integer[n];
bt = System.currentTimeMillis();
for(int i=0; i<n; i++){
a0[i] = i;
}
et = System.currentTimeMillis();
System.out.println("===== loop0 time =======" + (et - bt));
bt = System.currentTimeMillis();
for(int i=0; i<n; i++){
a1.add(i);
}
et = System.currentTimeMillis();
System.out.println("===== loop1 time =======" + (et - bt));
bt = System.currentTimeMillis();
for(int i=0; i<n; i++){
a2.add(i);
}
et = System.currentTimeMillis();
System.out.println("===== loop2 time =======" + (et - bt));
bt = System.currentTimeMillis();
for(int i=0; i<n; i++){
a3[i] = i;
}
et = System.currentTimeMillis();
System.out.println("===== loop3 time =======" + (et - bt));
}
}
The result is
===== loop0 time =======11
===== loop1 time =======6776
===== loop2 time =======17305
===== loop3 time =======56
Why the ArralyList/LinkedList is so slower than array ?
How could I improve the performance.
env:
Java: jdk1.8.0_231
Thanks

There are potential inaccuracies in your benchmark, but the overall ranking of the results is probably correct. You may get faster results for all of the benchmarks if you "warm-up" the code before taking timings to allow the JIT compiler to generate native code and optimise it. Some benchmark results may be closer or even equal.
Iterating over an int array is going to be much faster than iterating over a List of Integer objects. A LinkedList is going to be slowest of all. These statements assume the optimiser doesn't make radical changes.
Let's look at why:
An int array (int[]) is a contiguous area of memory containing your 4 byte ints arranged end-to-end. The loop to iterate over this and set the elements just has to work its way through the block of memory setting each 4 bytes in turn. In principle an index check is required, but in practice the optimiser can realise this isn't necessary and remove it. The JIT compiler is well able to optimise this kind of thing based on native CPU instructions.
An ArrayList of Integer objects contains an array of references which point to individual Integer objects (or are null). Each Integer object will have to be allocated separately (although Java can re-use Integers of small numbers). There is an overhead to allocate new objects and in addition the reference may be 8 bytes instead of 4. Also, if the list size is not specified (though it is in your case) the internal array may need to be reallocated. There is an overhead due to calling the add method instead of assigning to the array directly (the optimizer may remove this though).
Your array of Integer benchmark is similar to the array list but doesn't have the overhead of the list add method call (which has to track the list size). Probably your benchmark overstates the difference between this array and the array list though.
A LinkedList is the worst case. Linked lists are optimised for inserting in the middle. They have references to point to the next item in the list and nodes to hold those references in addition to the Integer object that needs allocating. This is a big memory overhead that also requires some initialisation and you would not use a linked list unless you were expecting to insert a lot of elements into the middle of the list.

Is there any efficient and optimized way to store 500M+ elements in long[] array?

In my exam's first question : I am working on a small task where I am required to store around 500Million+ elements in an Array.
However, I am running into a heap space problem. Could you please help me with this to best optimal storage algorithm ?
I found "BitSet" but I dont know how to use it.
Step 1 - Create 3 long[] arrays with very large length (Least 100M+)
Step 2 - Init values should be randomly generated, not sorted, may contain duplicates
Step 3 - Merge them after init with randomly (3 long[] arrays)
Step 4 - Duplicate items should be removed in output
I wrote a few things :
package exam1;
import java.time.Duration;
import java.time.Instant;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Random;
/**
*
* #author Furkan
*/
//VM OPTIONS -> -Xincgc -Xmx4g -Xms4g
public final class Exam1 {
private static final int LENGTH = 100000000;
private volatile long[] m_testArr1 = null;
private volatile long[] m_testArr2 = null;
private volatile long[] m_testArr3 = null;
private volatile long[] m_merged = null;
private Random m_r = new Random(System.currentTimeMillis());
public static void main(String[] args) {
Exam1 exam = new Exam1();
Instant start1 = Instant.now();
System.out.println("Fill Started");
exam.Fill();
Instant end1 = Instant.now();
System.out.println("Fill Ended : " + Duration.between(start1, end1));
Instant start2 = Instant.now();
System.out.println("Merge Started");
exam.Merge();
Instant end2 = Instant.now();
System.out.println("Merge Ended : " + Duration.between(start1, end1));
Instant start3 = Instant.now();
System.out.println("DupRemove Started");
exam.DupRemove();
Instant end3 = Instant.now();
System.out.println("DupRemove Ended : " + Duration.between(start1, end1));
}
private void Fill(){
this.m_testArr1 = new long[Exam1.LENGTH];
this.m_testArr2 = new long[Exam1.LENGTH];
this.m_testArr3 = new long[Exam1.LENGTH];
for (int i = 0; i < Exam1.LENGTH; i++) {
this.m_testArr1[i] = this.m_r.nextLong();
this.m_testArr2[i] = this.m_r.nextLong();
this.m_testArr3[i] = this.m_r.nextLong();
}
}
private void Merge(){
this.m_merged = this.TryMerge(this.m_testArr1, this.m_testArr2, this.m_testArr3);
}
private void DupRemove(){
this.m_merged = this.RemoveDuplicates(this.m_merged);
}
public long[] TryMerge(long[] arr1, long[] arr2, long[] arr3){
int aLen = arr1.length;
int bLen = arr2.length;
int cLen = arr3.length;
int len = aLen + bLen + cLen;
//TODO: Use BitSize for RAM optimize. IDK how to use...
//OutOfMemory Exception on this line.
long[] mergedArr = new long[len];
this.m_merged = new long[len];
//long[] mergedArr = (long[]) Array.newInstance(long.class, aLen+bLen+cLen);
System.arraycopy(arr1, 0, mergedArr, 0, aLen);
System.arraycopy(arr2, 0, mergedArr, aLen, bLen);
System.arraycopy(arr3, 0, mergedArr, (aLen + bLen), cLen);
return mergedArr;
}
//!!!NOT WORKING!!!
private long[] RemoveDuplicates(long[] arr){
HashSet<Long> set = new HashSet<Long>();
final int len = arr.length;
for(int i = 0; i < len; i++){
set.add(arr[i]);
}
long[] clean = new long[set.size()];
int i = 0;
for (Iterator<Long> it = set.iterator(); it.hasNext();) {
clean[i++] = it.next();
}
return clean;
}
}
UPDATE
Original Question ;
-Implement an efficient methot to merge 3 sets of very large (Length: 100M+) long[] arrays.
-Input data randomly generated, not sorted, may contain duplicateds
-Duplicate items should be removed in output.
(8 GB RAM i have)
Run Args: -Xincgc -Xmx4g -Xms4g
Exception : Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at test.

Since you have limited space, and assuming you're allowed to modify the 3 random arrays, I'd suggest the following.
For each of the 3 arrays:
Sort the array, e.g. using Arrays.sort().
Eliminate duplicates by compacting non-repeating numbers to the beginning.
E.g. if you have {1,2,2,3,3}, you compact to {1,2,3,?,?} with length 3, where ? means value don't matter.
(optional) Move to array of correct size, and discard original array, to free up space for result array(s).
Create result array of size len1 + len2 + len3.
Merge the 3 arrays to the result, eliminating duplicates between the arrays.
E.g. if you have {1,3,5}, {1,2,3}, you end up with {1,2,3,5,?,?} with length 4.
If needed, copy result to new array of correct size.
If low on memory, release 3 original arrays before doing this to free up space.

Use a Bloom filter to identify possible duplicates, then use a hash set to weed out the false positives from the set of possible duplicates, i.e.
foreach source array element, add it to the Bloom filter; if the element is (possibly) contained in the bloom filter, then add it to a hash set, else add it to the merged array. When all source arrays are processed, check each element of the merged array to see if it is in the hash set, removing duplicates from the hash set. Finally, add all remaining elements of the hash set to the merged array.
Guava has a bloom filter data structure that you can use.

If you don't have enough memory to store all the data you need to change something by analysing the business requirements and the real world situation.
Maybe you should use some built in collection framework as others suggested.
Or if it is not allowed (for whatever reason) you should save the data somewhere else than memory. E.g.
sort the arrays
watch the three array with three moving indexes (i, j, k)
always pick the smallest of arr1[i], arr2[j], arr3[k]
ignore if it is a duplicate and move on
write to a file if it is a new value
and increment the corresponding index
do it until the end of each array
Now you have the sorted duplicate free merged array in a file which you can read if necessary after dropping the originals.

Java - Improper Checking in For Loop

This is a chunk of code in Java, I'm trying to output random numbers from the tasks array, and to make sure none of the outputs are repeated, I put them through some other loops (say you have the sixth, randomly-chosen task "task[5]"; it goes through the for loop that will check it against every "tCheck" element, and while task[5] equals one of the tCheck elements, it will keep trying to find another option before going back to the start of the checking forloop... The tCheck[i] elements are changed at the end of each overall loop of output to the new random number settled on for the task element).
THE PROBLEM is that, despite supposedly checking each new random task against all tCheck elements, sometimes (not always) there are repeated tasks output (meaning, instead of putting out say 2,3,6,1,8,7,5,4, it will output something like 2,3,2,1,8,7,5,4, where "2" is repeated... NOT always in the same place, meaning it can sometimes end up like this, too, where "4" is repeated: 3,1,4,5,4,6,7,8)
int num = console.nextInt();
String[] tasks = {"1","2","3","4","5","6","7","8"};
String[] tCheck = {"","","","","","","",""};
for(int i = 0; i<= (num-1); i++){
int tNum = rand.nextInt(8);
for(int j = 0; j <=7; j++){
if(tasks[tNum].equals(tCheck[j])){
while(tasks[tNum].equals(tCheck[j])){
tNum = rand.nextInt(8);
}
j = 0;
}
}
tCheck[i] = tasks[tNum];
System.out.println(tasks[tNum]+" & "+tCheck[i]);
}
None of the other chunks of code affect this part (other than setting up Random int's, Scanners, so on; those are all done correctly). I just want it to print out each number randomly and only once. to never have any repeats. How do I make it do that?
Thanks in advance.

Firstly, don't use arrays. Use collections - they are way more programmer friendly.
Secondly, use the JDK's API to implement this idea:
randomise the order of your elements
then iterate over them linearly
In code:
List<String> tasks = Arrays.asList("1","2","3","4","5","6","7","8");
Collections.shuffle(tasks);
tasks.forEach(System.out::println);
Job done.

you can check if a certain value is inside your array with this approach.
for(int i = 0; i<= (num-1); i++){
int tNum = rand.nextInt(8);
boolean exist = Arrays.asList(tasks).contains(tNum);
while(!exist){
//your code
int tNum = rand.nextInt(8);
exist = Arrays.asList(tasks).contains(tNum);
}
}
if you are using an arraylist then you can check it with contains method since you are using an array we have to get the list from the array using asList() and then use the contains method. with the help of the while loop it will keep generating random numbers untill it generates a non duplicate value.

I used to created something similar using an ArrayList
public class Main {
public static void main(String[] args) {
String[] array = { "a", "b", "c", "d", "e" };
List<String> l = new ArrayList<String>(Arrays.asList(array));
Random r = new Random();
while(!l.isEmpty()){
String s = l.remove(r.nextInt(l.size()));
System.out.println(s);
}
}
}
I remove a random position in the list until it's empty. I don't use any check of content. I believe that is kind of effective (Even if I create a list)

More efficient to create new array or reset array

I have an array of 6 elements. Is it more efficient (time-wise) to set those elements to null, or to create a new array? I'm going to be using this array hundreds of times.

I would use this method.
System.arraycopy
It is native and pretty efficient.
I would keep an array of nulls and copy it into my array using this method, each time when I want to reset my array. So my advise would be not to create a new array each time, but also not to loop (using Java code) and set the elements to null yourself. Just use this native method which Java provides.
import java.util.Arrays;
public class Test056 {
public static void main(String[] args) {
String[] arrNull = new String[10000];
String[] arrString = new String[10000];
long t1 = System.nanoTime();
for (int i=0; i<10000; i++){
System.arraycopy(arrNull, 0, arrString, 0, arrNull.length);
}
long t2 = System.nanoTime();
System.out.println(t2 - t1);
long t3 = System.nanoTime();
for (int i=0; i<10000; i++){
Arrays.fill(arrString, null);
}
long t4 = System.nanoTime();
System.out.println(t4 - t3);
}
}

Mu - You are not asking the right question.
Do not worry about the efficiency of such trivial operations until it is a known performance bottleneck in your application.
You say you will be nulling/creating 6 references 100s of times. So, you will be creating/nulling/looping < 6000 references. Which is trivial in modern programming.
There are likely much better places where you should be spending your development time.

Creating a new array should be more efficient time-wise because it would only allocate empty references.
Setting the elements to null would imply walking through the entire array to set references to null (which is default behavior on array creation) which is more time consuming (even if for an array of 6 elements it's totally negligible).
EDIT : time-wise is bold because memory-wise is may not be your best option. Since you'll be creating new references, if you instantiate new object, make sure the object of your previous array are garbage collected properly (again, with only 6 elements it must be gigantic objects to see any bad performance impact).

The following bit of code demonstrates that it's faster to assign each value to null.
Note that this is valid for arrays with 6 elements. Eventually it may be faster to create a new array.
Object[] array = new Object[6];
long start = System.nanoTime();
for(int i = 0; i < Integer.MAX_VALUE; i++){
array[0] = null;
array[1] = null;
array[2] = null;
array[3] = null;
array[4] = null;
array[5] = null;
}
System.out.println("elapsed nanoseconds: " + (System.nanoTime() - start));
int length = array.length;
start = System.nanoTime();
for(int i = 0; i < Integer.MAX_VALUE; i++){
array = new Object[6];
}
System.out.println("elapsed nanoseconds: " + (System.nanoTime() - start));
length = array.length;
and the output:
elapsed nanoseconds: 264095957
elapsed nanoseconds: 17885568039

The best and secure way to create an array is the next one:
List<Object> myList = new ArrayList<Object>(0);
So can can regenerate by this way the array each time you want. You don't need to set to null. By this way, the array is destroyed and regenerated in memory.
You can read more info in Oracle Java documentation:

Java optimization, gain from hashMap?

I've been give some lovely Java code that has a lot of things like this (in a loop that executes about 1.5 million times).
code = getCode();
for (int intCount = 1; intCount < vA.size() + 1; intCount++)
{
oA = (A)vA.elementAt(intCount - 1);
if (oA.code.trim().equals(code))
currentName= oA.name;
}
Would I see significant increases in speed from switching to something like the following
code = getCode();
//AMap is a HashMap
strCurrentAAbbreviation = (String)AMap.get(code);
Edit: The size of vA is approximately 50. The trim shouldn't even be necessary, but definitely would be nice to call that 50 times instead of 50*1.5 million. The items in vA are unique.
Edit: At the suggestion of several responders, I tested it. Results are at the bottom. Thanks guys.

There's only one way to find out.

Ok, Ok, I tested it.
Results follow for your enlightenment:
Looping: 18391ms
Hash: 218ms
Looping: 18735ms
Hash: 234ms
Looping: 18359ms
Hash: 219ms
I think I will be refactoring that bit ..
The framework:
public class OptimizationTest {
private static Random r = new Random();
public static void main(String[] args){
final long loopCount = 1000000;
final int listSize = 55;
long loopTime = TestByLoop(loopCount, listSize);
long hashTime = TestByHash(loopCount, listSize);
System.out.println("Looping: " + loopTime + "ms");
System.out.println("Hash: " + hashTime + "ms");
}
public static long TestByLoop(long loopCount, int listSize){
Vector vA = buildVector(listSize);
A oA;
StopWatch sw = new StopWatch();
sw.start();
for (long i = 0; i< loopCount; i++){
String strCurrentStateAbbreviation;
int j = r.nextInt(listSize);
for (int intCount = 1; intCount < vA.size() + 1; intCount++){
oA = (A)vA.elementAt(intCount - 1);
if (oA.code.trim().equals(String.valueOf(j)))
strCurrentStateAbbreviation = oA.value;
}
}
sw.stop();
return sw.getElapsedTime();
}
public static long TestByHash(long loopCount, int listSize){
HashMap hm = getMap(listSize);
StopWatch sw = new StopWatch();
sw.start();
String strCurrentStateAbbreviation;
for (long i = 0; i < loopCount; i++){
int j = r.nextInt(listSize);
strCurrentStateAbbreviation = (String)hm.get(j);
}
sw.stop();
return sw.getElapsedTime();
}
private static HashMap getMap(int listSize) {
HashMap hm = new HashMap();
for (int i = 0; i < listSize; i++){
String code = String.valueOf(i);
String value = getRandomString(2);
hm.put(code, value);
}
return hm;
}
public static Vector buildVector(long listSize)
{
Vector v = new Vector();
for (int i = 0; i < listSize; i++){
A a = new A();
a.code = String.valueOf(i);
a.value = getRandomString(2);
v.add(a);
}
return v;
}
public static String getRandomString(int length){
StringBuffer sb = new StringBuffer();
for (int i = 0; i< length; i++){
sb.append(getChar());
}
return sb.toString();
}
public static char getChar()
{
final String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int i = r.nextInt(alphabet.length());
return alphabet.charAt(i);
}
}

Eh, there's a good chance that you would, yes. Retrieval from a HashMap is going to be constant time if you have good hash codes.
But the only way you can really find out is by trying it.

This depends on how large your map is, and how good your hashCode implementation is (such that you do not have colisions).

You should really do some real profiling to be sure if any modification is needed, as you may end up spending your time fixing something that is not broken.
What actually stands out to me a bit more than the elementAt call is the string trimming you are doing with each iteration. My gut tells me that might be a bigger bottleneck, but only profiling can really tell.
Good luck

I'd say yes, since the above appears to be a linear search over vA.size(). How big is va?

Why don't you use something like YourKit (or insert another profiler) to see just how expensive this part of the loop is.

Using a Map would certainly be an improvement that helps maintaining that code later on.
If you can use a map depends on whether the (vector?) contains unique codes or not. The for loop given would remember the last object in the list with a given code, which would mean a hash is not the solution.
For small (stable) list sizes simply converting the list to an array of objects would show a performance increase on top of some better readability.
If none of the above holds, at least use an itarator to inspect the list, giving better readability and some (probable) performance increase.

Depends. How much memory you got?
I would guess much faster, but profile it.

I think the dominant factor here is how big vA is, since the loop needs to run n times, where n is the size of vA. With the map, there is no loop, no matter how big vA is. So if n is small, the improvement will be small. If it is huge, the improvement will be huge. This is especially true because even after finding the matching element the loop keeps going! So if you find your match at element 1 of a 2 million element list, you still need to check the last 1,999,999 elements!

Yes, it'll almost certainly be faster. Looping an average of 25 times (half-way through your 50) is slower than a hashmap lookup, assuming your vA contents decently hashable.
However, speaking of your vA contents, you'll have to trim them as you insert them into your aMap, because aMap.get("somekey") will not find an entry whose key is "somekey ".
Actually, you should do that as you insert into vA, even if you don't switch to the hashmap solution.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.