I have an array of 6 elements. Is it more efficient (time-wise) to set those elements to null, or to create a new array? I'm going to be using this array hundreds of times.
I would use this method.
System.arraycopy
It is native and pretty efficient.
I would keep an array of nulls and copy it into my array using this method, each time when I want to reset my array. So my advise would be not to create a new array each time, but also not to loop (using Java code) and set the elements to null yourself. Just use this native method which Java provides.
import java.util.Arrays;
public class Test056 {
public static void main(String[] args) {
String[] arrNull = new String[10000];
String[] arrString = new String[10000];
long t1 = System.nanoTime();
for (int i=0; i<10000; i++){
System.arraycopy(arrNull, 0, arrString, 0, arrNull.length);
}
long t2 = System.nanoTime();
System.out.println(t2 - t1);
long t3 = System.nanoTime();
for (int i=0; i<10000; i++){
Arrays.fill(arrString, null);
}
long t4 = System.nanoTime();
System.out.println(t4 - t3);
}
}
Mu - You are not asking the right question.
Do not worry about the efficiency of such trivial operations until it is a known performance bottleneck in your application.
You say you will be nulling/creating 6 references 100s of times. So, you will be creating/nulling/looping < 6000 references. Which is trivial in modern programming.
There are likely much better places where you should be spending your development time.
Creating a new array should be more efficient time-wise because it would only allocate empty references.
Setting the elements to null would imply walking through the entire array to set references to null (which is default behavior on array creation) which is more time consuming (even if for an array of 6 elements it's totally negligible).
EDIT : time-wise is bold because memory-wise is may not be your best option. Since you'll be creating new references, if you instantiate new object, make sure the object of your previous array are garbage collected properly (again, with only 6 elements it must be gigantic objects to see any bad performance impact).
The following bit of code demonstrates that it's faster to assign each value to null.
Note that this is valid for arrays with 6 elements. Eventually it may be faster to create a new array.
Object[] array = new Object[6];
long start = System.nanoTime();
for(int i = 0; i < Integer.MAX_VALUE; i++){
array[0] = null;
array[1] = null;
array[2] = null;
array[3] = null;
array[4] = null;
array[5] = null;
}
System.out.println("elapsed nanoseconds: " + (System.nanoTime() - start));
int length = array.length;
start = System.nanoTime();
for(int i = 0; i < Integer.MAX_VALUE; i++){
array = new Object[6];
}
System.out.println("elapsed nanoseconds: " + (System.nanoTime() - start));
length = array.length;
and the output:
elapsed nanoseconds: 264095957
elapsed nanoseconds: 17885568039
The best and secure way to create an array is the next one:
List<Object> myList = new ArrayList<Object>(0);
So can can regenerate by this way the array each time you want. You don't need to set to null. By this way, the array is destroyed and regenerated in memory.
You can read more info in Oracle Java documentation:
Related
I used the following code to test the performance between Array/ArrayList/LinkedList
import java.util.ArrayList;
import java.util.LinkedList;
public class Main3 {
public static void main(String[] args) throws Exception{
int n = 20000000;
long bt = 0, et = 0;
int[] a0 = new int[n];
ArrayList<Integer> a1 = new ArrayList<>(n);
LinkedList<Integer> a2 = new LinkedList<>();
Integer[] a3 = new Integer[n];
bt = System.currentTimeMillis();
for(int i=0; i<n; i++){
a0[i] = i;
}
et = System.currentTimeMillis();
System.out.println("===== loop0 time =======" + (et - bt));
bt = System.currentTimeMillis();
for(int i=0; i<n; i++){
a1.add(i);
}
et = System.currentTimeMillis();
System.out.println("===== loop1 time =======" + (et - bt));
bt = System.currentTimeMillis();
for(int i=0; i<n; i++){
a2.add(i);
}
et = System.currentTimeMillis();
System.out.println("===== loop2 time =======" + (et - bt));
bt = System.currentTimeMillis();
for(int i=0; i<n; i++){
a3[i] = i;
}
et = System.currentTimeMillis();
System.out.println("===== loop3 time =======" + (et - bt));
}
}
The result is
===== loop0 time =======11
===== loop1 time =======6776
===== loop2 time =======17305
===== loop3 time =======56
Why the ArralyList/LinkedList is so slower than array ?
How could I improve the performance.
env:
Java: jdk1.8.0_231
Thanks
There are potential inaccuracies in your benchmark, but the overall ranking of the results is probably correct. You may get faster results for all of the benchmarks if you "warm-up" the code before taking timings to allow the JIT compiler to generate native code and optimise it. Some benchmark results may be closer or even equal.
Iterating over an int array is going to be much faster than iterating over a List of Integer objects. A LinkedList is going to be slowest of all. These statements assume the optimiser doesn't make radical changes.
Let's look at why:
An int array (int[]) is a contiguous area of memory containing your 4 byte ints arranged end-to-end. The loop to iterate over this and set the elements just has to work its way through the block of memory setting each 4 bytes in turn. In principle an index check is required, but in practice the optimiser can realise this isn't necessary and remove it. The JIT compiler is well able to optimise this kind of thing based on native CPU instructions.
An ArrayList of Integer objects contains an array of references which point to individual Integer objects (or are null). Each Integer object will have to be allocated separately (although Java can re-use Integers of small numbers). There is an overhead to allocate new objects and in addition the reference may be 8 bytes instead of 4. Also, if the list size is not specified (though it is in your case) the internal array may need to be reallocated. There is an overhead due to calling the add method instead of assigning to the array directly (the optimizer may remove this though).
Your array of Integer benchmark is similar to the array list but doesn't have the overhead of the list add method call (which has to track the list size). Probably your benchmark overstates the difference between this array and the array list though.
A LinkedList is the worst case. Linked lists are optimised for inserting in the middle. They have references to point to the next item in the list and nodes to hold those references in addition to the Integer object that needs allocating. This is a big memory overhead that also requires some initialisation and you would not use a linked list unless you were expecting to insert a lot of elements into the middle of the list.
In my exam's first question : I am working on a small task where I am required to store around 500Million+ elements in an Array.
However, I am running into a heap space problem. Could you please help me with this to best optimal storage algorithm ?
I found "BitSet" but I dont know how to use it.
Step 1 - Create 3 long[] arrays with very large length (Least 100M+)
Step 2 - Init values should be randomly generated, not sorted, may contain duplicates
Step 3 - Merge them after init with randomly (3 long[] arrays)
Step 4 - Duplicate items should be removed in output
I wrote a few things :
package exam1;
import java.time.Duration;
import java.time.Instant;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Random;
/**
*
* #author Furkan
*/
//VM OPTIONS -> -Xincgc -Xmx4g -Xms4g
public final class Exam1 {
private static final int LENGTH = 100000000;
private volatile long[] m_testArr1 = null;
private volatile long[] m_testArr2 = null;
private volatile long[] m_testArr3 = null;
private volatile long[] m_merged = null;
private Random m_r = new Random(System.currentTimeMillis());
public static void main(String[] args) {
Exam1 exam = new Exam1();
Instant start1 = Instant.now();
System.out.println("Fill Started");
exam.Fill();
Instant end1 = Instant.now();
System.out.println("Fill Ended : " + Duration.between(start1, end1));
Instant start2 = Instant.now();
System.out.println("Merge Started");
exam.Merge();
Instant end2 = Instant.now();
System.out.println("Merge Ended : " + Duration.between(start1, end1));
Instant start3 = Instant.now();
System.out.println("DupRemove Started");
exam.DupRemove();
Instant end3 = Instant.now();
System.out.println("DupRemove Ended : " + Duration.between(start1, end1));
}
private void Fill(){
this.m_testArr1 = new long[Exam1.LENGTH];
this.m_testArr2 = new long[Exam1.LENGTH];
this.m_testArr3 = new long[Exam1.LENGTH];
for (int i = 0; i < Exam1.LENGTH; i++) {
this.m_testArr1[i] = this.m_r.nextLong();
this.m_testArr2[i] = this.m_r.nextLong();
this.m_testArr3[i] = this.m_r.nextLong();
}
}
private void Merge(){
this.m_merged = this.TryMerge(this.m_testArr1, this.m_testArr2, this.m_testArr3);
}
private void DupRemove(){
this.m_merged = this.RemoveDuplicates(this.m_merged);
}
public long[] TryMerge(long[] arr1, long[] arr2, long[] arr3){
int aLen = arr1.length;
int bLen = arr2.length;
int cLen = arr3.length;
int len = aLen + bLen + cLen;
//TODO: Use BitSize for RAM optimize. IDK how to use...
//OutOfMemory Exception on this line.
long[] mergedArr = new long[len];
this.m_merged = new long[len];
//long[] mergedArr = (long[]) Array.newInstance(long.class, aLen+bLen+cLen);
System.arraycopy(arr1, 0, mergedArr, 0, aLen);
System.arraycopy(arr2, 0, mergedArr, aLen, bLen);
System.arraycopy(arr3, 0, mergedArr, (aLen + bLen), cLen);
return mergedArr;
}
//!!!NOT WORKING!!!
private long[] RemoveDuplicates(long[] arr){
HashSet<Long> set = new HashSet<Long>();
final int len = arr.length;
for(int i = 0; i < len; i++){
set.add(arr[i]);
}
long[] clean = new long[set.size()];
int i = 0;
for (Iterator<Long> it = set.iterator(); it.hasNext();) {
clean[i++] = it.next();
}
return clean;
}
}
UPDATE
Original Question ;
-Implement an efficient methot to merge 3 sets of very large (Length: 100M+) long[] arrays.
-Input data randomly generated, not sorted, may contain duplicateds
-Duplicate items should be removed in output.
(8 GB RAM i have)
Run Args: -Xincgc -Xmx4g -Xms4g
Exception : Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at test.
Since you have limited space, and assuming you're allowed to modify the 3 random arrays, I'd suggest the following.
For each of the 3 arrays:
Sort the array, e.g. using Arrays.sort().
Eliminate duplicates by compacting non-repeating numbers to the beginning.
E.g. if you have {1,2,2,3,3}, you compact to {1,2,3,?,?} with length 3, where ? means value don't matter.
(optional) Move to array of correct size, and discard original array, to free up space for result array(s).
Create result array of size len1 + len2 + len3.
Merge the 3 arrays to the result, eliminating duplicates between the arrays.
E.g. if you have {1,3,5}, {1,2,3}, you end up with {1,2,3,5,?,?} with length 4.
If needed, copy result to new array of correct size.
If low on memory, release 3 original arrays before doing this to free up space.
Use a Bloom filter to identify possible duplicates, then use a hash set to weed out the false positives from the set of possible duplicates, i.e.
foreach source array element, add it to the Bloom filter; if the element is (possibly) contained in the bloom filter, then add it to a hash set, else add it to the merged array. When all source arrays are processed, check each element of the merged array to see if it is in the hash set, removing duplicates from the hash set. Finally, add all remaining elements of the hash set to the merged array.
Guava has a bloom filter data structure that you can use.
If you don't have enough memory to store all the data you need to change something by analysing the business requirements and the real world situation.
Maybe you should use some built in collection framework as others suggested.
Or if it is not allowed (for whatever reason) you should save the data somewhere else than memory. E.g.
sort the arrays
watch the three array with three moving indexes (i, j, k)
always pick the smallest of arr1[i], arr2[j], arr3[k]
ignore if it is a duplicate and move on
write to a file if it is a new value
and increment the corresponding index
do it until the end of each array
Now you have the sorted duplicate free merged array in a file which you can read if necessary after dropping the originals.
I have this:
long hnds[] = new long[133784560]; // 133 million
Then I quickly fill the array (couple of ms) and then I somehow want to know the number of unique (i.e. distinct) values. Now, I don't even need this realtime, I just need to try out a couple of variations and see how many unique values each gives.
I tried e.g. this:
import org.apache.commons.lang3.ArrayUtils;
....
HashSet<Long> length = new HashSet<Long>(Arrays.asList(ArrayUtils.toObject(hnds)));
System.out.println("size: " + length.size());
and after waiting for half an hour it gives a heap space error (I have Xmx4000m).
I also tried initializing Long[] hnds instead of long[] hnds, but then the initial filling of the array takes forever. Or for example use a Set from the beginning when adding the values, but also then it takes forever. Is there any way to count the distinct values of a long[] array without waiting forever? I'd write it to a file if I have to, just some way.
My best suggestion would be to use a library like fastutil (http://fastutil.di.unimi.it/) and then use the custom unboxed hash set:
import it.unimi.dsi.fastutil.longs.LongOpenHashSet;
System.out.println(new LongOpenHashSet(hnds).size());
(Also, by the way, if you can accept approximate answers, there are much more efficient algorithms you can try; see e.g. this paper for details.)
Just sort it and count.
int sz = 133784560;
Random randy = new Random();
long[] longs = new long[sz];
for(int i = 0; i < sz; i++) { longs[i] = randy.nextInt(10000000); }
Arrays.sort(longs);
long lastSeen = longs[0];
long count = 0;
for(int i = 1; i < sz; i++) {
if(longs[i] != lastSeen) count++;
lastSeen = longs[i];
}
Takes about 15 seconds on my laptop.
I'm currently programming something where I'm paying a lot of attention to performance and ram usage.
I came wondering with this problem, and I was trying to make a decision. Imagine this situation:
I need to associate a certain Class (Location) and a Integer to a String (let's say a name). So a Name has an Id and a Location....
What would be the best approach to this?
First: Create two hashmaps
HashMap<String, Location> one = new HashMap<String, Location>
HashMap<String, Integer> two = new HashMap<String, Integer>
Second: Use only one hashmap and create a new class
HashMap<String, NewClass> one = new HashMap<String, NewClass>
where NewClass contains:
class NewClass {
Location loc;
Integer int;
}
If you want every String to be coupled with BOTH the location and integer, use a new class, it will be much easier to debug and maintain, because it makes sense. A String X is connected to both a location and an integer. It ensures you will do less mistakes (like inserting only one of them, or deleting only one), and will be more readable.
If the association is loose, and some strings might need only location, and some only integers - using two maps is probably preferable, as future readers of the code (including you in 3 months) will fail to understand what is this new class and why the String X needs to have a location.
tl;dr:
String->MyClass if each string is always associated with a location and an integer
String->Integer, String->Location if each string is independently assiciated with locations and integers.
If you always need to retrieve both Id and Location, the first approach would require 2 Hash lookups while the second approach would require only 1. In that case, the second approach should have a slight better performance.
To test that I did the simple test below:
// create 2 hashes with 1M entries
for (int i = 0; i < 1000000; i++){
String s = new BigInteger(80, random).toString(32);
hash1.put(s, s);
hash2.put(s, new BigInteger(80, random).intValue());
}
// create 1 hash with 1M entries
for (int i = 0; i < 1000000; i++){
String s = new BigInteger(80, random).toString(32);
NewClass n = new NewClass();
n.i = new BigInteger(80, random).intValue();
n.loc = s;
hash3.put(s, n);
}
// 5M lookups
long start = new Date().getTime();
for (int i = 0; i < 5000000; i++){
String s = "AAA";
hash1.get(s);
hash2.get(s);
}
System.out.println("Approach 1 (2 hashes): " + (new Date().getTime() - start));
// 5M lookups
long start2 = new Date().getTime();
for (int i = 0; i < 5000000; i++){
String s = "BBB";
hash3.get(s);
}
System.out.println("Approach 2 (1 hash): " + (new Date().getTime() - start2));
Running on my computer, the results were:
Approach 1 (2 hashes): 37 ms
Approach 2 (1 hash): 18 ms
The test is super simplistic and, if you are to consider serious performance issues, you should investigate deeper into this issue, considering other aspects as memory footprint, cost of object creation, etc. But, in any case, using 2 hashes will increase the total lookup time.
I've been give some lovely Java code that has a lot of things like this (in a loop that executes about 1.5 million times).
code = getCode();
for (int intCount = 1; intCount < vA.size() + 1; intCount++)
{
oA = (A)vA.elementAt(intCount - 1);
if (oA.code.trim().equals(code))
currentName= oA.name;
}
Would I see significant increases in speed from switching to something like the following
code = getCode();
//AMap is a HashMap
strCurrentAAbbreviation = (String)AMap.get(code);
Edit: The size of vA is approximately 50. The trim shouldn't even be necessary, but definitely would be nice to call that 50 times instead of 50*1.5 million. The items in vA are unique.
Edit: At the suggestion of several responders, I tested it. Results are at the bottom. Thanks guys.
There's only one way to find out.
Ok, Ok, I tested it.
Results follow for your enlightenment:
Looping: 18391ms
Hash: 218ms
Looping: 18735ms
Hash: 234ms
Looping: 18359ms
Hash: 219ms
I think I will be refactoring that bit ..
The framework:
public class OptimizationTest {
private static Random r = new Random();
public static void main(String[] args){
final long loopCount = 1000000;
final int listSize = 55;
long loopTime = TestByLoop(loopCount, listSize);
long hashTime = TestByHash(loopCount, listSize);
System.out.println("Looping: " + loopTime + "ms");
System.out.println("Hash: " + hashTime + "ms");
}
public static long TestByLoop(long loopCount, int listSize){
Vector vA = buildVector(listSize);
A oA;
StopWatch sw = new StopWatch();
sw.start();
for (long i = 0; i< loopCount; i++){
String strCurrentStateAbbreviation;
int j = r.nextInt(listSize);
for (int intCount = 1; intCount < vA.size() + 1; intCount++){
oA = (A)vA.elementAt(intCount - 1);
if (oA.code.trim().equals(String.valueOf(j)))
strCurrentStateAbbreviation = oA.value;
}
}
sw.stop();
return sw.getElapsedTime();
}
public static long TestByHash(long loopCount, int listSize){
HashMap hm = getMap(listSize);
StopWatch sw = new StopWatch();
sw.start();
String strCurrentStateAbbreviation;
for (long i = 0; i < loopCount; i++){
int j = r.nextInt(listSize);
strCurrentStateAbbreviation = (String)hm.get(j);
}
sw.stop();
return sw.getElapsedTime();
}
private static HashMap getMap(int listSize) {
HashMap hm = new HashMap();
for (int i = 0; i < listSize; i++){
String code = String.valueOf(i);
String value = getRandomString(2);
hm.put(code, value);
}
return hm;
}
public static Vector buildVector(long listSize)
{
Vector v = new Vector();
for (int i = 0; i < listSize; i++){
A a = new A();
a.code = String.valueOf(i);
a.value = getRandomString(2);
v.add(a);
}
return v;
}
public static String getRandomString(int length){
StringBuffer sb = new StringBuffer();
for (int i = 0; i< length; i++){
sb.append(getChar());
}
return sb.toString();
}
public static char getChar()
{
final String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int i = r.nextInt(alphabet.length());
return alphabet.charAt(i);
}
}
Eh, there's a good chance that you would, yes. Retrieval from a HashMap is going to be constant time if you have good hash codes.
But the only way you can really find out is by trying it.
This depends on how large your map is, and how good your hashCode implementation is (such that you do not have colisions).
You should really do some real profiling to be sure if any modification is needed, as you may end up spending your time fixing something that is not broken.
What actually stands out to me a bit more than the elementAt call is the string trimming you are doing with each iteration. My gut tells me that might be a bigger bottleneck, but only profiling can really tell.
Good luck
I'd say yes, since the above appears to be a linear search over vA.size(). How big is va?
Why don't you use something like YourKit (or insert another profiler) to see just how expensive this part of the loop is.
Using a Map would certainly be an improvement that helps maintaining that code later on.
If you can use a map depends on whether the (vector?) contains unique codes or not. The for loop given would remember the last object in the list with a given code, which would mean a hash is not the solution.
For small (stable) list sizes simply converting the list to an array of objects would show a performance increase on top of some better readability.
If none of the above holds, at least use an itarator to inspect the list, giving better readability and some (probable) performance increase.
Depends. How much memory you got?
I would guess much faster, but profile it.
I think the dominant factor here is how big vA is, since the loop needs to run n times, where n is the size of vA. With the map, there is no loop, no matter how big vA is. So if n is small, the improvement will be small. If it is huge, the improvement will be huge. This is especially true because even after finding the matching element the loop keeps going! So if you find your match at element 1 of a 2 million element list, you still need to check the last 1,999,999 elements!
Yes, it'll almost certainly be faster. Looping an average of 25 times (half-way through your 50) is slower than a hashmap lookup, assuming your vA contents decently hashable.
However, speaking of your vA contents, you'll have to trim them as you insert them into your aMap, because aMap.get("somekey") will not find an entry whose key is "somekey ".
Actually, you should do that as you insert into vA, even if you don't switch to the hashmap solution.