How to speedup iteration in List - java

I have a List<String> and there is almost 20,000 records in it(and may be more)...
I need to iterate over this list and it takes almost 3 minutes...
Here is my block of code:
for (String string : list) {
response += string;
response += "/t";
}
I have two questions:
long time is for List iteration or for operation on each item?
depending on answer to the question 1 how can i speed up this operation?

The poor performance is more likely to be your use of string concatenation. Use a StringBuilder instead.

Consider using Map if it is applicable. Here is a link for a very common Java objects and how much their operations cost using Big-O notation.
http://objectissues.blogspot.com/2006/11/big-o-notation-and-java-constant-time.html

Related

Is if statement in for loop faster than one by one if statements in Java?

I wonder that if I use a HashMap to collect the conditions and loop each one in one if statement can I reach higher performance rather than to write one by one if - else if statement?
In my opinion, one-by-one if-else, if statements may be faster because in for loop runs one more condition in each loop like, does the counter reach the target number? So actually each if statement, it runs 2 if statements. Of course inside of the statements different but if we talk about just statement performance, I think one-by-one type would be better?
Edit: this is just a sample code, my question is about the performance differences between the usage of these statements.
Map<String, Integer> words = new HashMap<String, Integer>
String letter ="d";
int n = 4;
words.put("a",1);
words.put("b",2);
words.put("c",3);
words.put("d",4);
words.put("e",5);
words.forEach((word,number)->{
if(letter.equals(word){
System.out.println(number*n);
});
String letter ="d";
int n = 4;
if(letter.equals("a"){
System.out.println(number*1);
}else if(letter.equals("b"){
System.out.println(number*2);
}else if(letter.equals("c"){
System.out.println(number*3);
}else if(letter.equals("d"){
System.out.println(number*4);
}else if(letter.equals("e"){
System.out.println(number*5);
}
For your example, having a HashMap but then doing an iterative lookup seems to be a bad idea. The point of using a HashMap is to be able to do a hash based lookup. That is much faster than doing an iterative lookup.
Also, from your example, cascading if-then tests will definitely be faster, since they will avoid the overhead of the map iterator and extra function calls. Also, they will avoid the overhead of the map iterator skipping empty storage locations in the hash map backing array. A better question is whether the cascading if-thens are faster than iterating across a simple list. That is hard to answer. Cascading if-thens seem likely to be faster, except that if there are a lot of if-thens, then a cost of loading the code should be added.
For string lookups, a list data structure provides adequate behavior up to a limiting value, above which a more sophisticated data structure must be used. What is the limiting value depends on the environment. For string comparisons, I've found the transition between 20 and 100 elements.
For particular lookups, and whether low level optimizations are available, the transition value may be much larger. For example, doing integer lookups using "C", which will can do direct memory lookups, the transition value is much higher.
Typical data structures are HashMaps, Tries, and sorted arrays. Each fits particular patterns of access. For example, sorted arrays are fastest and most compact, but are expensive to update. HashMaps support dynamic updates, and for good hash functions, provide constant time lookups. But, HashMaps are space inefficient, since they depend on having empty cells between hash values.
For cases which do not involve "very large" data sets, and which are not in critical "hot" code paths, HashMaps are the usual structure which is used.
If you have a Map and you want to retrieve one letter, I'm not sure why you would loop at all?
Map<String, Integer> words = new HashMap<String, Integer>
String letter ="d";
int n = 4;
words.put("a",1);
words.put("b",2);
words.put("c",3);
words.put("d",4);
words.put("e",5);
if (words.containsKey(letter) {
System.out.println(words.get(letter)*n);
}
else
{
System.out.println(letter + " doesn't exist in Map");
}
If you aren't using the benefits of a Map, then why use a Map at all?
A forEach will actually touch every key in the list. The number of checks on your if/else is dependent on where it is in the list and how long the list of available letters is. If the letter you choose is the last one in the list then it would complete all checks before printing. If it is first then it will only do one which is much faster than having to check all.
It would be easy for you to write the two examples and run a timer to determine which is actually faster.
https://www.baeldung.com/java-measure-elapsed-time
There are a lot of wasted calculations if you have to run through 1 million if/else statements and only select one which could be anywhere in the list. This doesn't include typos and the horror of code maintenance. Using a Map with an index would be much quicker. If you are only talking about 100 if/else statements (still too many in my opinion) then you may be able to break even on speed.

Why is functional faster than imperative for large amounts of data and slower than imperative for small amounts of data?

I wanted to compare the performance of loops to streams. For this I wrote 2 methods. Both filter out names beginning with 'A' and return them in a String.
If I do this with 50.000 randomly generated names, the imperative way is faster. But if I do this with 500.000 randomly generated names, the functional way is faster.
My question is why is the functional paradigm slower for small data sets and faster for large data sets? Do streams need a lot of time to initialize but after this they are more efficient?
public String imperativeArray() {
String result = "";
for(String name : arraytestSet) {
if(name.startsWith("A")) {
if(result.isEmpty()) {
result += name;
} else {
result += "," + name;
}
}
}
return result;
}
public String functionalArray() {
return Arrays.stream(arraytestSet)
.filter(e -> e.startsWith("A"))
.collect(Collectors.joining(","));
}
There shouldn’t be any performance difference between imperative and functional. Functional programming will improve readability of code by using predefined functions. Internally they must do similar type of processing. One of the main advantages of functional style is that you can understand the code with a single glance at it.
Parallel streams can be faster than your imperative code but Parallel stream can be used if you are sure that the data will be huge and gets benefit from parallel processing. Otherwise parallel stream would be slower than normal stream.
Coming to your sample code, I feel that by using Collectors.joining you are getting a performance benefit than the if-else check because you are using string append operation which will keep on creating a new string object everytime it appends. But collectors.joining internally uses StringBuilder and so it will work on a single object every time it appends.

How to use multi-threading to make my application faster

I am iterating through a List of Strings with +- 1500 entries. In each iteration I am again iterating through a List of Strings, but this time with +- 35 million entries. The result of the application is perfect. But it takes the application a long time (2+ hours) to give me the result. How should I structure multithreading to make my application faster?
The order of the result List is not important.
Should I divide the big List (35 million entries) into smaller blocks and iterator through them parallel? (How can I determine the perfect amount of blocks?)
Should I start a thread for each iteration in the small List? (This will create 1500 threads and I guess a lot of them will run "parallel")
What are my other options?
Representation of the code:
List<String> result = new ArrayList<String>();
for(Iterator<String> i = data1.iterator();i.hasNext();){ //1500 entries
String val = i.next();
for(Iterator<String> j = data2.iterator();j.hasNext();){ //35 million entries
String test = j.next();
if(val.equals(test)){
result.add(val);
break;
}
}
}
for(Iterator<String> h = result.iterator();h.hasNext();){
//write to file
}
UPDATE
After restructuring my code and implementing the answer given by JB Nizet my application now runs a lot faster. It now only takes 20 seconds to get to the same result! Without multi-threading!
You could use a parallel stream:
List<String> result =
data1.parallelStream()
.filter(data2::contains)
.collect(Collectors.toList());
But since you call contains() on data2 1500 times, and since contains() is O(N) for a list, transforming it to a HashSet first could make things much faster: contains() on HashSet is O(1). You might not even need multi-threading anymore:
Set<String> data2Set = new HashSet<>(data2);
List<String> result =
data.stream()
.filter(data2Set::contains)
.collect(Collectors.toList());
I am also agree with your idea. What you need to do now?
First calculate number of processor in your system.
Based on number of processor split your records and create exactly that number of threads. ( numberofprocessor * 2 max, else because of context switching between thread performance will be degraded ).
Do not create unnecessarily lots of threads. That will not going to speedup your application. Check exactly how many threads you should create based on number of processor and size of memory in a system. Efficient parallel processing is depends on your machine hardware as well.

What are the performance differences with these two uses of the map stream function in java 8

Say I have the functions mutateElement() which does x operations and mutateElement2() which does y operations. What is the difference in performance between these two pieces of code.
Piece1:
List<Object> = array.stream().map(elem ->
mutateElement(elem);
mutateElement2(elem);
)
.collect(Collectors.toList());
Piece2:
List<Object> array = array.stream().map(elem ->
mutateElement(elem);
)
.collect(Collectors.toList());
array = array.stream().map(elem ->
mutateElement2(elem);
)
.collect(Collectors.toList());
Clearly The first implementation is better as it only uses one iterator, however the second uses two iterators. But would the difference be noticeable if I had say a million elements in the array.
The first implementation is not better simply because it uses only one iterator, the first implementation is better because it only collects once.
Nobody can tell you whether the difference would be noticeable if you had a million elements. (And if someone did try to tell you, you should not believe them.) Benchmark it.
Whatever you use stream or external loop, the problem is the same.
One iteration on the List in the first code and two iterations on the List in the second code.
The time of execution of the second code is so logically more important.
Besides invoking twice the terminal operation on the stream :
.collect(Collectors.toList());
rather than once, has also a cost.
But would the difference be noticeable if I had say a million elements
in the array.
It could be.
Now the question is hard to answer : yes or no.
It depends on other parameters such as cpus, number of concurrent users and processing and your definition of "noticeable".

How should I map string keys to values in Java in a memory-efficient way?

I'm looking for a way to store a string->int mapping. A HashMap is, of course, a most obvious solution, but as I'm memory constrained and need to store 2 million pairs, 7 characters long keys, I need something that's memory efficient, the retrieval speed is a secondary parameter.
Currently I'm going along the line of:
List<Tuple<String, int>> list = new ArrayList<Tuple<String, int>>();
list.add(...); // load from file
Collections.sort(list);
and then for retrieval:
Collections.binarySearch(list, key); // log(n), acceptable
Should I perhaps go for a custom tree (each node a single character, each leaf with result), or is there an existing collection that fits this nicely? The strings are practically sequential (UK postcodes, they don't differ much), so I'm expecting nice memory savings here.
Edit: I just saw you mentioned the String were UK postcodes so I'm fairly confident you couldn't get very wrong by using a Trove TLongIntHashMap (btw Trove is a small library and it's very easy to use).
Edit 2: Lots of people seem to find this answer interesting so I'm adding some information to it.
The goal here is to use a map containing keys/values in a memory-efficient way so we'll start by looking for memory-efficient collections.
The following SO question is related (but far from identical to this one).
What is the most efficient Java Collections library?
Jon Skeet mentions that Trove is "just a library of collections from primitive types" [sic] and, that, indeed, it doesn't add much functionality. We can also see a few benchmarks (by the.duckman) about memory and speed of Trove compared to the default Collections. Here's a snippet:
100000 put operations 100000 contains operations
java collections 1938 ms 203 ms
trove 234 ms 125 ms
pcj 516 ms 94 ms
And there's also an example showing how much memory can be saved by using Trove instead of a regular Java HashMap:
java collections oscillates between 6644536 and 7168840 bytes
trove 1853296 bytes
pcj 1866112 bytes
So even though benchmarks always need to be taken with a grain of salt, it's pretty obvious that Trove will save not only memory but will always be much faster.
So our goal now becomes to use Trove (seen that by putting millions and millions of entries in a regular HashMap, your app begins to feel unresponsive).
You mentioned 2 million pairs, 7 characters long keys and a String/int mapping.
2 million is really not that much but you'll still feel the "Object" overhead and the constant (un)boxing of primitives to Integer in a regular HashMap{String,Integer} which is why Trove makes a lot of sense here.
However, I'd point out that if you have control over the "7 characters", you could go even further: if you're using say only ASCII or ISO-8859-1 characters, your 7 characters would fit in a long (*). In that case you can dodge altogether objects creation and represent your 7 characters on a long. You'd then use a Trove TLongIntHashMap and bypass the "Java Object" overhead altogether.
You stated specifically that your keys were 7 characters long and then commented they were UK postcodes: I'd map each postcode to a long and save a tremendous amount of memory by fitting millions of keys/values pair into memory using Trove.
The advantage of Trove is basically that it is not doing constant boxing/unboxing of Objects/primitives: Trove works, in many cases, directly with primitives and primitives only.
(*) say you only have at most 256 codepoints/characters used, then it fits on 7*8 == 56 bits, which is small enough to fit in a long.
Sample method for encoding the String keys into long's (assuming ASCII characters, one byte per character for simplification - 7 bits would be enough):
long encode(final String key) {
final int length = key.length();
if (length > 8) {
throw new IndexOutOfBoundsException(
"key is longer than 8 characters");
}
long result = 0;
for (int i = 0; i < length; i++) {
result += ((long) ((byte) key.charAt(i))) << i * 8;
}
return result;
}
Use the Trove library.
The Trove library has optimized HashMap and HashSet classes for primitives. In this case, TObjectIntHashMap<String> will map the parameterized object (String) to a primitive int.
First of, did you measure that LinkedList is indeed more memory efficient than a HashMap, or how did you come to that conclusion? Secondly, a LinkedList's access time of an element is O(n), so you cannot do efficient binary search on it. If you want to do such approach, you should use an ArrayList, which should give you the beast compromise between performance and space. However, again, I doubt that a HashMap, HashTable or - in particular - a TreeMap would consume that much more memory, but the first two would provide constant access and the tree map logarithmic and provide a nicer interface that a normal list. I would try to do some measurements, how much the difference in memory consumption really is.
UPDATE: Given, as Adamski pointed out, that the Strings themselves, not the data structure they are stored in, will consume the most memory, it might be a good idea to look into data structures that are specific for strings, such as tries (especially patricia tries), which might reduce the storage space needed for the strings.
What you are looking for is a succinct-trie - a trie which stores its data in nearly the least amount of space theoretically possible.
Unfortunately, there are no succinct-trie classes libraries currently available for Java. One of my next projects (in a few weeks) is to write one for Java (and other languages).
In the meanwhile, if you don't mind JNI, there are several good native succinct-trie libraries you could reference.
Have you looked at tries. I've not used them but they may fit with what you're doing.
A custom tree would have the same complexity of O(log n), don't bother. Your solution is sound, but I would go with an ArrayList instead of the LinkedList because the linked list allocates one extra object per stored value, which will amount to a lot of objects in your case.
As Erick writes using the Trove library is a good place to start as you save space in storing int primitives rather than Integers.
However, you are still faced with storing 2 million String instances. Given that these are keys in the map, interning them won't offer any benefit so the next thing I'd consider is whether there's some characteristic of the Strings that can be exploited. For example:
If the Strings represent sentences of common words then you could transform the String into a Sentence class, and intern the individual words.
If the Strings only contain a subset of Unicode characters (e.g. only letters A-Z, or letters + digits) you could use a more compact encoding scheme than Java's Unicode.
You could consider transforming each String into a UTF-8 encoded byte array and wrapping this in class: MyString. Obviously the trade-off here is the additional time spent performing look-ups.
You could write the map to a file and then memory map a portion or all of the file.
You could consider libraries such as Berkeley DB that allow you to define persistent maps and cache a portion of the map in memory. This offers a scalable approach.
maybe you can go with a RadixTree?
Use java.util.TreeMap instead of java.util.HashMap. It makes use of a red black binary search tree and doesn't use more memory than what is required for holding notes containing the elements in the map. No extra buckets, unlike HashMap or Hashtable.
I think the solution is to step a little outside of Java. If you have that many values, you should use a database. If you don't feel like installing Oracle, SQLite is quick and easy. That way the data you don't immediately need is stored on the disk, and all of the caching/storage is done for you. Setting up a DB with one table and two columns won't take much time at all.
I'd consider using some cache as these often have the overflow-to-disk ability.
You might create a key class that matches your needs. Perhaps like this:
public class MyKey implements Comparable<MyKey>
{
char[7] keyValue;
public MyKey(String keyValue)
{
... load this.keyValue from the String keyValue.
}
public int compareTo(MyKey rhs)
{
... blah
}
public boolean equals(Object rhs)
{
... blah
}
public int hashCode()
{
... blah
}
}
try this one
OptimizedHashMap<String, int[]> myMap = new OptimizedHashMap<String, int[]>();
for(int i = 0; i < 2000000; i++)
{
myMap.put("iiiiii" + i, new int[]{i});
}
System.out.println(myMap.containsValue(new int[]{3}));
System.out.println(myMap.get("iiiiii" + 1));
public class OptimizedHashMap<K,V> extends HashMap<K,V>
{
public boolean containsValue(Object value) {
if(value != null)
{
Class<? extends Object> aClass = value.getClass();
if(aClass.isArray())
{
Collection values = this.values();
for(Object val : values)
{
int[] newval = (int[]) val;
int[] newvalue = (int[]) value;
if(newval[0] == newvalue[0])
{
return true;
}
}
}
}
return false;
}
Actually HashMap and List are too general for such specific task as a lookup of int by zipcode. You should use advantage of knowledge which data is used. One of the options is to use a prefix tree with leaves that stores the int value. Also, it could be pruned if (my guess) a lot of codes with same prefixes map to the same integer.
Lookup of the int by zipcode will be linear in such tree and will not grow if number of codes is increased, compare to O(log(N)) in case of binary search.
Since you are intending to use hashing, you can try numerical conversions of the strings based on ASCII values.
the simplest idea will be
int sum=0;
for(int i=0;i<arr.length;i++){
sum+=(int)arr[i];
}
hash "sum" using a well defined hash functions. You would use a hash function based on the expected input patterns.
e.g. if you use division method
public int hasher(int sum){
return sum%(a prime number);
}
selecting a prime number which is not close to an exact power of two improves performances and gives better uniformly hashed distribution of keys.
another method is to weigh the characters based on their respective position.
e.g: if you use the above method, both "abc" and "cab" will be hashed into a same location. but if you need them to be stored in two distinct location give weights for locations like we use the number systems.
int sum=0;
int weight=1;
for(int i=0;i<arr.length;i++){
sum+= (int)arr[i]*weight;
weight=weight*2; // using powers of 2 gives better results. (you know why :))
}
As your sample is quite large, you'd avoid collisions by a chaining mechanism rather than using a probe sequence.
After all,What method you would choose totally depends on the nature of your application.
The problem is objects' memory overhead, but using some tricks you can try to implement your own hashset. Something like this. Like others said strings have quite large overhead so you need to "compress" it somehow. Also try not to use too many arrays(lists) in hashtable (if you do chaining type hashtable) as they are also objects and also have overhead. Better yet do open addressing hashtable.

Categories