Arrays.sort() vs sorting using map - java

I have a requirement where I have to loop through an array which has list of strings:
String[] arr = {"abc","cda","cka","snd"}
and match the string "bca", ignoring the order of the characters, which will return true as it’s present in the array ("abc").
To solve this I have two approaches:
Use Arrays.sort() to sort both the strings and then use Arrays.equals to compare them.
create 2 hashmaps and add frequency of each letter in string and then finally compare two map of char using equals method.
I read that complexity of using Arrays.sort() method is more. So, thought of working on 2nd approach but when I am running both the code 1st approach is taking very less time to execute program.
Any suggestions why this is happening?

The Time Complexity only tells you, how the approach will scale with (significantly) larger input. It doesn’t tell you which approach is faster.
It’s perfectly possible that a solution is faster for small input sizes (string lengths and/or array length) but scales badly for larger sizes, due to its Time Complexity. But it’s even possible that you never encounter the point where an algorithm with a better Time Complexity becomes faster, when natural limits to the input sizes prevent it.
You didn’t show the code of your approaches, but it’s likely that your first approach calls a method like toCharArray() on the strings, followed by Arrays.sort(char[]). This implies that sort operates on primitive data.
In contrast, when your second approach uses a HashMap<Character,Integer> to record frequencies, it will be subject to boxing overhead, for the characters and the counts, and also use a significantly larger data structure that needs to be processed.
So it’s not surprising that the hash approach is slower for small strings and arrays, as it has a significantly larger fixed overhead and also a size dependent (O(n)) overhead.
So first approach had to suffer from the O(n log n) time complexity significantly to turn this result. But this won’t happen. That time complexity is a worst case of sorting in general. As explained in this answer, the algorithms specified in the documentation of Arrays.sort should not be taken for granted. When you call Arrays.sort(char[]) and the array size crosses a certain threshold, the implementation will turn to Counting Sort with an O(n) time complexity (but use more memory temporarily).
So even with large strings, you won’t suffer from a worse time complexity. In fact, the Counting Sort shares similarities with the frequency map, but usually is more efficient, as it avoids the boxing overhead, using an int[] array instead of a HashMap<Character,Integer>.

Approach 1: will be O(NlogN)
Approach 2: will be O(N*M), where M is the length of each string in your array.
You should search linearly in O(N):
for (String str : arr) {
if (str.equals(target)) return true;
}
return false;

Let's decompose the problem:
You need a function to sort a string by its chars (bccabc -> abbccc) to be able to compare a given string with the existing ones.
Function<String, String> sortChars = s -> s.chars()
.sorted()
.mapToObj(i -> (char) i)
.map(String::valueOf)
.collect(Collectors.joining());
Instead of sorting the chars of the given strings anytime you compare them, you can precompute the set of unique tokens (values from your array, sorted chars):
Set<String> tokens = Arrays.stream(arr)
.map(sortChars)
.collect(Collectors.toSet());
This will result in the values "abc","acd","ack","dns".
Afterwards you can create a function which checks if a given string, when sorted by chars, matches any of the given tokens:
Predicate<String> match = s -> tokens.contains(sortChars.apply(s));
Now you can easily check any given string as follows:
boolean matches = match.test("bca");
Matching will only need to sort the given input and do a hash set lookup to check if it matches, so it's very efficient.
You can of course write the Function and Predicate as methods instead (String sortChars(String s) and boolean matches(String s) if you're unfamiliar with functional programming.

More of an addendum to the other answers. Of course, your two options have different performance characteristics. But: understand that performance is not necessarily the only factor to make a decision!
Meaning: if you are talking about a search that runs hundreds or thousands of time per minute, on large data sets: then for sure, you should invest a lot of time to come up with a solution that gives you best performance. Most likely, that includes doing various experiments with actual measurements when processing real data. Time complexity is a theoretical construct, in the real world, there are also elements such as CPU cache sizes, threading issues, IO bottlenecks, and whatnot that can have significant impact on real numbers.
But: when your code will doing its work just once a minute, even on a few dozen or hundred MB of data ... then it might not be worth to focus on performance.
In other words: the "sort" solution sounds straight forward. It is easy to understand, easy to implement, and hard to get wrong (with some decent test cases). If that solution gets the job done "good enough", then consider to use use that: the simple solution.
Performance is a luxury problem. You only address it if there is a reason to.

Related

Comparison of these two algorithms?

So I'm presented with a problem that states. "Determine if a string contains all unique characters"
So I wrote up this solution that adds each character to a set, but if the character already exists it returns false.
private static boolean allUniqueCharacters(String s) {
Set<Character> charSet = new HashSet<Character>();
for (int i = 0; i < s.length(); i++) {
char currentChar = s.charAt(i);
if (!charSet.contains(currentChar)) {
charSet.add(currentChar);
} else {
return false;
}
}
return true;
}
According to the book I am reading this is the "optimal solution"
public static boolean isUniqueChars2(String str) {
if (str.length() > 128)
return false;
boolean[] char_set = new boolean[128];
for (int i = 0; i < str.length(); i++) {
int val = str.charAt(i);
if (char_set[val]) {
return false;
}
char_set[val] = true;
}
return true;
}
My question is, is my implementation slower than the one presented? I assume it is, but if a Hash look up is O(1) wouldn't they be the same complexity?
Thank you.
As Amadan said in the comments, the two solutions have the same time complexity O(n) because you have a for loop looping through the string, and you do constant time operations in the for loop. This means that the time it takes to run your methods increases linearly with the length of the string.
Note that time complexity is all about how the time it takes changes when you change the size of the input. It's not about how fast it is with data of the same size.
For the same string, the "optimal" solution should be faster because sets have some overheads over arrays. Handling arrays is faster than handling sets. However, to actually make the "optimal" solution work, you would need an array of length 2^16. That is how many different char values there are. You would also need to remove the check for a string longer than 128.
This is one of the many examples of the tradeoff between space and time. If you want it to go faster, you need more space. If you want to save space, you have to go slower.
Both algorithms have time complexity of O(N). The difference is in their space complexity.
The book's solution will always require storage for 128 characters - O(1), while your solution's space requirement will vary linearly according to the input - O(N).
The book's space requirement is based on an assumed character set with 128 characters. But this may be rather problematic (and not scalable) given the likelihood of needing different character sets.
The hashmap is in theory acceptable, but is a waste.
A hashmap is built over an array (so it is certainly more costly than an array), and collision resolution requires extra space (at least the double of the number of elements). In addition, any access requires the computation of the hash and possibly the resolution of collisions.
This adds a lot of overhead in terms of space and time, compared to a straight array.
Also note that it is kind of folklore that a hash table has an O(1) behavior. The worst case is much poorer, accesses can take up to O(N) time for a table of size N.
As a final remark, the time complexity of this algorithm is O(1) because you conclude false at worse when N>128.
Your algorithm is also O(1). You can think about complexity like how my algorithm will react to the change in amount of elements processed. Therefore O(n) and O(2n) are effectively equal.
People are talking about O notation as growth rate here
Your solution is could indeed be slower than the book's solution. Firstly, a hash lookup ideally has a constant time lookup. But, the retrieval of the object will not be if there are multiple hash collisions. Secondly, even if it is constant time lookup, there is usually significant overhead involved in executing the hash code function as compared to looking up an element in an array by index. That's why you may want to go with the array lookup. However, if you start to deal with non-ASCII Unicode characters, then you might not want to go with the array approach due to the significant amount of space overhead.
The bottleneck of your implementation is, that a set has a lookup (and insert) complexity* of O(log k), while the array has a lookup complexity in O(1).
This sounds like your algorithm must be much worse. But in fact it is not, as k is bounded by 128 (else the reference implementation would be wrong and produce a out-of-bounds error) and can be treated as a constant. This makes the set lookup O(1) as well with a bit bigger constants than the array lookup.
* assuming a sane implementation as tree or hashmap. The hashmap time complexity is in general not constant, as filling it up needs log(n) resize operations to avoid the increase of collisions which would lead to linear lookup time, see e.g. here and here for answers on stackoverflow.
This article even explains that java 8 by itself converts a hashmap to a binary tree (O(n log n) for the converstion, O(log n) for the lookup) before its lookup time degenerates to O(n) because of too many collisions.

Sorting an array of partially sorted primitive integers in Java

The array in question may hold any integer which equals or is bigger than zero and the numbers are unique. The numbers have to be in ascending order.
The array's size will usually be less than 100.
Most of the array is already sorted. By most I mean on avarage atleast 90% of it.
I've found this implementation of TimSort but it is not for primitive values. Autoboxing would cause a lot of overhead.
Performance is most crucial as the sorting-algorithm will be called many times.
Use Arrays.sort:
int[] array = /* something */;
Arrays.sort(array);
Being only one line, this is (obviously) both extremely simple to use and very readable. That should be your #1 priority when writing code. It's also going to be pretty darn fast, because the writers of the standard library have put a lot of effort into performance, particularly relating to sorting algorithms.
The only situation in which you should not use Arrays.sort is if you've almost entirely finished your system, profiled it carefully, and determined that the part of the code that sorts your array is the bottleneck. Even then, you still might not be able to write your own sorting algorithm that performs noticeably better.
Depends what you mean by "almost sorted". Insertion Sort is a very efficient algorithm if the array is nearly sorted (linear complexity when it is sorted), but the performance can vary depending on whether the outliers are close or far from their final sorted position. For example, [1,2,3,4,6,5,7,8,9] will be slightly faster to sort than [1,3,4,5,6,7,8,9,2].

fastest way to map a large number of longs

I'm writing a java application that transforms numbers (long) into a small set of result objects. This mapping process is very critical to the app's performance as it is needed very often.
public static Object computeResult(long input) {
Object result;
// ... calculate
return result;
}
There are about 150,000,000 different key objects, and about 3,000 distinct values.
The transformation from the input number (long) to the output (immutable object) can be computed by my algorithm with a speed of 4,000,000 transformations per second. (using 4 threads)
I would like to cache the mapping of the 150M different possible inputs to make the translation even faster but i found some difficulties creating such a cache:
public class Cache {
private static long[] sortedInputs; // 150M length
private static Object[] results; // 150M length
public static Object lookupCachedResult(long input) {
int index = Arrays.binarySearch(sortedInputs, input);
return results[index];
}
}
i tried to create two arrays with a length of 150M. the first array holds all possible input longs, and it is sorted numerically. the second array holds a reference to one of the 3000 distinct, precalculated result objects at the index corresponding to the first array's input.
to get to the cached result, i do a binary search for the input number on the first array. the cached result is then looked up in the second array at the same index.
sadly, this cache method is not faster than computing the results. not even half, only about 1.5M lookups per second. (also using 4 threads)
Can anyone think of a faster way to cache results in such a scenario?
I doubt there is a database engine that is able to answer more than 4,000,000 queries per second on, let's say an average workstation.
Hashing is the way to go here, but I would avoid using HashMap, as it only works with objects, i.e. must build a Long each time you insert a long, which can slow it down. Maybe this performance issue is not significant due to JIT, but I would recommend at least to try the following and measure performance against the HashMap-variant:
Save your longs in a long-array of some length n > 3000 and do the hashing by hand via a very simple hash-function (and thus efficient) like
index = key % n. Since you know your 3000 possible values before hand you can empirically find an array-length n such that this trivial hash-function won't cause collisions. So you circumvent rehashing etc. and have true O(1)-performance.
Secondly I would recommend you to look at Java-numerical libraries like
https://github.com/mikiobraun/jblas
https://github.com/fommil/matrix-toolkits-java
Both are backed by native Lapack and BLAS implementations that are usually highly optimized by very smart people. Maybe you can formulate your algorithm in terms of matrix/vector-algebra such that it computes the whole long-array at one time (or chunk-wise).
There are about 150,000,000 different key objects, and about 3,000 distinct values.
With the few values, you should ensure that they get re-used (unless they're pretty small objects). For this an Interner is perfect (though you can run your own).
i tried hashmap and treemap, both attempts ended in an outOfMemoryError.
There's a huge memory overhead for both of them. And there isn't much point is using a TreeMap as it uses a sort of binary search which you've already tried.
There are at least three implementations of a long-to-object-map available, google for "primitive collections". This should use slightly more memory than your two arrays. With hashing being usually O(1) (let's ignore the worst case as there's no reason for it to happen, is it?) and much better memory locality, it'll beat(*) your binary search by a factor of 20. You binary search needs log2(150e6), i.e., about 27 steps and hashing may need on the average maybe two. This depends on how tightly you pack the hash table; this is usually a parameter given when it gets created.
In case you run your own (which you most probably shouldn't), I'd suggest to use an array of size 1 << 28, i.e., 268435456 entries, so that you can use bitwise operations for indexing.
(*) Such predictions are hard, but I'm sure it's worth trying.

Dictionary data structure + fast complexity methods

I'm trying to build from scratch, a data structure that would be able to hold a vast dictionary (of words/characters).
The "words" can be made out of arbitrarily large number of characters.
The dictionary would need standard methods such as search, insert, delete.
I need the methods to have time complexity that is better than O(log(n)), so between O(log(n)) to O(1), e.g log(log(n))
where n = dictionary size (number of elements)
I've looked into various tree structures, like for example b-tree which has log(n) methods (not fast enough) as well as trie which seemed most appropriate for the dictionary, but due to the fact that the words can be arbitrarily large it seemed liked it's complexity would not be faster than log(n).
If you could please provide any explanation
A trie has significant memory requirements but the access time is usually faster than O(log n).
If I recall well, the access time depends on the length of the word, not of the count of the words in the structure.
The efficiency and memory consumption also depend on exactly what implementation of the trie you chose to use. There are some pretty efficient implementations out there.
For more information on Tries see:
http://en.wikipedia.org/wiki/Trie
http://algs4.cs.princeton.edu/52trie/
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/
If your question is how to achieve as few string comparisons as possible, then a hash table is probably a very good answer, as it requires close to O(1) string comparisons. Note that hashing the key value takes time proportional to the string length, as can be the time for string comparison.
But this is nothing new. Can we do better for long strings ? To be more precise, we will assume the string length to be bounded by M. We will also assume that the length of every string is known (for long strings, this can make a difference).
First notice that the search time is bounded below by the string length, and is Ω(M) in the worst case: comparing two strings can require to compare all characters as the strings can differ only in the last character comparisons. On the other hand, in the best case, the comparison can conclude immediately, either because the lengths are different or because the strings differ in the first characters compared.
Now you can reason as follows: consider the whole set of strings in the dictionary and find the position of the first character on which they differ. Based on the value of this character, you will decompose in a number of subsets. And you can continue this decomposition recursively until you get singletons.
For example,
able
about
above
accept
accident
accompany
is organized as
*bl*
*bou*
*bov*
*c*e**
*c*i****
*c*o*****
where an asterisk stands for a character which just ignored, and the remaining characters are used for discrimination.
As you can see, in this particular example two or three character comparisons are enough to recognize any word in the dictionary.
This representation can be described as a finite state automaton such that in every state you know which character to check next and what are the possible outcomes, leading to the next states. It has a K-ary tree structure (where K is the size of the alphabet).
For an efficient implementation, every state can be represented by the position of the decision character and an array of links to the next states. Actually, this is a trie structure, with path compression. (As said by #peter.petrov, there are many variants of the trie structure.)
How do we use it ? There are two situations:
1) the search string is known to be in the dictionary: then a simple traversal of the tree is guaranteed to find it. It will do so after a number of character comparisons equal to the depth of the corresponding leaf in the tree O(D), where D is this depth. This can be a very significant saving.
2) the search string may not be in the dictionary: during traversal of the tree you can observe an early rejection; otherwise, in the end you find a single potential match. Then you can't avoid performing an exhaustive comparison, O(1) in the best case, O(M) in the worst. (On average O(M) for random strings, but probably better for real-world distributions.) But you will compare against a single string, never more.
In addition to that device, if your distribution of key lengths is sparse, it may be useful to maintain a hash table of the key lengths, so that immediate rejection of the search string can occur.
As final remarks, notice that this solution has a cost not directly a function of N, and that it is likely that time sublinear in M could be achieved by suitable heuristics taking advantage of the particular distribution of the strings.

Best way to compare - Using Sorting or adding to Set

I have two Strings like
String one = "one, two, three, four";
String two ="two,nine,ten";
Now, if any of the numbers two / nine / ten is present in the first string, I need to return true.
And, I Split both strings . splitOne[], splitTwo[] are present now.
Now, one way will be to compare each and every element more like a bubble sort.
This will give me a complexity of O(n^2).
Will adding the elements to a HashSet get me better complexity ?
For adding to set, I need to iterate through both lists and add each element.
Which of these will require lesser time ? Is there any significant difference ?
It really depends on your use-case.
There is no point in trying to sort it yourself. There are much more efficient methods available, one of which is, obviously, using a HashSet.
If you really are working with up to around 30 words then HashSet is certainly your way to go. However, as the number of strings gets bigger you are going to start running into space problems. For a start String.split will eat huge amounts of memory when you get into the thousands of strings, let alone the HashSet.
If you wish to avoid using database then there are solutions such as a Bloom Filter.
At the extreme end you would probably want to use a database of some sort.
You can use HashMap and also can maintain the number occurences as value.
Or instead of spliting strings, Split one string and conpare till the source String exits in distination one.
private boolean testArray(){
String one = "one, two, three,four,nine,ten";
String two ="two,nine,ten,11";
String strTwo[] = two.split(",");
for (String string : strTwo) {
if(!one.contains(string)) return false;
}
return true;
}
Lets say the no. of elements in the first set is N and the no. of elements in the second set is M.
Using a hashset will require O(N+M) as O(N) used for adding while O(M) is use for checking. (Assuming comparisons are O(1))
The 'bubble sort' way will take O(NM).
Theoretically, I think O(N+M) with a hashset will be faster in complexity. However, the constant factor of a hashset should be higher and hence you might not see any improvement for lower values of N and M.
Alternatively, since you are dealing with strings, comparison between strings aren't O(1). You can create a trie using the first set, taking O(A) time, where A is the number of characters in the first set. And then O(B) in total to traverse the trie to check, where O(B) is the number of characters in the second set. This might give you better performances than a HashSet as it is independent of any hashing function (and hence collision checking).

Categories