What is the time complexity of this code execution? - java

I have to print out the numbers of occurences of characters inside a string . I have used something like:
String str="This is sample string";
HashSet<Character> hc= new HashSet<Character>();
for (int i = 0; i < str.length(); i++) {
if(!Character.isSpaceChar(str.charAt(i)) && hc.add( str.charAt(i)) ) {
int countMatches = StringUtils.countMatches(str, str.charAt(i));
System.out.println(str.charAt(i)+" occurs at "+countMatches +" times");
}
}
It is a kind of solution, but how do I analyze the time complexity? I am beginner so please guide me through the learning process.

First of all, if you are looking for a decent introduction to complexity analysis, the following one looks pretty good:
A Gentle Introduction to Algorithm Complexity Analysis by Dionysis Zindros.
I recommend that you read it all, carefully, and take the time to do the exercises embedded in the page.
The complexity of your code is not trivial.
On the face of it, the loop will execute N times, where N is the length of the input string. But then if we look at what the loop does, it can do one of three things:
if the character is a space, nothing else is done
if the character is not a space, it is added (or re-added) to the hashmap
if the character was added, countMatches is called.
The complexity of doing nothing is O(1).
The complexity of adding an entry to the map is O(1).
The complexity of calling countMatches is O(N), because it is looking at every character of the string.
Now, if we think about what the code is doing, we can easily identify the best and worst cases.
The best case occurs when all N characters of a string are a space. This gives O(N) repetitions of a O(1) loop body, giving a best-case complexity of O(N).
The worst case occurs when all N characters are different. This gives O(N) repetitions of an O(N) loop body, giving a worst-case complexity of O(N^2). (You would think ... but read on!)
What about the average case? That is difficult if we don't know more about the nature of the input strings.
If the characters are randomly chosen, the probability of repeated characters is small, and the probability of space characters is small.
If the character are alphabetic text, then the spaces are more frequent, and so are repetitions. Indeed, for English text characters are likely to be limited to upper and lowercase Latin letters (52) plus a handful of punctuation characters. So you might expect about 60 map entries for a long string and performance that converges rapidly to O(N).
Finally, even the worst-case is not really O(N^2). A String is a sequence of char values, and Java char values are restricted to the range 0 to 65535. So after 2^16 distinct characters, all characters must repeat, and thus even the worst-case goes to O(N) as N goes to infinity.
(I did mention that this was non-trivial? 😀 )

What you need to do here is reason how many steps have to be taken in relationship to the length of the String.
For every character in the String it has to call countMatches once. Every call of countMatches has to loop over every character of the String again to count them.
The other operations (determining the length of the String, adding to the HashSet, retrieving a character from a String by index, checking the whitespaceness, printing the answers) are assumed to be constant-time and do not matter.
The fact that some of the characters will be skipped (because they are whitespace or already in the HashSet) does not reduce the complexity for an unrestricted String. You can assume the worst case of all characters being different.
So that is O(n^2), where n is the length of the String.
You can improve it to O(n) by changing your HashSet to a HashMap of counters. Then you only need a single pass over the String instead of two nested passes.

Related

Comparison of these two algorithms?

So I'm presented with a problem that states. "Determine if a string contains all unique characters"
So I wrote up this solution that adds each character to a set, but if the character already exists it returns false.
private static boolean allUniqueCharacters(String s) {
Set<Character> charSet = new HashSet<Character>();
for (int i = 0; i < s.length(); i++) {
char currentChar = s.charAt(i);
if (!charSet.contains(currentChar)) {
charSet.add(currentChar);
} else {
return false;
}
}
return true;
}
According to the book I am reading this is the "optimal solution"
public static boolean isUniqueChars2(String str) {
if (str.length() > 128)
return false;
boolean[] char_set = new boolean[128];
for (int i = 0; i < str.length(); i++) {
int val = str.charAt(i);
if (char_set[val]) {
return false;
}
char_set[val] = true;
}
return true;
}
My question is, is my implementation slower than the one presented? I assume it is, but if a Hash look up is O(1) wouldn't they be the same complexity?
Thank you.
As Amadan said in the comments, the two solutions have the same time complexity O(n) because you have a for loop looping through the string, and you do constant time operations in the for loop. This means that the time it takes to run your methods increases linearly with the length of the string.
Note that time complexity is all about how the time it takes changes when you change the size of the input. It's not about how fast it is with data of the same size.
For the same string, the "optimal" solution should be faster because sets have some overheads over arrays. Handling arrays is faster than handling sets. However, to actually make the "optimal" solution work, you would need an array of length 2^16. That is how many different char values there are. You would also need to remove the check for a string longer than 128.
This is one of the many examples of the tradeoff between space and time. If you want it to go faster, you need more space. If you want to save space, you have to go slower.
Both algorithms have time complexity of O(N). The difference is in their space complexity.
The book's solution will always require storage for 128 characters - O(1), while your solution's space requirement will vary linearly according to the input - O(N).
The book's space requirement is based on an assumed character set with 128 characters. But this may be rather problematic (and not scalable) given the likelihood of needing different character sets.
The hashmap is in theory acceptable, but is a waste.
A hashmap is built over an array (so it is certainly more costly than an array), and collision resolution requires extra space (at least the double of the number of elements). In addition, any access requires the computation of the hash and possibly the resolution of collisions.
This adds a lot of overhead in terms of space and time, compared to a straight array.
Also note that it is kind of folklore that a hash table has an O(1) behavior. The worst case is much poorer, accesses can take up to O(N) time for a table of size N.
As a final remark, the time complexity of this algorithm is O(1) because you conclude false at worse when N>128.
Your algorithm is also O(1). You can think about complexity like how my algorithm will react to the change in amount of elements processed. Therefore O(n) and O(2n) are effectively equal.
People are talking about O notation as growth rate here
Your solution is could indeed be slower than the book's solution. Firstly, a hash lookup ideally has a constant time lookup. But, the retrieval of the object will not be if there are multiple hash collisions. Secondly, even if it is constant time lookup, there is usually significant overhead involved in executing the hash code function as compared to looking up an element in an array by index. That's why you may want to go with the array lookup. However, if you start to deal with non-ASCII Unicode characters, then you might not want to go with the array approach due to the significant amount of space overhead.
The bottleneck of your implementation is, that a set has a lookup (and insert) complexity* of O(log k), while the array has a lookup complexity in O(1).
This sounds like your algorithm must be much worse. But in fact it is not, as k is bounded by 128 (else the reference implementation would be wrong and produce a out-of-bounds error) and can be treated as a constant. This makes the set lookup O(1) as well with a bit bigger constants than the array lookup.
* assuming a sane implementation as tree or hashmap. The hashmap time complexity is in general not constant, as filling it up needs log(n) resize operations to avoid the increase of collisions which would lead to linear lookup time, see e.g. here and here for answers on stackoverflow.
This article even explains that java 8 by itself converts a hashmap to a binary tree (O(n log n) for the converstion, O(log n) for the lookup) before its lookup time degenerates to O(n) because of too many collisions.

Algorithm, Big O notation: Is this function O(n^2) ? or O(n)?

This is code from a algorithm book, "Data structures and Algorithms in Java, 6th Edition." by by Michael T. GoodRich, Roberto Tamassia, and Michael H. Goldwasser
public static String repeat1(char c, int n)
{
String answer = "";
for(int j=0; j < n; j++)
{
answer += c;
}
return answer;
}
According to the authors, the Big O notation of this algorithm is O(n^2) with reason:
"The command, answer += c, is shorthand for answer = (answer + c). This
command does not cause a new character to be added to the existing String
instance; instead it produces a new String with the desired sequence of
characters, and then it reassigns the variable, answer, to refer to that new
string. In terms of efficiency, the problem with this interpretation is that
the creation of a new string as a result of a concatenation, requires time
that is proportional to the length of the resulting string. The first time
through this loop, the result has length 1, the second time through the loop
the result has length 2, and so on, until we reach the final string of length
n."
However, I do not understand, how this code can have O(n^2) as its number of primitive operations just doubles each iteration regardless of the value of n(excluding j < n and j++).
The statement answer += c requires two primitive operations each iteration regardless of the value n, therefore I think the equation for this function supposed to be 4n + 3. (Each loop operates j
Or, is the sentence,"In terms of efficiency, the problem with this interpretation is that the creation of a new string as a result of a concatenation, requires time that is proportional to the length of the resulting string.," just simply saying that creating a new string as a result of a concatenation requires proportional time to its length regardless of the number of primitive operations used in the function? So the number of primitive operations does not have big effects on the running time of the function because the built-in code for concatenated String assignment operator's running time runs in O(n^2).
How can this function be O(n^2)?
Thank you for your support.
During every iteration of the loop, the statement answer += c; must copy each and every character already in the string answer to a new string.
E.g. n = 5, c = '5'
First loop: answer is an empty string, but it must still create a new string. There is one operation to append the first '5', and answer is now "5".
Second loop: answer will now point to a new string, with the first '5' copied to a new string with another '5' appended, to make "55". Not only is a new String created, one character '5' is copied from the previous string and another '5' is appended. Two characters are appended.
"n"th loop: answer will now point to a new string, with n - 1 '5' characters copied to a new string, and an additional '5' character appended, to make a string with n 5s in it.
The number of characters copied is 1 + 2 + ... + n = n(n + 1)/2. This is O(n2).
The efficient way to constructs strings like this in a loop in Java is to use a StringBuilder, using one object that is mutable and doesn't need to copy all the characters each time a character is appended in each loop. Using a StringBuilder has a cost of O(n).
Strings are immutable in Java. I believe this terrible code is O(n^2) for that reason and only that reason. It has to construct a new String on each iteration. I'm unsure if String concatenation is truly linearly proportional to the number of characters (it seems like it should be a constant time operation since Strings have a known length). However if you take the author's word for it then iterating n times with each iteration taking a time proportional to n, you get n^2. StringBuilder would give you O(n).
I mostly agree with it being O(n^2) in practice, but consider:
Java is SMART. In many cases it uses StringBuilder instead of string for concatenation under the covers. You can't just assume it's going to copy the underlying array every time (although it almost certainly will in this case).
Java gets SMARTER all the time. There is no reason it couldn't optimize that entire loop based on StringBuilder since it can analyze all your code and figure out that you don't use it as a string inside that loop.
Further optimizations can happen--Strings currently use an array AND an length AND a shared flag (And maybe a start location so that splits wouldn't require copying, I forget, but they changed that split implementation anyway)--so appending into an oversized array and then returning a new string with a reference to the same underlying array but a higher end without mutating the original string is altogether possible (by design, they do stuff like this already to a degree)...
So I think the real question is, is it a great idea to calculate O() based on a particular implementation of a language-level construct?
And although I can't say for sure what the answer to that is, I can say it would be a REALLY BAD idea to optimize on the assumption that it was O(n^2) unless you absolutely needed it--you could take away java's ability to speed up your code later by hand optimizing today.
ps. this is from experience. I had to optimize some java code that was the UI for a spectrum analyzer. I saw all sorts of String+ operations and figured I'd clean them all up with .append(). It saved NO time because Java already optimizes String+ operations that are not in a loop.
The complexity becomes O(n^2) because each time the string increase the length by one and to create it each time you need n complexity. Also, the outer loop is n in complexity. So the exact complexity will be (n * (n+1))/2 which is O(n^2)
For example,
For abcdefg
a // one length string object is created so complexity is 1
ab // similarly complexity is 2
abc // complexity 3 here
abcd // 4 now.
abcde // ans so on.
abcdef
abcedefg
Now, you see the total complexity is 1 + 2 + 3 + 4 + ... + n = (n * (n+1))/2. In big O notation it's O(n^2)
Consider the length of the string as "n" so every time we need to add the element at the end so iteration for the string is "n" and also we have the outer for loop so "n" for that, So as a result we get O(n^2).
That is because:
answer += c;
is a String concatenation. In java Strings are immutable.
It means concatenated string is created by creating a copy of original string and appending c to it. So a simple concatenation operation is O(n) for n sized String.
In first iteration, answer length is 0, in second iteration its 1, in third its 2 and so on.
So you're doing these operations every time i.e.
1 + 2 + 3 + ... + n = O(n^2)
For string manipulations StringBuilder is the preferred way i.e. it appends any character in O(1) time.

Is time complexity of an algorithm calculated only based on number of times loop excecutes?

I have a big doubt in calculating time complexity. Is it calculated based on number of times loop executes? My question stems from the situation below.
I have a class A, which has a String attribute.
class A{
String name;
}
Now, I have a list of class A instances. This list has different names in it. I need to check whether the name "Pavan" exist in any of the objects in the list.
Scenario 1:
Here the for loop executes listA.size times, which can be said as O(n)
public boolean checkName(List<String> listA, String inputName){
for(String a : listA){
if(a.name.equals(inputName)){
return true;
}
}
return false;
}
Scenario 2:
Here the for loop executes listA.size/2 + 1 times.
public boolean checkName(List<String> listA, String inputName){
int length = listA.size/2
length = length%2==0 ? length : length + 1
for(int i=0; i < length ; i++){
if(listA[i].name.equals(inputName) || listA[listA.size - i - 1].name.equals(inputName)){
return true;
}
}
return false;
}
I minimized the number of times for loop executes, but I increased the complexity of the logic.
Can we say this is O(n/2)? If so, can you please explain me?
First note that in Big-O notation there is nothing such as O(n/2) as 1/2 is a constant factor which is ignored in this notation. The complexity would remain as O(n). So by modifying your code you haven't changed anything regarding complexity.
In general estimating the number of times a loop is executed with respect to input size and the operation that actually is associated with a cost in time is the way to get to the complexity class of the algorithm.
The operation that is producing cost in your method is String.equals, which by looking at it's implementation, is producing cost by comparing characters.
In your example the input size is not strictly equal to the size of the list. It also depends on how large the strings contained in that list are and how large the inputName is.
So let's say the largest string in the list is m1 characters and the inputName is m2 characters in length. So for your original checkName method the complexity would be O(n*min(m1,m2)) because of String.equals comparing at most all characters of a string.
For most applications the term min(m1,m2) doesn't matter as either one of the compared strings is stored in a fixed size database column for example and therefore this expression is a constant, which is, as said above, ignored.
No. In big O expression, all constant values are ignored.
We only care about n, such as O(n^2), O(logn).
Time and space complexity is calculated based on the number or operations executed, respectively the number the units of memory used.
Regarding time complexity: all the operations are taken into account and numbered. Because it's hard to compare say O(2*n^2+5*n+3) with O(3*n^2-3*n+1), equivalence classes are used. That means that for very large values of n, the two previous example will have a roughly similar value (more exactly said: they have a similar rate of grouth). Therefor, you reduce the expression to it's most basic form, saying that the two example are in the same equivalence class as O(n^2). Similarly, O(n) and O(n/2) are in the same class and therefor both are in O(n).
Due to what I said before, you can ignore most constant operations (such as .size(), .lenth() on collections, assignments, etc) as they don't really count in the end. Therefor, you're left with loop operations and sometimes complex computations (that somewhere lower on the stack use loops themselves).
To better get an understanding on the 3 classes of complexity, try reading articles on the subject, such as: http://discrete.gr/complexity/
Time complexity is a measure for theoretical time it will take for an operation to be executed.
While normally any improvement in the time required is significant in time complexity we are interested in the order of magnitude. The former means
If an operation for N objects requires N time intervals then it has complexity O(N).
If an operation for N objects requires N/2 it's complexity is still O(N) though.
The above paradox is explained if you get to calculate the operation for large N then there is no big difference in the /2 part as for the N part. If complexity is O(N^2) then O(N) is negligible for large N so that's why we are interested in order of magnitude.
In other words any constant is thrown away when calculating complexity.
As for the question if
Is it calculated based on number of times loop executes ?
well it depends on what a loop contains. But if only basic operation are executed inside a loop then yes. To give an example if you have a loop inside which an eigenanaluysis is executed in each run, which has complexity O(N^3) you cannot say that your complexity is simply O(N).
Complexity of an algorithm is measured based on the response made on the input size in terms of processing time or space requirement. I think you are missing the fact that the notations used to express the complexity are asymptotic notations.
As per your question, you have reduced the loop execution count, but not the linear relation with the input size.

Sorting string so that there aren't two same characters on adjacent places [duplicate]

It's a bonus school task for which we didn't receive any teaching yet and I'm not looking for a complete code, but some tips to get going would be pretty cool. Going to post what I've done so far in Java when I get home, but here's something I've done already.
So, we have to do a sorting algorithm, which for example sorts "AAABBB" to the ABABAB. Max input size is 10^6, and it all has to happen under 1 second. If there's more than one answer, the first one in alphabetical order is the right one. I started to test different algorithms to even sort them without that alphabetical order requirement in mind, just to see how the things work out.
First version:
Save the ascii codes to the Integer array where index is the ascii code, and the value is amount which that character occurs in the char array.
Then I picked 2 highest numbers, and started to spam them to the new character array after each other, until some number was higher, and I swapped to it. It worked well, but of course the order wasn't right.
Second version:
Followed the same idea, but stopped picking the most occurring number and just picked the indexes in the order they were in my array. Works well until the input is something like CBAYYY. Algorithm sorts it to the ABCYYY instead of AYBYCY. Of course I could try to find some free spots for those Y's, but at that point it starts to take too long.
An interesting problem, with an interesting tweak. Yes, this is a permutation or rearranging rather than a sort. No, the quoted question is not a duplicate.
Algorithm.
Count the character frequencies.
Output alternating characters from the two lowest in alphabetical order.
As each is exhausted, move to the next.
At some point the highest frequency char will be exactly half the remaining chars. At that point switch to outputting all of that char alternating in turn with the other remaining chars in alphabetical order.
Some care required to avoid off-by-one errors (odd vs even number of input characters). Otherwise, just writing the code and getting it to work right is the challenge.
Note that there is one special case, where the number of characters is odd and the frequency of one character starts at (half plus 1). In this case you need to start with step 4 in the algorithm, outputting all one character alternating with each of the others in turn.
Note also that if one character comprises more than half the input then apart for this special case, no solution is possible. This situation may be detected in advance by inspecting the frequencies, or during execution when the tail consists of all one character. Detecting this case was not part of the spec.
Since no sort is required the complexity is O(n). Each character is examined twice: once when it is counted and once when it is added to the output. Everything else is amortised.
My idea is the following. With the right implementation it can be almost linear.
First establish a function to check if the solution is even possible. It should be very fast. Something like most frequent letter > 1/2 all letters and take into cosideration if it can be first.
Then while there are still letters remaining take the alphabetically first letter that is not the same as previous, and makes further solution possible.
The correct algorithm would be the following:
Build a histogram of the characters in the input string.
Put the CharacterOccurrences in a PriorityQueue / TreeSet where they're ordered on highest occurrence, lowest alphabetical order
Have an auxiliary variable of type CharacterOccurrence
Loop while the PQ is not empty
Take the head of the PQ and keep it
Add the character of the head to the output
If the auxiliary variable is set => Re-add it to the PQ
Store the kept head in the auxiliary variable with 1 occurrence less unless the occurrence ends up being 0 (then unset it)
if the size of the output == size of the input, it was possible and you have your answer. Else it was impossible.
Complexity is O(N * log(N))
Make a bi directional table of character frequencies: character->count and count->character. Record an optional<Character> which stores the last character (or none of there is none). Store the total number of characters.
If (total number of characters-1)<2*(highest count character count), use the highest count character count character. (otherwise there would be no solution). Fail if this it the last character output.
Otherwise, use the earliest alphabetically that isn't the last character output.
Record the last character output, decrease both the total and used character count.
Loop while we still have characters.
While this question is not quite a duplicate, the part of my answer giving the algorithm for enumerating all permutations with as few adjacent equal letters as possible readily can be adapted to return only the minimum, as its proof of optimality requires that every recursive call yield at least one permutation. The extent of the changes outside of the test code are to try keys in sorted order and to break after the first hit is found. The running time of the code below is polynomial (O(n) if I bothered with better data structures), since unlike its ancestor it does not enumerate all possibilities.
david.pfx's answer hints at the logic: greedily take the least letter that doesn't eliminate all possibilities, but, as he notes, the details are subtle.
from collections import Counter
from itertools import permutations
from operator import itemgetter
from random import randrange
def get_mode(count):
return max(count.items(), key=itemgetter(1))[0]
def enum2(prefix, x, count, total, mode):
prefix.append(x)
count_x = count[x]
if count_x == 1:
del count[x]
else:
count[x] = count_x - 1
yield from enum1(prefix, count, total - 1, mode)
count[x] = count_x
del prefix[-1]
def enum1(prefix, count, total, mode):
if total == 0:
yield tuple(prefix)
return
if count[mode] * 2 - 1 >= total and [mode] != prefix[-1:]:
yield from enum2(prefix, mode, count, total, mode)
else:
defect_okay = not prefix or count[prefix[-1]] * 2 > total
mode = get_mode(count)
for x in sorted(count.keys()):
if defect_okay or [x] != prefix[-1:]:
yield from enum2(prefix, x, count, total, mode)
break
def enum(seq):
count = Counter(seq)
if count:
yield from enum1([], count, sum(count.values()), get_mode(count))
else:
yield ()
def defects(lst):
return sum(lst[i - 1] == lst[i] for i in range(1, len(lst)))
def test(lst):
perms = set(permutations(lst))
opt = min(map(defects, perms))
slow = min(perm for perm in perms if defects(perm) == opt)
fast = list(enum(lst))
assert len(fast) == 1
fast = min(fast)
print(lst, fast, slow)
assert slow == fast
for r in range(10000):
test([randrange(3) for i in range(randrange(6))])
You start by count each number of letter you have in your array:
For example you have 3 - A, 2 - B, 1 - C, 4 - Y, 1 - Z.
1) Then you put each time the lowest one (it is A), you can put.
so you start by :
A
then you can not put A any more so you put B:
AB
then:
ABABACYZ
These works if you have still at least 2 kind of characters. But here you will have still 3 Y.
2) To put the last characters, you just go from your first Y and insert one on 2 in direction of beginning.(I don't know if these is the good way to say that in english).
So ABAYBYAYCYZ.
3) Then you take the subsequence between your Y so YBYAYCY and you sort the letter between the Y :
BAC => ABC
And you arrive at
ABAYAYBYCYZ
which should be the solution of your problem.
To do all this stuff, I think a LinkedList is the best way
I hope it help :)

Selecting top 10 most frequently occurring strings from an array, java

I have an array of strings from which I want to find the top 10 most frequently occurring strings.
One primitive way of doing this is to of course loop through the array once, get a stack/queue of all the distinct strings, store these distinct strings in an array, then check the number of times each string in this new array occurs in the original array, and finally store the values in 'n' distinct integers, where n is the number of distinct strings.
Obviously this is a horrible method when it comes to time efficiency, so I was wondering if there is a better way of doing this.
If you don't care about memory, you can build a hash map holding the count of each string: you loop through all your strings and for each one you do
myhash[mystring] += 1
if the string is already present in the hash, or
myhash[mystring] = 1
otherwise.
If you consider that looking up a value in a hash map is made in constant time (which could not be true) then this algorithm is "only" O(n) (but it takes up a lot of memory).
If you care about memory, you can sort the array and then count how many times each string appears easily (each string will appear firstly at position i, i+1, i+2, ..., i+k and nowhere else).
Sorting will take O(n log n), than O(n) for counting occurences of strings.
You could use a Guava Multiset adding all the strings then call Multisets.copyHighestCountFirst() only looking at the first 10
See this question for an example

Categories