Big-O Memory of Array vs String - java

This might sound dumb but I was wondering while thinking a bit, can't you play around with an algorithm and make O(n) memory seem O(1)?
(Java)
Let's say you have an array of N elements of true or false.
Then that array would result in O(n) memory.
However, if we have an array of say, "FFFFFTFTFFT" with each charAt(i) answering the result of the i-th index of the array, haven't we used only O(1) memory or is it considered O(n) memory since String is size of O(n) itself?
Let's take this further.
If we have an N-array of true and false and convert this to bytes, we use even less memory. Then is the byte also considered O(1) memory or O(n) memory?
For instance, let's say n = 6. Then array size is 6 = O(n). But the byte size is just 1 byte since 1 byte can store 8 different values (8 bits). So is this O(1) or is this O(n) since for large N we get the following case...:
N equals 10000. Array is O(n) memory but is a byte what memory? Cause our byte is O(n/8) = O(n)?

All the cases you've described are O(n), it describes the limiting behavior when n tends towards infinity, saying mathematically:
f(n) = O(n), as n -> INF equals to f(n)/n -> const, as n -> INF, where const <> 0
So 10*n + 100 = O(n) and 0.1*n = O(n).
And as you wrote next statement is correct too: O(n/8) = O(n) = O(n/const)

I'm not sure you understand the concepts of Big O completely, but you still have N elements in each of the listed cases.
The notation O(N) is an upper bound for a function of N elements, not so much defined by the size of the underlying datatypes, since as noted O(N/8) = O(N).
So, example,
If we have an N-array of true and false and convert this to bytes
You are converting N elements into N bytes. This is O(N) time complexity. You stored 2 * O(N) total arrays, resulting in O(N) space complexity.
charAt(i)
This operation alone is O(1) time complexity because you are accessing one element. But you have N elements in an array or string, so it is O(N) space complexity
I'm not really sure there is a common O(1) space complexity algorithm (outside of simple math operations)

There is another misconception here: in order to truly make a "character container" with that O(1) property (respectively: O(log n), as the required memory still grows with growing data), that would only work for exactly that: strings that contain n characters of one kind, and 1 character of another kind.
In such cases, yes, you would only need to remember the index that has the different character. That is similar to defining a super sparse matrix: if only one value is != 0 in a huge matrix, you could store only the corresponding indexes instead of the whole matrix with gazillions of 0 values.
And of course: there are libraries that do such things for sparse matrixes to reduce the cost of keeping known 0 values in memory. Why remember something when you can (easily) compute it?!

Related

Best way to retrieve K largest elements from large unsorted arrays?

I recently had a coding test during an interview. I was told:
There is a large unsorted array of one million ints. User wants to retrieve K largest elements. What algorithm would you implement?
During this, I was strongly hinted that I needed to sort the array.
So, I suggested to use built-in sort() or maybe a custom implementation if performance really mattered. I was then told that using a Collection or array to store the k largest and for-loop it is possible to achieve approximately O(N), in hindsight, I think it's O(N*k) because each iteration needs to compare to the K sized array to find the smallest element to replace, while the need to sort the array would cause the code to be at least O(N log N).
I then reviewed this link on SO that suggests priority queue of K numbers, removing the smallest number every time a larger element is found, which would also give O(N log N). Write a program to find 100 largest numbers out of an array of 1 billion numbers
Is the for-loop method bad? How should I justify pros/cons of using the for-loop or the priorityqueue/sorting methods? I'm thinking that if the array is already sorted, it could help by not needing to iterate through the whole array again, i.e. if some other method of retrieval is called on the sorted array, it should be constant time. Is there some performance factor when running the actual code that I didn't consider when theorizing pseudocode?
Another way of solving this is using Quickselect. This should give you a total average time complexity of O(n). Consider this:
Find the kth largest number x using Quickselect (O(n))
Iterate through the array again (or just through the right-side partition) (O(n)) and save all elements ≥ x
Return your saved elements
(If there are repeated elements, you can avoid them by keeping count of how many duplicates of x you need to add to the result.)
The difference between your problem and the one in the SO question you linked to is that you have only one million elements, so they can definitely be kept in memory to allow normal use of Quickselect.
There is a large unsorted array of one million ints. The user wants to retrieve the K largest elements.
During this, I was strongly hinted that I needed to sort the array.
So, I suggested using a built-in sort() or maybe a custom
implementation
That wasn't really a hint I guess, but rather a sort of trick to deceive you (to test how strong your knowledge is).
If you choose to approach the problem by sorting the whole source array using the built-in Dual-Pivot Quicksort, you can't obtain time complexity better than O(n log n).
Instead, we can maintain a PriorytyQueue which would store the result. And while iterating over the source array for each element we need to check whether the queue has reached the size K, if not the element should be added to the queue, otherwise (is size equals to K) we need to compare the next element against the lowest element in the queue - if the next element is smaller or equal we should ignore it if it is greater the lowest element has to be removed and the new element needs to be added.
The time complexity of this approach would be O(n log k) because adding a new element into the PriorytyQueue of size k costs O(k) and in the worst-case scenario this operation can be performed n times (because we're iterating over the array of size n).
Note that the best case time complexity would be Ω(n), i.e. linear.
So the difference between sorting and using a PriorytyQueue in terms of Big O boils down to the difference between O(n log n) and O(n log k). When k is much smaller than n this approach would give a significant performance gain.
Here's an implementation:
public static int[] getHighestK(int[] arr, int k) {
Queue<Integer> queue = new PriorityQueue<>();
for (int next: arr) {
if (queue.size() == k && queue.peek() < next) queue.remove();
if (queue.size() < k) queue.add(next);
}
return toIntArray(queue);
}
public static int[] toIntArray(Collection<Integer> source) {
return source.stream().mapToInt(Integer::intValue).toArray();
}
main()
public static void main(String[] args) {
System.out.println(Arrays.toString(getHighestK(new int[]{3, -1, 3, 12, 7, 8, -5, 9, 27}, 3)));
}
Output:
[9, 12, 27]
Sorting in O(n)
We can achieve worst case time complexity of O(n) when there are some constraints regarding the contents of the given array. Let's say it contains only numbers in the range [-1000,1000] (sure, you haven't been told that, but it's always good to clarify the problem requirements during the interview).
In this case, we can use Counting sort which has linear time complexity. Or better, just build a histogram (first step of Counting Sort) and look at the highest-valued buckets until you've seen K counts. (i.e. don't actually expand back to a fully sorted array, just expand counts back into the top K sorted elements.) Creating a histogram is only efficient if the array of counts (possible input values) is smaller than the size of the input array.
Another possibility is when the given array is partially sorted, consisting of several sorted chunks. In this case, we can use Timsort which is good at finding sorted runs. It will deal with them in a linear time.
And Timsort is already implemented in Java, it's used to sort objects (not primitives). So we can take advantage of the well-optimized and thoroughly tested implementation instead of writing our own, which is great. But since we are given an array of primitives, using built-in Timsort would have an additional cost - we need to copy the contents of the array into a list (or array) of wrapper type.
This is a classic problem that can be solved with so-called heapselect, a simple variation on heapsort. It also can be solved with quickselect, but like quicksort has poor quadratic worst-case time complexity.
Simply keep a priority queue, implemented as binary heap, of size k of the k smallest values. Walk through the array, and insert values into the heap (worst case O(log k)). When the priority queue is too large, delete the minimum value at the root (worst case O(log k)). After going through the n array elements, you have removed the n-k smallest elements, so the k largest elements remain. It's easy to see the worst-case time complexity is O(n log k), which is faster than O(n log n) at the cost of only O(k) space for the heap.
Here is one idea. I will think for creating array (int) with max size (2147483647) as it is max value of int (2147483647). Then for every number in for-each that I get from the original array just put the same index (as the number) +1 inside the empty array that I created.
So in the end of this for each I will have something like [1,0,2,0,3] (array that I created) which represent numbers [0, 2, 2, 4, 4, 4] (initial array).
So to find the K biggest elements you can make backward for over the created array and count back from K to 0 every time when you have different element then 0. If you have for example 2 you have to count this number 2 times.
The limitation of this approach is that it works only with integers because of the nature of the array...
Also the representation of int in java is -2147483648 to 2147483647 which mean that in the array that need to be created only the positive numbers can be placed.
NOTE: if you know that there is max number of the int then you can lower the created array size with that max number. For example if the max int is 1000 then your array which you need to create is with size 1000 and then this algorithm should perform very fast.
I think you misunderstood what you needed to sort.
You need to keep the K-sized list sorted, you don't need to sort the original N-sized input array. That way the time complexity would be O(N * log(K)) in the worst case (assuming you need to update the K-sized list almost every time).
The requirements said that N was very large, but K is much smaller, so O(N * log(K)) is also smaller than O(N * log(N)).
You only need to update the K-sized list for each record that is larger than the K-th largest element before it. For a randomly distributed list with N much larger than K, that will be negligible, so the time complexity will be closer to O(N).
For the K-sized list, you can take a look at the implementation of Is there a PriorityQueue implementation with fixed capacity and custom comparator? , which uses a PriorityQueue with some additional logic around it.
There is an algorithm to do this in worst-case time complexity O(n*log(k)) with very benign time constants (since there is just one pass through the original array, and the inner part that contributes to the log(k) is only accessed relatively seldomly if the input data is well-behaved).
Initialize a priority queue implemented with a binary heap A of maximum size k (internally using an array for storage). In the worst case, this has O(log(k)) for inserting, deleting and searching/manipulating the minimum element (in fact, retrieving the minimum is O(1)).
Iterate through the original unsorted array, and for each value v:
If A is not yet full then
insert v into A,
else, if v>min(A) then (*)
insert v into A,
remove the lowest value from A.
(*) Note that A can return repeated values if some of the highest k values occur repeatedly in the source set. You can avoid that by a search operation to make sure that v is not yet in A. You'd also want to find a suitable data structure for that (as the priority queue has linear complexity), i.e. a secondary hash table or balanced binary search tree or something like that, both of which are available in java.util.
The java.util.PriorityQueue helpfully guarantees the time complexity of its operations:
this implementation provides O(log(n)) time for the enqueing and dequeing methods (offer, poll, remove() and add); linear time for the remove(Object) and contains(Object) methods; and constant time for the retrieval methods (peek, element, and size).
Note that as laid out above, we only ever remove the lowest (first) element from A, so we enjoy the O(log(k)) for that. If you want to avoid duplicates as mentioned above, then you also need to search for any new value added to it (with O(k)), which opens you up to a worst-case overall scenario of O(n*k) instead of O(n*log(k)) in case of a pre-sorted input array, where every single element v causes the inner loop to fire.

Why are these loop & hashing operations take O(N) time complexity?

Given the array :
int arr[]= {1, 2, 3, 2, 3, 1, 3}
You are asked to find a number within the array that occurs odd number of times. It's 3 (occurring 3 times). The time complexity should be at least O(n).
The solution is to use an HashMap. Elements become keys and their counts become values of the hashmap.
// Code belongs to geeksforgeeks.org
// function to find the element occurring odd
    // number of times
    static int getOddOccurrence(int arr[], int n)
    {
        HashMap<Integer,Integer> hmap = new HashMap<>();
        // Putting all elements into the HashMap
        for(int i = 0; i < n; i++)
        {
            if(hmap.containsKey(arr[i]))
            {
                int val = hmap.get(arr[i]);
                // If array element is already present then
                // increase the count of that element.
                hmap.put(arr[i], val + 1); 
            }
            else
                // if array element is not present then put
                // element into the HashMap and initialize 
                // the count to one.
                hmap.put(arr[i], 1); 
        }
        // Checking for odd occurrence of each element present
          // in the HashMap 
        for(Integer a:hmap.keySet())
        {
            if(hmap.get(a) % 2 != 0)
                return a;
        }
        return -1;
    }
I don't get why this overall operation takes O(N) time complexity. If I think about it, the loop alone takes O(N) time complexity. Those hmap.put (an insert operation) and hmap.get(a find operations) take O(N) and they are nested within the loop. So normally I would think this function takes O(N^2) times. Why it instead takes O(N)?.
I don't get why this overall operation takes O(N) time complexity.
You must examine all elements of the array - O(N)
For each element of the array you call contain, get and put on the array. These are O(1) operations. Or more precisely, they are O(1) on averaged amortized over the lifefime of the HashMap. This is due to the fact that a HashMap will grow its hash array when the ratio of the array size to the number of elements exceeds the load factor.
O(N) repetitions of 2 or 3 O(1) operations is O(N). QED
Reference:
Is a Java hashmap really O(1)?
Strictly speaking there are a couple of scenarios where a HashMap is not O(1).
If the hash function is poor (or the key distribution is pathological) the hash chains will be unbalanced. With early HashMap implementations, this could lead to (worst case) O(N) operations because operations like get had to search a long hash chain. With recent implementations, HashMap will construct a balanced binary tree for any hash chain that is too long. That leads to worst case O(logN) operations.
HashMap is unable to grow the hash array beyond 2^31 hash buckets. So at that point HashMap complexity starts transitioning to O(log N) complexity. However if you have a map that size, other secondary effects will probably have affected the real performance anyway.
The algorithm first iterates the array of numbers, of size n, to generate the map with counts of occurrences. It should be clear why this is an O(n) operation. Then, after the hashmap has been built, it iterates that map and finds all entries whose counts are odd numbers. The size of this map would in practice be somewhere between 1 (in the case of all input numbers being the same), and n (in the case where all inputs are different). So, this second operation is also bounded by O(n), leaving the entire algorithm O(n).

Java - Space complexity with variable

Is the space complexity here O(n)? Since if k increases by 5, my variable p would also increase by 5.
All this method does right now is get the node at k. For example: 1->5->3, when k = 2, the node is 5.
public ListNode reverseKGroup(ListNode head, int k) {
int p = 1;
while (p < k) {
if (head.next == null) {
return head;
}
head = head.next;
p++;
}
return head
}
Strictly considering your algorithm, it has a space complexity O(1). Your input is a header of a list and a number k, but your algo doesn't consume any space more than just a reference head and a number p. In my opinion, the existing list doesn't belong to the complexity of your method. However, your time complexity is O(N).
--- answer to Theo's question in the comment:
p is a number (in this case of primitive type int, so it takes 4 bytes - constant size). If p increases, it doesn't mean, it takes more space, but that a higher number is stored in. E.g. p = 5 means following bytes are set: "0,0,0,5" , for p = 257, bytes are set: "0,0,1,2".
I assume, JVM stores the date in big endian byte order, so the first zero's are representing the bigger bytes. With little endian, the byte order would be reversed.
Of course, you are right, that for very big N, the 32 bits long integer is not enought. Therefore, strictly considering this fact, O(log(N)) bits are necessary to store numbers up to N.
E.g. a number 2^186 needs 187 bits to be stored (one 1 and 186 zeros).
But in reality, when working with "usual" data, you do not expect such a huge amount. Since only to exceed 32 bit register (one int number), you would need to have 2^32 data entries (1 entry = 4 bytes for a next reference, at least 4 bytes for the value Object reference, and the object size itself = at least 8 bytes), that is 2^35 bytes = 32 gigabyes. Therefore, when a number is used, it's generally considered to be a constant space complexity. It depends on the task and circumstances.
Depending on whether you consider pre-existing structures part of your space complexity, the space complexity is either O(1) or O(N) where N is the length of the list being operated on since you do not add any new nodes and only reference existing nodes.
k only matters for time complexity.
The only space this algorithm uses is the space for int p, which is constant regardless of input so space complexity is O(1). The time complexity indeed is O(N).

Determining the element that occurred the most in O(n) time and O(1) space

Let me start off by saying that this is not a homework question. I am trying to design a cache whose eviction policy depends on entries that occurred the most in the cache. In software terms, assume we have an array with different elements and we just want to find the element that occurred the most. For example: {1,2,2,5,7,3,2,3} should return 2. Since I am working with hardware, the naive O(n^2) solution would require a tremendous hardware overhead. The smarter solution of using a hash table works well for software because the hash table size can change but in hardware, I will have a fixed size hash table, probably not that big, so collisions will lead to wrong decisions. My question is, in software, can we solve the above problem in O(n) time complexity and O(1) space?
There can't be an O(n) time, O(1) space solution, at least not for the generic case.
As amit points out, by solving this, we find the solution to the element distinctness problem (determining whether all the elements of a list are distinct), which has been proven to take Θ(n log n) time when not using elements to index the computer's memory. If we were to use elements to index the computer's memory, given an unbounded range of values, this requires at least Θ(n) space. Given the reduction of this problem to that one, the bounds for that problem enforces identical bounds on this problem.
However, practically speaking, the range would mostly be bounded, if for no other reason than the type one typically uses to store each element in has a fixed size (e.g. a 32-bit integer). If this is the case, this would allow for an O(n) time, O(1) space solution, albeit possibly too slow and using too much space due to the large constant factors involved (as the time and space complexity would depend on the range of values).
2 options:
Counting sort
Keeping an array of the number of occurrences of each element (the array index being the element), outputting the most frequent.
If you have a bounded range of values, this approach would be O(1) space (and O(n) time). But technically so would the hash table approach, so the constant factors here is presumably too large for this to be acceptable.
Related options are radix sort (has an in-place variant, similar to quicksort) and bucket sort.
Quicksort
Repeatedly partitioning the data based on a selected pivot (through swapping) and recursing on the partitions.
After sorting we can just iterate through the array, keeping track of the maximum number of consecutive elements.
This would take O(n log n) time and O(1) space.
As you say maximum element in your cache may e a very big number but following is one of the solution.
Iterate over the array.
Lets say maximum element that the array holds is m.
For each index i get the element it holds let it be array[i]
Now go to the index array[i] and add m to it.
Do above for all the indexes in array.
Finally iterate over the array and return index with maximum element.
TC -> O(N)
SC -> O(1)
It may not be feasible for large m as in your case. But see if you can optimize or alter this algo.
A solution on top off my head :
As the numbers can be large , so i consider hashing , instead of storing them directly in array .
Let there are n numbers 0 to n-1 .
Suppose the number occcouring maximum times , occour K times .
Let us create n/k buckets , initially all empty.
hash(num) tells whether num is present in any of the bucket .
hash_2(num) stores number of times num is present in any of the bucket .
for(i = 0 to n-1)
if the number is already present in one of the buckets , increase the count of input[i] , something like Hash_2(input[i]) ++
else find an empty bucket , insert input[i] in 1st empty bucket . Hash(input[i]) = true
else , if all buckets full , decrease count of all numbers in buckets by 1 , don't add input[i] in any of buckets .
If count of any number becomes zero [see hash_2(number)], Hash(number) = false .
This way , finally you will get atmost k elements , and the required number is one of them , so you need to traverse the input again O(N) to finally find the actual number .
The space used is O(K) and time complexity is O(N) , considering implementaion of hash O(1).
So , the performance really depends on K . If k << n , this method perform poorly .
I don't think this answers the question as stated in the title, but actually you can implement a cache with the Least-Frequently-Used eviction policy having constant average time for put, get and remove operations. If you maintain your data structure properly, there's no need to scan all items in order to find the item to evict.
The idea is having a hash table which maps keys to value records. A value record contains the value itself plus a reference to a "counter node". A counter node is a part of a doubly linked list, and consists of:
An access counter
The set of keys having this access count (as a hash set)
next pointer
prev pointer
The list is maintained such that it's always sorted by the access counter (where the head is min), and the counter values are unique. A node with access counter C contains all keys having this access count. Note that this doesn't increment the overall space complexity of the data structure.
A get(K) operation involves promoting K by migrating it to another counter record (either a new one or the next one in the list).
An eviction operation triggered by a put operation roughly consists of checking the head of the list, removing an arbitrary key from its key set, and then removing it from the hash table.
It is possible if we make reasonable (to me, anyway) assumptions about your data set.
As you say you could do it if you could hash, because you can simply count-by-hash. The problem is that you may get non-unique hashes. You mention 20bit numbers, so presumably 2^20 possible values and a desire for a small and fixed amount of working memory for the actual hash counts. This, one presumes, will therefore lead to hash collisions and thus a breakdown of the hashing algorithm. But you can fix this by doing more than one pass with complementary hashing algorithms.
Because these are memory addresses, it's likely not all of the bits are actually going to be capable of being set. For example if you only ever allocate word (4 byte) chunks you can ignore the two least significant bits. I suspect, but don't know, that you're actually only dealing with larger allocation boundaries so it may be even better than this.
Assuming word aligned; that means we have 18 bits to hash.
Next, you presumably have a maximum cache size which is presumably pretty small. I'm going to assume that you're allocating a maximum of <=256 items because then we can use a single byte for the count.
Okay, so to make our hashes we break up the number in the cache into two nine bit numbers, in order of significance highest to lowest and discard the last two bits as discussed above. Take the first of these chunks and use it as a hash to give a first part count. Then we take the second of these chunks and use it as a hash but this time we only count if the first part hash matches the one we identified as having the highest hash. The one left with the highest hash is now uniquely identified as having the highest count.
This runs in O(n) time and requires a 512 byte hash table for counting. If that's too large a table you could divide into three chunks and use a 64 byte table.
Added later
I've been thinking about this and I've realised it has a failure condition: if the first pass counts two groups as having the same number of elements, it cannot effectively distinguish between them. Oh well
Assumption: all the element is integer,for other data type we can also achieve this if we using hashCode()
We can achieve a time complexity O(nlogn) and space is O(1).
First, sort the array , time complexity is O(nlog n) (we should use in - place sorting algorithm like quick sort in order to maintain the space complexity)
Using four integer variable, current which indicates the value we are referring to,count , which indicate the number of occurrences of current, result which indicates the finale result and resultCount, which indicate the number of occurrences of result
Iterating from start to end of the array data
int result = 0;
int resultCount = -1;
int current = data[0];
int count = 1;
for(int i = 1; i < data.length; i++){
if(data[i] == current){
count++;
}else{
if(count > resultCount){
result = current;
resultCount = count;
}
current = data[i];
count = 1;
}
}
if(count > resultCount){
result = current;
resultCount = count;
}
return result;
So, in the end, there is only 4 variables is used.

Data structure recommendation

Developing in Java, I need a data structure to select N distinct random numbers between 0 and 999999 ?
I want to be able to quickly allocate N numbers and make sure they don't repeat themselves.
Main goal is not to use too much memory and still keep performance reasonable.
I am considering using a BitSet But I am not sure if the memory implications.
Can someone tell me if the memory requirements of this class are related to the number of bits or to the number of set bits? and what is the complexity to setting/testing a bit ?
UPDATE:
Thanks for all the replies so far.
I Think I had this in my initial wording of this Q but removed it when I first saw the BitSet Class.
Anyway I wanted to add the following info:
Currently I am looking at N of a few thousands at most (most likely around 1000-2000) and a number range of 0 to 999999.
But I would like my choice to take into consideration the option of increasing the range to 8 digits (i.e. 0 to 99 999 999) while keeping N at roughly the same ranges (maybe increase it to 5K or 10K).
So the "used values" are quite sparse.
It depends on how large N is.
For small values of N, you could use a HashSet<Integer> to hold the numbers you have already issued. This gives you O(1) lookup and O(N) space usage.
A BitSet for the range 0-999999 is going to use roughly 125Kb, regardless of the value of N. For large enough values of N, this will be more space efficient than a HashSet. I'm not sure exactly what the value of N is where a BitSet will use less space, but my guestimate would be 10,000 to 20,000.
Can someone tell me if the memory requirements of BitSet are related to the number of bits or to the number of set bits?
The size is determined either by the largest bit that has ever been set, or the nBits parameter if you use the BitSet(int nBits) constructor.
and what is the complexity to setting/testing a bit ?
Testing bit B is O(1).
Setting bit B is O(1) best case, and O(B) if you need to expand the bitset backing array. However, since the size of the backing array is the next largest power of 2, the cost of expansion can typically be amortized over multiple BitSet operations.
A BitSet will take up as much space as 1,000,000 booleans, which is 125,000 bytes or roughly 122kB, plus some minor overhead and space to grow. An array of the actual numbers, i.e. an int[] will take N × 4B of space plus some overhead. The break-even point is
4 × N = 125,000
N = 31250
I'm not intimately familiar with Java internals, but I suspect it won't allocate more than twice the actual space used, so you're using less then 250kB of memory with a bitset. Also, an array makes it harder to find the duplicates when you need unique integers, so I'd use the bitset either way and perhaps convert it to an array at the end, if that's more convenient for further processing.
Setting/getting a bit in a BitSet will have constant complexity, although it takes a few more operations than getting one out of a boolean[].

Categories