Graph implementations and initialization of adjacency matrices

Graph implementations and initialization of adjacency matrices - java

Graphs are often represented using an adjacency matrix. Various sources indicate it is possible to avoid the cost of initialization to be |V^2| (V is the number of vertices) but could I have not figured out how.
In Java, simply by declaring the matrix, e.g. boolean adj [][], the runtime will automatically initialize the array with false and this will be at O(V^2) cost (the dimensions of the array).
Do I misunderstand? Is it possible to avoid the quadratic cost of initialization of the adjacency matrix, or is this just something theoretical that depends on the implementation language?

That would be possible by using a sparse matrix representation of an adjacency matrix where only the position of the "ones" is allocated rather than each and every element of the matrix (which might include a large number of zeros). You might find this thread useful as well

The default initialization of the matrix's values is in fact a feature. Were it not with the default initialization, wouldn't you still need to initialize every field yourself so you know what to expect its value to be?
Adjacency matrices have this drawback: they are bad in the sense of memory efficiency (they require O(n2) memory cells) and as you said their initialization is slower. The initialization, however, is never considered the biggest problem. Believe me, the memory allocation is a lot slower and the needed memory is much more limiting than the initialization time.
In many cases people prefer using adjacency lists, instead of the matrix. Such list require O(m) memory, where m is the number of edges in the graph. This is a lot more efficient, especially for sparse graphs. The only operations this graph representation is worse than the adjacency matrix is the query is there edge between vertices i and j. the matrix answers in O(1) time and the list will for sure be a lot slower.
However many of the typical graph algorithms (like Dijkstra, Bellman-Ford, Prim, Tarjan, BFS and DFS) will only need to iterate all the neighbours of a given vertex. All these algorithms benefit immensely if you use adjacency list instead of matrix.

There is a good deal of confusion and misinformation in this thread. In fact, there is a method of avoiding initialization costs of adjacency matrices (and any array in general). However, it is not possible to use the method with Java primitives since they are initialized with default values under the hood.
Suppose you could create an array data[0..n] that is not auto-initialized. To start, it is filled with junk from whatever was previously in memory. If we don't want to spend O(n) time overwriting it, we need some way to differentiate the good data we add from junk data.
The "trick" is to use an auxiliary stack that tracks cells containing good data. The first time we write to data[i], we add index i to the stack. Since a stack only grows as we add to it, it never contains any junk we need to worry about.
Now whenever we access data[k], we can check if its junk or not by scanning the stack for k. But that would take O(n) time for each read, defeating the point of an array in the first place!
To solve this, we make another auxiliary array stack_check[0..n] that also starts full of junk. This array contains pointers to elements in the stack. Now when we first write to data[i], we push i onto the stack and set stack_check[i] to point to the new stack element.
If data[k] is good data, then stack_check[k] points to a stack element holding k. If data[k] is junk, then the junk value of stack_check[k] either points outside of the stack or points to some stack element besides k (since k was never put on the stack). Checking this property only takes O(1) time so our array access is still fast.
Bringing it all together, we can create our array and helper structures in O(1) time by letting them be full of junk. On each read and write, we check if the given cell contains junk in O(1) time using our helpers. If we are writing over junk, we update our helper structures to mark the cell as valid data. If we read junk, we can handle it in whatever way is appropriate for the given problem. For example, we could return a default value like 0 (now you can't even tell we didn't initialize it!) or maybe throw an exception.

I'll elaborate on A_A's answer. He recommends a sparse matrix, which basically means you're back to maintaining adjacency lists.
You have two reasons to use a matrix - if you don't care about performance at all and like the simple code it offers, or if you do care about performance but your matrix is going to be relatively full (let's say at least 20% full, for the sake of this post).
You obviously do care about performance. If your matrix is going to be relatively empty, its initialization overhead can be meaningful, and you're better off using adjacency lists. If it's going to be quite full, initialization becomes negligible - you'll need to fill the right cells in the matrix (which will take more than initializing it), and you need to process them (which, again, will take more time than initializing it).

Related

Growable multidimensional data structure supporting range queries

Let me put the question first: considering the situation and requirements I'll describe further down, what data structures would make sense/help achieving the non-functional requirements?
I tried to look up several structures but wasn't very successful so far, which might be due to me missing some terminology.
Since we'll implement that in Java any answers should take that into account (e.g. no pointer-magic, assume 8-byte references etc.).
The situation
We have somewhat large set of values that are mapped via a 4-dimensional key (let's call those dimensions A, B, C and D). Each dimension can have a different size, so we'll assume the following:
A: 100
B: 5
C: 10000
D: 2
This means a completely filled structure would contain 10 million elements. Not considering their size the space needed to hold the references alone would be like 80 megabytes, so that would be considered a lower bound for memory consumption.
We further can assume that the structure won't be completely filled but quite densely.
The requirements
Since we build and query that structure quite often we have the following requirements:
constructing the structure should be fast
queries on single elements and ranges (e.g. [A1-A5, B3, any C, D0]) should be efficient
fast deletion of elements isn't required (won't happen too often)
the memory footprint should be low
What we already considered
kd-trees
Building such a tree takes some time since it can get quite deep and we'd either have to accept slower queries or take rebalancing measures. Additonally the memory footprint is quite high since we need to hold the complete key in each node (there might be ways to reduce that though).
Nested maps/map tree
Using nested maps we could store only the key for each dimension as well as a reference to the next dimension map or the values - effectively building a tree out of those maps. To support range queries we'd keep sorted sets of the possible keys and access those while traversing the tree.
Construction and queries were way faster than with kd-trees but the memory footprint was much higher (as expected).
A single large map
An alternative would be to keep the sets for individual available keys and use a single large map instead.
Construction and queries were fast as well but memory consumption was even higher due to each map node being larger (they need to hold all dimensions of a key now).
What we're thinking of at the moment
Building insertion-order index-maps for the dimension keys, i.e. we map each incoming key to a new integer index as it comes in. Thus we can make sure that those indices grow one step a time without any gaps (not considering deletions).
With those indices we'd then access a tree of n-dimensional arrays (flattened to a 1-d array of course) - aka n-ary tree. That tree would grow on demand, i.e. if we need a new array then instead of creating a larger one and copying all the data we'd just create the new block. Any needed non-leaf nodes would be created on demand, replacing the root if needed.
Let me illustrate that with an example of 2 dimensions A and B. We'll allocate 2 elements for each dimension resulting in a 2x2 matrix (array of length 4).
Adding the first element A1/B1 we'd get something like this:
[A1/B1,null,null,null]
Now we add element A2/B2:
[A1/B1,null,A2/B2,null]
Now we add element A3/B3. Since we can't map the new element to the existing array we'll create a new one as well as a common root:
[x,null,x,null]
/ \
[A1/B1,null,A2/B2,null] [A3/B3,null,null,null]
Memory consumption for densely filled matrices should be rather low depending on the size of each array (having 4 dimensions and 4 values per dimension in an array we'd have arrays of length 256 and thus get a maximum tree depth of 2-4 in most cases).
Does this make sense?

If the structure will be "quite densely" filled, then I think it makes sense to assume that it will be full. That simplifies things quite a bit. And it's not like you're going to save a lot (or anything) using a sparse matrix representation of a densely filled matrix.
I'd try the simplest possible structure first. It might not be the most memory efficient, but it should be reasonable and quite easy to work with.
First, a simple array of 10,000,000 references. That is (and please pardon the C#, as I'm not really a Java programmer):
MyStructure[] theArray = new MyStructure[](10000000);
As you say, that's going to consume 80 megabytes.
Next is four different dictionaries (maps, I think, in Java), one for each key type:
Dictionary<KeyAType, int> ADict;
Dictionary<KeyBType, int> BDict;
Dictionary<KeyCType, int> CDict;
Dictionary<KeyDType, int> DDict;
When you add an element at {A,B,C,D}, you look up the respective keys in the dictionary to get their indexes (or add a new index if that key doesn't exist), and do the math to compute an index into the array. The math is, I think:
DIndex + 2*(CIndex + 10000*(BIndex + 5*AIndex));
In .NET, dictionary overhead is something like 24 bytes per key. But you only have 11,007 total keys, so the dictionaries are going to consume something like 250 kilobytes.
This should be very quick to query directly, and range queries should be as fast as a single lookup and then some array manipulation.
One thing I'm not clear on is if you want a key, to resolve to the same index with every build. That is, if "foo" maps to index 1 in one build, will it always map to index 1?
If so, you probably should statically construct the dictionaries. I guess it depends on if your range queries always expect things in the same key order.
Anyway, this is a very simple and very effective data structure. If you can afford 81 megabytes as the maximum size of the structure (minus the actual data), it seems like a good place to start. You could probably have it working in a couple of hours.
At best it's all you'll have to do. And if you end up having to replace it, at least you have a working implementation that you can use to verify the correctness of whatever new structure you come up with.

There are other multidimensional trees that are usually better than kd-trees:quadtrees, R*Trees (like R-Tree, but much faster for updates) or PH-Tree.
The PH-Tree is like a quadtree, but much more space efficient, scales better with dimensions and depth is limited by maximum bitwidth of values, i.e. maximum '10000' requires 14 bit, so the depth will not be more than 14.
Java implementations of all trees can be found on my repo, either here (quadtree may be a bit buggy) or here.
EDIT
The following optimization can probably be ignored. Of course the described query will result in a full scan, but that may not be as bad as it sounds, because it will on average anyway return 33%-50% of the whole tree.
Possible optimisation (not tested, but might work for the PH-Tree):
One problem with range queries is the different selectivity of your dimensions, which may result in something to a full scan of the tree. For example when querying for [0..100][0..5][0..10000][1..1], i.e. constraining only the last dimension (with least selectivity).
To avoid this, especially for the PH-Tree, I would try to multiply your values by a fixed constant. For example multiply A by 100, B by 2000, C by 1 and D by 5000. This allows all values to range from 0 to 10000, which may improve query performance when constraining only dimensions with low selectivity (the 2nd or 4th).

Difference between ArrayList.TrimToSize() and Array?

Generally, They say that we have moved from Array to ArrayList for the following Reason
Arrays are fixed size where as Array Lists are not .
One of the disadvantages of ArrayList is:
When it reaches it's capacity , ArrayList becomes 3/2 of it's actual size. As a result , Memory can go wasted if we donot utilize the space properly.In this scenario, Arrays are preferred.
If we use ArrayList.TrimSize(), will that make Array List a unanimous choice? Eliminating the only advantage(fixed size) Array has over it?

One short answer would be: trimToSize doesn't solve everything, because shrinking an array after it has grown - is not the same as preventing growth in the first place; the former has the cost of copying + garbage collection.
The longer answer would be: int[] is low level, ArrayList is high level which means it's more convenient but gives you less control over the details. Thus in a business-oriented code (e.g. manipulating a short list of "Products") i'll prefer ArrayList so that I can forget about the technicalities and focus on the business. In a mathematically-oriented code i'll probably go for int[].
There are additional subtle differences, but i'm not sure how relevant they are to you. E.g. concurrency: if you change the data of ArrayList from several threads simultaneously, it will intentionally fail, because that's the intuitive requirement for most business code. An int[] will allow you to do whatever you want, leaving it up to you to make sure it makes sense. Again, this can all be summarized as "low level"...

If you are developing an extremely memory critical application, need resizability as well and performance can be traded off, then trimming array list is your best bet. This is the only time, array list with trimming will be unanimous choice.
In other situations, what you are actually doing is:
You have created an array list. Default capacity of the list is 10.
Added an element and applied trim operation. So both size and capacity is now 1. How trim size works? It basically creates a new array with actual size of the list and copies old array data to new array. Old array is left for grabage collection.
You again added a new element. Since list is full, it will be reallocated with more 50% spaces. Again, procedure similar to 2 will be followed.
Again you call TrimSize and it follows same procedure as 2.
Things repeats...
So you see, we are incurring lots of performance overhead just to keep list capacity and size same. Fixed size is not offering you anything advantageous here except saving few more extra spaces which is hardly an issue in modern machines.
In a nutshell, if you want resizability without writing lots of boilerplate code, then array list is unanimous choice. But if size never changes and you don't need any dynamic function such as removal operation, then array is better choice. Few extra bytes are hardly an issue.

When representing a sparse graph with an adjacency matrix, why use linked list as structure to contain edges?

When representing graphs in memory in a language like Java, either an adjacency matrix is used (for dense graphs) or an adjacency list for sparse graphs.
So say we represent the latter like
Map<Integer, LinkedList<Integer>> graph;
The integer key represents the vertex and LinkedList contains all the other vertexes it points to.
Why use a LinkedList to represent the edges? Couldn't an int[] or ArrayList work just as fine, or is there a reason why you want to represent the edges in a way that maintains the ordering such as
2 -> 4 -> 1 -> 5

Either an int[] or ArrayList could also work.
I wouldn't recommend an int[] right off the bat though, as you'll need to cater for resizing in case you don't know all the sizes from the start, essentially simulating the ArrayList functionality, but it might make sense if memory is an issue.
A LinkedList might be slightly preferable since you'd need to either make the array / ArrayList large enough to handle the maximum number of possible edges, or resize it as you go, where-as you don't have this problem with a LinkedList, but then again, creating the graph probably isn't the most resource-intensive task for most applications.
Bottom line - it's most likely going to make a negligible difference for most applications - just pick whichever one you feel most comfortable with (unless of course you need to do access-by-index a lot, or something which one of the two performs a lot better than the other).

Algorithms 4th Edition by Sedgewick and Wayne proposes the following desired performance characteristics for a graph implementation that is useful for most graph-processing applications:
Space usage proportional to V + E
Constant time to add an edge
Time proportional to the degree of v to iterate through vertices adjacent to v (constant time per adjacent vertex processed)
Using a linked list to represent the vertices adjacent to each vertex has all these characteristics. Using an array instead of a linked list would result in either (1) or (2) not being achieved.

Time complexity assignment

I have an assignment in my intro to programming course that I don't understand at all. I've been falling behind because of problems at home. I'm not asking you to do my assignment for me I'm just hoping for some help for a programming boob like me.
The question is this:
Calculate the time complexity in average case for searching, adding, and removing in a
- unsorted vector
- sorted vector
- unsorted singlelinked list
- sorted singlelinked list
- hash table
Let n be the number of elements in the datastructure
and present the solution in a
table with three rows and five columns.
I'm not sure what this even means.. I've read as much as I can about time complexity but I don't understand it.. It's so confusing. I don't know where I would even start.. Remember I'm a novice programmer, as dumb as they come. I did really well last semester but had problems at home at the start of this one so I missed a lot of lectures and the first assignments so now I'm in over my head..
Maybe if someone could give me the answer and the reasoning behind it on a couple of them I could maybe understand it and do the others? I have a hard time learning through theory, examples work best.

Time complexity is a formula that describes how the cost of an operation varies related to the number of elements. It is usually expressed using "big-O" notation, for example O(1) or constant time, O(n) where cost relates linearly to n, O(n2) where cost increases as the square of the size of the input. There can be others involving exponentials or logarithms. Read up on "Big-O Notation".
You are being asked to evaluate five different data structures, and provide average cost for three different operations on each data structure (hence the table with three rows and five columns).

Time complexity is an abstract concept, that allows us to compare the complexity of various algorithms by looking at how many operations are performed in order to handle its inputs. To be precise, the exact number of operations isn't important, the bottom line is, how does the number of operations scale with increasing complexity of inputs.
Generally, the number of inputs is denoted as n and the complexity is denoted as O(p(n)), with p(n) being some kind of expression with n. If an algorithm has O(n) complexity, it means, that is scales linearly, with every new input, the time needed to run the algorithm increases by the same amount.
If an algorithm has complexity of O(n^2) it means, that the amount of operations grows as a square of number of inputs. This goes on and on, up to exponencially complex algorithms, that are effectively useless for large enough inputs.
What your professor asks from you is to have a look at the given operations and judge, how are going to scale with increasing size of lists, you are handling. Basically this is done by looking at the algorithm and imagining, what kinds of cycles are going to be necessary. For example, if the task is to pick the first element, the complexity is O(1), meaning that it doesn't depend on the size of input. However, if you want to find a given element in the list, you already need to scan the whole list and this costs you depending on the list size. Hope this gives you a bit of an idea how algorithm complexity works, good luck with your assignment.

Ok, well there are a few things you have to start with first. Algorithmic complexity has a lot of heavy math behind it and so it is hard for novices to understand, especially if you try to look up Wikipedia definitions or more-formal definitions.
A simple definition is that time-complexity is basically a way to measure how much an operation costs to perform. Alternatively, you can also use it to see how long a particular algorithm can take to run.
Complexity is described using what is known as big-O notation. You'll usually end up seeing things like O(1) and O(n). n is usually the number of elements (possibly in a structure) on which the algorithm is operating.
So let's look at a few big-O notations:
O(1): This means that the operation runs in constant time. What this means is that regardless of the number of elements, the operation always runs in constant time. An example is looking at the first element in a non-empty array (arr[0]). This will always run in constant time because you only have to directly look at the very first element in an array.
O(n): This means that the time required for the operation increases linearly with the number of elements. An example is if you have an array of numbers and you want to find the largest number. To do this, you will have to, in the worst case, look at every single number in the array until you find the largest one. Why is that? This is because you can have a case where the largest number is the last number in the array. So you cannot be sure until you have examined every number in the array. This is why the cost of this operation is O(n).
O(n^2): This means that the time required for the operation increases quadratically with the number of elements. This usually means that for each element in the set of elements, you are running through the entire set. So that is n x n or n^2. A well-known example is the bubble-sort algorithm. In this algorithm you run through and swap adjacent elements to ensure that the array is sorted according to the order you need. The array is sorted when no-more swaps need to be made. So you have multiple passes through the array, which in the worst case is equal to the number of elements in the array.
Now there are certain things in code that you can look at to get a hint to see if the algorithm is O(n) or O(n^2).
Single loops are usually O(n), since it means you are iterating over a set of elements once:
for(int i = 0; i < n; i++) {
...
}
Doubly-nested loops are usually O(n^2), since you are iterating over an entire set of elements for each element in the set:
for(int i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
...
}
}
Now how does this apply to your homework? I'm not going to give you the answer directly but I will give you enough and more hints to figure it out :). What I wrote above, describing big-O, should also help you. Your homework asks you to apply runtime analyses to different data structures. Well, certain data structures have certain runtime properties based on how they are set up.
For example, in a linked list, the only way you can get to an element in the middle of the list, is by starting with the first element and then following the next pointer until you find the element that you want. Think about that. How many steps would it take for you to find the element that you need? What do you think those steps are related to? Do the number of elements in the list have any bearing on that? How can you represent the cost of this function using big-O notation?
For each datastructure that your teacher has asked you about, try to figure out how they are set up and try to work out manually what each operation (searching, adding, removing) entails. I'm talking about writing the steps out and drawing pictures of the strucutres on paper! This will help you out immensely! Looking at that, you should have enough information to figure out the number of steps required and how it relates to the number of elements in the set.
Using this approach you should be able to solve your homework. Good luck!

What data-structure should I use to create my own "BigInteger" class?

As an optional assignment, I'm thinking about writing my own implementation of the BigInteger class, where I will provide my own methods for addition, subtraction, multiplication, etc.
This will be for arbitrarily long integer numbers, even hundreds of digits long.
While doing the math on these numbers, digit by digit isn't hard, what do you think the best datastructure would be to represent my "BigInteger"?
At first I was considering using an Array but then I was thinking I could still potentially overflow (run out of array slots) after a large add or multiplication. Would this be a good case to use a linked list, since I can tack on digits with O(1) time complexity?
Is there some other data-structure that would be even better suited than a linked list? Should the type that my data-structure holds be the smallest possible integer type I have available to me?
Also, should I be careful about how I store my "carry" variable? Should it, itself, be of my "BigInteger" type?

Check out the book C Interfaces and Implementations by David R. Hanson. It has 2 chapters on the subject, covering the vector structure, word size and many other issues you are likely to encounter.
It's written for C, but most of it is applicable to C++ and/or Java. And if you use C++ it will be a bit simpler because you can use something like std::vector to manage the array allocation for you.

Always use the smallest int type that will do the job you need (bytes). A linked list should work well, since you won't have to worry about overflowing.

If you use binary trees (whose leaves are ints), you get all the advantages of the linked list (unbounded number of digits, etc) with simpler divide-and-conquer algorithms. You do not have in this case a single base but many depending the level at which you're working.
If you do this, you need to use a BigInteger for the carry. You may consider it an advantage of the "linked list of ints" approach that the carry can always be represented as an int (and this is true for any base, not just for base 10 as most answers seem to assume that you should use... In any base, the carry is always a single digit)
I might as well say it: it would be a terrible waste to use base 10 when you can use 2^30 or 2^31.

Accessing elements of linked lists is slow. I think arrays are the way to go, with lots of bound checking and run time array resizing as needed.
Clarification: Traversing a linked list and traversing an array are both O(n) operations. But traversing a linked list requires deferencing a pointer at each step. Just because two algorithms both have the same complexity it doesn't mean that they both take the same time to run. The overhead of allocating and deallocating n nodes in a linked list will also be much heavier than memory management of a single array of size n, even if the array has to be resized a few times.

Wow, there are some… interesting answers here. I'd recommend reading a book rather than try to sort through all this contradictory advice.
That said, C/C++ is also ill-suited to this task. Big-integer is a kind of extended-precision math. Most CPUs provide instructions to handle extended-precision math at comparable or same speed (bits per instruction) as normal math. When you add 2^32+2^32, the answer is 0… but there is also a special carry output from the processor's ALU which a program can read and use.
C++ cannot access that flag, and there's no way in C either. You have to use assembler.
Just to satisfy curiosity, you can use the standard Boolean arithmetic to recover carry bits etc. But you will be much better off downloading an existing library.

I would say an array of ints.

An Array is indeed a natural fit. I think it is acceptable to throw OverflowException, when you run out of place in your memory. The teacher will see attention to detail.
A multiplication roughly doubles digit numbers, addition increases it by at most 1. It is easy to create a sufficiently big Array to store the result of your operation.
Carry is at most a one-digit long number in multiplication (9*9 = 1, carry 8). A single int will do.

std::vector<bool> or std::vector<unsigned int> is probably what you want. You will have to push_back() or resize() on them as you need more space for multiplies, etc. Also, remember to push_back the correct sign bits if you're using two-compliment.

i would say a std::vector of char (since it has to hold only 0-9) (if you plan to work in BCD)
If not BCD then use vector of int (you didnt make it clear)
Much less space overhead that link list
And all advice says 'use vector unless you have a good reason not too'

As a rule of thumb, use std::vector instead of std::list, unless you need to insert elements in the middle of the sequence very often. Vectors tend to be faster, since they are stored contiguously and thus benefit from better spatial locality (a major performance factor on modern platforms).
Make sure you use elements that are natural for the platform. If you want to be platform independent, use long. Remember that unless you have some special compiler intrinsics available, you'll need a type at least twice as large to perform multiplication.
I don't understand why you'd want carry to be a big integer. Carry is a single bit for addition and element-sized for multiplication.
Make sure you read Knuth's Art of Computer Programming, algorithms pertaining to arbitrary precision arithmetic are described there to a great extent.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.