December 07, 2020
These are my notes from CS 3000, Data Structures and Algorithms. I took this class in the fall of 2020.
See the outline on the left to jump to a relevant section. Here’s where the second half of the semester starts.
Many problems are easier to solve when data is sorted.
Selection sort takes $O(n^2)$ time.
At step i, insert element i into the space $[0…i]$ in order to maintain the sortedness of $[0…i]$. Before the i’th element is moved, the array is split into the first section $[0…i-1]$ which contains elements in sorted order and the rest of the array, $[i…n]$ which is unsorted. At the end of $i = n+1$, the entire array is sorted. This takes $O(n^2)$ time as well, since iteration $j$ could take time proportional to $j$ time to insert it into the properly sorted section.
Recursively divide the array into two halves, sort the halves, and recombine them. We can show this to be correct by induction. At each level in the recursion tree, the work is divided by $2$, and a linear amount of work is done in recombining the two halves. This comes from the recursion $T(n) = T(n/2) + T(n/2) + Cn = 2T(n/2) + Cn$. There are a total of $\log_2(n)$ levels to this tree, so the total work is the work at each level * the number of levels. This is $Cn\log_2(n)$.
Limit as n approaches infinity of f(n) / g(n):
For this class, using this limit test is an acceptable way to prove runtime.
The Master Theorem is a general way to determine the runtime of a divide-and-conquer algorithm that splits the problem into a pieces of size $\frac{n}{b}$, and then merges solutions in time $O(n^d)$. The recurrence of such an algorithm would be $T(n) = a\cdot T(\frac{n}{b}) + C\cdot n^d$, with base case of $T(1) = 1$.
Note: we can use a still more general form of the master theorem, where we replace $n^d$ with a function $f(n)$. This does not change much.
Choosing an element and partitioning takes $θ(n)$ time. When the recursive call is made, the size of the problem set is reduced. On the average case,
Problem: Given n points on a line, find the pair of points with the smallest distance from each other.
Sorting Solution: We could solve this by sorting all the points, then walking through them in order to find the pair with the smallest distance between them. This takes $O(n\log n)$ time since it is bounded by the time of sorting.
Non-Sorting Solution: We can split this problem into two sections of equal size, by first finding the median element using linear-time selection. Once the median is selected, one subproblem will be all elements less than the median and including the median, and the other subproblem will include all elements greater than the median and including the median. We recurse over both halves.
The recurrence is $T(n) = 2T(n/2) + 2n = Θ(n\log(n))$, which achieves the same runtime as the sorting solution.
Problem: Given n points in 2D space, find the pair of points with the smallest Euclidean distance.
Once again we can divide our problem into two subproblems by selecting the median value for X-coordinate among all points. This splits our data into two halves. We can recurse over the two halves.
In regard to pairs of points that may cross over the median line, we know that the greatest distance between any pair with points on both sides of the median must be closer than the closest pair in each half. If the closest distance between any pair found on each half is ƍ, then the only points we need to consider are the points within ƍ distance of the median X line, and each point only needs to be compared against points within ƍ of itself.
Using some math we can show that there are actually a constant number of points to be compared against, so only a linear number of comparisons are made. If we sort by y coordinate, we can achieve this in $O(n \log(n))$ time. The total recurrence would be $T(n) = 2T(n/2) + Θ(n\log(n))$ which works out to be $θ((n \log(n))^2)$. We can fix this if we embed merge sort in every layer of our recursive call, to also return the sorted half – this would enable merging in linear time, removing the need to sort after getting the points back. This cuts down the runtime to $Θ(n\log(n))$.
Memoization | Bottom Up |
---|---|
Keeping a global array to remember previously computed solutions. This allows us to calculate a recurrence in linear time. | Computing a recurrence iteratively, so by the time a previously computed solution is needed, it has already been calculated. |
Weighted Interval Scheduling | Segmented Least Squares |
---|---|
Outputs a subset of the input | Outputs a partitioning of the input |
“Does the solution include the last item?” | “What is the last segment of the solution?” |
Input: Given $n$ intervals, (start, end), with a value $v_i$, that are all sorted by finish time in ascending order.
Output: A list of compatible intervals that maximizes the sum of values of the intervals. Schedules are compatible if no interval begins before another ends.
Naive approaches, such as choosing the most valuable intervals first or choosing the earliest intervals, fail easily.
Let us begin with a probing question: Does the optimal schedule contain the last interval?
Let us continue with an insightful answer: The optimal schedule either contains, or does not contain, the last interval.
Given: $n$ data points in 2D space, as well as a cost parameter $c$.
Output: A partitioning of the data points into segments and lines, optimizing the total cost which is the sum of the error of the line plus a constant cost penalty for every new segment.
We begin with a probing question: What is the final segment?
We continue with an insightful answer: The final segment is the segment that minimizes the total cost of itself and the rest of the segments.
Given: A knapsack with maximum weight $T$, and $n$ items, each with a value $v_i$ and a weight $w_i$.
Output: A subset of items which has a weight less than $T$, and which maximizes the total value.
Special cases: Subset sum, when values = weights, or tug-of-war, when $T= \frac{1}{2}$ of the maximum weight
We begin with a probing question: Does the optimal knapsack include item $n$?
We continue with an insightful answer: The optimal knapsack either includes the item, or it does not.
The recurrence to the knapsack problem is interesting because it involves two variables (maximum weight of the knapsack, and the subset of the items).
Given: A set of frames of a video, and access to a black box that can determine the Quality of a set of frames.
Output: A set of contiguous blocks of frames that maximizes the total quality.
We begin with a probing question: What is the final segment?
We continue with an insightful answer: The final segment is any number of frames ${i, i+1, … n}$ that maximizes the sum of the quality of that segment and the optimal segmentation of all frames that come before it.
$\text{OPT}(j)$ = the value of the optimal segmentation from ${1…j}$.
$\text{OPT}(j)$ = $\text{argmax}(1 ≤ i ≤ j){\text{OPT}(i-1) + \text{Quality}(i, j)}$
$\text{OPT}(0) = 0$
Given: Two strings, $X$ and $Y$ of lengths $n$ and $m$ respectively.
Output: The edit distance between $X$ and $Y$, defined as the number of insertions, deletions, and swaps needed to turn one string into another. This can be visualized as aligning the two strings, and adding the number of positions where the two strings disagree.
We begin with a probing question: What goes in the final column of the optimal alignment?
We continue with an insightful answer: The final column of the optimal alignment could be X and nothing from Y, Y and nothing from X, or both X and Y.
Our recurrence therefore is to take the minimum of these three cases.
Given: Two strings, X and Y, of lengths n and m respectively.
Output: The longest common substring of X and Y
We begin with a probing question: Is the last symbol of X and Y part of the substring?
We continue with an insightful answer: If the last symbol of X and Y match, it is part of the substring, but otherwise, either the last symbol from X is not part of the substring, or the last symbol from Y is not part of the substring.
Our recurrence takes the maximum of these two cases, and in the second case, takes the maximum of the two subcases of symbols not matching.
Our bottom-up solution here takes time O(mn), since the table we need to fill in is of size mn.
Given: A sequence X1, X2, …, Xn of anything comparable
Output: the longest subsequence that is increasing. A sequence is a set of strictly increasing indices of elements such that the elements at those indices are strictly increasing.
We begin with a probing question: Is the last element in the longest increasing subsequence?
We continue with an insightful answer: The last element either is, or is not.
Remember that if we take the last element to be part of the longest increasing subsequence, we must still fulfill the other requirement that the longest increasing subsequence is strictly increasing. We can filter the subproblem by the values of X that are less than Xn.
Maximum edges in a simple undirected graph with n vertices: $nC^2 = n (n-1) / 2$ because now you can have an edge between any pairing of two vertices.
A directed graph is strongly connected if for every two vertices $u$ and $v$ in the graph, there is a path from u to v and from v to u.
In any undirected graph with m vertices and m edges, the sum of vertex degrees is $2m$. If you add up the degrees of all vertices, each edge will be counted twice, so naturally the vertex degree is $2m$.
Adjacency Matrices use a matrix form where space (i, j) indicates if the edge (i, j) is in the set of edges in the graph. This uses O(V2) space, allows lookups in Θ(1) time, and neighbor lookup in Θ(v) time.
Adjacency Lists use a two-dimensional linked list structure where the adjacency list associated with a vertex is the set of all vertices that are directly connected to that vertex. This uses space Θ(m+n).
Depth-first search starts on a vertex and examines all nodes who share an edge with it. It follows the first edge to the vertex, and makes a recursive call there. Nodes are marked as discovered once visited, and are visited only once.
Using depth-first search, edges can be categorized into four categories:
Keeping track of the discovery and finish times of each node can yield us insights into the type of graph we are looking at and the properties of the edges. For any directed edge (u ⇒ v), d[u] = discovery time of u, and f[u] = finish time of u:
Using adjacency lists, DFS runs in O(n+m) time and space.
Directed Acyclic Graphs are directed graphs with no directed cycles. They give rise to topological orderings, which are a labeling of the nodes in the graph such that all edges go from a node with lesser index to a node with greater index. Iff a graph has a topological ordering, then it is a Directed Acyclic Graph (DAG).
Observe: in a DAG, there is always a node with no incoming edges – an in-degree of 0.
In any traversal of a DAG, you will never find a back edge. A back edge would imply the existence of a directed cycle, which we know a DAG will never have.
In order to produce a list of nodes in topological order, we could use the fact that there every DAG has a node with an in-degree of 0. We could repeatedly find this node with no incoming edges, and push that onto a queue such that that node would be the first in the ordering. This would take O(n+m) time.
Another approach would be to utilize the discovery and finish times of a DFS traversal of the graph. We claim that ordering the vertices by decreasing finish time creates a topological ordering.
Assume for the sake of contradiction that vertices are in order of decreasing finish time, but this does not create a topological ordering. This means that there are two nodes u and v such that f[v] > f[u], but there is an edge from u to v. This fits in the category of there existing a back edge – which is a contradiction, since the presence of a back edge implies a cycle existing and a DAG cannot have any cycles.
Problem: given an undirected graph, split it into connected components – a labeling of vertices, by their connected component.
Algorithm:
The total runtime is Θ(# of vertices + # of edges), which is linear since every vertex and edge is only looked at once.
Given a directed graph, split it into strongly connected components.
By definition, the strongly connected components of a directed graph consists of all nodes v that are both reachable from and can reach the starting node s. Observe how if we do DFS(s), we get the set of all nodes reachable from s, and if we do DFS(s) on the graph with all directed edges reversed, we get the set of all nodes that can reach s. If we then take the intersection of these two sets, we get the vertices that comprise the strongly connected component that has s in it.
This naive approach runs in O(n * (n+m)), where at each of n vertices you must run DFS twice, which each time takes O(n+m).
But also, observe that if you started the DFS from a “sink” component, a component with no outward connections, then you would be able to recover that component without visiting all the nodes in the graph. If we do the DFS’s in the right order, we can actually find the strongly connected components in linear time.
Repeatedly, until all nodes are marked as belonging to a strongly connected component:
The sink component is the node with the largest finish time in a DFS traversal of the reversed graph.
In a breadth-first search, we first explore all the immediate neighbors of the starting node, before exploring the immediate neighbors of each of those neighbors. This can be done on both directed and undirected graphs.
In the tree produced by depth-first search, notice that edges only occur within a level, or between consecutive levels. If there was an edge skipping a level, then that would put that node on the previous level.
Breadth-first search’s easy application is to find the distances from one node to every other node, where distance is defined as the number of edges in the shortest path between the two nodes.
BFS runs in $O(mn)$ time.
A graph is bipartite if it can be split into two halves, where all vertices go between the two halves and no vertices go between nodes inside the same half.
A related and useful problem is that of 2-coloring, which asks if nodes in a graph can be assigned a coloring such that no two nodes of the same color share an edge. We can solve the 2-coloring problem by running a breadth-first search and assigning alternating colors to alternating levels of the BFS tree. If we ever encounter an edge between two nodes in the same level, then this graph cannot be 2-colored.
We can see that a successful 2-coloring of a graph implies that the graph is bipartite, and that if we fail at 2-coloring, then the graph is not bipartite. If BFS 2-coloring fails, this means the graph contains a cycle of odd length. It is impossible to color a cycle of odd length in two colors and have no two same-colored vertices share an edge.
Shortest paths in a graph are particularly useful in a weighted graph, where each edge has a different weight. The length of a path is the sum of edge weights on the path, while the distance from $s$ to $t$ is the length of the shortest path from $s$ to $t$.
Dijkstra’s algorithm only works on graphs with non-negative edge weights.
There are two key observations we will make, which will help us in understanding Dijkstra’s Algorithm. Both of these are in the situation where we have three nodes, $s, u, v$, and we know of some shortest path from $s$ to $u$, and the weight of the path from $u$ to $v$.
- If $(u, v)\in E$, then $d(s,v)\le d(s,u)+w(u,v)$ for every node in $v$. This tells us that the shortest path from $s$ to $v$ cannot be longer than the path made by taking an intermediate hop through another node $u$. This already known path is going from $s$ to $u$ to $v$. If the shortest path were longer than $d(s,u)+d(u,v)$, then the other path wouldn’t be a shortest path because we just found a shorter path.
- If $(u,v)\in E$ and $d(s,v)=d(s,u)+w(u,v)$, then there exists a shortest $s\to v$ path that ends in $u,v$. This is saying that if the distance (shortest path) from $s$ to $v$ is equal to the sum of the length of the shortest paths $(s,u)$ and $(u,v)$, then there is a shortest path from $s$ to $v$ that does indeed end in $(u,v)$.
Dijkstra’s algorithm:
Before we explore the $k$-th node $v$, the previous distance to $v$ that was known when the last node was explored, is tight. We only decide to explore the $k$-th node when it is the next closest unexplored node. If there was another way to reach $v$ with a smaller distance than what we already know for $v$, we would have chosen to explore that node instead since it would be the next closest unexplored node. But we didn’t, which indicates that it is not possible to have a shorter path from $s$ to $v$ than what we’ve already found. We can show this more formally using strong induction (which I will not).
Heaps organize data as a complete binary tree, where each node has 2 children. If node $a$ is the parent of node $b$, then $\text{val}(a) \le \text{val}(b)$. Heaps have a few operations that are efficient:
ExtractMin
: Returns and removes the smallest value in the heap. This can be done by removing the very topmost node in the tree. But then we must combine the two new trees by selecting the next smallest node to be the root, and then swapping nodes to move the next smallest node down in the tree until it belongs again in a binary tree. In the worst case we will have to move this node all the way to the bottom of the tree, which is $\log{n}$ levels, so this ultimately takes $O(\log{n})$ time.DecreaseKey
: Change the value of a key, the performs swaps upwards to restore the heap order. This is again at most $O(\log{n})$ swaps.Insert
: Puts a key in the correct location, and then repairs the heap like before. $O(\log{n})$.Lookup
: Searches the heap in $O(\log{n})$ time.We can directly implement heaps an array, and use a dictionary to map keys to their location in the array based on their value. This allows us to get the value for a key in $O(1)$ time and perform swaps in $O(1)$ time. Using this approach, our actual runtimes are:
ExtractMin
, DecreaseKey
, Insert
: $O(\log{n})$ time.Using heaps, we can have Dijkstra’s algorithm run in time $O(m\log{n})$. Using fibonacci heaps and some other tricks we can get it to run even faster!
Negative edge weights come up in modeling currency exchange rates or chemical reactions that release energy. Dijkstra’s algorithm fails to produce a correct shortest path when our graph has negative edge weights, and simply adding a constant value to all edge weights does not fix this.
Bellman-Ford uses dynamic programming.
Let $\text{OPT}(v, j)$ be the length of the shortest path from $s$ to $v$ with at most $j$ hops $(0\le j \le n-1)$ where $n$ is the number of vertices. On a path with $n$ nodes and no cycles, you can only have at most $n-1$ edges. We count on never taking a cycle because:
We observe that on every shortest path from $s$ to $v$, there is a final hop on edge $(u,v)$. The optimal length of the shortest path from $s$ to $v$ on a path that makes $j$ hops is: \(\text{OPT}(v,j)=\min(\text{OPT}(v,j-1), \min_{(u,v)\in E}{(\text{OPT}(u,j-1)+w(u,v))})\\ \text{OPT}(s,0)=0\\ \text{OPT}(v,0)=\infty\\\) There are two cases: either we don’t need the full $j$ hops, in which case the length is just$ \text{OPT}(v,j-1)$, or we select the best last hop using $\min_{(u,v)\in E}{(\text{OPT}(u,j-1)+w(u,v))}$.
A bottom-up implementation of Bellman-Ford takes $O(nm)$ time, and has a space complexity of $O(n^2)$. There are a few optimizations we can do:
If a graph has more than $n-1$ edges, it could contain a negative cycle. To find out, we can run Bellman-Ford for $n$ iterations and check if $\text{OPT}(v,n) < \text{OPT}(v, n-1)$ for some $v\in V$. If so, this means that by taking more edges (forming a cycle), you can get a smaller route length. This could not have been possible in a positive cycle, so this must be a negative cycle. If this condition never exists, then there are no negative cycles.
Given a set of nodes $V$, a set of possible edges $E\subseteq V\times V$, and a weightfunction on edges $w(e)$, we want to build a network with the cheapest set of edges that connects all possible nodes in $V$. This problem comes up a lot in network design.
Input: A weighted graph $G=(V, E, {W_e})$ that is undirected, connected, and could have negative weights. Assume all edge weights are distinct.
Output: A spanning tree of minimum cost
There are four primary MST algorithms: Prim’s, Kruskal’s, Reverse-Kruskal’s, and Boruvka’s.
Define a cut as a subset of vertices, and the cutset as the edges which have one endpoint in the cut and one endpoint out of the cut. Edges between two vertices in the cut are not in the cutset.
Define a safe edge as the minimum weight edge cut by a cut $S$. The MST must contain this edge.
Define a cycle as a path where the first and last nodes are the same.
In any cycle $C$, there is a maximum edge weight edge, called $f$. Then, the MST does not contain edge $f$. $f$ is called the useless edge.
Fact: a cycle and a cutset must intersect in an even number of edges. Every time you enter the cut, you must leave it.
Assume we have a cycle $C$ in a graph, and the MST called $T^*$ contains the edge $f$, the maximum weight edge. Because the MST is a MST, if we removed this edge $f$ from it, we would break the tree into two connected components, call them CC1 and the rest. We know that $f$ is in the cutset of CC1, but we also know there must be another edge in the graph that is in the cutset, since a cycle and a cutset must intersect in an even number of places. Therefore, there must be an edge $e$ in the cutset and in $C$. We know that $w(e)<w(f)$ , so the cost of the tree minus $f$ and adding $e$ (essentially replacing $f$ with $e$) is less than the cost of the MST. $T^*-f+e < \text{cost}(T^*)$. This is a contradiction to $T^*$ being a MST, since we’ve found another tree that spans and has less eight. So therefore edge $f$ must not be in the MST.
So in summary:
Prim’s Algorithm | Kruskal’s Algorithm |
---|---|
Iteratively builds the solution, one connected component at a time | Maintains several connected components separately and joins them together |
Faster on dense graphs | Faster on sparse graphs |
$O(m\log n)$ runtime | $O(m\log n)$ runtime |
Prim’s algorithm produces a MST. Intuitively, Prim’s algorithm gradually expands the search each time only adding safe edges of the cut containing all already handled nodes and ensuring that $T$ is connected. By using the cut property, Prim’s algorithm ensures that every time an edge is added to the MST, it connects to a node not already in the MST, and does so in the cheapest possible way.
An implementation of Prim’s algorithm using heaps will run in $O(m\log{n})$.
Kruskal’s algorithm also produces an MST. Kruskal’s is greedy in always adding the next least valued edge that would reduce the number of connected components in the graph. To begin, each node is its own connected component, and there exist no edges.
We can do this efficiently by first sorting the edges in ascending order of weight, then adding edges in increasing weight until there is only one connected component in our MST. By induction, we know that every edge must be safe: the first edge is safe, and when we pick the next edge it must go between two strongly connected components, and is thus safe. The MST produced must also by acyclic because if we only add edges that reduce the number of strongly connected components: if a new edge would produce a cycle, it would not decrease the number of strongly connected components.
In order to be efficient, we need a data structure that can efficiently perform the operations we would like on connected components. This is the union-find data structure, which will let us group nodes into components with operations Find(u)
which looks up which component contains u
, and Union(u, v)
which merges the connected components of u
and v
.
A naive union-find implementation would use a dictionary, which would enable Find(u)
in $O(1)$ but would take $O(n)$ time to perform a Union(u, v)
since it would need to update assignments of all elements in those groups.
If we used an array where element $i$ denoted the current component of vertex $i$, and a linked list for the items in each component, we could join components in $O(\text{length of smaller component})$, and $O(1)$ to update assignments in our array.
So the total runtime is $O(k\log{k})$ for union-find to perform $k$ unions.
Kruskal’s algorithm, using this implementation of union-find, takes time $O(m\log n)$ because the runtime is dominated by the speed of sorting all edges.
In amortized analysis, we argue that the running time of the worst case of an algorithm is not as bad as it sounds because the number of times the worst case can happen is bounded by something. In other words, usually the worst case doesn’t happen. For example, consider a resizing vector, or an ArrayList implementation that must resize the underlying array and copy elements to an array twice as large. This worst case runtime is $O(n)$ for an insertion, but usually it’s not that bad.
When computing the amortized runtime using aggregate analysis, we calculate the total amount of work done by a set of calls, then divide by the number of operations performed. We average over one worst-case run.
We don’t average over different input distributions, but over one single worst case execution. We want to show that in the worst case, it may not be possible to incur the largest cost on every iteration.
Given a stack with $O(1)$ push
and pop
, add a function multipop()
which calls pop()
$k$ times, giving a runtime of $O(n)$ in the worst case. But note that multipop
can only take as much time as the number of push()
operations we perform first. The worst case operation must be balanced by a number of fast operations.
We define our single worst case run as pushing $N-1$ items onto the stack, and then calling multipop()
. The total work done is $N \cdot O(1) + O(N)$, since each push()
is $O(1)$. $O(N) / N = O(1)$, so multipop()
’s amortized runtime is constant.
Imagine incrementing a number, and thinking about it in terms of bit flips in the number’s binary representation. We are interested in counting the number of bits that we flip. In the worst case, we may flip all $k=n\log{n}$ bits in one step: 011...1
to 100...0
. But we know that the les significant bits flip a lot more than the more significant bits. We can express the sum of flips as a sum:
We can sum to infinity because we know that must be greater than summing to $k-1$, and this sequence converges. We see that $N$ increments utilize $O(n)$ flips, not $N\log N$. The amortized number of flips per increment is $O(1)$.
Every time we must resize the array, we double it. In $N$ insertions, we double $\log_2{N}$ times. The cost of the $j$’th doubling is $2^j$ (assuming we start at size 0), so the total work for $N$ insertions is:
\[N + \sum_{j=0}^{\log{N}}{2^j}=N + (2^{1 + \log{N}}-1) < N + 2N = 3N\]The amortized cost of an add operation = $\frac{3N}{N} = 3 = O(1)$.
Imagine “prepaying” for future expensive operations with a cheap operation.
We could assign a cost $2$ to push()
, and $0$ for pop()
and multipop()
. Now we’ve overpayed for push()
, and we break even if every pushed item is popped. Note that we can never underpay: the cost we’ve payed is always greater than or equal to the actual cost. In the worst case, the work is just counting the number of pushes = $O(2N)$. This makes analysis simpler when we only count pushes.
Every time we flip a bit to $1$, we can prepay its being flipped back to 0. The credit remains positive, so in $N$ increments we make $N$ on-flips, so the total is $2N = O(N)$. This matches the aggregate analysis approach, but is much easier to reason about.
We can budget 3 operations that every add
operation pays for:
After every resizing, no credit is left over.
Define potential as a function of what operations can do on the data structure. It’s kind of like the potential energy of a data structure. It is related in some well-defined way to the work that operations can do on the data structure, and the operations that the data structure could support in the future.
Potential is a measure of a data structure’s state. Potential functions must always be non-negative, and should bound/limit the work done for the most expensive operations that would reduce the potential. This could be the number of items in the list/stack, for example. This lends a key insight:
Amortized cost = (true cost) + (change in potential)
Given $\Phi(s)$ is the potential of the data structure in state $s$, the amortized cost of operation $i$ is $C_i + \Phi(s_i) - \Phi(s_{i-1})$. The amortized cost is the true cost of operation, $C_i$, added to the difference in potential between states $s_i$ (after the operation) and $s_{i-1}$ (before the operation).
Actually, this is just a roundabout way to calculate the actual sums. The sum over all amortized costs is equal to the sum of all true costs plus the total change in potential, which is greater than or equal to the sum of all true costs. This guarantees that the outcome of the potential method is at least the true cost of operations, which is good for our upper bound.
$\sum_i{(C_i+\Phi(S_i) - \Phi(S_{i-1})) = (\sum_i{C_i}) + \Phi(S_n) - \Phi(S_0)} \geq \sum_i{C_i}$
The amortized costs generated by examiningk the change in potential give us the same costs as we saw in the accounting method.
The number of $1$’s in the counter is the potential: it starts at 0, and is non-negative.
Define $\Phi(s) = 2\cdot(N-\frac{\text{array.length}}{2})$ , where $N$ is the number of items in the array. This sets the potential to be $0$ when the array is half full, and the potential to be $N$ when the array is full. The amortized cost ends up being 3 again.
Randomized algorithms are useful when you have a problem where many choices are good, and all you need to do is avoid the rare bad choices that. These are situations where a bad deterministic algorithm could give you the worst case. The key fact we must remember is that of expected wait times, which informs the expected success:
If something has an independent probability $p$ of happening each trial, then the expected time until it occurs is $1/p$.
For example, the expected time to roll 5 on a 6-sided die is 6; the expected time to flip heads on a coin is 2. As a result, we usually don’t have to wait very long to find the hay in the haystack.
The task is to find the $k$’th smallest item. We divide items into piles smaller than and greater than a random pivot, then recurse over the appropriate side.
This is cool because it works even if $X$ and $Y$ are independent events.
If we are collecting coupons, how long will it take us to collect all the unique coupons? By linearity of expectation, $\text{E}[x] = \sum_{i=0}^{n-1}{\text{E}[x_i]}$ if we consider $x_i$ the number of coupons we collect to get a new unique coupon (going from $i$ to $i+1$ unique coupons). So if we repeat this collecting $n$ times, by linearity of expectation the number of coupons we collect is that.
What is $\text{E}[x_i]$?
QuickSort takes a random pivot and partitions elements into smaller and larger piles, then sorts those piles. This has a worst case of $\Theta(N^2)$ but an expected time of $\Theta(N\log N)$. Randomness protects us from bad pivot choices.
Big Idea: Repeated runs of an algorithm can increase its chance of success. The Miller-Rabin Primality Test checks the primality of a number with around a $3/4$ success rate. But if we run this many many times, the probability of passing $k$ such tests and still being nonprime is $(\frac{1}{4})^k$ – if you run 100 tests, the chance it is composite is $\frac{1}{2^{200}}$, while the chance of being struck by lightning this year is $\frac{1}{2^{20}}$.
More specifically, prime numbers satisfy Fermat’s Little Theorem, which states that $p$ is prime implies that $a^{p-1} \mod{p}=1 ~~\forall a \in \mathbb{N}, a \ge 2$ . Unfortunately, some nonprime numbers still pass Fermat’s Little Theorem for all values of $a$: we call these Carmichael Numbers. Fortunately, we do know that primes shouldn’t have nontrivial square roots, where $a^2\mod{p}=1$ when $a\ne1$ and $a\ne p-1$
So we will repeatedly square to calculate $a^{p-1}$, so we can easily reject $p$ if we square it and get 1 when we shouldn’t. This will catch the Carmichael numbers. We can repeat this many many times with random values of $a$, chosen between $[2, p-2]$.
Hash functions are designed to have an equal probability of sending an input to a value in a set of possible values, unless the inputs are identical. A linear function on bits modulo some prime number seems to work well for this. The chance of collisions in a table of $N$ size is $\frac{1}{N}$, regardless of how similar or dissimilar elements are. This low collision probability is what makes the expected time of put
and get
$O(1)$, but the worst case linear time if you have to traverse elements on collision.
Bloom filters are useful in applications when we want to know with relative certainty if we’ve seen a key before or not. This could be useful in a network filter that is blacklisting IP addresses. Rather than use a huge hashtable, we can use many smaller hashtables each with different random hashing functions.
These smaller tables can be relatively small, but if all of them agree, we can be fairly sure we’ve seen a key before. With $m$ bits, the chance of a collision is $\frac{1}{m}$, and the chance of wrongly saying “yes, I’ve seen this key” in $b$ different tables is $(\frac{1}{m})^b$. We would need a single hashtable of size $m^b$ to have the same false positive rate as $b$ tables of size $m$.
Greedy algorithms take actions that prioritize a short-term goal which is often easily computable. This short term goal is often to minimize or maximize something, which lends itself well to speedup from sorting input data (in the case that all data is known beforehand) or storing data in a min/max-heap (in the case that the algorithm must operate on changing data).
Greedy algorithms build a unique solution as they go, unlike dynamic programming approaches which don’t decide on any one solution until the very end.
A simple example of a greedy algorithm is the change-making process using US coins. We begin by finding the biggest coin that can fit into the total amount of change left to give back, and simply repeat this until the entire change is made. This happens to work with the US currency denominations, but surprisingly this approach is not guaranteed to work. If we had to make $0.66 of change and had $0.22 coins available, we would first pick out a quarter and struggle along instead of simply picking two 22 cent coins.
Often, greedy algorithms will make use of priority queues or min/max-heaps to avoid a potential bottleneck of multiple sorts. Both Prim’s and Dijkstra’s algorithms are greedy algorithms.
Problem: many people want to use a resource, and our goal is to simply satisfy the most intervals.
There are plenty of greedy but non-optimal approaches:
Optimal approach: Repeatedly pick the request that ends first (gives us the most time left after its completion) – as long as the request does not conflict with any previous requests.
Like most greedy algorithms, this one’s runtime is lower bounded by the speed of sorting. We begin by sorting by the value we’re being greedy about, giving us a lower bound of $n\log{n}$ time complexity. If we knew something about the distribution of the dataset or some other prior, we might be able to do sorting in $O(n)$.
In this proof strategy, we show that the greedy algorithm starts out optimal, and stays optimal or never falls behind the optimal solution. We compete our algorithm against an imaginary optimal algorithm which always is optimal (regardless of method).
Base case: At $i=1$, the greedy solution picks out the very earliest-finishing request in the whole set, so this is the only optimal solution possible at this point. The optimal algorithm can’t possibly finish any sooner than our solution finishes.
Inductive step: Asume the claim is true for all $i \leq k$. Then, there is no way for the greedy algorithm to fall behind at request $k+1$. The optimal move is legal for the greedy algorithm because we finished earlier or the same time as the optimal algorithm at the previous time step.
By our greedy algorithm, our request $i$ always ends no later than the optimal solution’s request $i$. This means that our approach must return an optimal scheduling. The greedy algorithm’s solution is always at least as good as the optimal algorithm’s solution.
This is a relative of the knapsack problem, but this time you are bringing spices so you can arbitrarily divide up quantities and take a fractional weight of an item which gives you a fractional value. The greedy approach we will use is to take as much of the most valuable item as possible, then switch to the next most valuable item.
Instead of only having one resource that multiple users would like to use, we want to schedule all requests using as few resources as possible. For example, we could be trying to schedule multiple tracks in a conference to use the least number of rooms.
We can solve this by sorting the requests by start time, and then greedily assign each request to the earliest resource that available.
We claim that the maximum number of simultaneous requests is the minimum number of resources we need. If there are $d$ requests all for the same time, it is not possible to create a schedule with fewer than $d$ resources. We call this the depth of the requests.
$d$ is a lower bound for the number of resources used. Our algorithm never uses more than $d$ resources at once. Because our algorith meets the lower bound, no solution can be more optimal than ours.
We want to schedule tasks, each of which has a duration and a deadline, in order to minimize the largest deadline miss.
Our greedy algorithm will schedule the tasks in order of the earliest deadline. We will use the exchange argument to prove its correctness.
The crux of the exchange argument is that we can transform an optimal solution into our solution through a number of moves that preserve optimality – that do not make the optimal solution no less optimal. Call the arbitrary optimal solution OPT.
Ascii is a compression format, but it’s not very efficient because every character has the same length of encoding: 8 bits. The crux of all compression is making sure that often-used characters have a smaller compressed size than less common characters. This is called a variable-length code, and is used in compressing text, images, and other mediums. Not all variable-length codes are good: morse code is one variable-length code that suffers from ambiguity since a sequence of morse code could be decoded as more than one output.
Prefix codes assure there can be no ambiguity with what a sequence of coded characters mean. When expressed as a binary tree, prefix codes have all characters at leaves of the tree.
Information = \(-\log_2{Pr(s)}\) , where \(Pr(s)\) is the probability of some symbol \(s\).
\(Pr(s) =\) (count of char s) / (count of all symbols)
ABL (Average Bits per Letter) is a measure of how compressed some information is, given frequencies of a character. \(\sum_x{f_x \cdot \gamma(x)}\) where \(f_x\) is the frequency or probability of a symbol occurring, and \(\gamma(x)\) is the length of the code for that symbol.
Huffman compression is a greedy algorithm for generating the optimal code for each symbol in a corpus of symbols to be encoded. Huffman compression begins by labeling all symbols with their frequency, or number of occurrences in a representative sample of text. Starting from most infrequently used symbols, it merges symbols into nodes that have the two symbols as children and takes on the summed frequency of the two children. These are recursively merged with the next smallest groupings of symbols, until all symbol-frequency tuples are merged into one binary tree.
This method ensures that the lowest nodes of the tree are the least common symbols, while the nodes closest to the root are the most commonly occurring symbols. This guarantees that the code is not only unambiguous, but also ideal.
Given k unique characters, Huffman compression needs k-1 rounds of merging, since each merge results in one fewer connected component.
Huffman’s compression is greedy because it keeps looking for the next least commonly used character and merging it, so we can apply a min-heap data structure to avoid the need of sorting. In each round of merging, Huffman’s algorithm needs to extract two smallest elements, taking \(O(2\cdot\log{k}) = O(\log{k})\) time. Inserting a new key is \(O(\log{k})\) time. The overall runtime is \(O(k\cdot\log{k})\), since we repeat this merging k times. Note that the runtime is dependent on the number of unique characters, not the length of the message.
\(ABL(T') = ABL(T) - f_m\) when two Huffman trees \(T\) and \(T'\) differ by a node \(m\) with children \(y\) and \(z\), and \(f_m = f_y + f_z\).
Unlike the Huffman encoding, LZW is able to create codes that take advantage of patterns in texts, compressing multiple occurrances. It can be used by itself, or in conjunction with Huffman encodings.
LZW is able to build a codebook in parallel on the sender and receiver, without ever transmitting a copy of the codebook or any codebook data.
Network flows deal with directed graphs $G = (V, E)$ that have two special nodes: a source s
and a sink t
. These graphs have edge capacities $c(e)$ for $e \in E$. These networks can model railways, pipelines, and a lot of other applications dealing with how much can travel from one station to another.
For simplicity, we will assume that there are no edges leading into the source, no edges leaving the sink, and every node can reach the sink.
A flow is defined with a function $f(e)$ outputting how much to send on a given edge. This function has two properties:
The value of a flow is defined $\text{val}(f) = \sum_{e~\text{out of s}}{f(e)}$, the sum of all edges out of the source $s$. This value must also be the sum of flow entering the sink $t$.
An $S-T$ cut is a partitioning of the set of vertices $V$ into sets $A$ and $B$ that has the source is in one set and the sink in another. $s \in A, t \in B$
The capacity of a cut is the sum of capacities for all edges going from $A$ to $B$.
Conservation of Flow: If $f$ is any $s-t$ flow and $(A, B)$ is any $s-t$ cut, then the net flow across $(A, B)$ is equal to the amount leaving the source $s$.
\[\text{val}(f) = \sum_{edges~A\to B}{f(e)} + \sum_{edges~B\to A}{f(e)} = \text{net flow}\]
The value of a flow is at most the capacity of the cut. For any $s-t$ flow $f$ and any $s-t$ cut $(A, B)$, $\text{val}(f) \leq \text{cap}(A, B)$.
\[\text{val}(f) = \sum_{e~\text{out of}~A}{f(e)} - \sum_{e~\text{into}~A}{f(e)} \\\leq \sum_{e~\text{into}~A}{f(e)}~~~~~\text{by negativity of flows, we can subtract the flow out A} \\\leq \sum_{e~\text{into}~A}{c(e)}~~~~~\text{since we know flow is less than the capacity of the cut}\]
Weak Duality Theorem: If $f$ is a flow, $(A, B)$ is a cut, and $\text{val}(f) = \text{cap}(A, B)$, then $f$ is a maximum flow and $(A, B)$ is a minimum cut.
Weak Maxflow-Mincut Duality: The value of the maximum flow $\leq$ the capacity of the minimum cut.
An augmenting path is an $s-t$ flow where the flow is less than the capacity on every edge in the path: $f(e) < c(e)$. On each of these edges, we can increase the flow and still maintain flow conservation.
This is the idea of a residual graph, which for every edge $(u, v)$, there are now two edges.
By thinking about residual edges, we can reason about reversing or reducing flow.
If we want to increase the flow in a residual graph, we can do the following steps:
The Ford-Fulkerson algorithm is a greedy algorithm for finding maximum flows through a network. It works by greedily looking for augmenting paths in the residual graph. (Remember that an augmenting path is a path from the source to the sink where there is potential to add flow to all edges in that path).
It begins with all 0 flows, $f(e) =0~\forall e \in E$.
Theorem: the following statements are equivalent for all flows $f$.
- There exists a cut $(A, B)$ such that $\text{val}(f) = \text{cap}(A, B)$: there is a cut where the flow across the cut is equal to the total capacity across the cut.
- Flow $f$ is a maximum flow.
- There is no augmenting path in $G_f$.
We can show 1 implies 2 by the weak duality theorem: if $f$ is a flow, $(A, B)$ is a cut, and $\text{val}(f) = \text{cap}(A, B)$, then $f$ is a maximum flow and $(A, B)$ is a minimum cut.
We can show 2 implies 3 by the contrapositive: $\neg 3 \implies \neg 2$: If there is an augmenting path in $G_f$, then by augmenting along that path we can increase the flow $f$.
We can show 3 implies 1 by reasoning.
If our graph has integer capacities, each augmentation will increase the flow by at least 1. The number of augmentations then is at least $\text{val}(f^*)$, the value of the maximum flow in the graph. Each augmentation takes $O(m)$ time, since it runs both DFS and then takes linear time to update the flows. DFS is $O(m+n)$ time, but we are almost always working on a graph with more vertices than edges, so $n < m$ and thus it’s $O(m)$ time to run DFS.
We notice that so far, we have only considered Ford-Fulkerson on integer capacities, which lend themselves to an integer maximum flow. We also observe that the runtime is dependent on the value of the maximum flow, which is suboptimal. Furthermore, the number of augmentations can be unbounded if capacities are non-integers or special integer cases.
There are a few strategies we can take in choosing our augmenting paths that will allow us to terminate sooner and when dealing with both special and non-integer capacities.
For the purposes of this course we can assume that we can solve maximum flow in time $O(mn)$. This is an area of active research!
Maximum Bottleneck Path | Shortest Path |
---|---|
Finds the path maximizing the minimum edge capacity, $\text{min}(c(e))$ | Finds the path with the shortest hops from the start to end node |
Can be computed in $O(m\log m)$ | Can be computed in $O(m)$ using BFS |
Variant of Dijkstra’s algorithm |
Arbitrary Path | Max Bottleneck Path | |
---|---|---|
Value of maxflow | $v^*$ | $v^*$ |
Value of augmenting path | $\ge1$ | $\ge\frac{v^*}{m}$ |
Flow remaining | $\le v^*$ | $\le v^* - \frac{v^*}{m}$ |
# of augmenting paths | $\le v$ | $\le m\log v^*$ |
When taking the max bottleneck path, the number of augmenting paths can be calculated as follows: If it takes us $k+1$ augmenting paths, after $k$ paths, the remaining flow is $\gt 0$. With some calculus trickery: \(0 \le (1-\frac{1}{m})^k v^* \approx (e^{-1})^{-\frac{k}{m}}v^*\) Using this heuristic, the number of augmenting paths is $O(m\log(v^*))$, and the running time of each augmentation is $O(m\log m)$. The total runtime of using the max bottleneck path is $O(m^2\log(m)\log(v^*))$, which is an improvement over being linear in $v^*$. This is still going to be slower on large graphs with low maximum flows though, since it has an $m^2$ term in there.
We can find the shortest augmenting path (the one with the least hops) in $O(m)$ time using DFS (recall from above that $n<m$ so $O(n+m)=O(m)$). The total number of augmenting paths is $O(mn)$. On arbitrary non-integer capacities, this heuristic gives a runtime of $O(m^2n)$ which is pretty nice, since it’s not bounded by the value of the maximum flow.
A reduction is a new algorithm that solves a new problem using function calls to an old algorithm that solves an old problem. The reduction should not make any assumptions about the implementation details of how the old algorithm solves the old problem.
New Input: $(G, S, T)$ = old input for Ford Fulkerson
Old Output (from Ford-Fulkerson): $f^*~ \text{for}~ G$
New Output: Compute residual graph $G_{f^*}$, compute $A$ as the set of nodes reachable from $S$ in $G_{f^*}$.
Input: a bipartite graph $G=(V, E)$ with $V=\text{left}~\cup~\text{right}$.
Output: a maximum cardinality matching $M \subseteq E$, where $M$ is a set of edges where no endpoint is connected to more than one edge in $M$. The cardinality is the number of edges in $M$. This problem has a variety of use cases.
Input for Bipartite Matching | Input for Integer Maxflow | Adjustment |
---|---|---|
Undirected | Directed | We can think of the matching as sending tokens from all nodes on the left hand side to all nodes on the right hand side. So we can connect every node on the left hand side to every node on the right hand side with a directed edge. |
Unweighted | Integer edge capacities | All edges are equally important for matching, so we can assign a capacity of 1 to each edge |
No source/sink | Has a source/sink | We can create a source node that we connect to every node on the left side, and a sink node which is connected to all nodes on the right hand side. |
The output of integer maxflow is a flow, where every edge is assigned either a 0 or a 1. We can simply use the set of edges that have a flow of 1 as our matching, after we remove our specially added source and sink nodes.
We know this is a correct matching since for every node in the matching, on the left there is only one edge entering it (from our special source node). By flow conservation, we know that each must connect to only 1 other node. On the right, each node only has one output, and so can only be connected to by 1 node on the left.
Transforming the input for bipartite matching into the input for integer maxflow takes time $O(n+m)$ to add $m$ new edges and a constant 2 new nodes. Finding the max flow in this new graph takes $O((n+2)(n+m))=O(nm)$ since $n^2 = O(nm)$. We can transform the integer maxflow output to our desired output in time $O(n+m)$.
Given: a directed graph $G$ with source $S$ and sink $T$
Output: a largest set of edge-disjoint $S-T$ paths. An edge-disjoint set of paths are paths from $S$ to $T$ that share no edges in common. This indicates that this network can tolerate that many edge failures.
Transforming the input for EDP to Integer Maxflow is fairly straightforward; we just need to assign an edge weight of 1 for every edge in $G$.
Transforming the output of Integer Maxflow to the output for EDP is a bit more involved.
The task is to separate an image, such as into a foreground and a background, given some prior knowledge about the pixels. Specifically:
Inputs:
Output: a partitioning of $V$ into sets $(A, B)$ that maximizes the a score function:
\[q(A, B)=\sum_{i\in A}{a_i} + \sum_{j\in B}{b_j} - \sum_{(i, j)\in E~\text{between A and B}}{p_{ij}}\]To reduce this maximization function to the Mincut problem, there are a few steps. Our reduction must create a graph such that the capacity of the minimum cut should correspond to the function we are trying to minimize.
Segmentation wants to maximize a function, while Mincut wants to minimize a function. We can negate our function to maximize and then minimize it.
\[q(A, B)=-\sum_{i\in A}{a_i} - \sum_{j\in B}{b_j} + \sum_{(i, j)\in E~\text{between A and B}}{p_{ij}}\] \[q(A, B)=\sum_{i\in B}{a_i} + \sum_{j\in A}{b_j} + \sum_{(i, j)\in E~\text{between A and B}}{p_{ij}}\]Segmentation has a minimizing function with terms for nodes in $A$ and nodes in $B$, the $\sum_{i\in B}{a_i} + \sum_{j\in A}{b_j}$ terms, while Mincut only has a term $\sum_{(i,j)\in E}{p_{ij}}$. We can connect our added source and sink nodes such that $S$ connects to all nodes, and $T$ connects from all nodes. $S\in A$ and $T\in B$, so we set the weights on all edges connecting out from $S$ to be $a_j$ and weights on all edges connecting to $T$ to be $b_i$.
This means that any cut will cut the edges from $S$ to the nodes in $B$, which carry weight $a_j$, and this cut will cut edges from any node in $A$ to $T$, which carry weight $b_j$. Looking back at our equation from step 1, we now have exactly what we wanted.
The capacity of every cut is equal to $-(\text{quality of partitioning in image segmentation})$. The minimum cut is that of the maximum quality.
Image segmentation takes time $O(m)$ to find the partitioning of pixels:
“Hard” problems are problems that cannot be solved in polynomial time. We commonly use reductions to show the relationship between hardness of problems. There are three primary problem classes.
2-SAT is considered an “easy” problem, and has a solution that can be found in polynomial time. 2-SAT is part of complexity class P. We define variables and literals, where variables are things like $X$ and $Y$, and literals are combinations of variables and operators, like $\sim X$ and $\sim Y$. A 2-Clause is comprised of two literals joined with an OR, like $X \vee \sim Z$.
Input: a bunch of 2-Clauses joined with AND operators $\wedge$
Output: an assignment of variables that makes the input’s formula true
This problem can be solved in polynomial time. Consider a 2-Clause $(X\vee Y)$:
So $(X\vee Y)$ is equivalent to $(\sim X \implies Y)\wedge(\sim Y\implies X)$. We can use this fact to create an implication graph, where each literal is a node and each clause forms two directed edges edges, $(\sim X, Y)$ and the $(\sim Y, X)$ part representing the two parts of it. A path from X to Y indicates that if X is true, Y must also be true.
We can check to make sure that this graph is valid. There are two possibilities:
If both of these paths exist, then we have found a contradiction and thus an assignment cannot be made. The formula is unsatisfiable. Otherwise, the formula is satisfiable.
Algorithm: For every variable, check for paths $(X, \sim X)$ and $(\sim X, X)$. For $n$ number of variables, this is solvable in polynomial time:
This is the same as 2-SAT, but instead each clause has 3 literals. It is unlikely that 3-SAT is solvable in polynomial time, since solving it seems to require trying all possible solutions. The search space has no ordering. If we were to take the same graph-based approach as solving 2-SAT, we could decompose an equation $(a \vee b \vee \sim c)$ into $\sim a \implies(b\vee\sim c)$, but we would not have a node in our graph for the latter part of the implication. We could introduce a new node but that brings additional complexities.
2-SAT $\in$ P
3-SAT not necessarily in P
2-SAT $\in$ NP
3-SAT $\in$ NP
P $\subseteq$ NP, since finding a solution is at least as hard as checking one.
Problems in NP that are not in P: 3-SAT, Hamiltonian Cycles, Traveling Salesman, Circuit-SAT, subset-sum, etc.
NP-Complete problems are in class NP (solutions can be checked in polynomial time), and are at least as hard as any other problem in NP. NP complete problems are linear-time reducible from any other problem in NP.
Theorem: If problem A is polynomial time reducible to problem B, and A is NP Complete, and B is in NP, then B is also NP Complete.
Subset-Sum | Partitioning |
---|---|
A multiset of positive integers $M$ A positive integer target $t$ |
A multiset of positive integers $N$ |
Is there a subset of $M$ that sums to $t$? | Is there a partitioning of $N$ into two sets, $N_1$ and $N_2$ , $N_1 \cap N_2=\emptyset, s.t.\sum_{N_1}=\sum_{N_2}$? |
NP Complete | To prove |
We know that subset-sum is NP Complete, and we’d like to show that Partitioning also is. According to our theorem, we need to show that Subset-Sum is polynomial-time reducible to Partitioning, and then that Partitioning is in NP.
In our first introduction to 3-SAT, we learned that 3-SAT is NP Complete.
Here, we introduce the Clique problem. A Clique is a set of nodes in an undirected graph, who are all neighbors with each other. The clique problem asks, “Does a clique of size $k$ exist?”
Here, we introduce the Clique problem. We will show that the Clique problem is NP Complete because we can reduce 3-SAT to it in polynomial time.