← Back to all posts

# Algorithms and Data Structures, Notes

December 07, 2020

These are my notes from CS 3000, Data Structures and Algorithms. I took this class in the fall of 2020.

See the outline on the left to jump to a relevant section. Here’s where the second half of the semester starts.

# Sorting Algorithms

Many problems are easier to solve when data is sorted.

## Selection Sort

1. Find the minimum value in the unsorted part of the array
2. Perform swaps with elements to the left until it is no longer smaller than the element to its left
3. Repeat with the next minimum value, $n-1$ times.

Selection sort takes $O(n^2)$ time.

## Insertion Sort

At step i, insert element i into the space $[0…i]$ in order to maintain the sortedness of $[0…i]$. Before the i’th element is moved, the array is split into the first section $[0…i-1]$ which contains elements in sorted order and the rest of the array, $[i…n]$ which is unsorted. At the end of $i = n+1$, the entire array is sorted. This takes $O(n^2)$ time as well, since iteration $j$ could take time proportional to $j$ time to insert it into the properly sorted section.

## Merge Sort

Recursively divide the array into two halves, sort the halves, and recombine them. We can show this to be correct by induction. At each level in the recursion tree, the work is divided by $2$, and a linear amount of work is done in recombining the two halves. This comes from the recursion $T(n) = T(n/2) + T(n/2) + Cn = 2T(n/2) + Cn$. There are a total of $\log_2(n)$ levels to this tree, so the total work is the work at each level * the number of levels. This is $Cn\log_2(n)$.

# Big-O Notation

## Rules

• Constant factors can be ignored: for any constant $c$, $O(c\cdot n) = O(n)$
• Smaller exponents are O(larger exponents): for $a < b$, $na = O(nb)$
• Any log is O(any polynomial): for any positive $a$, $b$, $(\log_2(n))a = O(nb)$
• Any polynomial is O(any exponential): for $a > 0$, $b > 1$, $na = O(bn)$
• Lower order terms can be dropped

## Other Notations

• Big $O$: $f(n) ≤ g(n)$
• Big omega $Ω$: $f(n) ≥ g(n)$
• Theta $Θ$: $f(n) = g(n)$ also implies Big O, Big Omega
• Little $o$: $f(n) < g(n)$
• Little omega $\omega$: $f(n) > g(n)$

## Limit Test

Limit as n approaches infinity of f(n) / g(n):

• $0$ ⇒ $f(n) = o(g(n))$ ⇒ $f(n) = O(g(n))$
• Constant > $0$ ⇒ $f(n) = θ(g(n))$ ⇒ $f(n) = Ω(g(n))$ and $f(n) = O(g(n))$
• Infinity ⇒ $f(n) = \omega(g(n))$ ⇒ $f(n) = Ω(g(n))$

For this class, using this limit test is an acceptable way to prove runtime.

## Master Theorem

The Master Theorem is a general way to determine the runtime of a divide-and-conquer algorithm that splits the problem into a pieces of size $\frac{n}{b}$, and then merges solutions in time $O(n^d)$. The recurrence of such an algorithm would be $T(n) = a\cdot T(\frac{n}{b}) + C\cdot n^d$, with base case of $T(1) = 1$.

• Case $\frac{a}{b\cdot d} > 1$: The last term dominates the runtime, since this forms a geometrically increasing sequence. The last term is then $n^d\cdot (\frac{a}{b\cdot d})^{\log_b{n}}$, so $T(n) = \theta(n^{\log_b{a}})$.
• Case $\frac{a}{b\cdot d} = 1$ : The work is constant at each layer as layers increase, so the total work is proportional to the number of layers. The number of layers is $\log_b(n)$, so $T(n) = \theta(n^d \log_b{n})$.
• Case $\frac{a}{b\cdot d} < 1$: The work done on each layer is geometrically decreasing, and convergent to a constant $\theta(1)$. The total work is proportional to the work spent recombining solutions, so $T(n) = \theta(n^d)$.

Note: we can use a still more general form of the master theorem, where we replace $n^d$ with a function $f(n)$. This does not change much.

# Divide and Conquer Examples

## Linear Time Selection

1. Taking a random element of the array
2. Partitioning the array into two parts: elements smaller than this element, and elements larger
3. If the index of the random element is the $k$ (the $k$’th largest element), then return the element
4. Otherwise, continue the search in the appropriate partition of the array

Choosing an element and partitioning takes $θ(n)$ time. When the recursive call is made, the size of the problem set is reduced. On the average case,

## Closest Pair: 1D

Problem: Given n points on a line, find the pair of points with the smallest distance from each other.

Sorting Solution: We could solve this by sorting all the points, then walking through them in order to find the pair with the smallest distance between them. This takes $O(n\log n)$ time since it is bounded by the time of sorting.

Non-Sorting Solution: We can split this problem into two sections of equal size, by first finding the median element using linear-time selection. Once the median is selected, one subproblem will be all elements less than the median and including the median, and the other subproblem will include all elements greater than the median and including the median. We recurse over both halves.

The recurrence is $T(n) = 2T(n/2) + 2n = Θ(n\log(n))$, which achieves the same runtime as the sorting solution.

## Closest Pair: 2D

Problem: Given n points in 2D space, find the pair of points with the smallest Euclidean distance.

Once again we can divide our problem into two subproblems by selecting the median value for X-coordinate among all points. This splits our data into two halves. We can recurse over the two halves.

In regard to pairs of points that may cross over the median line, we know that the greatest distance between any pair with points on both sides of the median must be closer than the closest pair in each half. If the closest distance between any pair found on each half is ƍ, then the only points we need to consider are the points within ƍ distance of the median X line, and each point only needs to be compared against points within ƍ of itself.

Using some math we can show that there are actually a constant number of points to be compared against, so only a linear number of comparisons are made. If we sort by y coordinate, we can achieve this in $O(n \log(n))$ time. The total recurrence would be $T(n) = 2T(n/2) + Θ(n\log(n))$ which works out to be $θ((n \log(n))^2)$. We can fix this if we embed merge sort in every layer of our recursive call, to also return the sorted half – this would enable merging in linear time, removing the need to sort after getting the points back. This cuts down the runtime to $Θ(n\log(n))$.

# Dynamic Programming

Memoization Bottom Up
Keeping a global array to remember previously computed solutions. This allows us to calculate a recurrence in linear time. Computing a recurrence iteratively, so by the time a previously computed solution is needed, it has already been calculated.
Weighted Interval Scheduling Segmented Least Squares
Outputs a subset of the input Outputs a partitioning of the input
“Does the solution include the last item?” “What is the last segment of the solution?”

## Weighted Interval Scheduling

Input: Given $n$ intervals, (start, end), with a value $v_i$, that are all sorted by finish time in ascending order.

Output: A list of compatible intervals that maximizes the sum of values of the intervals. Schedules are compatible if no interval begins before another ends.

Naive approaches, such as choosing the most valuable intervals first or choosing the earliest intervals, fail easily.

Let us begin with a probing question: Does the optimal schedule contain the last interval?

Let us continue with an insightful answer: The optimal schedule either contains, or does not contain, the last interval.

• The optimal solution contains the last interval ⇒ The optimal solution therefore consists of the last interval, in addition to the optimal solution for all intervals coming before the last interval.
• The optimal solution does not contain the last interval ⇒ The optimal solution therefore consists of the optimal solution for all intervals coming before this last interval.

## Segmented Least Squares

Given: $n$ data points in 2D space, as well as a cost parameter $c$.

Output: A partitioning of the data points into segments and lines, optimizing the total cost which is the sum of the error of the line plus a constant cost penalty for every new segment.

We begin with a probing question: What is the final segment?

We continue with an insightful answer: The final segment is the segment that minimizes the total cost of itself and the rest of the segments.

• The final segment is the set of points $P_i$ until $P_n$. The optimal solution, therefore, is to have the final segment as ${P_i…P_n}$ + optimal segments for ${P_0…P_{i-1}}$.

## Knapsack

Given: A knapsack with maximum weight $T$, and $n$ items, each with a value $v_i$ and a weight $w_i$.

Output: A subset of items which has a weight less than $T$, and which maximizes the total value.

Special cases: Subset sum, when values = weights, or tug-of-war, when $T= \frac{1}{2}$ of the maximum weight

We begin with a probing question: Does the optimal knapsack include item $n$?

We continue with an insightful answer: The optimal knapsack either includes the item, or it does not.

• If the optimal knapsack includes the item, then the optimal solution consists of that item and the optimal solution of the knapsack problem with a knapsack with that much less weight, and that item removed.
• If the optimal knapsack does not include the item, then the optimal solution consists of the optimal solution to the knapsack problem with the same knapsack with that item removed.

The recurrence to the knapsack problem is interesting because it involves two variables (maximum weight of the knapsack, and the subset of the items).

## Video Segmentation

Given: A set of frames of a video, and access to a black box that can determine the Quality of a set of frames.

Output: A set of contiguous blocks of frames that maximizes the total quality.

We begin with a probing question: What is the final segment?

We continue with an insightful answer: The final segment is any number of frames ${i, i+1, … n}$ that maximizes the sum of the quality of that segment and the optimal segmentation of all frames that come before it.

$\text{OPT}(j)$ = the value of the optimal segmentation from ${1…j}$.

$\text{OPT}(j)$ = $\text{argmax}(1 ≤ i ≤ j){\text{OPT}(i-1) + \text{Quality}(i, j)}$

$\text{OPT}(0) = 0$

## Edit Distance

Given: Two strings, $X$ and $Y$ of lengths $n$ and $m$ respectively.

Output: The edit distance between $X$ and $Y$, defined as the number of insertions, deletions, and swaps needed to turn one string into another. This can be visualized as aligning the two strings, and adding the number of positions where the two strings disagree.

We begin with a probing question: What goes in the final column of the optimal alignment?

We continue with an insightful answer: The final column of the optimal alignment could be X and nothing from Y, Y and nothing from X, or both X and Y.

• If the final column of the optimal alignment is X and nothing from Y:
• The optimal alignment is exactly that appended to the optimal alignment of ${X_1, X_2, … X_{n-1}}$ and ${Y_1, Y_2, … Y_m}$.
• The edit distance is 1 + the edit distance of the optimal alignment of the rest
• If the final column of the optimal alignment is Y and nothing from X:
• Then the optimal alignment is exactly that appended to the optimal alignment of ${Y_1, Y_2, … Y_{m-1}}$ and ${X1, X_2, … X_n}$.
• The edit distance is 1 + the edit distance of the optimal alignment of the rest
• If the final column of the optimal alignment is both X and Y:
• The optimal alignment is exactly that appended to the optimal alignment of ${X_1, X_2, … X_{n-1}}$ and ${Y_1, Y_2, … Y_{m-1}}$.
• If the last characters match, the edit distance is just the edit distance of the optimal alignment of the rest.
• If the last characters don’t match, then the edit distance is 1 + the edit distance of the optimal alignment of the rest.

Our recurrence therefore is to take the minimum of these three cases.

## Longest Common Substrings

Given: Two strings, X and Y, of lengths n and m respectively.

Output: The longest common substring of X and Y

We begin with a probing question: Is the last symbol of X and Y part of the substring?

We continue with an insightful answer: If the last symbol of X and Y match, it is part of the substring, but otherwise, either the last symbol from X is not part of the substring, or the last symbol from Y is not part of the substring.

• If the last symbol of X and Y match, then we always include that symbol in the substring.
• If the last symbol of X and Y do not match:
• If the last symbol of X is not in the substring, then the longest common substring is just that of symbols {X1, X2, … Xn-1} and {Y1, Y2, … Ym}
• If the last symbol of Y is not in the substring, then the longest common substring is just that of symbols {X1, X2, … Xn} and {Y1, Y2, … Ym-1}
• Note that there is no case where neither symbols are in the substring, since that would already be matched by one of the above cases.

Our recurrence takes the maximum of these two cases, and in the second case, takes the maximum of the two subcases of symbols not matching.

Our bottom-up solution here takes time O(mn), since the table we need to fill in is of size mn.

## Longest Increasing Subsequence

Given: A sequence X1, X2, …, Xn of anything comparable

Output: the longest subsequence that is increasing. A sequence is a set of strictly increasing indices of elements such that the elements at those indices are strictly increasing.

We begin with a probing question: Is the last element in the longest increasing subsequence?

We continue with an insightful answer: The last element either is, or is not.

• If the last element is in the longest increasing subsequence, then the longest increasing subsequence is the longest increasing subsequence of X1, X2, …, Xn-1 that are less than Xn, in addition to Xn.
• If the last element is not in the longest increasing subsequence, then the longest increasing subsequence is the longest increasing subsequence of X1, X2, …, Xn-1.

Remember that if we take the last element to be part of the longest increasing subsequence, we must still fulfill the other requirement that the longest increasing subsequence is strictly increasing. We can filter the subproblem by the values of X that are less than Xn.

# Graphs

## Definitions

• Simple Graphs: no duplicate edges, no self-loops.
• Maximum edges in a simple directed graph with n vertices: $2nC^2 = n (n-1)$
• Maximum edges in a simple undirected graph with n vertices: $nC^2 = n (n-1) / 2$ because now you can have an edge between any pairing of two vertices.

• Path: a sequence of consecutive edges. The length of a path is its number of edges.
• An undirected graph is connected if for every two vertices in the graph, there is a path between them.
• A directed graph is strongly connected if for every two vertices $u$ and $v$ in the graph, there is a path from u to v and from v to u.

• In an undirected graph, the degree of a vertex is the number of edges connected to it.
• In any undirected graph with m vertices and m edges, the sum of vertex degrees is $2m$. If you add up the degrees of all vertices, each edge will be counted twice, so naturally the vertex degree is $2m$.

• A tree is a simple undirected graph that is connected and contains no cycles.
• Any two of these implies the third. A tree satisfies all of these.
1. A graph is connected
2. A graph contains no cycles
3. A graph has exactly n - 1 edges
• A rooted tree is created by choosing a root node and orienting edges away from it. This models a hierarchical structure.

## Ways to Represent Graphs

Adjacency Matrices use a matrix form where space (i, j) indicates if the edge (i, j) is in the set of edges in the graph. This uses O(V2) space, allows lookups in Θ(1) time, and neighbor lookup in Θ(v) time.

Adjacency Lists use a two-dimensional linked list structure where the adjacency list associated with a vertex is the set of all vertices that are directly connected to that vertex. This uses space Θ(m+n).

Depth-first search starts on a vertex and examines all nodes who share an edge with it. It follows the first edge to the vertex, and makes a recursive call there. Nodes are marked as discovered once visited, and are visited only once.

Using depth-first search, edges can be categorized into four categories:

• Tree Edges: Edges that discover new nodes
• Forward Edges: Edges from a node discovered first to a node discovered later, but not a tree edge. (ancestor ⇒ descendant)
• Backwards Edges: Edges from a node discovered later to a node discovered earlier. The existence of backwards edges implies the existence of a directed cycle. (descendant ⇒ ancestor)
• Cross Edges: No ancestral relation implied by these edges.

Keeping track of the discovery and finish times of each node can yield us insights into the type of graph we are looking at and the properties of the edges. For any directed edge (u ⇒ v), d[u] = discovery time of u, and f[u] = finish time of u:

• d[u] < d[v] < f[v] < f[u] ⇒ TREE/FORWARD EDGE u is an ancestor of v. Ancestor node u was discovered before v, and finished after v, implying that v was discovered by a path from u.
• d[v] < d[u] < f[u] < f[v] ⇒ BACK EDGE v is an ancestor of u. Ancestor node v was discovered before u, and finished after u, implying that there is a path (v, u). But, we know that the edge from u to v also exists: this forms a back edge, creating a cycle.
• d[v] < f[v] < d[u] < f[u] ⇒ CROSS EDGE v and u have no ancestral connection. Node v was discovered and finished before node u was discovered and finished. When exploring v there was no connection to u, and so these have no ancestral relation but may share a common parent.
• d[u] < f[u] < d[v] < f[v] ⇒ IMPOSSIBLE If there was an edge from u to v, we would have discovered it when exploring u. But we didn’t discover it, since v was discovered after u. This is a contradiction, and cannot happen.

Using adjacency lists, DFS runs in O(n+m) time and space.

### Directed Acyclic Graphs

Directed Acyclic Graphs are directed graphs with no directed cycles. They give rise to topological orderings, which are a labeling of the nodes in the graph such that all edges go from a node with lesser index to a node with greater index. Iff a graph has a topological ordering, then it is a Directed Acyclic Graph (DAG).

Observe: in a DAG, there is always a node with no incoming edges – an in-degree of 0.

In any traversal of a DAG, you will never find a back edge. A back edge would imply the existence of a directed cycle, which we know a DAG will never have.

### Topological Ordering: Removing In-Degree 0 Vertices

In order to produce a list of nodes in topological order, we could use the fact that there every DAG has a node with an in-degree of 0. We could repeatedly find this node with no incoming edges, and push that onto a queue such that that node would be the first in the ordering. This would take O(n+m) time.

### Topological Ordering: DFS Finish Times

Another approach would be to utilize the discovery and finish times of a DFS traversal of the graph. We claim that ordering the vertices by decreasing finish time creates a topological ordering.

Assume for the sake of contradiction that vertices are in order of decreasing finish time, but this does not create a topological ordering. This means that there are two nodes u and v such that f[v] > f[u], but there is an edge from u to v. This fits in the category of there existing a back edge – which is a contradiction, since the presence of a back edge implies a cycle existing and a DAG cannot have any cycles.

### Connected Components

Problem: given an undirected graph, split it into connected components – a labeling of vertices, by their connected component.

Algorithm:

• Pick a node v, use DFS to find all nodes reachable from v
• Label all those nodes as the same label
• Repeat with a yet undiscovered vertex.

The total runtime is Θ(# of vertices + # of edges), which is linear since every vertex and edge is only looked at once.

### Strongly Connected Components

Given a directed graph, split it into strongly connected components.

By definition, the strongly connected components of a directed graph consists of all nodes v that are both reachable from and can reach the starting node s. Observe how if we do DFS(s), we get the set of all nodes reachable from s, and if we do DFS(s) on the graph with all directed edges reversed, we get the set of all nodes that can reach s. If we then take the intersection of these two sets, we get the vertices that comprise the strongly connected component that has s in it.

This naive approach runs in O(n * (n+m)), where at each of n vertices you must run DFS twice, which each time takes O(n+m).

But also, observe that if you started the DFS from a “sink” component, a component with no outward connections, then you would be able to recover that component without visiting all the nodes in the graph. If we do the DFS’s in the right order, we can actually find the strongly connected components in linear time.

Repeatedly, until all nodes are marked as belonging to a strongly connected component:

• Find a node in a sink component of the graph
• Run DFS(u) to find the SCC of that node
• Mark the nodes in the SCC to not visit them twice

The sink component is the node with the largest finish time in a DFS traversal of the reversed graph.

In a breadth-first search, we first explore all the immediate neighbors of the starting node, before exploring the immediate neighbors of each of those neighbors. This can be done on both directed and undirected graphs.

In the tree produced by depth-first search, notice that edges only occur within a level, or between consecutive levels. If there was an edge skipping a level, then that would put that node on the previous level.

Breadth-first search’s easy application is to find the distances from one node to every other node, where distance is defined as the number of edges in the shortest path between the two nodes.

BFS runs in $O(mn)$ time.

### Bipartite Graphs & 2-Coloring

A graph is bipartite if it can be split into two halves, where all vertices go between the two halves and no vertices go between nodes inside the same half.

A related and useful problem is that of 2-coloring, which asks if nodes in a graph can be assigned a coloring such that no two nodes of the same color share an edge. We can solve the 2-coloring problem by running a breadth-first search and assigning alternating colors to alternating levels of the BFS tree. If we ever encounter an edge between two nodes in the same level, then this graph cannot be 2-colored.

We can see that a successful 2-coloring of a graph implies that the graph is bipartite, and that if we fail at 2-coloring, then the graph is not bipartite. If BFS 2-coloring fails, this means the graph contains a cycle of odd length. It is impossible to color a cycle of odd length in two colors and have no two same-colored vertices share an edge.

# Graph Optimization

Shortest paths in a graph are particularly useful in a weighted graph, where each edge has a different weight. The length of a path is the sum of edge weights on the path, while the distance from $s$ to $t$ is the length of the shortest path from $s$ to $t$.

Dijkstra’s algorithm only works on graphs with non-negative edge weights.

## Dijkstra’s

There are two key observations we will make, which will help us in understanding Dijkstra’s Algorithm. Both of these are in the situation where we have three nodes, $s, u, v$, and we know of some shortest path from $s$ to $u$, and the weight of the path from $u$ to $v$.

• If $(u, v)\in E$, then $d(s,v)\le d(s,u)+w(u,v)$ for every node in $v$. This tells us that the shortest path from $s$ to $v$ cannot be longer than the path made by taking an intermediate hop through another node $u$. This already known path is going from $s$ to $u$ to $v$. If the shortest path were longer than $d(s,u)+d(u,v)$, then the other path wouldn’t be a shortest path because we just found a shorter path.
• If $(u,v)\in E$ and $d(s,v)=d(s,u)+w(u,v)$, then there exists a shortest $s\to v$ path that ends in $u,v$. This is saying that if the distance (shortest path) from $s$ to $v$ is equal to the sum of the length of the shortest paths $(s,u)$ and $(u,v)$, then there is a shortest path from $s$ to $v$ that does indeed end in $(u,v)$.

Dijkstra’s algorithm:

1. Maintain an upper bound on distances for all nodes, where all of them start out at $\infty$ except the source which is $0$.
2. Explore the closest unexplored node: the node with the smallest upper bound on its distance
3. When a node is explored, look at its out neighbors and the weights on those edges to lower the upper bound on its distance and all of its out neighbors.

Before we explore the $k$-th node $v$, the previous distance to $v$ that was known when the last node was explored, is tight. We only decide to explore the $k$-th node when it is the next closest unexplored node. If there was another way to reach $v$ with a smaller distance than what we already know for $v$, we would have chosen to explore that node instead since it would be the next closest unexplored node. But we didn’t, which indicates that it is not possible to have a shorter path from $s$ to $v$ than what we’ve already found. We can show this more formally using strong induction (which I will not).

### Dijkstra’s Using Heaps

Heaps organize data as a complete binary tree, where each node has 2 children. If node $a$ is the parent of node $b$, then $\text{val}(a) \le \text{val}(b)$. Heaps have a few operations that are efficient:

• ExtractMin: Returns and removes the smallest value in the heap. This can be done by removing the very topmost node in the tree. But then we must combine the two new trees by selecting the next smallest node to be the root, and then swapping nodes to move the next smallest node down in the tree until it belongs again in a binary tree. In the worst case we will have to move this node all the way to the bottom of the tree, which is $\log{n}$ levels, so this ultimately takes $O(\log{n})$ time.
• DecreaseKey: Change the value of a key, the performs swaps upwards to restore the heap order. This is again at most $O(\log{n})$ swaps.
• Insert: Puts a key in the correct location, and then repairs the heap like before. $O(\log{n})$.
• Lookup: Searches the heap in $O(\log{n})$ time.

We can directly implement heaps an array, and use a dictionary to map keys to their location in the array based on their value. This allows us to get the value for a key in $O(1)$ time and perform swaps in $O(1)$ time. Using this approach, our actual runtimes are:

• Fixing heap: $O(1)$ to fix a single swap, with a maximum of having to swap $O(\log{n})$ times.
• Lookup: $O(1)$ since we can consult our dictionary.
• ExtractMin, DecreaseKey, Insert: $O(\log{n})$ time.

Using heaps, we can have Dijkstra’s algorithm run in time $O(m\log{n})$. Using fibonacci heaps and some other tricks we can get it to run even faster!

## Bellman-Ford

Negative edge weights come up in modeling currency exchange rates or chemical reactions that release energy. Dijkstra’s algorithm fails to produce a correct shortest path when our graph has negative edge weights, and simply adding a constant value to all edge weights does not fix this.

Bellman-Ford uses dynamic programming.

Let $\text{OPT}(v, j)$ be the length of the shortest path from $s$ to $v$ with at most $j$ hops $(0\le j \le n-1)$ where $n$ is the number of vertices. On a path with $n$ nodes and no cycles, you can only have at most $n-1$ edges. We count on never taking a cycle because:

• If the cycle has a negative sum, there is no solution.
• If the cycle has a non-negative sum, there is no point in taking the path at all.

We observe that on every shortest path from $s$ to $v$, there is a final hop on edge $(u,v)$. The optimal length of the shortest path from $s$ to $v$ on a path that makes $j$ hops is: $\text{OPT}(v,j)=\min(\text{OPT}(v,j-1), \min_{(u,v)\in E}{(\text{OPT}(u,j-1)+w(u,v))})\\ \text{OPT}(s,0)=0\\ \text{OPT}(v,0)=\infty\\$ There are two cases: either we don’t need the full $j$ hops, in which case the length is just$\text{OPT}(v,j-1)$, or we select the best last hop using $\min_{(u,v)\in E}{(\text{OPT}(u,j-1)+w(u,v))}$.

A bottom-up implementation of Bellman-Ford takes $O(nm)$ time, and has a space complexity of $O(n^2)$. There are a few optimizations we can do:

• Once we know $\text{OPT}(v,j)$, we don’t care about $\text{OPT}(v, j-1)$. So we don’t need to carry around a whole table, we can just keep an array containing the shortest path from $s$ to each node $v$. This will take $O(n)$ space now.
• There’s no need to check edges $(u,v)$ again unless $d[u]$ has changed. Doesn’t affect worst case runtime, but can speed it up.
• Stop early if no $d[v]$ has changed through a whole pass through $v$. If no distances are updated by exploring a new node.

### Negative Cycle Detection

If a graph has more than $n-1$ edges, it could contain a negative cycle. To find out, we can run Bellman-Ford for $n$ iterations and check if $\text{OPT}(v,n) < \text{OPT}(v, n-1)$ for some $v\in V$. If so, this means that by taking more edges (forming a cycle), you can get a smaller route length. This could not have been possible in a positive cycle, so this must be a negative cycle. If this condition never exists, then there are no negative cycles.

## Minimum Spanning Trees

Given a set of nodes $V$, a set of possible edges $E\subseteq V\times V$, and a weightfunction on edges $w(e)$, we want to build a network with the cheapest set of edges that connects all possible nodes in $V$. This problem comes up a lot in network design.

Input: A weighted graph $G=(V, E, {W_e})$ that is undirected, connected, and could have negative weights. Assume all edge weights are distinct.

Output: A spanning tree of minimum cost

• A spanning tree of $G$ is a subset $T\subseteq E$ of the edges such that $V, T$ forms a tree.
• A spanning tree, like all trees, must be connected and acyclic.
• The cost of a spanning tree is the sum over edge weights, $\sum_{e\in T}{W_e}$.

There are four primary MST algorithms: Prim’s, Kruskal’s, Reverse-Kruskal’s, and Boruvka’s.

### Cut Property

Define a cut as a subset of vertices, and the cutset as the edges which have one endpoint in the cut and one endpoint out of the cut. Edges between two vertices in the cut are not in the cutset.

Define a safe edge as the minimum weight edge cut by a cut $S$. The MST must contain this edge.

### Cycle Property

Define a cycle as a path where the first and last nodes are the same.

In any cycle $C$, there is a maximum edge weight edge, called $f$. Then, the MST does not contain edge $f$. $f$ is called the useless edge.

Fact: a cycle and a cutset must intersect in an even number of edges. Every time you enter the cut, you must leave it.

Assume we have a cycle $C$ in a graph, and the MST called $T^*$ contains the edge $f$, the maximum weight edge. Because the MST is a MST, if we removed this edge $f$ from it, we would break the tree into two connected components, call them CC1 and the rest. We know that $f$ is in the cutset of CC1, but we also know there must be another edge in the graph that is in the cutset, since a cycle and a cutset must intersect in an even number of places. Therefore, there must be an edge $e$ in the cutset and in $C$. We know that $w(e)<w(f)$ , so the cost of the tree minus $f$ and adding $e$ (essentially replacing $f$ with $e$) is less than the cost of the MST. $T^*-f+e < \text{cost}(T^*)$. This is a contradiction to $T^*$ being a MST, since we’ve found another tree that spans and has less eight. So therefore edge $f$ must not be in the MST.

So in summary:

• If $e$ is the minimum weight edge in a graph $G$, then $e$ is always in the MST. This is because $e$ connects nodes $u$ to $v$, and since $e$ is the smallest weight edge, it is the best way (smallest weight way) to connect $u$ to $v$.
• If $e$ is the maximum weight edge in a graph $G$, it could still be in the MST, if there’s no other way to get from $u$ to $v$.
Prim’s Algorithm Kruskal’s Algorithm
Iteratively builds the solution, one connected component at a time Maintains several connected components separately and joins them together
Faster on dense graphs Faster on sparse graphs
$O(m\log n)$ runtime $O(m\log n)$ runtime

## Prim’s Algorithm

Prim’s algorithm produces a MST. Intuitively, Prim’s algorithm gradually expands the search each time only adding safe edges of the cut containing all already handled nodes and ensuring that $T$ is connected. By using the cut property, Prim’s algorithm ensures that every time an edge is added to the MST, it connects to a node not already in the MST, and does so in the cheapest possible way.

• Let $T$ be the MST, and we begin with nothing in the MST, $T=\emptyset$.
• Let $s$ be the starting node, and the cut $C={s}$. $C$ is the cut with only element $s$.
• Repeatedly find the cheapest edge $(u, v)$ cut by $C$. Add this edge to $T$, and set $s$ to node $v$. We repeat this step until the cutset $C$ contains all nodes in $G$.

An implementation of Prim’s algorithm using heaps will run in $O(m\log{n})$.

## Kruskal’s Algorithm

Kruskal’s algorithm also produces an MST. Kruskal’s is greedy in always adding the next least valued edge that would reduce the number of connected components in the graph. To begin, each node is its own connected component, and there exist no edges.

We can do this efficiently by first sorting the edges in ascending order of weight, then adding edges in increasing weight until there is only one connected component in our MST. By induction, we know that every edge must be safe: the first edge is safe, and when we pick the next edge it must go between two strongly connected components, and is thus safe. The MST produced must also by acyclic because if we only add edges that reduce the number of strongly connected components: if a new edge would produce a cycle, it would not decrease the number of strongly connected components.

### Runtime

In order to be efficient, we need a data structure that can efficiently perform the operations we would like on connected components. This is the union-find data structure, which will let us group nodes into components with operations Find(u) which looks up which component contains u, and Union(u, v) which merges the connected components of u and v.

A naive union-find implementation would use a dictionary, which would enable Find(u) in $O(1)$ but would take $O(n)$ time to perform a Union(u, v) since it would need to update assignments of all elements in those groups.

If we used an array where element $i$ denoted the current component of vertex $i$, and a linked list for the items in each component, we could join components in $O(\text{length of smaller component})$, and $O(1)$ to update assignments in our array.

• After $k$ unions, only $O(k)$ items will have changed components
• The largest component has size $O(k)$, and at most $k+1$
• Every time an item changes components, the new component is at least twice the size of its old component, since only elements in the smaller of two components changes.
• No item changes components more than $O(\log{k})$ times

So the total runtime is $O(k\log{k})$ for union-find to perform $k$ unions.

Kruskal’s algorithm, using this implementation of union-find, takes time $O(m\log n)$ because the runtime is dominated by the speed of sorting all edges.

• $O(m\log n)$ to sort all edges. This is because $O(m\log m)$ becomes $m=O(n^2)$, so $O(m\log n^2)=O(m\log n)$ as exponents drop out of logs.
• $O(1)\times m=O(m)$ to test edges
• $O(n\log n)$ to add $n-1$ edges and merge them, using union-find.

# Amortized Analysis

In amortized analysis, we argue that the running time of the worst case of an algorithm is not as bad as it sounds because the number of times the worst case can happen is bounded by something. In other words, usually the worst case doesn’t happen. For example, consider a resizing vector, or an ArrayList implementation that must resize the underlying array and copy elements to an array twice as large. This worst case runtime is $O(n)$ for an insertion, but usually it’s not that bad.

• The amortized runtime is not the same as expected runtime: the amortized runtime averages over a single worst-case run, whereas the expected runtime averages across many different runs.
• The worst case for a single call can look worse than the total amortized cost, so it only makes sense to think of the amortized runtime over multiple calls.
• Different operations can be mixed together in analysis, since the bound of amortized runtime will apply to all of them.

## Aggregate Analysis

When computing the amortized runtime using aggregate analysis, we calculate the total amount of work done by a set of calls, then divide by the number of operations performed. We average over one worst-case run.

We don’t average over different input distributions, but over one single worst case execution. We want to show that in the worst case, it may not be possible to incur the largest cost on every iteration.

### Example: Multipop

Given a stack with $O(1)$ push and pop, add a function multipop() which calls pop() $k$ times, giving a runtime of $O(n)$ in the worst case. But note that multipop can only take as much time as the number of push() operations we perform first. The worst case operation must be balanced by a number of fast operations.

We define our single worst case run as pushing $N-1$ items onto the stack, and then calling multipop(). The total work done is $N \cdot O(1) + O(N)$, since each push() is $O(1)$. $O(N) / N = O(1)$, so multipop()’s amortized runtime is constant.

### Example: Incrementing a Binary Counter

Imagine incrementing a number, and thinking about it in terms of bit flips in the number’s binary representation. We are interested in counting the number of bits that we flip. In the worst case, we may flip all $k=n\log{n}$ bits in one step: 011...1 to 100...0. But we know that the les significant bits flip a lot more than the more significant bits. We can express the sum of flips as a sum:

$\sum_{i=0}^{k-1}{\frac{n}{2^i}} < n\cdot\sum_{i=0}^{\infty}{\frac{1}{2^i}}=2n=O(n)$

We can sum to infinity because we know that must be greater than summing to $k-1$, and this sequence converges. We see that $N$ increments utilize $O(n)$ flips, not $N\log N$. The amortized number of flips per increment is $O(1)$.

### Example: ArrayList

Every time we must resize the array, we double it. In $N$ insertions, we double $\log_2{N}$ times. The cost of the $j$’th doubling is $2^j$ (assuming we start at size 0), so the total work for $N$ insertions is:

$N + \sum_{j=0}^{\log{N}}{2^j}=N + (2^{1 + \log{N}}-1) < N + 2N = 3N$

The amortized cost of an add operation = $\frac{3N}{N} = 3 = O(1)$.

## Accounting Method

Imagine “prepaying” for future expensive operations with a cheap operation.

### Example: Multipop

We could assign a cost $2$ to push(), and $0$ for pop() and multipop(). Now we’ve overpayed for push(), and we break even if every pushed item is popped. Note that we can never underpay: the cost we’ve payed is always greater than or equal to the actual cost. In the worst case, the work is just counting the number of pushes = $O(2N)$. This makes analysis simpler when we only count pushes.

### Example: Incrementing a Binary Counter

Every time we flip a bit to $1$, we can prepay its being flipped back to 0. The credit remains positive, so in $N$ increments we make $N$ on-flips, so the total is $2N = O(N)$. This matches the aggregate analysis approach, but is much easier to reason about.

### Example: ArrayList

We can budget 3 operations that every add operation pays for:

1. Inserting the own element into the data structure
2. Moving this element once when the array is resized
3. Moving one old item to the new array when it is resized. In any state, $\frac{n}{2}$ elements are added before the resize and can pay for the existing $\frac{n}{2}$ elements which don’t have the payment to move to the new array.

After every resizing, no credit is left over.

## Potential Method

Define potential as a function of what operations can do on the data structure. It’s kind of like the potential energy of a data structure. It is related in some well-defined way to the work that operations can do on the data structure, and the operations that the data structure could support in the future.

Potential is a measure of a data structure’s state. Potential functions must always be non-negative, and should bound/limit the work done for the most expensive operations that would reduce the potential. This could be the number of items in the list/stack, for example. This lends a key insight:

Amortized cost = (true cost) + (change in potential)

Given $\Phi(s)$ is the potential of the data structure in state $s$, the amortized cost of operation $i$ is $C_i + \Phi(s_i) - \Phi(s_{i-1})$. The amortized cost is the true cost of operation, $C_i$, added to the difference in potential between states $s_i$ (after the operation) and $s_{i-1}$ (before the operation).

Actually, this is just a roundabout way to calculate the actual sums. The sum over all amortized costs is equal to the sum of all true costs plus the total change in potential, which is greater than or equal to the sum of all true costs. This guarantees that the outcome of the potential method is at least the true cost of operations, which is good for our upper bound.

$\sum_i{(C_i+\Phi(S_i) - \Phi(S_{i-1})) = (\sum_i{C_i}) + \Phi(S_n) - \Phi(S_0)} \geq \sum_i{C_i}$

### Example: Multipop

The amortized costs generated by examiningk the change in potential give us the same costs as we saw in the accounting method.

• push: (1 operation) + (increase stack size by 1) = 1 + 2 = 2
• pop: (1 operation) + (decrease stack size by 1) = 1 - 1 = 0
• multipop: (k operations) + (decrease stack size by k) = k - k = 0

### Example: Incrementing a Binary Counter

The number of $1$’s in the counter is the potential: it starts at 0, and is non-negative.

• At increment $i$, we reset $t_i$ bits to $0$ and flip at most $1$ bit to $1$. This change in potential is $1-t_i$.
• The actual work done at flip $i$ is $1+t_i$, since we need to do at most that many flips.
• The amortized cost of a single increment is $(1-t_i) + (1+t_i)=2$, so the bits flipped in $N$ increments is $O(N)$.

### Example: ArrayList

Define $\Phi(s) = 2\cdot(N-\frac{\text{array.length}}{2})$ , where $N$ is the number of items in the array. This sets the potential to be $0$ when the array is half full, and the potential to be $N$ when the array is full. The amortized cost ends up being 3 again.

# Randomized Algorithms

Randomized algorithms are useful when you have a problem where many choices are good, and all you need to do is avoid the rare bad choices that. These are situations where a bad deterministic algorithm could give you the worst case. The key fact we must remember is that of expected wait times, which informs the expected success:

If something has an independent probability $p$ of happening each trial, then the expected time until it occurs is $1/p$.

For example, the expected time to roll 5 on a 6-sided die is 6; the expected time to flip heads on a coin is 2. As a result, we usually don’t have to wait very long to find the hay in the haystack.

## QuickSelect (review)

The task is to find the $k$’th smallest item. We divide items into piles smaller than and greater than a random pivot, then recurse over the appropriate side.

• In the worst case, we keep picking bad pivots, and end up with a $\Theta(n^2)$ runtime.
• In the expected case, randomizing the pivot prevents from bad pivots that could arise if we started with an already sorted list. This has an expected runtime of $\Theta(n)$.
• Imagine all the elements in the array, sorted in order. Define “picking a good pivot” as picking a pivot in the middle 50% of the array, which would allow us to throw out at least 25% of the items. 
• The probability of picking a good pivot is $1/2$, so the expected number of random pivots we must pick before coming across a good pivot is $2$.
$\sum_j{2(cn\cdot(\frac{3}{4})^j)} = O(n)$

## Linearity of Expectation: E[x+y] = E[x] + E[y]

This is cool because it works even if $X$ and $Y$ are independent events.

If we are collecting coupons, how long will it take us to collect all the unique coupons? By linearity of expectation, $\text{E}[x] = \sum_{i=0}^{n-1}{\text{E}[x_i]}$ if we consider $x_i$ the number of coupons we collect to get a new unique coupon (going from $i$ to $i+1$ unique coupons). So if we repeat this collecting $n$ times, by linearity of expectation the number of coupons we collect is that.

What is $\text{E}[x_i]$?

• On the first coupon, we have a $\frac{N}{N}=1$ chance of getting a unique coupon.
• On the second coupon, we have a $\frac{N-1}{N}$ chance of getting a unique coupon, so our expected time is $\frac{N}{N-1}$.
• On the third coupon, our expected time is $\frac{N}{N-2}$.
• On the $N-1$th coupon, our expected time is $N$ since we have a $\frac{1}{N}$ chance of getting a unique coupon.
$\text{E}[x] = \sum{\text{E}[x_i]} = \sum_{i=0}^{n-1}{\frac{N}{N-i}} \approx N\ln N$

## QuickSort

QuickSort takes a random pivot and partitions elements into smaller and larger piles, then sorts those piles. This has a worst case of $\Theta(N^2)$ but an expected time of $\Theta(N\log N)$. Randomness protects us from bad pivot choices.

## Miller-Rabin Primality Test

Big Idea: Repeated runs of an algorithm can increase its chance of success. The Miller-Rabin Primality Test checks the primality of a number with around a $3/4$ success rate. But if we run this many many times, the probability of passing $k$ such tests and still being nonprime is $(\frac{1}{4})^k$ – if you run 100 tests, the chance it is composite is $\frac{1}{2^{200}}$, while the chance of being struck by lightning this year is $\frac{1}{2^{20}}$.

More specifically, prime numbers satisfy Fermat’s Little Theorem, which states that $p$ is prime implies that $a^{p-1} \mod{p}=1 ~~\forall a \in \mathbb{N}, a \ge 2$ . Unfortunately, some nonprime numbers still pass Fermat’s Little Theorem for all values of $a$: we call these Carmichael Numbers. Fortunately, we do know that primes shouldn’t have nontrivial square roots, where $a^2\mod{p}=1$ when $a\ne1$ and $a\ne p-1$

So we will repeatedly square to calculate $a^{p-1}$, so we can easily reject $p$ if we square it and get 1 when we shouldn’t. This will catch the Carmichael numbers. We can repeat this many many times with random values of $a$, chosen between $[2, p-2]$.

## Hashing

Hash functions are designed to have an equal probability of sending an input to a value in a set of possible values, unless the inputs are identical. A linear function on bits modulo some prime number seems to work well for this. The chance of collisions in a table of $N$ size is $\frac{1}{N}$, regardless of how similar or dissimilar elements are. This low collision probability is what makes the expected time of put and get $O(1)$, but the worst case linear time if you have to traverse elements on collision.

## Bloom Filters

Bloom filters are useful in applications when we want to know with relative certainty if we’ve seen a key before or not. This could be useful in a network filter that is blacklisting IP addresses. Rather than use a huge hashtable, we can use many smaller hashtables each with different random hashing functions.

These smaller tables can be relatively small, but if all of them agree, we can be fairly sure we’ve seen a key before. With $m$ bits, the chance of a collision is $\frac{1}{m}$, and the chance of wrongly saying “yes, I’ve seen this key” in $b$ different tables is $(\frac{1}{m})^b$. We would need a single hashtable of size $m^b$ to have the same false positive rate as $b$ tables of size $m$.

# Greedy Algorithms

Greedy algorithms take actions that prioritize a short-term goal which is often easily computable. This short term goal is often to minimize or maximize something, which lends itself well to speedup from sorting input data (in the case that all data is known beforehand) or storing data in a min/max-heap (in the case that the algorithm must operate on changing data).

Greedy algorithms build a unique solution as they go, unlike dynamic programming approaches which don’t decide on any one solution until the very end.

A simple example of a greedy algorithm is the change-making process using US coins. We begin by finding the biggest coin that can fit into the total amount of change left to give back, and simply repeat this until the entire change is made. This happens to work with the US currency denominations, but surprisingly this approach is not guaranteed to work. If we had to make $0.66 of change and had$0.22 coins available, we would first pick out a quarter and struggle along instead of simply picking two 22 cent coins.

Often, greedy algorithms will make use of priority queues or min/max-heaps to avoid a potential bottleneck of multiple sorts. Both Prim’s and Dijkstra’s algorithms are greedy algorithms.

• Prim’s algorithm repeatedly finds the next lowest weight edge cut by the cutset of included nodes
• Dijkstra’s algorithm repeatedly explores the next closest unexplored node and subsequently updates edge weights

## Unweighted Interval Scheduling

Problem: many people want to use a resource, and our goal is to simply satisfy the most intervals.

There are plenty of greedy but non-optimal approaches:

• Picking the first request: bad if the very first request takes the entire time, leaving no space for other intervals
• Picking the shortest interval: bad if the shortest request is in an odd timing and prevents you from doing other intervals
• Picking the interval with fewest conflicts: fails if there are many intervals stacked on top of each other when you should be picking one from those.

Optimal approach: Repeatedly pick the request that ends first (gives us the most time left after its completion) – as long as the request does not conflict with any previous requests.

### Runtime

Like most greedy algorithms, this one’s runtime is lower bounded by the speed of sorting. We begin by sorting by the value we’re being greedy about, giving us a lower bound of $n\log{n}$ time complexity. If we knew something about the distribution of the dataset or some other prior, we might be able to do sorting in $O(n)$.

In this proof strategy, we show that the greedy algorithm starts out optimal, and stays optimal or never falls behind the optimal solution. We compete our algorithm against an imaginary optimal algorithm which always is optimal (regardless of method).

Base case: At $i=1$, the greedy solution picks out the very earliest-finishing request in the whole set, so this is the only optimal solution possible at this point. The optimal algorithm can’t possibly finish any sooner than our solution finishes.

Inductive step: Asume the claim is true for all $i \leq k$. Then, there is no way for the greedy algorithm to fall behind at request $k+1$. The optimal move is legal for the greedy algorithm because we finished earlier or the same time as the optimal algorithm at the previous time step.

By our greedy algorithm, our request $i$ always ends no later than the optimal solution’s request $i$. This means that our approach must return an optimal scheduling. The greedy algorithm’s solution is always at least as good as the optimal algorithm’s solution.

## Fractional Knapsack

This is a relative of the knapsack problem, but this time you are bringing spices so you can arbitrarily divide up quantities and take a fractional weight of an item which gives you a fractional value. The greedy approach we will use is to take as much of the most valuable item as possible, then switch to the next most valuable item.

## Interval Partitioning

Instead of only having one resource that multiple users would like to use, we want to schedule all requests using as few resources as possible. For example, we could be trying to schedule multiple tracks in a conference to use the least number of rooms.

We can solve this by sorting the requests by start time, and then greedily assign each request to the earliest resource that available.

### Lower Bound Proof

We claim that the maximum number of simultaneous requests is the minimum number of resources we need. If there are $d$ requests all for the same time, it is not possible to create a schedule with fewer than $d$ resources. We call this the depth of the requests.

$d$ is a lower bound for the number of resources used. Our algorithm never uses more than $d$ resources at once. Because our algorith meets the lower bound, no solution can be more optimal than ours.

## Minimizing Max Lateness

We want to schedule tasks, each of which has a duration and a deadline, in order to minimize the largest deadline miss.

Our greedy algorithm will schedule the tasks in order of the earliest deadline. We will use the exchange argument to prove its correctness.

### Exchange Argument

The crux of the exchange argument is that we can transform an optimal solution into our solution through a number of moves that preserve optimality – that do not make the optimal solution no less optimal. Call the arbitrary optimal solution OPT.

• OPT has no idle times, and already has that in common with our algorithm’s schedule.
• An inversion is a pair of jobs where one job has a later deadline than the other, but is scheduled before the one with the earlier deadline. All schedules with no idle time and no inversions have the same maximum lateness.
• Optimal Schedules with no inversions can only differ when multiple jobs have the same deadline; shuffling these same-deadline jobs will not change the maximum lateness.
• In this case, swapping adjacent, inverted jobs in the schedule keeps the schedule optimal

# Compression

Ascii is a compression format, but it’s not very efficient because every character has the same length of encoding: 8 bits. The crux of all compression is making sure that often-used characters have a smaller compressed size than less common characters. This is called a variable-length code, and is used in compressing text, images, and other mediums. Not all variable-length codes are good: morse code is one variable-length code that suffers from ambiguity since a sequence of morse code could be decoded as more than one output.

Prefix codes assure there can be no ambiguity with what a sequence of coded characters mean. When expressed as a binary tree, prefix codes have all characters at leaves of the tree.

## Information

Information = $-\log_2{Pr(s)}$ , where $Pr(s)$ is the probability of some symbol $s$.

$Pr(s) =$ (count of char s) / (count of all symbols)

ABL (Average Bits per Letter) is a measure of how compressed some information is, given frequencies of a character. $\sum_x{f_x \cdot \gamma(x)}$ where $f_x$ is the frequency or probability of a symbol occurring, and $\gamma(x)$ is the length of the code for that symbol.

## Huffman Compression

Huffman compression is a greedy algorithm for generating the optimal code for each symbol in a corpus of symbols to be encoded. Huffman compression begins by labeling all symbols with their frequency, or number of occurrences in a representative sample of text. Starting from most infrequently used symbols, it merges symbols into nodes that have the two symbols as children and takes on the summed frequency of the two children. These are recursively merged with the next smallest groupings of symbols, until all symbol-frequency tuples are merged into one binary tree.

This method ensures that the lowest nodes of the tree are the least common symbols, while the nodes closest to the root are the most commonly occurring symbols. This guarantees that the code is not only unambiguous, but also ideal.

### Runtime

Given k unique characters, Huffman compression needs k-1 rounds of merging, since each merge results in one fewer connected component.

Huffman’s compression is greedy because it keeps looking for the next least commonly used character and merging it, so we can apply a min-heap data structure to avoid the need of sorting. In each round of merging, Huffman’s algorithm needs to extract two smallest elements, taking $O(2\cdot\log{k}) = O(\log{k})$ time. Inserting a new key is $O(\log{k})$ time. The overall runtime is $O(k\cdot\log{k})$, since we repeat this merging k times. Note that the runtime is dependent on the number of unique characters, not the length of the message.

### Usage Notes

• Sending a Huffman tree with your encoding can actually increase the size of whatever you’re sending, if the encoding doesn’t save as much space as the tree you’re sending takes. By starting with a shared codebook, you can reduce this overhead.
• Huffman encodings are only optimal as far as encoding on a character-by-character basis, and indicates nothing about compressing adjacent characters.

$ABL(T') = ABL(T) - f_m$ when two Huffman trees $T$ and $T'$ differ by a node $m$ with children $y$ and $z$, and $f_m = f_y + f_z$.

## Lempel-Ziv-Welch (LZW)

Unlike the Huffman encoding, LZW is able to create codes that take advantage of patterns in texts, compressing multiple occurrances. It can be used by itself, or in conjunction with Huffman encodings.

LZW is able to build a codebook in parallel on the sender and receiver, without ever transmitting a copy of the codebook or any codebook data.

# Network Flows

Network flows deal with directed graphs $G = (V, E)$ that have two special nodes: a source s and a sink t. These graphs have edge capacities $c(e)$ for $e \in E$. These networks can model railways, pipelines, and a lot of other applications dealing with how much can travel from one station to another.

For simplicity, we will assume that there are no edges leading into the source, no edges leaving the sink, and every node can reach the sink.

## Flow: $S-T$ flow

A flow is defined with a function $f(e)$ outputting how much to send on a given edge. This function has two properties:

• $\forall e \in E, 0 \leq f(e) \leq c(e)$ For every edge, the flow on that edge must be positive and less than the maximum capacity of that edge.
• $\forall v \in V, \sum_{e~\text{in to}~v}{f(e)} = \sum_{e~\text{out of}~v}{f(e)}$ For every vertex, the total flow into that vertex must equal the total flow out of that vertex.

The value of a flow is defined $\text{val}(f) = \sum_{e~\text{out of s}}{f(e)}$, the sum of all edges out of the source $s$. This value must also be the sum of flow entering the sink $t$.

## Cuts

An $S-T$ cut is a partitioning of the set of vertices $V$ into sets $A$ and $B$ that has the source is in one set and the sink in another. $s \in A, t \in B$

The capacity of a cut is the sum of capacities for all edges going from $A$ to $B$.

Conservation of Flow: If $f$ is any $s-t$ flow and $(A, B)$ is any $s-t$ cut, then the net flow across $(A, B)$ is equal to the amount leaving the source $s$.

$\text{val}(f) = \sum_{edges~A\to B}{f(e)} + \sum_{edges~B\to A}{f(e)} = \text{net flow}$

The value of a flow is at most the capacity of the cut. For any $s-t$ flow $f$ and any $s-t$ cut $(A, B)$, $\text{val}(f) \leq \text{cap}(A, B)$.

$\text{val}(f) = \sum_{e~\text{out of}~A}{f(e)} - \sum_{e~\text{into}~A}{f(e)} \\\leq \sum_{e~\text{into}~A}{f(e)}~~~~~\text{by negativity of flows, we can subtract the flow out A} \\\leq \sum_{e~\text{into}~A}{c(e)}~~~~~\text{since we know flow is less than the capacity of the cut}$

Weak Duality Theorem: If $f$ is a flow, $(A, B)$ is a cut, and $\text{val}(f) = \text{cap}(A, B)$, then $f$ is a maximum flow and $(A, B)$ is a minimum cut.

Weak Maxflow-Mincut Duality: The value of the maximum flow $\leq$ the capacity of the minimum cut.

## Augmenting Paths

An augmenting path is an $s-t$ flow where the flow is less than the capacity on every edge in the path: $f(e) < c(e)$. On each of these edges, we can increase the flow and still maintain flow conservation.

### Residual Graphs

This is the idea of a residual graph, which for every edge $(u, v)$, there are now two edges.

• Forward edge, in the original direction, has value of how much more flow can be added to that edge. $\text{capacity} = c(e) - f(e)$: how much more flow can we send?
• Residual (backward) edge, has value of how much can the flow be reduced. This is the current flow. $\text{capacity} = f(e)$: how much can we reduce flow before it becomes negative?

By thinking about residual edges, we can reason about reversing or reducing flow.

If we want to increase the flow in a residual graph, we can do the following steps:

1. Find the bottleneck capacity, which is the minimum capacity of all edges in P. This is the minimum value of all forward edges in the residual graph.
2. For every edge in the augmenting path in the residual graph:
• If it is a forward edge: add flow by the bottleneck amount
• If it is a residual/backward edge: subtract flow by the bottleneck amount
3. Repeat. After each augmenting step, this is still a valid flow.

## Ford-Fulkerson Algorithm

The Ford-Fulkerson algorithm is a greedy algorithm for finding maximum flows through a network. It works by greedily looking for augmenting paths in the residual graph. (Remember that an augmenting path is a path from the source to the sink where there is potential to add flow to all edges in that path).

It begins with all 0 flows, $f(e) =0~\forall e \in E$.

1. Find an augmenting path in the residual graph, update the flow
• Finding an augmenting path in the residual graph can be found by finding any path from $S$ to $T$, since because it is in the residual graph that means that there is capacity left on all those edges. This can be done with DFS, taking $O(m+n)$ time, but because there are
2. Repeat until we are stuck: there is no longer an $s-t$ path in the residual graph.

Theorem: the following statements are equivalent for all flows $f$.

1. There exists a cut $(A, B)$ such that $\text{val}(f) = \text{cap}(A, B)$: there is a cut where the flow across the cut is equal to the total capacity across the cut.
2. Flow $f$ is a maximum flow.
3. There is no augmenting path in $G_f$.

We can show 1 implies 2 by the weak duality theorem: if $f$ is a flow, $(A, B)$ is a cut, and $\text{val}(f) = \text{cap}(A, B)$, then $f$ is a maximum flow and $(A, B)$ is a minimum cut.

We can show 2 implies 3 by the contrapositive: $\neg 3 \implies \neg 2$: If there is an augmenting path in $G_f$, then by augmenting along that path we can increase the flow $f$.

We can show 3 implies 1 by reasoning.

• Create a cut $(A, B)$ such that $A$ consists of all nodes reachable from A in the residual graph, and $B$ is all the other nodes. Observe that no edges in the residual graph go from $A$ to $B$.
• If an edge from $A$ to $B$ is in the original graph but is not in the residual graph, that means that $A\to B$ is saturated: its flow is exhausted, $f(e) = c(e)$.
• If an edge from $B$ to $A$ is in the original graph but is not in the residual graph, it has flow $0$ or else we would have had a backwards edge $A\to B$ in the residual graph.
$\text{val}(f) = \sum_{e\in~\text{A \to B}}{f(e)} - \sum_{e\in~\text{B \to A}}{f(e)}$
• We can apply these two facts that all edges $A\to B$ are $c(e)$, and all edges $B\to A$ are $0$:
$\text{val}(f) = \sum_{e\in~\text{A \to B}}{c(e)} - \sum_{e\in~\text{B\toA}}{0} = \text{cap}(A, B)$
• This tells us that the flow is optimal, but it also tells us that given a max flow, we can compute a min cut in linear time using BFS to figure out all the vertices reachable from $s$ – essentially finding $A$.

### Runtime

If our graph has integer capacities, each augmentation will increase the flow by at least 1. The number of augmentations then is at least $\text{val}(f^*)$, the value of the maximum flow in the graph. Each augmentation takes $O(m)$ time, since it runs both DFS and then takes linear time to update the flows. DFS is $O(m+n)$ time, but we are almost always working on a graph with more vertices than edges, so $n < m$ and thus it’s $O(m)$ time to run DFS.

• Ford-Fulkerson has a runtime of $O(m\cdot \text{val}(f^*))$ where $\text{val}(f^*)$ is the value of the max flow.
• Given a maximum $S\to T$ flow, we can run depth-first search to find the set of all nodes reachable from $S$ in the augmented graph. Given a max flow, we can find a mincut in time $O(n+m)$.

### Improvements to Ford-Fulkerson

We notice that so far, we have only considered Ford-Fulkerson on integer capacities, which lend themselves to an integer maximum flow. We also observe that the runtime is dependent on the value of the maximum flow, which is suboptimal. Furthermore, the number of augmentations can be unbounded if capacities are non-integers or special integer cases.

There are a few strategies we can take in choosing our augmenting paths that will allow us to terminate sooner and when dealing with both special and non-integer capacities.

For the purposes of this course we can assume that we can solve maximum flow in time $O(mn)$. This is an area of active research!

Maximum Bottleneck Path Shortest Path
Finds the path maximizing the minimum edge capacity, $\text{min}(c(e))$ Finds the path with the shortest hops from the start to end node
Can be computed in $O(m\log m)$ Can be computed in $O(m)$ using BFS
Variant of Dijkstra’s algorithm

#### Max Bottleneck Path Heuristic

Arbitrary Path Max Bottleneck Path
Value of maxflow $v^*$ $v^*$
Value of augmenting path $\ge1$ $\ge\frac{v^*}{m}$
Flow remaining $\le v^*$ $\le v^* - \frac{v^*}{m}$
# of augmenting paths $\le v$ $\le m\log v^*$

When taking the max bottleneck path, the number of augmenting paths can be calculated as follows: If it takes us $k+1$ augmenting paths, after $k$ paths, the remaining flow is $\gt 0$. With some calculus trickery: $0 \le (1-\frac{1}{m})^k v^* \approx (e^{-1})^{-\frac{k}{m}}v^*$ Using this heuristic, the number of augmenting paths is $O(m\log(v^*))$, and the running time of each augmentation is $O(m\log m)$. The total runtime of using the max bottleneck path is $O(m^2\log(m)\log(v^*))$, which is an improvement over being linear in $v^*$. This is still going to be slower on large graphs with low maximum flows though, since it has an $m^2$ term in there.

#### Shortest Path Heuristic

We can find the shortest augmenting path (the one with the least hops) in $O(m)$ time using DFS (recall from above that $n<m$ so $O(n+m)=O(m)$). The total number of augmenting paths is $O(mn)$. On arbitrary non-integer capacities, this heuristic gives a runtime of $O(m^2n)$ which is pretty nice, since it’s not bounded by the value of the maximum flow.

## Reductions to Network Flow

A reduction is a new algorithm that solves a new problem using function calls to an old algorithm that solves an old problem. The reduction should not make any assumptions about the implementation details of how the old algorithm solves the old problem.

### Example: Mincut

New Input: $(G, S, T)$ = old input for Ford Fulkerson

Old Output (from Ford-Fulkerson): $f^*~ \text{for}~ G$

New Output: Compute residual graph $G_{f^*}$, compute $A$ as the set of nodes reachable from $S$ in $G_{f^*}$.

### Example: Bipartite Matching

Input: a bipartite graph $G=(V, E)$ with $V=\text{left}~\cup~\text{right}$.

Output: a maximum cardinality matching $M \subseteq E$, where $M$ is a set of edges where no endpoint is connected to more than one edge in $M$. The cardinality is the number of edges in $M$. This problem has a variety of use cases.

Input for Bipartite Matching Input for Integer Maxflow Adjustment
Undirected Directed We can think of the matching as sending tokens from all nodes on the left hand side to all nodes on the right hand side. So we can connect every node on the left hand side to every node on the right hand side with a directed edge.
Unweighted Integer edge capacities All edges are equally important for matching, so we can assign a capacity of 1 to each edge
No source/sink Has a source/sink We can create a source node that we connect to every node on the left side, and a sink node which is connected to all nodes on the right hand side.

The output of integer maxflow is a flow, where every edge is assigned either a 0 or a 1. We can simply use the set of edges that have a flow of 1 as our matching, after we remove our specially added source and sink nodes.

We know this is a correct matching since for every node in the matching, on the left there is only one edge entering it (from our special source node). By flow conservation, we know that each must connect to only 1 other node. On the right, each node only has one output, and so can only be connected to by 1 node on the left.

Transforming the input for bipartite matching into the input for integer maxflow takes time $O(n+m)$ to add $m$ new edges and a constant 2 new nodes. Finding the max flow in this new graph takes $O((n+2)(n+m))=O(nm)$ since $n^2 = O(nm)$. We can transform the integer maxflow output to our desired output in time $O(n+m)$.

### Edge-Disjoint Paths (EDP)

Given: a directed graph $G$ with source $S$ and sink $T$

Output: a largest set of edge-disjoint $S-T$ paths. An edge-disjoint set of paths are paths from $S$ to $T$ that share no edges in common. This indicates that this network can tolerate that many edge failures.

Transforming the input for EDP to Integer Maxflow is fairly straightforward; we just need to assign an edge weight of 1 for every edge in $G$.

Transforming the output of Integer Maxflow to the output for EDP is a bit more involved.

• Integer maxflow gives weights of 1 on all edges that are part of an $S-T$ path.
• We want to choose paths only from these maxflow edges, so we can search along these maxflow edges for a path from $S$ to $T$, ignoring all other edges. Once we find one path, we remove those edges and keep searching. The set of maxflow edges is guaranteed to have an $S-T$ path.
• Furthermore, because each edge weight is 1, the value of the maxflow is equal to the number of edge-disjoint paths we will find.

#### Correctness

• If there are $k$ disjoint paths, then there is a flow of value $k$. Each edge-disjoint path is like a new augmenting path that can be used. So if we have $k$ of these, then we can add flow 1 $k$ times.
• If there is a flow of value $k$, then there are $k$ disjoint paths. Can prove this by induction.

#### Running time

1. Producing $G’$ from $G$ takes $O(mn)$ since we are just labeling that many edges.
2. Finding the max flow on $G’$ takes $O(mn)$ since we are solving integer maxflow on $n$ nodes, $m$ edges, $f(e)=1$.
3. Producing the set of paths $P_1…P_k$ from the maxflow takes $O(mn)$ since we run BFS as many times as the value of the maximum flow. The value of the maximum flow is $O(n)$ and $n = O(m)$, so $O(m+n)\cdot O(n)=O(mn)$.

### Image Segmentation

The task is to separate an image, such as into a foreground and a background, given some prior knowledge about the pixels. Specifically:

Inputs:

• An undirected graph $G=(V,E)$ where $V=\text{pixels}$ and $E=\text{pairs of pixels}$
• Likelihoods $a_i, b_i \ge 0 ~\forall i \in V$ indicating for each pixel, the likelihood it is in in set A or set B
• A separation penalty $P_{ij} \ge 0$ for all $(i, j)\in E$, denoting the cost penalty if we put pixels $i$ and $j$ into separate sets

Output: a partitioning of $V$ into sets $(A, B)$ that maximizes the a score function:

$q(A, B)=\sum_{i\in A}{a_i} + \sum_{j\in B}{b_j} - \sum_{(i, j)\in E~\text{between A and B}}{p_{ij}}$

To reduce this maximization function to the Mincut problem, there are a few steps. Our reduction must create a graph such that the capacity of the minimum cut should correspond to the function we are trying to minimize.

1. Segmentation wants to maximize a function, while Mincut wants to minimize a function. We can negate our function to maximize and then minimize it.

$q(A, B)=-\sum_{i\in A}{a_i} - \sum_{j\in B}{b_j} + \sum_{(i, j)\in E~\text{between A and B}}{p_{ij}}$ $q(A, B)=\sum_{i\in B}{a_i} + \sum_{j\in A}{b_j} + \sum_{(i, j)\in E~\text{between A and B}}{p_{ij}}$
2. Segmentation allows any partitioning of the graph, while Mincut requires a source node $S\in A$ and sink node $T\in B$. We can just add new nodes $S$ and $T$ which are not connected to anything, allowing Mincut to make any partition it wants.
3. Segmentation has undirected edges between sets $A$ and $B$, while Mincut considers edges from $A$ to $B$ (these are directed edges). For every undirected edge, we can create both a forwards and a backwards edge, both with the same capacity.
4. Segmentation has a minimizing function with terms for nodes in $A$ and nodes in $B$, the $\sum_{i\in B}{a_i} + \sum_{j\in A}{b_j}$ terms, while Mincut only has a term $\sum_{(i,j)\in E}{p_{ij}}$. We can connect our added source and sink nodes such that $S$ connects to all nodes, and $T$ connects from all nodes. $S\in A$ and $T\in B$, so we set the weights on all edges connecting out from $S$ to be $a_j$ and weights on all edges connecting to $T$ to be $b_i$.

This means that any cut will cut the edges from $S$ to the nodes in $B$, which carry weight $a_j$, and this cut will cut edges from any node in $A$ to $T$, which carry weight $b_j$. Looking back at our equation from step 1, we now have exactly what we wanted.

#### Correctness

The capacity of every cut is equal to $-(\text{quality of partitioning in image segmentation})$. The minimum cut is that of the maximum quality.

#### Running Time

Image segmentation takes time $O(m)$ to find the partitioning of pixels:

• It takes $O(m)$ time to produce $G’$, which involves adding $2m$ edges (both from $S$ to all the nodes, and from all nodes to $T$).
• It takes $O(mn)$ time to solve the Mincut problem on $G’$, a graph of $(n+2)$ nodes and $(2m+2n)$ edges.
• It takes $O(m)$ time to find the partitioning.

# Complexity Classes

“Hard” problems are problems that cannot be solved in polynomial time. We commonly use reductions to show the relationship between hardness of problems. There are three primary problem classes.

• P: problems that can be solved in polynomial time
• NP: problems whose solutions can be checked in polynomial time
• NP Complete: the very hardest problems in NP. These problems are polynomial-time reducible from any other problem in NP.

## 2-SAT

2-SAT is considered an “easy” problem, and has a solution that can be found in polynomial time. 2-SAT is part of complexity class P. We define variables and literals, where variables are things like $X$ and $Y$, and literals are combinations of variables and operators, like $\sim X$ and $\sim Y$. A 2-Clause is comprised of two literals joined with an OR, like $X \vee \sim Z$.

Input: a bunch of 2-Clauses joined with AND operators $\wedge$

Output: an assignment of variables that makes the input’s formula true

This problem can be solved in polynomial time. Consider a 2-Clause $(X\vee Y)$:

• If $\sim X \implies Y=\text{true}$
• If $\sim Y \implies X = \text{true}$

So $(X\vee Y)$ is equivalent to $(\sim X \implies Y)\wedge(\sim Y\implies X)$. We can use this fact to create an implication graph, where each literal is a node and each clause forms two directed edges edges, $(\sim X, Y)$ and the $(\sim Y, X)$ part representing the two parts of it. A path from X to Y indicates that if X is true, Y must also be true.

We can check to make sure that this graph is valid. There are two possibilities:

1. A path from $X$ to $\sim X$ exists. This means that $X\implies\sim X$, so $X$ must be false.
2. A path from $\sim X$ to $X$ exists. This means that $\sim X \implies X$, so $X$ must be true.

If both of these paths exist, then we have found a contradiction and thus an assignment cannot be made. The formula is unsatisfiable. Otherwise, the formula is satisfiable.

Algorithm: For every variable, check for paths $(X, \sim X)$ and $(\sim X, X)$. For $n$ number of variables, this is solvable in polynomial time:

• 1 node per literal $\le2n=O(n)$ to add all nodes to graph
• 2 edges / clause $\le$ ${2n}\choose{2}$ $=\frac{4n^2-2n}{2}=O(n^2)$ to add all edges to graph
• To check the paths $(X, \sim X)$ and $(\sim X,X)~\forall~X$ will require running DFS or BFS. Both will take time $O(\vert V\vert + \vert E\vert)=O(n^2)$, and we will need to run this twice per node. This is a total runtime of $O(n^3)$.

## 3-SAT

This is the same as 2-SAT, but instead each clause has 3 literals. It is unlikely that 3-SAT is solvable in polynomial time, since solving it seems to require trying all possible solutions. The search space has no ordering. If we were to take the same graph-based approach as solving 2-SAT, we could decompose an equation $(a \vee b \vee \sim c)$ into $\sim a \implies(b\vee\sim c)$, but we would not have a node in our graph for the latter part of the implication. We could introduce a new node but that brings additional complexities.

• Proving no polynomial time solution exists for 3-SAT: P≠NP
• Finding a polynomial time solution for 3-SAT: P=NP

2-SAT $\in$ P

3-SAT not necessarily in P

2-SAT $\in$ NP

3-SAT $\in$ NP

P $\subseteq$ NP, since finding a solution is at least as hard as checking one.

Problems in NP that are not in P: 3-SAT, Hamiltonian Cycles, Traveling Salesman, Circuit-SAT, subset-sum, etc.

## NP-Completeness

NP-Complete problems are in class NP (solutions can be checked in polynomial time), and are at least as hard as any other problem in NP. NP complete problems are linear-time reducible from any other problem in NP.

Theorem: If problem A is polynomial time reducible to problem B, and A is NP Complete, and B is in NP, then B is also NP Complete.

### Subset-Sum ➡ Partition

Subset-Sum Partitioning
A multiset of positive integers $M$
A positive integer target $t$
A multiset of positive integers $N$
Is there a subset of $M$ that sums to $t$? Is there a partitioning of $N$ into two sets, $N_1$ and $N_2$ , $N_1 \cap N_2=\emptyset, s.t.\sum_{N_1}=\sum_{N_2}$?
NP Complete To prove

We know that subset-sum is NP Complete, and we’d like to show that Partitioning also is. According to our theorem, we need to show that Subset-Sum is polynomial-time reducible to Partitioning, and then that Partitioning is in NP.

• Partitioning is in NP: Given a partitioning, we can calculate the sum of the partitions in $O(n)$ time and check if the solution is correct. So this is polynomial time, and Partitioning is in NP. ✔️
• Subset-Sum is polynomial-time reducible to Partitioning:
• We convert our input $M$ from Subset-Sum to the input for Partitioning, $N=M\cup{{S-2t}}$ where $S=\sum_{M}$ and $t$ is the target sum.
• For example, if Subset-Sum’s input $M={4, 6,7,7,30}$ and $t=20$, then $S=54$ and $M={4,6,7,7,30,14}$
• The runtime to reduce Subset-Sum to Partitioning is linear since it only involves a sum over all elements.
• Subset-Sum being true means Partitioning is true: ✔️
• Because Subset-Sum is true, this means $\exists M’ \subseteq M~ s.t. \sum_{M’}=t$.
• Let $N_1=M-M’$ and $N_2=M’\cup {s-2t}$
• $\sum_{N_1}=\sum_{M}-\sum_{M’} = s-t$, and $\sum_{N_2}=\sum_{M’} + (s-2t) = t+(s-2t)=s-t$
• So both partitions have the same sum.
• Partitioning being true means Subset-Sum is true: ✔️
• Because Partitioning is true, this means $\exists N_1, N_2 \subseteq N,~N_1\cup N_2 = N,~ N_1 \cap N_2=\emptyset,~s.t.~ \sum_{N_1}=\sum_{N_2}=\frac{\sum_N}{2}=\frac{s+s-2t}{2}=s-t$
• Assume, without loss of generality, that $s-2t \in N_1$ (it could also be in $N_2$).
• Let $M^*=N_1-{s-2t}\subseteq M$ which is saying everything in $N_1$ not in $s-2t$.
• $\sum_{M^*}=(s-t)-(s-2t)=t$ which is what we want.

### 3-SAT ➡ Clique

In our first introduction to 3-SAT, we learned that 3-SAT is NP Complete.

Here, we introduce the Clique problem. A Clique is a set of nodes in an undirected graph, who are all neighbors with each other. The clique problem asks, “Does a clique of size $k$ exist?”

Here, we introduce the Clique problem. We will show that the Clique problem is NP Complete because we can reduce 3-SAT to it in polynomial time.

• Clique is in NP: Given some solution, we can check for edges pairwise in the solution and ensure that all edges are between neighbors. This takes time $O(\vert C \vert^2)$, which is polynomial time.
• 3-SAT is polynomial-time reducible to Clique: ✔️
• Each appearance of each literal in 3-SAT becomes a node, once every time it appears in an expression. ($X$ and $\neg X$ appear independently)
• Edges are made between all nodes except literals within a clause, and between literals $X$ and $\neg X$.
• There are $n^3$ literals, so creating all the nodes takes time $O(n^3)$. Adding edges requires checking if they are within the same clause, and so making edges takes $O(n^3)\times O(n^3)=O(n^6)$ which is still polynomial time.
• We would now be checking for a clique of size $k$, where $k$ is the number of clauses.
• Clique being true means 3-SAT is true: If there exists a clique of size $k$, there must be 1 node from every clause, since there are no edges within a clause and between conflicting literals. This means we can satisfy 3-SAT. ✔️
• 3-SAT being true means Clique is true: If 3-SAT is true, there can’t be any contradictions so if we just pick one literal from each $k$ clauses this will automatically form a clique of size $k$. ✔️