RNA Secondary Structure Prediction
Problem Definition
Definition 1. The primary structure of an RNA molecule $B = b_1 \ldots b_n$ is a sequence of symbols drawn from $\lbrace A, C, G, U\rbrace$.
Definition 2. A secondary structure $S \subseteq \lbrace 1,\ldots,n\rbrace \times \lbrace 1,\ldots,n\rbrace$ over primary structure $B$ satisfies:
- (No sharp turns) $(i,j) \in S \implies i < j - 4$
- (Complementary base pairs) $\lbrace b_i, b_j\rbrace = \lbrace A,U\rbrace$ or $\lbrace C,G\rbrace$
- (Matching) Each index appears at most once
- (Noncrossing) For $(i,j),(k,l) \in S$, $i < k < j < l$ is forbidden
Problem 1. Given an RNA primary structure $B$, find the secondary structure that maximizes the number of complementary base pairs.
Subproblem Definition
Defining $\text{OPT}(j)$ as the maximum base pairs over the first $j$ bases fails: pairing $b_j$ with $b_k$ ($k < j-4$) leaves a subproblem that is not a prefix. Instead, we use:
Definition 3. $\text{OPT}(i,j)$ = the maximum secondary structure size over $b_i \ldots b_j$
Recurrence
Lemma 1. For $1 \leq i \leq j \leq n$:
\[\text{OPT}(i,j) = \max \begin{cases} \text{OPT}(i, j-1) \\ \displaystyle\max_{\substack{k:\ i \leq k < j-4 \\ b_j \text{ and } b_k \text{ complementary}}} \bigl[\text{OPT}(i, k-1) + \text{OPT}(k+1, j-1) + 1\bigr] \end{cases}\]where $\text{OPT}(i, i-1) := 0$.
Proof.
- Case $j$ is unpaired: $\text{OPT}(i,j) = \text{OPT}(i, j-1)$.
- Case $(k, j) \in S$: $b_k$ and $b_j$ must be complementary and $i \leq k < j-4$. By the noncrossing condition, the remaining structure decomposes into $b_i \ldots b_{k-1}$ and $b_{k+1} \ldots b_{j-1}$. Defining $\text{OPT}(i, i-1) = 0$ handles boundary cases $k = i$ and $k = j-1$. ∎
Evaluation Order and Algorithm
The right-hand side always has smaller size ($= j - i$) than the left-hand side, so fill the table in order of increasing size:
1 | |
Theorem 1. The algorithm is correct ($M[i,j] = \text{OPT}(i,j)$ by induction on $\ell$) and runs in $O(n^3)$ time.
Shortest Path with Negative Edges (Bellman-Ford)
Problem Definition
Problem 2. Given a weighted directed graph $G = (V, A)$, a cost function $c: A \to \mathbb{Q}$, a source $s$, and a destination $t$, find the shortest path from $s$ to $t$.
- Difference from Dijkstra: Negative-cost edges are allowed.
- Assumption: No negative cycles (otherwise shortest path length is undefined).
Key Lemma
Lemma 2. If $G$ has no negative cycles, there exists a simple path among the shortest $s$-$t$ paths, with at most $n-1$ edges.
Proof. A non-simple path has a repeated vertex. Removing the cycle reduces cost by a nonnegative amount. Repeating this yields a simple path. ∎
Subproblem and Recurrence
Definition 4. $\text{OPT}(i, v)$ = the length of the shortest path from $v$ to $t$ using at most $i$ edges. $+\infty$ if no such path exists.
Boundary conditions: $\text{OPT}(0, t) = 0$, $\text{OPT}(0, v) = +\infty$ for $v \neq t$
Lemma 3. For all $i \geq 1$:
\[\text{OPT}(i, v) = \min \begin{cases} \text{OPT}(i-1, v) \\ \min_{\langle v,w \rangle \in A} \bigl[c(v,w) + \text{OPT}(i-1, w)\bigr] \end{cases}\]Proof. Consider the shortest $v$-$t$ path using at most $i$ edges. If it uses $\leq i-1$ edges, the first case applies; if it uses the first edge $\langle v,w \rangle$, the second case applies. ∎
Bellman-Ford Algorithm
1 | |
Theorem 2. The algorithm returns the shortest path length from $s$ to $t$.
Theorem 3. Time complexity: $O(n(m+n))$ (outer loop $n-1$ iterations, each processing all edges $O(m+n)$).
Space-Optimized Version
1 | |
This version using only a 1D array is also correct.