RNA Secondary Structure Prediction

Problem Definition

Definition 1. The primary structure of an RNA molecule $B = b_1 \ldots b_n$ is a sequence of symbols drawn from $\lbrace A, C, G, U\rbrace$.

Definition 2. A secondary structure $S \subseteq \lbrace 1,\ldots,n\rbrace \times \lbrace 1,\ldots,n\rbrace$ over primary structure $B$ satisfies:

(No sharp turns) $(i,j) \in S \implies i < j - 4$

(Complementary base pairs) $\lbrace b_i, b_j\rbrace = \lbrace A,U\rbrace$ or $\lbrace C,G\rbrace$

(Matching) Each index appears at most once

(Noncrossing) For $(i,j),(k,l) \in S$, $i < k < j < l$ is forbidden

Problem 1. Given an RNA primary structure $B$, find the secondary structure that maximizes the number of complementary base pairs.

Subproblem Definition

Defining $\text{OPT}(j)$ as the maximum base pairs over the first $j$ bases fails: pairing $b_j$ with $b_k$ ($k < j-4$) leaves a subproblem that is not a prefix. Instead, we use:

Definition 3. $\text{OPT}(i,j)$ = the maximum secondary structure size over $b_i \ldots b_j$

Recurrence

Lemma 1. For $1 \leq i \leq j \leq n$:

\[\text{OPT}(i,j) = \max \begin{cases} \text{OPT}(i, j-1) \\ \displaystyle\max_{\substack{k:\ i \leq k < j-4 \\ b_j \text{ and } b_k \text{ complementary}}} \bigl[\text{OPT}(i, k-1) + \text{OPT}(k+1, j-1) + 1\bigr] \end{cases}\]

where $\text{OPT}(i, i-1) := 0$.

Proof.

Case $j$ is unpaired: $\text{OPT}(i,j) = \text{OPT}(i, j-1)$.
Case $(k, j) \in S$: $b_k$ and $b_j$ must be complementary and $i \leq k < j-4$. By the noncrossing condition, the remaining structure decomposes into $b_i \ldots b_{k-1}$ and $b_{k+1} \ldots b_{j-1}$. Defining $\text{OPT}(i, i-1) = 0$ handles boundary cases $k = i$ and $k = j-1$. ∎

Evaluation Order and Algorithm

The right-hand side always has smaller size ($= j - i$) than the left-hand side, so fill the table in order of increasing size:

M[i, i-1] ← 0  for all 1 ≤ i ≤ n
for ℓ ← 0, ..., n-1 do
    for i ← 1, ..., n-ℓ do
        j ← i + ℓ
        M[i,j] ← M[i, j-1]
        k[i,j] ← 0
        if M[i,j] < max{ M[i,k-1]+M[k+1,j-1]+1 | i≤k<j-4, bj,bk comp. } then
            M[i,j] ← (above maximum)
            k[i,j] ← corresponding k (argmax)
sol ← M[1,n]
OutputSecondaryStructure(1, n)

function OutputSecondaryStructure(i, j):
    if i ≤ j then
        if k[i,j] = 0 then
            OutputSecondaryStructure(i, j-1)
        else
            OutputSecondaryStructure(i, k[i,j]-1)
            print (k[i,j], j)
            OutputSecondaryStructure(k[i,j]+1, j-1)

Theorem 1. The algorithm is correct ($M[i,j] = \text{OPT}(i,j)$ by induction on $\ell$) and runs in $O(n^3)$ time.

Shortest Path with Negative Edges (Bellman-Ford)

Problem Definition

Problem 2. Given a weighted directed graph $G = (V, A)$, a cost function $c: A \to \mathbb{Q}$, a source $s$, and a destination $t$, find the shortest path from $s$ to $t$.

Difference from Dijkstra: Negative-cost edges are allowed.
Assumption: No negative cycles (otherwise shortest path length is undefined).

Key Lemma

Lemma 2. If $G$ has no negative cycles, there exists a simple path among the shortest $s$-$t$ paths, with at most $n-1$ edges.

Proof. A non-simple path has a repeated vertex. Removing the cycle reduces cost by a nonnegative amount. Repeating this yields a simple path. ∎

Subproblem and Recurrence

Definition 4. $\text{OPT}(i, v)$ = the length of the shortest path from $v$ to $t$ using at most $i$ edges. $+\infty$ if no such path exists.

Boundary conditions: $\text{OPT}(0, t) = 0$, $\text{OPT}(0, v) = +\infty$ for $v \neq t$

Lemma 3. For all $i \geq 1$:

\[\text{OPT}(i, v) = \min \begin{cases} \text{OPT}(i-1, v) \\ \min_{\langle v,w \rangle \in A} \bigl[c(v,w) + \text{OPT}(i-1, w)\bigr] \end{cases}\]

Proof. Consider the shortest $v$-$t$ path using at most $i$ edges. If it uses $\leq i-1$ edges, the first case applies; if it uses the first edge $\langle v,w \rangle$, the second case applies. ∎

Bellman-Ford Algorithm

M[0, t] ← 0
M[0, v] ← +∞  for all v ≠ t
for i ← 1, ..., n-1 do
    for each v ∈ V do
        M[i, v] ← min(
            M[i-1, v],
            min{ c(v,w) + M[i-1, w] | ⟨v,w⟩ ∈ A }
        )
return M[n-1, s]

Theorem 2. The algorithm returns the shortest path length from $s$ to $t$.

Theorem 3. Time complexity: $O(n(m+n))$ (outer loop $n-1$ iterations, each processing all edges $O(m+n)$).

Space-Optimized Version

M[t] ← 0
M[v] ← +∞  for all v ≠ t
for i ← 1, ..., n-1 do
    for each v ∈ V do
        M[v] ← min(
            M[v],
            min{ c(v,w) + M[w] | ⟨v,w⟩ ∈ A }
        )
return M[s]

This version using only a 1D array is also correct.