TDT4287 - Algoritmer for bioinformatikk

Sources and syllabus

01 Introduction

Bioinformatics - the collection, classification, storage, and analysis of biochemical and biological information using computers especially as applied in molecular genetics and genomics.

This course will focus on sequence (string) analysis

A cell as seen by a computer scientist

DNAfTRRNAfTLProtein\text{DNA} \longrightarrow_{f_{TR}} \text{RNA} \longrightarrow_{f_{TL}} \text{Protein}

List of biological sequence problems

How do genes encode proteins?

Protein structure

The genetic code

The genetic code must use at least 3 nucleotides to encode all 20 amino acids, but will have some redundancy (43=644^3 = 64).

What is the function of a protein?

Protein sequence determines function.

SequenceStructureFunction\text{Sequence} \longrightarrow \text{Structure} \longrightarrow \text{Function}

Genes evolved from common ancestor - homologs - often have similar functions. Thus, we can search for similar gene sequences to determine similar gene functions. We determine similar genes by finding common gene subsequences.

Longest common subsequence (LCS)

E.g. the LCS of AGCTAGCT and ATGCATGC is AGCAGC.

Recursive LCS

Three possibilities:

  1. Last letter from both strings part of LCS if identical; none of them otherwise
  2. Last letter from left string is part of LCS
  3. Last letter from right string is part of LCS

Thus we have

\begin{align*} \text{LCS}_{(x,y)} &= \max\{ \\ & \text{LCS}_{(x,y-1)}, \\ & \text{LCS}_{(x-1,y)}, \\ & \text{LCS}_{(x-1,y-1)} + (\text{LCS}[x] == \text{LCS}[y] \; ? \; 1 : 0) \\ & \} \end{align*}

Recursive LCS solves subproblems multiple times.

Dynamic programming (DP) LCS

Dynamic programming can be applied to recursive LCS such that subproblems are only solved once. We solve the matrix bottom-up. To record the actual subsequence, we also build a backtrack matrix.

// LCS(S, R)

a[0,0] = 0
for i in range(1, |S|):
    a[i,0] = 0
for j in range(1, |R|):
    a[0,j] = 0
for i in range(1, |S|):
    for j in range(1, |R|):
        a[i,j] = max{
            a[i-1,j],
            a[i,j-1],
            a[i-1,j-1] + (s[i] == r[j] ? 1 : 0)
        }
        // The following creates the backtrack matrix
        b[i,j] = {
            0 if a[i,j] == a[i-1,j],        // "up"
            1 if a[i,j] == a[i,j-1],        // "left"
            2 if a[i,j] == a[i-1,j-1] + 1   // "up-left"
        }
// PrintLCS(b, S, i, j):

if i == 0 or j == 0:
    return
if b[i,j] == 2:
    PrintLCS(b, S, i-1, j-1)
    print s[i]
else if b[i,j] == 0:
    PrintLCS(b, S, i-1, j)
else:
    PrintLCS(b, S, i, j-1)

Homework

  1. Compute the LCS DP matrix for LCS(ATTCGGTTA, TAGTGATG).
  2. Find the LCS without using a backtrack matrix.
  3. LCS(AGCTTAGCTG, TCGGATG) has multiple solutions (for example, TTG or ATG). Find an algorithm that returns all the LCSs. (Hint: a stack could be useful)

02 Alignment

Subsequences

A subsequence is an ordered sequence of characters from a string. They characters do not have to be consecutive in the original string. We may represent a subsequence as an ordered set of indexes.

A common subsequence of two strings is a subsequence of both strings.

Algorithms have already been discussed in chaper 01 Introduction. // TODO: Move and join

Edit distance (ED)

The Levenshtein edit distance between two strings S and R is the minimum number of edit operations (insertions, deletions, substitutions) for transforming S to R.

Recursive ED may be optimized by dynamic programming.

Edit operations model evolutionary events. Edit distance-based comparisons measure evolutionary distance. Alignments visualize evolutionary changes.

Global alignment (GA)

Find the best alignment between S and R given a scoring function.

Local alignment (LA)

Find the best local alignment between S and R given a scoring function. Output: Substrings of S and R with maximal global alignment of all substrings of S and R

Local alignment is useful when changes are not uniform across sequence

Local/Global alignment variants

03 Heuristic alignment

04 Substring indexes

Exact pattern matching

Text T
|T| = m
Ex: “I like bananas and potatoes”
Pattern P
|P| = n
Ex: “banana”

Find all occurrences of P in T
- Naive: O(mn)
- Possible: O(m + n)

05 Suffix tree applications

06 Sequencing

07 Alignment revisited

08 Motif Discovery

09 Markov Chains