Hidden Markov Models

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Hidden Markov Models as PDF for free.

More details

  • Words: 4,642
  • Pages: 104
Rohit Jhawer

Lecture 7: Molecular Biology Databases Hidden Markov Models

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Digitally signed by Rohit Jhawer DN: cn=Rohit Jhawer, o, ou, email=rohit_jhawer@hotm ail.com, c=IN Date: 2007.03.09 13:47:34 +05'30'

Putting it All Together • Each Database contains specific information • Biological organisms are a system • Like the system, each of these databases is interrelated CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

PROTEIN PIR

DISEASE ASSEMBLED GENOMES

LocusLink

SWISS-PROT

OMIM

GoldenPath

OMIA

WormBase

MOTIFS

TIGR

BLOCKS Pfam

GENOMIC DATA

Prosite

GenBank

ESTs dbEST

DDBJ

GENES

EMBL

RefSeq

unigene

AllGenes

SNPs

GENE EXPRESSION

dbSNP

STRUCTURE

PATHWAY

Stanford MGDB

KEGG

NetAffx

COG

ArrayExpress

PDB MMDB SCOP CECS 694-02 Introduction to Bioinformatics University of Louisville

GDB

LITERATURE PubMed Spring 2004 Dr. Eric Rouchka

http://www.ebi.ac.uk/2can/tutorials/database/srs_linking_databases.html CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Factors in Constructing a Biological Database • • • •

Fast retrieval of information Ability to store large amounts of data Ability to update data Choice of paradigm – Object oriented? – Relational? – Flat Files?

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

EnsEMBL • MySQL interface to EMBL databases • Connect to database using: • mysql –u anonymous –h kaka.sanger.ac.uk

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Hidden Markov Models • Hidden Markov Models (HMMs): – Probabilistic models for studying sequence of symbols – Can model matches, mismatches, insertions and deletions of symbols – Deeply rooted in speech recognition CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMMs in Speech Recognition • Problem: determine phonemes (words) spoken in a particular time frame • Everyone has different voices – Accents – Person having a cold – Differences in physiological development – Speed of speech CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMMs in Speech Recognition • Humans are able to distinguish speech a large percentage of the time • Computer: – Take a speech signal in a time frame – Fit it to a specific model of phonemes

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMMs in Speech Recognition • Humans probably do something similar – Consider the Sprint PCS commercials! – 200 oxen? Or 200 dachshund?

• Speech is fit into a model of words we have learned

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMMs in Sequence Analysis • Given an amino acid sequence, determine protein family • Similar to fitting a speech signal to a word model • Insertions, deletions, and substitutions are allowed CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain • Probabilistic Model – Generates sequence – Probability of symbol depends upon the previous symbol

• Examples – Traffic Light – Dinucleotides

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain • Consider a random DNA sequence: – Four states: A, C, G, T – Given a certain state, there is a transition to the next state associate with a probability

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain

A

C

start

END

G

CECS 694-02 Introduction to Bioinformatics University of Louisville

T

Spring 2004 Dr. Eric Rouchka

Markov Chain • Key Property: – Probability of a symbol S at position p (Sp) is dependent only upon the previous symbol S at position p-1 (Sp-1)

• Biological Example – CpG islands: rich in the dinucleotide CG – High CG: More gene rich CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

CpG Islands • Methylation: process that converts C to T with a high probability when CG is encountered – TpG dinucleotides overrepresented – CpG dinucleotides underrepresented

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands • Ignore Start/Stop states for now • 16 different transitions – Four states; four transitions from each state

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands • Study of genomic DNA reveals the following 16 transition probabilities:

• The edges of the system can be assigned these weights CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands • Model describes regions where methylation occurs • Some regions of the human genome are protected from methylation – Promoter regions of genes

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands • Study of protected regions produces the following probabilities (CpG islands):

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands • A Second CpG Island model can be created, using these new probabilities

Given a sequence, we can find probability it is in CpG Island model, or in non-CpG Island model CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands • property of a Markov chain – probability of a symbol S at position p (Sp) depends only upon the previous symbol S at position p – 1 (Sp-1)

• • to find probability a sequence fits a model, multiply the conditional probabilities: • • P(x) = P(xL|x L-1)P(x L-1|x L-2)…P(x2|x1)P(x1)

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands P(x) = P(xL|x L-1)P(x L-1|x L-2)…P(x2|x1)P(x1) Can be rewritten as: L

P ( x ) = P ( x1 )∏ a xi −1 xi i =2

Where

a xi −1xi

is a transition probability

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands • Now consider the sequence: GGCGACG • • The probability for this sequence is as follows: • • P(G)P(G|G)P(C|G)P(G|C)P(A|G)P(C|A)P(G|C)

• • Conditional (transition) probabilities are different based on the model CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands •

P(G)P(G|G)P(C|G)P(G|C)P(A|G)P(C|A)P(G|C)

• • For the non-CpG model can be calculated as: • •

(0.20)(0.298)(0.246)(0.078)(0.248)(0.205)(0.078) = 0.000000453499

• • For the CpG model can be calculated as: • •

(0.25)(0.375)(0.339)(0.274)(0.161)(0.274)(0.274)(0.125) = 0.0010526

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Markov Chain for CpG Islands • More likely sequence fits the CpG model • Note how quickly probability goes to zero – Shows importance of using log statistics

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Discriminating Between two Models • How different are the two models? • Is there enough difference in the information? • To discriminate, take a log ratio of probabilities and create a new table • log2(P(x|CpG model) / P(x| non-CpG model))

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Discriminating Between two Models • Sequence with positive score belongs to CpG Islands model • Sequence with negative score belongs to non-CpG Islands model

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Discriminating Between two Models

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Hidden Markov Models • Two separate models: – Non-CpG – CpG

• Sequence belongs to either one or the other model

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Hidden Markov Models • Consider: given a long piece of genomic DNA • Some pieces are in CpG islands; some are not • Need a way to combine the two models CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Hidden Markov Models • Solution: Take two models; include transitions to switch back and forth between two models • Problem is more complicated – Two states corresponding to each nucleotide symbol

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Hidden Markov Models • State emitting symbol is now “hidden” since there is no longer 1 to 1 correspondence • State can emit a symbol based on a probabilistic distribution – CpG island example: emission is 100% or 0%

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Hidden Markov Models • Using emission probabilities, joint probability of seeing observed sequence x and path Pi through the HMM is: L

P( x, π )= aoπ 1 ∏ eπ i ( xi )aπ iπ i +1 i =1

• Where eπ ( xi ) is probability of emitting residue x at i position i, when at state πi in the path

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Hidden Markov Models • Problem: – Path is hardly ever known

• Calculate: – Most Probable Path (Viterbi Algorithm) – Probability sequence fits HMM (forward algorithm) – Posterior probability (backward algorithm)

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Viterbi Algorithm • Most probable path through an HMM • Can be calculated recursively • Implementation: Dynamic Programming – Initialization; Recursive Step; Trace-Back

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Viterbi Algorithm • To find maximum path through a state, consider all transitions into state, and emission probability for the state

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Viterbi Algorithm • Dynamic Programming Matrix

ÅSTATESÆ

ÅSEQUENCE CHARACTERSÆ

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Viterbi Algorithm • Initialization • • set the probability of a path of length 0 ending at state 0 to 1, and the probability of all other paths of length 0 ending at all other states equal to 0: • • v0(0) = 1; vk(0) = 0 for k > 0 • CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Viterbi Algorithm • Recursion • • go through the whole length of the input sequence, one at a time, calculate maximum probability for being in a state, l, given the current input i: • • • vl(i) = el(xi)maxk(vk(i-1)akl) • CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Viterbi Algorithm • Recursion • • recursion step keeps track of a pointer for getting to each state • traceback can be performed to reconstruct the path with the maximum probability: • • ptri(l) = argmaxk(vk(i-1)akl) CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Viterbi Algorithm • Termination • • probability of the sequence and the maximum path set to be maximum value at the final position • pointer for maximal path is set at that point; a termination step, ak0 is introduced:

• • P(x,π* ) = maxk(vk(L)ak0) • • πL* = argmaxk(vk(L)ak0) CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Viterbi Algorithm • Traceback • • traceback is similar sequence alignment. • start at the final state; trace back through the set of transitions that led to that state. recurse back until we get to the first state:

• • (i = L..1): πi-1* = ptri(πi*) CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Probability of Sequence • determine probability of a sequence given a particular HMM • could be done by summing the probability over all possible paths • number of potential paths increases exponentially with length of the sequence -brute force method is not possible. CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Forward Algorithm • Once again, Dynamic Programming can solve this problem! • the probability of the observed sequence up to and including a point xi, where the path ends at state πi , can be calculated similar to the Viterbi algorithm: • • fk(i)=P(x1...xi, πi=k) CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Forward Algorithm • Initialization • • identical to the Viterbi algorithm, except now the v’s (maximum probable path) are replaced by f’s (overall probability) • • f0(0) = 1; fk(0) = 0 for k > 0 CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Forward Algorithm • Recursion • • Go through length of the input sequence, one at a time, calculating overall probability for being in a state, l, given the current input i. in the forward algorithm, we will consider sum over all possibilities:

f l (i ) = el ( xi )∑ f k (i − 1)a kl k

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Forward Algorithm • Termination • • The termination step is an extension of the recursion step with the difference that a terminating transition is used. The overall probability of the sequence being described by the HMM is then given in the final cell:

P ( x) = ∑ f k ( L)a k 0 k

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Backward Algorithm • In addition to determining the maximum path and overall probability, may want to determine probability of being in a particular state • Example: Is a sequence emitted from a CpG model, or from a non CpG model

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Backward Algorithm • Backward algorithm calculates posterior probability: • P(πi = k | x) • • similar to the traceback step in dynamic programming. start at the final step and work back to the beginning CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Backward Algorithm • Initialization • • value of the posterior probability for each of the final transitions is assigned the value of the final transition probability: • • bk(L) = ak0 for all k CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Backward Algorithm • Recursion • • The recursion works backwards (i = L-1, ..., 1) •

bk (i ) = ∑ a kl el ( xi +1 )bl (i + 1) l

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Backward Algorithm • Termination • • The termination step reports back the same value as the forward algorithm, calculated in the reverse direction:

P( x) = ∑ a0l el ( x1 )bl (1) l

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Parameter Estimation • How are the model and probabilities calculated? • Two steps: – Assignment of transition and emission probabilities – Design of the structure

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Estimation when the state sequence is known • Consider a multiple alignment of a protein family • the transition probabilities and emission probabilities are calculated using a maximum likelihood estimation – observed frequencies of residues in a column – observed frequencies of transitioning from one column to the next.

• CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Estimation when the state sequence is known • to deal with insufficient data and overtraining of models, pseudocounts should be incorporated

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Estimation when the Paths are unknown – Baum-Welch • If state sequence is unknown, iterative method is used • Baum-Welch: transition and emission probabilities calculated as the expected number of times each transition or emission is used, given the training data • log likelihood computed; transition and emission probabilities recalculated iteratively CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Estimation when the Paths are unknown – Baum-Welch • Initialization: – Pick arbitrary model parameters

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Estimation when the Paths are unknown – Baum-Welch • Recurrence: – Set all the transition and emission variables to their pseudocount values – For each sequence j = 1..n: • Calculate the forward value for sequence j • Calculate the backward value for sequence j • Add the contribution of sequence j to the transition and emission variables

– Calculate the new model parameters using maximum likelihood – Calculate the new log likelihood of the model

• CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Estimation when the Paths are unknown – Baum-Welch • Termination: – Stop if the change in log likelihood is less than some predefined threshold

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMM Model Structure • Duration Modeling • • Complex length distributions can be modeled by introducing several states with transitions between one another: • State for one or • More emissions

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMM Model Structure • Six or more characters

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMM Model Structure • Two to seven emissions

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMM Model Structure • Silent States • • Silent states will allow for arbitrary deletions of a chain states.

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMM Model Structure • Insertion States • it may be necessary to emit residues between matching states

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMMs in Computational Biology • • • • •

Multiple sequence alignments sequence profiles gene prediction protein structure prediction protein family classification

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Programs to Calculate HMMs HMMER home page: • http://hmmer.wustl.edu

• SAM home page: • http://www.cse.ucsd.edu/research/compbio/sam.html

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Pairwise Alignments Using HMMs • finite state machine can be created for pairwise alignment using five states: – start state – stop state – match/mismatch state – insertion state (for sequence 1) – insertion state (for sequence 2)

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Pairwise Alignments Using Five State FSA

X

M

END

BEG

Y

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Conversion of FSA to HMM • assigning probabilities for transitions and emissions at states M, X, and Y, converts FSA into a pair HMM • A pair HMM emits a pair of sequences • This pair HMM will generate an aligned pair of sequences CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Conversion of FSA to HMM • Chapter 4 (Durbin, et al) discusses HMMs and pairwise alignments • we will concentrate on where the real power of HMMs lies: in determining families of sequences based on multiple alignments

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Profile HMMs and Sequence Alignments • HMMs can create a model of a sequence family • Such a relationship would allows concentration on conserved features • Profile HMMs are similar to sequence profiles CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Profile HMMs and Sequence Alignments • profile HMMs begin with a multiple alignment that has been correctly calculated • multiple alignment can then be used to build a model to score potential matches to new sequences • . For example, consider the following multiple alignment of globin sequences: CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Alignment of globin sequences

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

PSSMs and HMMs • consider only the BLOCKS with no insertion and deletion events • a position specific scoring matrix (PSSM) can be calculated for the BLOCK

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Converting a PSSM to a HMM • Converting a PSSM for a block to a HMM is trivial, due to the absence of insertions and deletions • begin state, end state and five match states, one for each column • Match states are represented by squares in a diagram of a profile HMM CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Converting a PSSM to a HMM • Transition probabilities between match states in the HMM would be 1 (can only transition to the next match state) • emission probabilities for match state calculated based on the PSSM

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Resulting HMM Structure

B

E

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Alignment to HMM • Calculating an alignment to this HMM is trival, due to the absence of insertion/deletion states

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Adding Insertions/Deletions to HMMs •

For a profile HMM, insertions and deletions are treated separately



For insertions a new set of insert states is inserted, denoted by diamonds



Transitions are needed from: – last match state in a block to the insertion – insertion to itself (to allow for multiple length insertions) – insertion to the first match state of the next block

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HMM With an Insertion

B

CECS 694-02 Introduction to Bioinformatics University of Louisville

E

Spring 2004 Dr. Eric Rouchka

HMM With an Insertion • score of a gap of length k:

log aM i I j +( k − 1) log a I j I j + log a I j M j + 1

B

CECS 694-02 Introduction to Bioinformatics University of Louisville

E

Spring 2004 Dr. Eric Rouchka

Deletions from an HMM • Handled by introducing silent states between matches • do not emit a residue (thus, the name “silent” state) • denoted by circles in a HMM CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Deletions from an HMM • Example HMM emitting a sequence between 0 and 3 residues long:

B

CECS 694-02 Introduction to Bioinformatics University of Louisville

E

Spring 2004 Dr. Eric Rouchka

Profile HMM • A HMM with match states, insertion states, and delete states • first introduced by David Haussler, et al and Anders Krogh, et al in the mid1990s

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Profile HMM

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Profile HMMs for Pairwise Alignments • used to perform generalized pairwise alignments, similar to the dynamic programming approach • emission probabilities are match/mismatch probabilities or probabilities of matching two amino acids in a scoring matrix CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Profile HMMs for Pairwise Alignments • transition probabilities from a match state to an insert or delete state is equivalent to a gap-open score • transition probabilities within an insertion state or between delete states equivalent to a gap extension penalty

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Deriving profile HMMs from multiple alignments • Multiple sequence alignments can be used to create profile HMMs • models describing the consensus sequence of a sequence family

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Choosing Model Length • corresponds to a decision on which columns to assign to match states, and which to assign to insert states • simple rule: columns more than half gap characters should be modeled by inserts

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Choosing Model Length • Considering the alignment: – columns 1, 2, 3, 6, and 7 to be match states – columns 4 and 5 to be insert states

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Assigning Probabilities • problem is to assign the transition and emission probability parameters • simplest solution is to take the frequencies

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Transition Frequencies • each state has three possible transitions leading from it • transition probability from state k to state l can be calculated as:

Akl a kl = ∑ Akl ' l'

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Emission Frequencies • emission probability of residue a at state k can be calculated:

Ek (a) ek ( a ) = ∑ Ek (a' ) a'

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Laplace’s Rule • difficulty in using straight frequencies when considering multiple alignments • if not enough sequences are used – some residues will be underrepresented (or not represented at all) – others will be overrepresented

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Laplace’s Rule • pseudocounts are introduced (much like in problem 3 of the homework) • The easiest form of pseudocounts is Laplace’s rule, which adds one to each count

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Transition Probabilities • From column 1 to column 2: – 10 transitions to next match state – 0 transitions to an insertion state – 0 transitions to a delete state

• Probabilities Using Laplace’s rule: – 11/13 aM1M2 – 1/13 aM1I1 – 1/13 aM1D2

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Transition Probabilities • The probabilities are the same for the transitions from the second column to the third column • fourth and fifth columns are insertion columns • next set of match transitions will be from column three to column four – Match: 5/13 – Deletion: 2/13 – Insertion: 6/13

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Transition Probabilities • Remembering that we have a total of five match states, the probabilities can be stored in the following table:

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Emission Probabilities • emission probabilities can be calculated for the match states in a similar fashion • In the first column, there are 9 H’s, and 1 F. Using Laplace’s rule, this becomes 10 H’s, 2 F’s, and 1 each of the remaining 18 amino acids. • probabilities are 10/30 H; 2/30 F; 1/30 for each of the remaining 18 amino acids.

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Searching Profile HMMs • sequences can be searched against the HMM to detect whether or not they belong to a particular family of sequences described by the profile HMM

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Searching Profile HMMs • Using a global alignment, the probability of the most probable alignment and sequence can be determined using the Viterbi algorithm (yielding P(x, PI*|M)).

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Searching Profile HMMs • full probability of a sequence aligning to the profile HMM determined using the forward algorithm (yielding P(x|M)). • Viterbi equations specifically designed for profile HMMs are given on page 109, Durbin et al. The forward equations are given on page 110.

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

HOMEWORK • Project #2 Due 3/4/2004 • Midterm: 3/11/2004 • Read Durbin, et al., chapter 3, 4, 5

CECS 694-02 Introduction to Bioinformatics University of Louisville

Spring 2004 Dr. Eric Rouchka

Related Documents