Rohit Jhawer
Lecture 7: Molecular Biology Databases Hidden Markov Models
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Digitally signed by Rohit Jhawer DN: cn=Rohit Jhawer, o, ou, email=rohit_jhawer@hotm ail.com, c=IN Date: 2007.03.09 13:47:34 +05'30'
Putting it All Together • Each Database contains specific information • Biological organisms are a system • Like the system, each of these databases is interrelated CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
PROTEIN PIR
DISEASE ASSEMBLED GENOMES
LocusLink
SWISS-PROT
OMIM
GoldenPath
OMIA
WormBase
MOTIFS
TIGR
BLOCKS Pfam
GENOMIC DATA
Prosite
GenBank
ESTs dbEST
DDBJ
GENES
EMBL
RefSeq
unigene
AllGenes
SNPs
GENE EXPRESSION
dbSNP
STRUCTURE
PATHWAY
Stanford MGDB
KEGG
NetAffx
COG
ArrayExpress
PDB MMDB SCOP CECS 694-02 Introduction to Bioinformatics University of Louisville
GDB
LITERATURE PubMed Spring 2004 Dr. Eric Rouchka
http://www.ebi.ac.uk/2can/tutorials/database/srs_linking_databases.html CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Factors in Constructing a Biological Database • • • •
Fast retrieval of information Ability to store large amounts of data Ability to update data Choice of paradigm – Object oriented? – Relational? – Flat Files?
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
EnsEMBL • MySQL interface to EMBL databases • Connect to database using: • mysql –u anonymous –h kaka.sanger.ac.uk
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Hidden Markov Models • Hidden Markov Models (HMMs): – Probabilistic models for studying sequence of symbols – Can model matches, mismatches, insertions and deletions of symbols – Deeply rooted in speech recognition CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMMs in Speech Recognition • Problem: determine phonemes (words) spoken in a particular time frame • Everyone has different voices – Accents – Person having a cold – Differences in physiological development – Speed of speech CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMMs in Speech Recognition • Humans are able to distinguish speech a large percentage of the time • Computer: – Take a speech signal in a time frame – Fit it to a specific model of phonemes
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMMs in Speech Recognition • Humans probably do something similar – Consider the Sprint PCS commercials! – 200 oxen? Or 200 dachshund?
• Speech is fit into a model of words we have learned
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMMs in Sequence Analysis • Given an amino acid sequence, determine protein family • Similar to fitting a speech signal to a word model • Insertions, deletions, and substitutions are allowed CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain • Probabilistic Model – Generates sequence – Probability of symbol depends upon the previous symbol
• Examples – Traffic Light – Dinucleotides
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain • Consider a random DNA sequence: – Four states: A, C, G, T – Given a certain state, there is a transition to the next state associate with a probability
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain
A
C
start
END
G
CECS 694-02 Introduction to Bioinformatics University of Louisville
T
Spring 2004 Dr. Eric Rouchka
Markov Chain • Key Property: – Probability of a symbol S at position p (Sp) is dependent only upon the previous symbol S at position p-1 (Sp-1)
• Biological Example – CpG islands: rich in the dinucleotide CG – High CG: More gene rich CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
CpG Islands • Methylation: process that converts C to T with a high probability when CG is encountered – TpG dinucleotides overrepresented – CpG dinucleotides underrepresented
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands • Ignore Start/Stop states for now • 16 different transitions – Four states; four transitions from each state
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands • Study of genomic DNA reveals the following 16 transition probabilities:
• The edges of the system can be assigned these weights CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands • Model describes regions where methylation occurs • Some regions of the human genome are protected from methylation – Promoter regions of genes
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands • Study of protected regions produces the following probabilities (CpG islands):
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands • A Second CpG Island model can be created, using these new probabilities
Given a sequence, we can find probability it is in CpG Island model, or in non-CpG Island model CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands • property of a Markov chain – probability of a symbol S at position p (Sp) depends only upon the previous symbol S at position p – 1 (Sp-1)
• • to find probability a sequence fits a model, multiply the conditional probabilities: • • P(x) = P(xL|x L-1)P(x L-1|x L-2)…P(x2|x1)P(x1)
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands P(x) = P(xL|x L-1)P(x L-1|x L-2)…P(x2|x1)P(x1) Can be rewritten as: L
P ( x ) = P ( x1 )∏ a xi −1 xi i =2
Where
a xi −1xi
is a transition probability
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands • Now consider the sequence: GGCGACG • • The probability for this sequence is as follows: • • P(G)P(G|G)P(C|G)P(G|C)P(A|G)P(C|A)P(G|C)
• • Conditional (transition) probabilities are different based on the model CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands •
P(G)P(G|G)P(C|G)P(G|C)P(A|G)P(C|A)P(G|C)
• • For the non-CpG model can be calculated as: • •
(0.20)(0.298)(0.246)(0.078)(0.248)(0.205)(0.078) = 0.000000453499
• • For the CpG model can be calculated as: • •
(0.25)(0.375)(0.339)(0.274)(0.161)(0.274)(0.274)(0.125) = 0.0010526
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Markov Chain for CpG Islands • More likely sequence fits the CpG model • Note how quickly probability goes to zero – Shows importance of using log statistics
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Discriminating Between two Models • How different are the two models? • Is there enough difference in the information? • To discriminate, take a log ratio of probabilities and create a new table • log2(P(x|CpG model) / P(x| non-CpG model))
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Discriminating Between two Models • Sequence with positive score belongs to CpG Islands model • Sequence with negative score belongs to non-CpG Islands model
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Discriminating Between two Models
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Hidden Markov Models • Two separate models: – Non-CpG – CpG
• Sequence belongs to either one or the other model
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Hidden Markov Models • Consider: given a long piece of genomic DNA • Some pieces are in CpG islands; some are not • Need a way to combine the two models CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Hidden Markov Models • Solution: Take two models; include transitions to switch back and forth between two models • Problem is more complicated – Two states corresponding to each nucleotide symbol
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Hidden Markov Models • State emitting symbol is now “hidden” since there is no longer 1 to 1 correspondence • State can emit a symbol based on a probabilistic distribution – CpG island example: emission is 100% or 0%
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Hidden Markov Models • Using emission probabilities, joint probability of seeing observed sequence x and path Pi through the HMM is: L
P( x, π )= aoπ 1 ∏ eπ i ( xi )aπ iπ i +1 i =1
• Where eπ ( xi ) is probability of emitting residue x at i position i, when at state πi in the path
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Hidden Markov Models • Problem: – Path is hardly ever known
• Calculate: – Most Probable Path (Viterbi Algorithm) – Probability sequence fits HMM (forward algorithm) – Posterior probability (backward algorithm)
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Viterbi Algorithm • Most probable path through an HMM • Can be calculated recursively • Implementation: Dynamic Programming – Initialization; Recursive Step; Trace-Back
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Viterbi Algorithm • To find maximum path through a state, consider all transitions into state, and emission probability for the state
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Viterbi Algorithm • Dynamic Programming Matrix
ÅSTATESÆ
ÅSEQUENCE CHARACTERSÆ
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Viterbi Algorithm • Initialization • • set the probability of a path of length 0 ending at state 0 to 1, and the probability of all other paths of length 0 ending at all other states equal to 0: • • v0(0) = 1; vk(0) = 0 for k > 0 • CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Viterbi Algorithm • Recursion • • go through the whole length of the input sequence, one at a time, calculate maximum probability for being in a state, l, given the current input i: • • • vl(i) = el(xi)maxk(vk(i-1)akl) • CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Viterbi Algorithm • Recursion • • recursion step keeps track of a pointer for getting to each state • traceback can be performed to reconstruct the path with the maximum probability: • • ptri(l) = argmaxk(vk(i-1)akl) CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Viterbi Algorithm • Termination • • probability of the sequence and the maximum path set to be maximum value at the final position • pointer for maximal path is set at that point; a termination step, ak0 is introduced:
• • P(x,π* ) = maxk(vk(L)ak0) • • πL* = argmaxk(vk(L)ak0) CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Viterbi Algorithm • Traceback • • traceback is similar sequence alignment. • start at the final state; trace back through the set of transitions that led to that state. recurse back until we get to the first state:
• • (i = L..1): πi-1* = ptri(πi*) CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Probability of Sequence • determine probability of a sequence given a particular HMM • could be done by summing the probability over all possible paths • number of potential paths increases exponentially with length of the sequence -brute force method is not possible. CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Forward Algorithm • Once again, Dynamic Programming can solve this problem! • the probability of the observed sequence up to and including a point xi, where the path ends at state πi , can be calculated similar to the Viterbi algorithm: • • fk(i)=P(x1...xi, πi=k) CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Forward Algorithm • Initialization • • identical to the Viterbi algorithm, except now the v’s (maximum probable path) are replaced by f’s (overall probability) • • f0(0) = 1; fk(0) = 0 for k > 0 CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Forward Algorithm • Recursion • • Go through length of the input sequence, one at a time, calculating overall probability for being in a state, l, given the current input i. in the forward algorithm, we will consider sum over all possibilities:
f l (i ) = el ( xi )∑ f k (i − 1)a kl k
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Forward Algorithm • Termination • • The termination step is an extension of the recursion step with the difference that a terminating transition is used. The overall probability of the sequence being described by the HMM is then given in the final cell:
P ( x) = ∑ f k ( L)a k 0 k
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Backward Algorithm • In addition to determining the maximum path and overall probability, may want to determine probability of being in a particular state • Example: Is a sequence emitted from a CpG model, or from a non CpG model
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Backward Algorithm • Backward algorithm calculates posterior probability: • P(πi = k | x) • • similar to the traceback step in dynamic programming. start at the final step and work back to the beginning CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Backward Algorithm • Initialization • • value of the posterior probability for each of the final transitions is assigned the value of the final transition probability: • • bk(L) = ak0 for all k CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Backward Algorithm • Recursion • • The recursion works backwards (i = L-1, ..., 1) •
bk (i ) = ∑ a kl el ( xi +1 )bl (i + 1) l
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Backward Algorithm • Termination • • The termination step reports back the same value as the forward algorithm, calculated in the reverse direction:
P( x) = ∑ a0l el ( x1 )bl (1) l
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Parameter Estimation • How are the model and probabilities calculated? • Two steps: – Assignment of transition and emission probabilities – Design of the structure
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Estimation when the state sequence is known • Consider a multiple alignment of a protein family • the transition probabilities and emission probabilities are calculated using a maximum likelihood estimation – observed frequencies of residues in a column – observed frequencies of transitioning from one column to the next.
• CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Estimation when the state sequence is known • to deal with insufficient data and overtraining of models, pseudocounts should be incorporated
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Estimation when the Paths are unknown – Baum-Welch • If state sequence is unknown, iterative method is used • Baum-Welch: transition and emission probabilities calculated as the expected number of times each transition or emission is used, given the training data • log likelihood computed; transition and emission probabilities recalculated iteratively CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Estimation when the Paths are unknown – Baum-Welch • Initialization: – Pick arbitrary model parameters
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Estimation when the Paths are unknown – Baum-Welch • Recurrence: – Set all the transition and emission variables to their pseudocount values – For each sequence j = 1..n: • Calculate the forward value for sequence j • Calculate the backward value for sequence j • Add the contribution of sequence j to the transition and emission variables
– Calculate the new model parameters using maximum likelihood – Calculate the new log likelihood of the model
• CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Estimation when the Paths are unknown – Baum-Welch • Termination: – Stop if the change in log likelihood is less than some predefined threshold
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMM Model Structure • Duration Modeling • • Complex length distributions can be modeled by introducing several states with transitions between one another: • State for one or • More emissions
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMM Model Structure • Six or more characters
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMM Model Structure • Two to seven emissions
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMM Model Structure • Silent States • • Silent states will allow for arbitrary deletions of a chain states.
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMM Model Structure • Insertion States • it may be necessary to emit residues between matching states
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMMs in Computational Biology • • • • •
Multiple sequence alignments sequence profiles gene prediction protein structure prediction protein family classification
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Programs to Calculate HMMs HMMER home page: • http://hmmer.wustl.edu
• SAM home page: • http://www.cse.ucsd.edu/research/compbio/sam.html
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Pairwise Alignments Using HMMs • finite state machine can be created for pairwise alignment using five states: – start state – stop state – match/mismatch state – insertion state (for sequence 1) – insertion state (for sequence 2)
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Pairwise Alignments Using Five State FSA
X
M
END
BEG
Y
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Conversion of FSA to HMM • assigning probabilities for transitions and emissions at states M, X, and Y, converts FSA into a pair HMM • A pair HMM emits a pair of sequences • This pair HMM will generate an aligned pair of sequences CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Conversion of FSA to HMM • Chapter 4 (Durbin, et al) discusses HMMs and pairwise alignments • we will concentrate on where the real power of HMMs lies: in determining families of sequences based on multiple alignments
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Profile HMMs and Sequence Alignments • HMMs can create a model of a sequence family • Such a relationship would allows concentration on conserved features • Profile HMMs are similar to sequence profiles CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Profile HMMs and Sequence Alignments • profile HMMs begin with a multiple alignment that has been correctly calculated • multiple alignment can then be used to build a model to score potential matches to new sequences • . For example, consider the following multiple alignment of globin sequences: CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Alignment of globin sequences
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
PSSMs and HMMs • consider only the BLOCKS with no insertion and deletion events • a position specific scoring matrix (PSSM) can be calculated for the BLOCK
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Converting a PSSM to a HMM • Converting a PSSM for a block to a HMM is trivial, due to the absence of insertions and deletions • begin state, end state and five match states, one for each column • Match states are represented by squares in a diagram of a profile HMM CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Converting a PSSM to a HMM • Transition probabilities between match states in the HMM would be 1 (can only transition to the next match state) • emission probabilities for match state calculated based on the PSSM
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Resulting HMM Structure
B
E
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Alignment to HMM • Calculating an alignment to this HMM is trival, due to the absence of insertion/deletion states
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Adding Insertions/Deletions to HMMs •
For a profile HMM, insertions and deletions are treated separately
•
For insertions a new set of insert states is inserted, denoted by diamonds
•
Transitions are needed from: – last match state in a block to the insertion – insertion to itself (to allow for multiple length insertions) – insertion to the first match state of the next block
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HMM With an Insertion
B
CECS 694-02 Introduction to Bioinformatics University of Louisville
E
Spring 2004 Dr. Eric Rouchka
HMM With an Insertion • score of a gap of length k:
log aM i I j +( k − 1) log a I j I j + log a I j M j + 1
B
CECS 694-02 Introduction to Bioinformatics University of Louisville
E
Spring 2004 Dr. Eric Rouchka
Deletions from an HMM • Handled by introducing silent states between matches • do not emit a residue (thus, the name “silent” state) • denoted by circles in a HMM CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Deletions from an HMM • Example HMM emitting a sequence between 0 and 3 residues long:
B
CECS 694-02 Introduction to Bioinformatics University of Louisville
E
Spring 2004 Dr. Eric Rouchka
Profile HMM • A HMM with match states, insertion states, and delete states • first introduced by David Haussler, et al and Anders Krogh, et al in the mid1990s
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Profile HMM
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Profile HMMs for Pairwise Alignments • used to perform generalized pairwise alignments, similar to the dynamic programming approach • emission probabilities are match/mismatch probabilities or probabilities of matching two amino acids in a scoring matrix CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Profile HMMs for Pairwise Alignments • transition probabilities from a match state to an insert or delete state is equivalent to a gap-open score • transition probabilities within an insertion state or between delete states equivalent to a gap extension penalty
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Deriving profile HMMs from multiple alignments • Multiple sequence alignments can be used to create profile HMMs • models describing the consensus sequence of a sequence family
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Choosing Model Length • corresponds to a decision on which columns to assign to match states, and which to assign to insert states • simple rule: columns more than half gap characters should be modeled by inserts
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Choosing Model Length • Considering the alignment: – columns 1, 2, 3, 6, and 7 to be match states – columns 4 and 5 to be insert states
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Assigning Probabilities • problem is to assign the transition and emission probability parameters • simplest solution is to take the frequencies
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Transition Frequencies • each state has three possible transitions leading from it • transition probability from state k to state l can be calculated as:
Akl a kl = ∑ Akl ' l'
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Emission Frequencies • emission probability of residue a at state k can be calculated:
Ek (a) ek ( a ) = ∑ Ek (a' ) a'
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Laplace’s Rule • difficulty in using straight frequencies when considering multiple alignments • if not enough sequences are used – some residues will be underrepresented (or not represented at all) – others will be overrepresented
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Laplace’s Rule • pseudocounts are introduced (much like in problem 3 of the homework) • The easiest form of pseudocounts is Laplace’s rule, which adds one to each count
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Transition Probabilities • From column 1 to column 2: – 10 transitions to next match state – 0 transitions to an insertion state – 0 transitions to a delete state
• Probabilities Using Laplace’s rule: – 11/13 aM1M2 – 1/13 aM1I1 – 1/13 aM1D2
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Transition Probabilities • The probabilities are the same for the transitions from the second column to the third column • fourth and fifth columns are insertion columns • next set of match transitions will be from column three to column four – Match: 5/13 – Deletion: 2/13 – Insertion: 6/13
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Transition Probabilities • Remembering that we have a total of five match states, the probabilities can be stored in the following table:
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Emission Probabilities • emission probabilities can be calculated for the match states in a similar fashion • In the first column, there are 9 H’s, and 1 F. Using Laplace’s rule, this becomes 10 H’s, 2 F’s, and 1 each of the remaining 18 amino acids. • probabilities are 10/30 H; 2/30 F; 1/30 for each of the remaining 18 amino acids.
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Searching Profile HMMs • sequences can be searched against the HMM to detect whether or not they belong to a particular family of sequences described by the profile HMM
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Searching Profile HMMs • Using a global alignment, the probability of the most probable alignment and sequence can be determined using the Viterbi algorithm (yielding P(x, PI*|M)).
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
Searching Profile HMMs • full probability of a sequence aligning to the profile HMM determined using the forward algorithm (yielding P(x|M)). • Viterbi equations specifically designed for profile HMMs are given on page 109, Durbin et al. The forward equations are given on page 110.
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka
HOMEWORK • Project #2 Due 3/4/2004 • Midterm: 3/11/2004 • Read Durbin, et al., chapter 3, 4, 5
CECS 694-02 Introduction to Bioinformatics University of Louisville
Spring 2004 Dr. Eric Rouchka