Application Of Hmm

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Application Of Hmm as PDF for free.

More details

  • Words: 1,315
  • Pages: 9
Hidden Markov Models applications in Bioinformatics

2XWOLQH

❚ ❚ ❚ ❚ ❚

*UDGLHQ 3URILO



0XOWLSO 3URWHL





+

'HVFHQ



IR



PHWKR

SURWHL

VHTXHQF





DQDO\VL

IR



OHDUQLQJ

IDPLOLHV

DOLJQPHQWV

FODVVLILFDWLR

6WUXFWXUD









DQ



DWWHU



GLVFRYHU\

1

*UDGLHQ

'HVFHQW

log P (O | w) → max the variables of the problem:

Subject to

∑e

iX

=1

w = (..., eiX ,..., t ij ,... )

for all i ∈ E

X

∑t j∈S

ij

=1

for all i ∈ S

We reparameterize the variables exp( viX ) exp( vij ) tij = ∑ Y exp( viY ) ∑ k exp( vik ) These equations hold automatically. The emission and transition probabilities will be between 0 and 1. eiX =

We will need the following formulae ∂eiX = eiX (1 − eiX ) ∂viX

and

∂eiY = − eiX eiY . ∂ v iX

These formulae can be obtained by simple calculations from eiX =

exp( viX ) ∑ Y exp( viY )

tij =

exp( vij )



k

exp( vik )

2

We need to solve the maximization problem

log P (O | w) → max

with the new variables (..., viX ,..., vij ,...) .

To do that we need to calculate the coordinates of the gradient

∂ log P (O | w) = 0 for all i ∈ E and X ∈ A, ∂ v iX

∂ log P (O | w) = 0 for all i ∈ S and j ∈ N (i ). ∂vij

By the chain rule,

∂ ∂ log P (O | w) ∂eiY = log P (O | w) = ∑ ∂ v iX ∂eiY ∂ vi X Y =

∂ log P (O | w) ∂eiX ∂eiX ∂ vi X n iX eiX

=

∂ log P (O | w) ∂eiY = ∂eiY ∂v i X Y≠X n iY − eiX eiY eiY



+

eiX (1 − eiX )

niX n eiX (1 − eiX ) + ∑ iY (− eiX eiY ) = niX − niX eiX + ∑ (− niY eiX ) = eiX Y ≠ X eiY Y≠X

= niX + ∑ (− niY eiX ) = niX − eiX ∑ niY = niX − eiX ni Y

Y

3

7

 

*UDGLHQ

UD



   

 

VW

'HV

W

PHWKRG

∂ log P (O | w) = niX − ni eiX . ∂ v iX

We obtained that

We can obtain similarly that

∂ log P (O | w) = nij − ni t ij . ∂vij

The general step of the Gradient Descent method, i.e k k how we move from point (..., viX ,..., vij ,...) to (..., vikX+1 ,..., vijk +1 ,... ) :

vijk +1 = α (nij − ni tij )

vikX+1 = α (niX − ni eiX ) step length

We calculate niX , ni , eiX , nij and t ij based on the previous formulae using (..., vikX ,..., vijk ,... ) .

*UDGLHQ



'HVFHQ



PHW



6XPPDU\

Initialization: We take the starting point (..., vi0X ,..., vij0 ,...) . For example, we can calculate the initial eiX and t ij similarly as in the Baum-Welch method, then calculate (..., vikX ,..., vijk ,...) by viX = log eiX and vij = log tij . General step: We calculate the points (..., vikX ,..., vijk ,...) subsequently for k=1,2,3… until there is no significant change in the loglikelihood function. The step length α is determined by using non-linear optimization meth. The value of loglikelihood function gets larger step by step.

4

*UDGLHQ

'HVFHQ

PHW



6XPPDU\

Batch gradient-descent equations can easily be derived by summing over all training sequences. Similarly to the Baum-Welch algorithm, it is not guaranteed that the gradient descent method provides us the optimum. (What we obtain is a local maximum.) Similarly to the Baum-Welch algorithm, the gradient-descent equations one forward and one backward algorithm. Thus, the running time of each iteration is O(KN2).

7KU

 

DVL



VWLRQV

Likelihood question: How likely is a given sequence for a particular HMM? Given O and w, P (O | w) = ? Decoding question: What is the most probable sequence of transitions and emissions through the HMM underlying the production of this particular sequence? Given O and w, max P (O, π | w) = ? π

Learning question: Given a sequence. What is the model that produces this sequence with the highest probability? Given O, max P (O | w) = ? w

5

3URILO

+00

Given a protein family. Our aim is to set up an HMM to characterize this protein family. Profile HMM

We train the HMM by the members of a family, i.e. by the sequences O1,…,OK. The members of the family are similar to each other.

0XOWLSO ´DOLJQL



VH

 

 

VHTXHQFH

OLJQPHQW

W

WK

PRGHOµ

Given the sequences O1,…,OK. Given a trained model, i.e. w is known. The HMM can be trained by a large amount of sequences (see learning question).

We calculate the Viterbi path for all sequences O1,…,OK, that provides us a multiple sequence alignment. Example: N · F L N K Y L Q · W -

start → m1 → m2 → m3 → end start → m1 → i2 → m2 → m3 → end start → m1 → m2 → d3 → end

Viterbi path of NFL NKYL QW.

m1 i2 m2 m3

6

Once we have the trained model, the running time of msa is O(KN2). computation cost of K Viterbi paths, K* O(N2), where N is the length of the model.

It is significantly less than the computation cost of dynamic programming method, that is O(NK).

The multiple alignments derived by HMMs are in some sense richer than conventional alignments. A gap can be a result of an insert as well as a delete state.

3URWHL 

ODVVLILFDWLRQ

Given a sequence O and protein families characterized by their profile HMM. We calculate a score of the alignment of sequence O to the protein family by the log-odds ratio

log where P (O | r ) = ∏t =1 qX t . T

P ( O | w) P (O | r )

Model r is referred as standard random model; it is the simple dice model or … … it can be interpreted as an HMM, where the transition . are 1, all the others are 0. probabilities between the main states The probability of letter X being emitted is the same qX for all emitting states and for all letter X.

7

null model, whose parameters reflect the general amino acid distribution in the training sequences. Barrett, Hughey, and Karplus have shown that a good choice of null model is the geometric mean of amino acids in the match states C. Barrett, R. Hughey, and K. Karplus. Scoring hidden Markov models. CABIOS, 13(2):191-199. 1997.

3URWHL 

ODVVLILFDWLRQ

We use a cut-off value to predict whether sequence O is a member of the family or not. In practice, this cut-off (threshold) value can be determined easily, because the points corresponding to family members are clearly separated from those corresponding to proteins not in the family. Alternatively, a score based on Viterbi path can be calculated by

P ( O , π * | w) log . P (O | r )

8

6WUXFWXUD SDWWHU

 

DQDO\VL



D



LVFRYHU\

Given an HMM profile of a protein family. The parameters of the HMM can be studied. High emission or transition probabilities are usually associated with conserved regions that may have structural/functional significance. It can be done by plotting the entropy of the emission distributions along the backbone of the model. n

− ∑ pi log pi i =1

pi is the probability that letter Xi is emitted

9

Related Documents

Application Of Hmm
November 2019 6
Hmm
June 2020 6
Hmm
April 2020 3
Lessons Hmm
December 2019 0