Needleman-wunsch And Smith-waterman Algorithm

May 2020
PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Needleman-wunsch And Smith-waterman Algorithm as PDF for free.

More details

Words: 1,525
Pages: 19

Preview
Full text

BIOTOOLS AND DATABASES

BY C.GAYATHRI (I M.Sc.BIOINFORMATICS)

 Needleman Wunsch Algorithm  Smith Waterman Algorithm

 Published

in 1970 by SAUL NEEDLEMAN and CHRISTIAN WUNSCH  General algorithm for sequence comparison  Commonly used in bioinformatics to align protein or nucleotide sequences  Example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison.

 Scores

for aligned characters are specified by a SIMILARITY MATRIX. Here, S(i, j) is the similarity of characters i and j. It uses a LINEAR GAP PENALTY, here called d.  Maximizes a similarity score, to give ‘MAXIMUM MATCH’  Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions  Finds the best GLOBAL alignment of any two sequences

 N-W

involves an iterative matrix method of calculation  All

possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array  All possible alignments (comparisons) are represented by pathways through this array

 Three

main steps

1. Assign similarity values 2. For each cell, allowing insertions and deletions give the maximum possible scoring value 3. Construct an alignment (pathway) back from the highest scoring cell

Similarity values 

Numerical value is assigned to M P R C L C Q R J N C B A every cell (depending on the P 1 similarity/dissimilarity of the two B 1 residues) R 1 1



C simple scores or more complicated, (chemical similarities K C or frequency of observed R substitutions)



The example shown here has  match = +1  mismatch = 0

N J C J A

1

1

1

1

1

1 1 1 1

1

1

1 1 1

Score pathways through array  to know the maximum possible score for an alignment  Searches sub rows and sub columns, for the highest score  Adds this to the score for the current cell  Proceeds row by row through the array  Gap penalty for the introduction of gaps in the alignment = 0

P B R C K C R N J C J A

M 0 0 0 0 0 0 0

P 1 0 0 0 0 0 0

R 0 1 2 1 1 1 2

C 0 1 1 3 2 3 2

L 0 1 1 2 3 3 3

C 0 1 1 3 3 4 3

Q 0 1 1 2 3 3 4

R 0 1 2 2 3 3 ?

J 0 1 1 2 3 3

N 0 1 1 2 3 3

C 0 1 1 3 3 4

B 0 2 1 2 3 3

A 0 1 2 2 3 3

1 1

1

1 1

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}

1

Construct alignment  The alignment score is cumulative by adding along a P M0 P1 R0 C0 L0 C0 Q0 R0 J0 N0 C0 B0 A0 path through the array B 0 0 1 1 1 1 1 1 1 1 1 2 1  The best alignment has the R 0 0 2 1 1 1 1 2 1 1 1 1 2 C 0 0 1 3 2 3 2 2 2 2 3 2 2 highest score i.e. the K 0 0 1 2 3 3 3 3 3 3 3 3 3 maximum match C 0 0 1 3 3 4 3 3 3 3 4 3 3  Maximum match = largest R 0 0 2 2 3 3 4 5 4 4 4 4 4 number resulting from N 0 0 1 2 3 3 4 4 5 6 5 5 5 summing the cell values of J 0 0 1 2 3 3 4 4 6 5 6 6 6 every pathway C 0 0 1 3 3 4 4 4 5 6 7 6 6  The maximum match will J 0 0 1 2 3 3 4 4 6 6 6 7 7 ALWAYS be somewhere in the A 0 0 1 2 3 3 4 4 5 6 6 7 8 outer row or column shown  The alignment is constructed by working backwards from MP –RCLCQR the maximum match

JNCBA | | | | | | | |

Statistical Significance  Maximum

match is a function of sequence relationship and composition  Useful to know probability of obtaining result (maximum match) from a pair of random sequences  Estimate this experimentally  Sequences

from random proteins are taken(I.e. having same composition as the real proteins)  if the value for the random proteins is significantly different from that for the real proteins then the difference is a function of the sequences alone and not of their composition

 Proposed

by Temple Smith and Michael Waterman in 1981  Smith-Waterman algorithm is useful for performing local sequence alignment  Determining similar regions between two nucleotide or protein sequences

 Instead

of looking at entire sequence, it compares segments of all possible lengths and optimizes the similarity measure.  For every cell the algorithm calculates ALL possible paths that can be of any length and contain insertions, deletions and gaps

 

 

Works effectively, only when gap penalties are used In example shown  match = +1  mismatch = -1/3  gap = -1+1/3k (k=extent of gap) Start with all cell values = 0 Looks in sub column and sub row shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account

A A U G C C A U U G A C G G

C 0.0 0.0 0.0 0.0 1.0 1.0

A 1.0 1.0 0.0 0.0 0.0 0.7

G 0.0 0.7 0.8 1.0 0.0 0.0

C 0.0 0.0 0.3 0.3 2.0 1.0

C 0.0 0.0 0.0 0.0 1.3 3.0

U 0.0 0.0 0.0 0.0 0.3 1.7

C 0.0 0.0 0.0 0.7 1.0 ?

G 0.0 0.0 0.0 1.0 0.3

C 0.0 0.0 0.0 0.0 2.0

U 0.0 0.0 1.0 0.0 0.7

U 0.0 0.0 1.0 0.7 0.3

A 1.0 1.0 0.0 0.7 0.3

G 0.0 0.7 0.7 1.0 0.3

Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}



Four possible ways of forming a path For every residue in the query sequence

1.

4.

To align with next residue, score =previous score +similarity score Deletion (i.e. match residue of query with a gap), score =previous score - gap penalty dependent on size of the gap Insertion (i.e. match residue of db sequence with a gap, score =previous score - gap penalty dependent on size of the gap Stop when the score is zero



Choose whichever of these which has the highest score

7. 9.

Construct Alignment 

The score in each cell is the maximum possible score for an alignment of ANY LENGTH ending at those coordinates



Trace pathway back from highest scoring cell



This cell can be anywhere in the array



Align highest scoring segment

A A U G C C A U U G A C G G

C 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

A 1.0 1.0 0.0 0.0 0.0 0.7 2.0 0.7 0.3 0.0 1.0 0.0 0.7 0.0

G 0.0 0.7 0.8 1.0 0.0 0.0 0.7 1.7 0.3 1.3 0.0 0.7 1.0 1.7

C 0.0 0.0 0.3 0.3 2.0 1.0 0.3 0.3 1.3 0.0 1.0 1.0 0.3 0.7

C 0.0 0.0 0.0 0.0 1.3 3.0 1.7 1.3 1.0 1.0 0.3 2.0 0.7 0.3

U 0.0 0.0 0.0 0.0 0.3 1.7 2.7 2.7 2.3 1.0 0.7 0.7 1.7 0.3

C 0.0 0.0 0.0 0.7 1.0 1.3 1.3 2.3 2.3 2.0 0.7 1.7 0.3 1.3

G 0.0 0.0 0.0 1.0 0.3 1.0 1.0 1.0 2.0 3.3 2.0 1.7 2.7 1.3

GCCUCG GCCAUUG

C 0.0 0.0 0.0 0.0 2.0 1.3 0.7 0.7 0.7 2.0 3.0 3.0 1.7 2.3

U 0.0 0.0 1.0 0.0 0.7 1.7 1.0 1.7 1.7 1.7 1.7 2.7 2.7 1.3

U 0.0 0.0 1.0 0.7 0.3 0.3 1.3 2.0 2.7 1.3 1.3 1.3 2.3 2.3

A 1.0 1.0 0.0 0.7 0.3 0.0 1.3 1.0 1.7 2.3 2.3 1.0 1.0 2.0

G 0.0 0.7 0.7 1.0 0.3 0.0 0.0 1.0 1.0 2.7 2.0 2.0 2.0 2.0



Needleman-Wunsch



Smith-Waterman

1. Global alignments

1. Local alignments

2. Requires alignment score for a pair of residues to be >=0

2. Residue alignment score may be positive or negative

3. No gap penalty required

3. Requires a gap penalty to work effectively

4. Score cannot decrease between two cells of a pathway

4. Score can increase, decrease or stay level between two cells of a pathway

5. Trace back is mostly from the last cell that has the highest score

5. Trace back is from the cell that has the highest score

CONCLUSION Hence from calculating and working many times on these algorithms considering different organisms, it is found that NW and SW algorithms are excellent methods for finding the similarity and dissimilarity between the different organisms

Needleman-wunsch And Smith-waterman Algorithm

Overview

More details

Related Documents

Algorithm

Algorithm

Algorithm

Algorithm

Prim And Kruskal Algorithm

Cpr Update And Algorithm (lecture)