BIOTOOLS AND DATABASES
BY C.GAYATHRI (I M.Sc.BIOINFORMATICS)
Needleman Wunsch Algorithm Smith Waterman Algorithm
Published
in 1970 by SAUL NEEDLEMAN and CHRISTIAN WUNSCH General algorithm for sequence comparison Commonly used in bioinformatics to align protein or nucleotide sequences Example of dynamic programming, and was the first application of dynamic programming to biological sequence comparison.
Scores
for aligned characters are specified by a SIMILARITY MATRIX. Here, S(i, j) is the similarity of characters i and j. It uses a LINEAR GAP PENALTY, here called d. Maximizes a similarity score, to give ‘MAXIMUM MATCH’ Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions Finds the best GLOBAL alignment of any two sequences
N-W
involves an iterative matrix method of calculation All
possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array All possible alignments (comparisons) are represented by pathways through this array
Three
main steps
1. Assign similarity values 2. For each cell, allowing insertions and deletions give the maximum possible scoring value 3. Construct an alignment (pathway) back from the highest scoring cell
Similarity values
Numerical value is assigned to M P R C L C Q R J N C B A every cell (depending on the P 1 similarity/dissimilarity of the two B 1 residues) R 1 1
C simple scores or more complicated, (chemical similarities K C or frequency of observed R substitutions)
The example shown here has match = +1 mismatch = 0
N J C J A
1
1
1
1
1
1 1 1 1
1
1
1 1 1
Score pathways through array to know the maximum possible score for an alignment Searches sub rows and sub columns, for the highest score Adds this to the score for the current cell Proceeds row by row through the array Gap penalty for the introduction of gaps in the alignment = 0
P B R C K C R N J C J A
M 0 0 0 0 0 0 0
P 1 0 0 0 0 0 0
R 0 1 2 1 1 1 2
C 0 1 1 3 2 3 2
L 0 1 1 2 3 3 3
C 0 1 1 3 3 4 3
Q 0 1 1 2 3 3 4
R 0 1 2 2 3 3 ?
J 0 1 1 2 3 3
N 0 1 1 2 3 3
C 0 1 1 3 3 4
B 0 2 1 2 3 3
A 0 1 2 2 3 3
1 1
1
1 1
Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}
1
Construct alignment The alignment score is cumulative by adding along a P M0 P1 R0 C0 L0 C0 Q0 R0 J0 N0 C0 B0 A0 path through the array B 0 0 1 1 1 1 1 1 1 1 1 2 1 The best alignment has the R 0 0 2 1 1 1 1 2 1 1 1 1 2 C 0 0 1 3 2 3 2 2 2 2 3 2 2 highest score i.e. the K 0 0 1 2 3 3 3 3 3 3 3 3 3 maximum match C 0 0 1 3 3 4 3 3 3 3 4 3 3 Maximum match = largest R 0 0 2 2 3 3 4 5 4 4 4 4 4 number resulting from N 0 0 1 2 3 3 4 4 5 6 5 5 5 summing the cell values of J 0 0 1 2 3 3 4 4 6 5 6 6 6 every pathway C 0 0 1 3 3 4 4 4 5 6 7 6 6 The maximum match will J 0 0 1 2 3 3 4 4 6 6 6 7 7 ALWAYS be somewhere in the A 0 0 1 2 3 3 4 4 5 6 6 7 8 outer row or column shown The alignment is constructed by working backwards from MP –RCLCQR the maximum match
JNCBA | | | | | | | |
Statistical Significance Maximum
match is a function of sequence relationship and composition Useful to know probability of obtaining result (maximum match) from a pair of random sequences Estimate this experimentally Sequences
from random proteins are taken(I.e. having same composition as the real proteins) if the value for the random proteins is significantly different from that for the real proteins then the difference is a function of the sequences alone and not of their composition
Proposed
by Temple Smith and Michael Waterman in 1981 Smith-Waterman algorithm is useful for performing local sequence alignment Determining similar regions between two nucleotide or protein sequences
Instead
of looking at entire sequence, it compares segments of all possible lengths and optimizes the similarity measure. For every cell the algorithm calculates ALL possible paths that can be of any length and contain insertions, deletions and gaps
Works effectively, only when gap penalties are used In example shown match = +1 mismatch = -1/3 gap = -1+1/3k (k=extent of gap) Start with all cell values = 0 Looks in sub column and sub row shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account
A A U G C C A U U G A C G G
C 0.0 0.0 0.0 0.0 1.0 1.0
A 1.0 1.0 0.0 0.0 0.0 0.7
G 0.0 0.7 0.8 1.0 0.0 0.0
C 0.0 0.0 0.3 0.3 2.0 1.0
C 0.0 0.0 0.0 0.0 1.3 3.0
U 0.0 0.0 0.0 0.0 0.3 1.7
C 0.0 0.0 0.0 0.7 1.0 ?
G 0.0 0.0 0.0 1.0 0.3
C 0.0 0.0 0.0 0.0 2.0
U 0.0 0.0 1.0 0.0 0.7
U 0.0 0.0 1.0 0.7 0.3
A 1.0 1.0 0.0 0.7 0.3
G 0.0 0.7 0.7 1.0 0.3
Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}
Four possible ways of forming a path For every residue in the query sequence
1.
4.
To align with next residue, score =previous score +similarity score Deletion (i.e. match residue of query with a gap), score =previous score - gap penalty dependent on size of the gap Insertion (i.e. match residue of db sequence with a gap, score =previous score - gap penalty dependent on size of the gap Stop when the score is zero
Choose whichever of these which has the highest score
7. 9.
Construct Alignment
The score in each cell is the maximum possible score for an alignment of ANY LENGTH ending at those coordinates
Trace pathway back from highest scoring cell
This cell can be anywhere in the array
Align highest scoring segment
A A U G C C A U U G A C G G
C 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
A 1.0 1.0 0.0 0.0 0.0 0.7 2.0 0.7 0.3 0.0 1.0 0.0 0.7 0.0
G 0.0 0.7 0.8 1.0 0.0 0.0 0.7 1.7 0.3 1.3 0.0 0.7 1.0 1.7
C 0.0 0.0 0.3 0.3 2.0 1.0 0.3 0.3 1.3 0.0 1.0 1.0 0.3 0.7
C 0.0 0.0 0.0 0.0 1.3 3.0 1.7 1.3 1.0 1.0 0.3 2.0 0.7 0.3
U 0.0 0.0 0.0 0.0 0.3 1.7 2.7 2.7 2.3 1.0 0.7 0.7 1.7 0.3
C 0.0 0.0 0.0 0.7 1.0 1.3 1.3 2.3 2.3 2.0 0.7 1.7 0.3 1.3
G 0.0 0.0 0.0 1.0 0.3 1.0 1.0 1.0 2.0 3.3 2.0 1.7 2.7 1.3
GCCUCG GCCAUUG
C 0.0 0.0 0.0 0.0 2.0 1.3 0.7 0.7 0.7 2.0 3.0 3.0 1.7 2.3
U 0.0 0.0 1.0 0.0 0.7 1.7 1.0 1.7 1.7 1.7 1.7 2.7 2.7 1.3
U 0.0 0.0 1.0 0.7 0.3 0.3 1.3 2.0 2.7 1.3 1.3 1.3 2.3 2.3
A 1.0 1.0 0.0 0.7 0.3 0.0 1.3 1.0 1.7 2.3 2.3 1.0 1.0 2.0
G 0.0 0.7 0.7 1.0 0.3 0.0 0.0 1.0 1.0 2.7 2.0 2.0 2.0 2.0
Needleman-Wunsch
Smith-Waterman
1. Global alignments
1. Local alignments
2. Requires alignment score for a pair of residues to be >=0
2. Residue alignment score may be positive or negative
3. No gap penalty required
3. Requires a gap penalty to work effectively
4. Score cannot decrease between two cells of a pathway
4. Score can increase, decrease or stay level between two cells of a pathway
5. Trace back is mostly from the last cell that has the highest score
5. Trace back is from the cell that has the highest score
CONCLUSION Hence from calculating and working many times on these algorithms considering different organisms, it is found that NW and SW algorithms are excellent methods for finding the similarity and dissimilarity between the different organisms