Sequence Analysis
Sequence analysis Pairwise Alignment: ¾Dot Matrix ¾Dynamic Programming ¾Heuristic Methods: FASTA BLAST
Pairwise Sequence Alignment
• The process of comparing two sequences by searching for a series of individual characters that are in the same order in the sequences
Terms used to indicate sequence similaity Term used
Description of the term
Homologous
Characters similar due to common ancestry
Analogous
Characters similar due to convergent evolution
Orthologous
Charecters homologous with conserved function
Paralogous
Characters homologous with divergent function
Xenologous
Sequence identity due to horizontal transfer
Dot matrix • • • •
Useful for simple alignments Uses graphical methods Easy to understand and apply Disadvantage: Does not produce an optimal alignment
Dot matrix • Constructed to visually compare two sequences and detect regions of close similarity between them • Similarity between two sequences can be detected as a diagonal on an identity matrix • Word size can be fixed to filter the best dot plot
Dynamic programming • A little bit challenging to understand when compared to Dot matrix • Produces optical alignment using both Global and Local alignments • Disadvantage: Needs skill to produce the most optical alignment
Scoring Matrices • The alignment that gives the highest similarity score is the optimal alignment • A scoring method used in the alignment of one residue against another • The number of alignments that must be checked to identify the optimal alignment increases exponentially with the lengths of the sequences
SCORING MATRICES for Proteins
• PAM (Percent Accepted Mutations) • BLOSUM (BLOcks SUbstitution Matrix)
PAM • The first substitution matrix developed for comparision of protein sequences • Also called as Dayhoff, MDM or PAM matrices • Developed by Margaret Dayhoff and co-workers • Developed from global alignments of closely related sequences • Number accompanying PAM refers to evolutionary distance between the sequences • Larger number – greater evolutionary distance
Table showing the PAM matrices and the corresponding percent identities
Matrix
PAM0 PAM30 PAM80 PAM110 PAM200
PAM250
% Identity
100
20
75
50
60
25
BLOSUM • Developed by Steve Henikoff and co-workers • Based on local alignments of distantly related sequences • Small number - greater evolutionary distance
Global Alignment • Is done across the entire sequence length to include as many matches as possible upto and including the sequence end • Needleman-Wunsh Algorithm • Example of programme using global alignment by implementing NeedlemanWunsh algorithm: Needle tool of EMBOSS
Local Alignment • Finds out the pair of regions, one from each sequence that exhibit high similarity • Smith-Waterman Algorithm • Example of programme using global alignment by implementing SmithWaterman algorithm: Water tool of EMBOSS
Heuristics methods • Fast and effective but approximate methods • BLAST and FASTA
FASTA • It is a programme written by W.R.Pearson and D.J.Lipman • Sensitive and rapid algorithm • EMBL and DDBJ use FASTA for whole database search • Disadvantage: Completely ignores more more distantly related sequences that have functional homology but no longer retain complete identity
Flavours of FASTA TYPE of fasta
Query sequence
Library/ Database scanned
fasta
Protein or DNA
Protein or DNA
tfasta
Protein
Translated DNA
lfasta
Compares two sequences for local similarity and shows local sequence alignments
plfasta
Compares two sequences for local similarity and plots the local sequence alignments
fastx
Tanslated DNA (all the six frames)
fasty
Protein
FASTA Considers exact matches for a given set of sequences
Uses DP algorithm to compute optimal alignment out of the exact matches
BLAST (Basic Local Alignment Search Tool) • Karlin and Altschul • Multiple hits to the same sequence is allowed in order to find out patches of regional similarity
BLAST Program
Database
Query
BLASTN
Nucleotide
Nucleotide
BLASTP
Protein
Protein
BLASTX
Protein
Nucleotide translated into protein
TBLASTN
Nucleotide translated into protein
Protein
TBLASTX
Nucleotide translated into protein
Nucleotide translated into protein
BLAST Looks for short subsequences that are likely to have significant matches
Tries to extend these subsequences in order to obtain maximum sequence similarity
Steps in alignment process 1. Matrix construction. 2. Matrix filling. 3. Back tracing. 4. Alignment plotting.
The two example sequences taken are: 5’ GAATTCAGTTA 3’ 5’ CCATCGG 3’
Matrix construction and filling: The matrix is constructed as in the slides that follow, the number of rows and columns taken depends on the number of bases present in both the sequences to be aligned. Once the matrix is constructed, it is filled based on some scoring rules and formulas as follows: Formula for determining the score of a cell (for global) :
Si,j = Max [(Si-1,j-1 + W), (Si-1,j + W), (Si,j-1 + W)]
(or)
Si,j = Max [(Si-1,j-i), (Si-1,j), (Si,j-1)] + W Where, Si,j = Score of the cell represented by ith column and jth row i = row in which the cell of consideration lies j = column in which the cell of consideration lies W = weight given to the substitution of bases represented by the cell of consideration
Scoring rules used: The weights given in this example for substitution or indel are the following :Match = 2
Si,j = Max [Si-1,j-1, Si-1,j, Si,j-1] + W
Mismatch = -1 Gap = -2
S2,2 = Max [S1,1, S1,2, S2,1] + W For example, the calculation of score in the case of the cell S2,2: S1,1 = 0 S1,2 = -2 S2,1 = -2 and W (GQC) = -1 Max of all the three is 0. So, 0 + -1 = -1 So, this score is given to the cell and arrow is placed from the diagonal cell from which the value is obtained, it can be to more than one cell also.
row
(i)
G col
(j)
1
0
-2
-2
-1
3
C
-4 4
A
-6 5
T
-8 6
C
-10 7
G
-12 8
G
-14
A 4
3
2
2
C
A -4
T 5
-6
T 6
-8
C 8
7
-10
A
-12
G 9
-14
T 10
-16
-18
T 11
-20
A 12
-22
Similarly all the cells are assigned scores and arrows are placed accordingly. Finally when the entire matrix is filled, the back tracing is done. The two differences between local and global alignment are: Scoring: In case of global alignment whatever score we get from the formula, we assign it to the cell. Where as in case of local alignment the cell score is equaled to zero if the calculated maximum score is less than zero. So, the formula for calculating score of a cell for local alignment is written as: Si,j = Max [(Si-1,j-1 + W), (Si-1,j + W), (Si,j-1 + W), 0] Back tracing: In case of global it starts from the bottom right and ends at the top left. Where as in case of local alignment, back tracing is done starting from the highest score in the matrix to the lowest score other than zero, and if more than one traces are there the longest one is chosen to plot the optimal local alignment. While back tracing, it is a must that trace should be always from the cell of consideration to only those cells from which it’s score is attained.
The final filled matrix for global alignment (including the optimal alignment trace in red) obtained for the example sequences taken is . . . .
row (i)
G col
(j)
1
A 4
3
2
A
T 5
T 6
C
A 8
7
G 9
T 10
T 11
A 12
0
-2
-4
-6
-8
-10
-12
-14
-16
-18
-20
-22
-2
-1
-2
-3
-4
-5
-3
-4
-5
-6
-7
-8
-4
-2
-2
-3
-4
-5
-3
-4
-5
-6
-7
-8
-6
-3
0
2
1
0
-1
1
0
-1
-2
0
-8
-4
-1
1
4
6
5
4
3
5
7
6
-10
-5
-2
0
3
5
8
7
6
5
4
6
-12
-3
-3
-1
2
4
7
7
9
8
7
6
-14
-1
-2
-2
1
3
6
6
11
10
9
8
2
C 3
C 4
A 5
T 6
C 7
G 8
G
If a score is followed by a diagonal, then there will be no gap. The bases represented by the cell can be aligned side by side in such case. If a score is followed by a horizontal or vertical arrow, a gap is inserted in the sequence towards which the arrow is pointing.
CAUTION: Gap must be inserted only when we come across a horizontal or vertical pointing arrow following a score, but not for a diagonal arrow following it. A diagonal arrow following a score represents an acceptable match, which can be either match or mismatch. Properly check where a gap must be inserted (the sequence towards which the arrow points should get a gap).
From the previous slide, there can be a total of four OPTIMAL GLOBAL ALIGNMENTS possible. These are:
GA A T T C A GT T A C C ATCG G GA A T T C A GT T A C C ATC GG G AA T T C A GT T A C C ATCG G G AA T T C A GT T A C C ATC GG
The final filled matrix for local alignment (including the optimal alignment trace in red) obtained for the example sequences taken is . . . .
row
(i)
G col
(j)
1
A 4
3
2
A
T 5
T 6
C
A 8
7
G 9
T 10
T 11
A 12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
0
0
4
3
2
1
0
0
0
0
2
4
3
2
3
6
5
4
3
5
0
0
1
3
6
8
7
6
5
7
9
8
0
0
0
2
5
7
10
9
8
7
8
8
0
2
1
1
4
6
9
9
11
10
9
8
0
4
3
2
3
5
8
8
13
12
11
10
2
C 3
C 4
A 5
T 6
C 7
G 8
G
From the previous slide, there can be a total of two OPTIMAL LOCAL ALIGNMENTS possible. These are:
A A T T C A G
A A T T C A G
A TC G G
A TC GG
Distance • Distance between two sequences is the minimal sum of weights for a set of mutations transforming one into the other
• Similarity is useful for database searching • Distance measures are useful for phylogenetic tree construction