Protein Bioinformatics 410.639 (http://jhunix.hcf.jhu.edu/~flebeda)
Lecture 3 SEQUENCE SEARCH & ALIGNMENT I. Similar 3D structures without sequence similarity Frank J. Lebeda 1
Experimental Data AA
2D data
composition
Protein sequences
Course Flow-chart Patterns
Search Database
Substitution matrices Gap penalties
2D predictions
2D prediction evaluation Protein class predictions
Pair & Multiple alignments Sequence Homologues
TM & hTM predictions
Solvent accessibility predictions
3D predictions
2
The variety of Definitions and Terms in Protein Bioinformatics • Due to the many people with diverse backgrounds who have made contributions: • biologists • chemists • computer scientists • statisticians • linguists • mathematicians • speech, pattern recognition & image analysts 3
Sequence search & alignment • Rationale • Matrices & scoring pairwise alignments • Scoring algorithms – 1st generation: Optimal – 2nd generation: Heuristic
• Scoring algorithms -3rd generation – multiple alignments
• Profiles/patterns - 4th generation 4
Why do protein sequence searching, pairwise or multiple-alignments, matching? - to determine whether a new sequence has already been deposited in a databank - to validate predicted ORFs - to find homology = to deduce a common evolutionary ancestry: usually similar over an entire sequence; alignments >50% identity in short <40 residue regions can occur solely by chance
- to help build new 3D models - to represent protein families & superfamilies - to identify conserved sequence features that correlate 5 with function
Why do alignments? (continued) Build evolutionary trees. - Understand the lineage of different species. - Have an organizing principle for sorting species into a taxonomy. Understand how various functions evolved. Understand forces and constraints on evolution.
Altman’s Phylogenetic lecture http://smi-web.stanford.edu/projects/helix/bmi214/ 6
Results from similarity searches: The most important conclusion that one can infer from doing sequence searching and alignment: high sequence similarity usually implies significant structural & functional similarity. (The 1st fact of biological sequence comparison -Gusfield 1997)
HOMOLOGY - implies a common ancestry - similarity in structure & function - applies either to a pair of sequences or to a large family of proteins By definition - homologous proteins share a common fold, i.e. the secondary & 3D structure, …however... 7
Similarity does not necessarily mean Homology - Homologous proteins “need not share statistically significant or even detectable sequence similarity” [BUT they can still share the same 2D, 3D structures, & functions] (The 2nd fact of biological sequence comparison- Gusfield, 1997)
- Proteins having similar 3D structures may not necessarily share similar functions - Proteins having similar sequences may not have similar 3D structures 8
Pearson, 2000
1chg
1chg
Similar 3D structures without sequence similarity
Sequence similarity, without similar 3D structures Pearson, 2000
1sgc
2baa
9
Protein Evolution - governed by the constraint of maintaining a characteristic fold which enables some function. From this concept, we can infer relationships between proteins that last shared a common ancestor ~2.5 billion years ago. (De La Vega’s notes)
10
Some important dates in history
-2.5
11
Pearson, 2000
Tree of Life
12
adapted from Dayhoff et al, 1978 Pearson, 2000
Types of homology Orthologous sequences: have the same function in various species, and have arisen via speciation; differences among related species at a single locus Paralogous sequences: members of multigene families that have arisen via gene duplication; differences among similar sequences in the same genome A (parent gene) speciation duplication B C C’ (orthologs) (paralogs) 13 After Altman http://smi-web.stanford.edu/projects/helix/bmi214/)
Concepts of similarity & redundancy “...all of biology is an enormous redundancy.” (Doolittle) Evolution: “...duplication with modification...” (Doolittle) “...reuses, builds on, duplicates, and modifies successful structures.” (Gusfield) (Quotes from Gusfield, 1997)
14
Orthologous sequences: cytochrome c family
15
Pearson, 1998
Search (comparison) algorithms • Align sequences Exact matching of patterns Inexact matching (matching with errors) • Calculate similarity scores for the alignments ==> each score represents an quantitative assessment of an alignment • Infer sequence homologies 16
Searching databases for sequence homologs - major concepts 1. We assume that the amino acid sequences, the protein structures and functions are inter-related. 2. The algorithms used provide: - search efficiency (vs. doing it by hand) - a mathematical basis to do searches consistently - results that can be translated into a measure of statistical significance Adapted from A. McLysaght http://www.maths.tcd.ie/~lily/pres2
17
DNA vs protein sequence comparison
Pearson, 1998
18
Distance between sequences (in the language of computer scientists) A quantitative measure of the difference between two sequences that examines a set of k different (sub)strings {x1, ..., xk} over an alphabet Σ. string = an ordered list of contiguous characters sequence #1: x1...x2... x3 (3 substrings) sequence #2: x1..........x3 (2 substrings) |Σ| = the number of characters in an alphabet: |Σ| = 4 for nucleic acids |Σ| = 20 for amino acids
19
Historical Background of Alignment Scores Hamming Distance (1950) substitutions allowed; no gaps or insertions: Maximal Segment Pair (MSP) algorithm ==> BLAST Levenshtein Distance (1966) substitutions, gaps & insertions allowed with a unit cost: Smith-Waterman P G algorithm ==> SSEARCH E ILEPGHAC Î L H IL----AC I
A C
In 1879, C.L. Dodgson (Lewis Carroll) made an early 20 reference to an alignment problem ==>
Superimposing 3D and 1D structures P E I
L
G H A
...ILEPGHAC... ...IL----AC...
A “gap” in a sequence alignment is an artificial way of showing locations of inserts & deletes No “gaps” in 3D structures Protein 1 ...ILAC... vs 21 Protein 2 ...ILEPGHAC... C
Carroll’s word puzzle: Substitute letters to transform an English word into another by going through a series intermediate English words - Start with two words of equal length - then transform HEAD ==> TAIL
HEAD HEAL (D L) TEAL (H T) TELL (A L) TALL (E A) TAIL (L I)
total number of substitutions d1 =0 d2 = 1 d3 = 1 [Each substitution has a COST, d4 = 1 or SCORE = 1 ] d5 = 1 d6 = 1
total no. of substitutions (Hamming distance) = DH = Σ di = 5 [ summing the “scores”]
22
The use of distance measures in pattern recognition/image analysis longer distance
shorter distance “unknown” “templates”
23
Scoring Matrices for residue pairs (weighting, substitution, matching) • Identity/unity/PAM0 - either identical or nonidentical (score = 1 or 0) • genetic code - 0,1,2, or 3 base changes in codon to code for a different residue • chemical - hydropathicity, charge, solvent access. • Observed residue substitutions in aligned sequences – PAM – BLOSUM (from Barton, Ch. 2, Sternberg, 1996) 24
Scoring Matrices •“...the proper scoring matrix is the most critical technical element in a successful search of protein databases.” Gusfield •Their MAJOR limitation: no unique set of scores is satisfactory to always give a good alignment e.g., amino acids in the same protein may experience different local environments • BUT, the structures of these matrices have been investigated and used with some success 25
Alignment locates possible mutation Theory tries to predict reality Point Accepted Mutation or Percent Accepted Mutations (Dayhoff et al.’s PAM score (1967, ‘69, ‘72, ‘78) -- accepted by natural selection. •There is a different mutation probability matrix for each evolutionary distance (or interval) measured in PAMs (e.g., 250PAM, 120PAM, 50PAM, 1PAM etc.) •These matrices quantitatively represent the average way in which residues change during evolution 26
Meaning of PAM scoring units • 1 PAM means having 1 point mutation per 100 residues, a 1% rate of substitution • 250 PAM means a lot of change has occurred; such alignments are still expected to remain about 20% identical since a given location can be associated with several mutations • These alignments are in the ‘Twilight Zone’ - a region in sequence comparison space “that contains distantly related (true positives) and relatively high-scoring but unrelated sequences (false positives)” (Pearson, 1996) (Lebeda’s italics) 27
The limits of sequence similarity 0 20 40
% identity
60 80 100 (mutations/ 100 residues)
Pearson, 2000; Doolittle et al., 1986
28
The limits of sequence similarity 0 20 40
% identity
60 80 100 (mutations/ 100 residues)
automatic pairwise methods
consensus, templates, MAs
threading predictions 29
Pearson, 2000; Doolittle et al., 1986
Comparison of alignment methods for predicting homologous 3D structures “Twilight zone”
% identity 100
80
60
40
20
0
|
|
|
|
|
|
automatic pairwise methods
threading consensus & predictions template (pattern) methods; mult. alignments
modified from Schulze-Kremer, 1996
30
Log-odds scoring matrix for the evolutionary distance of 250 PAMs (190 possible exchanges)
31
(Data from Dayhoff et al., 1978)
Substitution matrices • The PAM250 was one of the most widely used substitution matrices • The BLOSUM series of matrices appear to be more sensitive to distantly related sequences
32
Scoring an alignment For each pair of residues (i and j) in an aligned pair of sequences, a score is calculated
p(i ← j) signal/noise score ~ p(i) p(j) ratio (odds) Where p(i ← j) is the probability of a location having residue “i” in one sequence that is substituted with residue “j” in the other sequence. p(i) and p(j) are the probabilities (frequencies) of these two residues in the database 33
Building the PAM250 matrix • Each element is a substitution score: target
p(i ← j) score(i ← j ) = log base ( ) p(i) p(j) background The scores that best distinguish between pairwise alignments from chance are called log-odds scores 34
Foundations: data for the PAM1 matrix • Start with the stochastic* matrix giving the target probability distribution for a distance of 1PAM • was experimentally determined by studying >1500 substitutions (accepted point mutations) • used 71 groups of closely related proteins (>85% identical)
* involving chance or probability
35
Log-odds scoring matrix for the evolutionary distance of 250 PAMs C
C S
(250PAM * 10 = MDM78)
S
R
K aromatics
hydrophil.
G
W
acidic
|
basic
hydrophob.
(Data from Dayhoff et al., 1978)
W
36
W
Varying the clustering percentages generated a family of BLOSUM matrices • different matrices were constructed by changing the value of the clustering % identity • For example, the BLOSUM62 substitution matrix (at 62%) contains scores from “clustered” segments that are on average 62% identical • Structure-based log-odds matrices were also constructed 37
Henikoff, 1996
Search algorithms (continued) Optimal - slow, rigorous, exhaustive: guaranteed to calculate an optimal similarity score for every sequence in a database Heuristic - a short-cut, rapid, approximate: not guaranteed to calculate an optimal similarity score 38
Pearson, 1998
Two Optimal algorithms Used to derive different alignment scores: 1. Global scores - calculated for an alignment that starts at the beginning of each sequence and extends to the end of each sequence; with or without gaps (Needleman and Wunsch; 1970) 2. Local scores - used to find short regions with similar residues (SSEARCH; Smith-Waterman, 1981)
39
Dynamic Programming Algorithms • Compares two sequences, computes the minimal distance between them when seeking an optimal alignment • Represents a group of efficient algorithms by which both optimal alignments and the resulting distances are computed at the same time 40
An example of scores (edit, distance or unit cost) • In an alignment, scores are given for an alignment of sequences A={a1,a2...ai..aN} and sequence B ={b1,b2…bj…bM} for every pair of: Matches (identical residues) σ(ai, bj) =1 Mismatches (substitutions) σ(ai, bj) =1 Insertions or deletions (gaps) σ(ai, -), σ(-, bj), =0 • Scores are calculated by adding all of these local (positive) scores for each alignment. Similarity scores unlike distance or cost scores may be < 0. • There may be many optimal alignments but only one optimal 41 score.
Local differences & their edit graphs a
K C insert b D Y
K CinsertEa F 1
K CinsertEa F
K C E F insert a = delete b
1
1
2
2
insert b= delete a
3
4 Scores: 4(optimal) Sequence A: KCEF
3 KCE-F
2 K-CE-F
Sequence B: KCDY
K-CDY
KC--DY
3
diagonal = substitution or identity vertical = insert b/delete a horizontal= delete b/ insert a
2
b
42
Another matrix representation: AATGC (sequence 1) EXAMPLE AG_GC (sequence 2)
ALIGNMENT (scores) s(i,j) s(i,j) s(-,j) s(i,-)
(insert seq 1)
A G G C
A 1 0 0 0
A 1 1 1 1
T 0 1 1 1
G 0 2 2 1
C 0 1 2 3
= = = =
+1 (i=j) 0 (i<>j) 0 gap 0 gap
(insert seq 2)
first arrow= G-G match
3+2+1+1+1= 8 43
Two heuristic algorithms FAST family - (Lipman & Pearson, 1985) (FASTP, FASTA, FASTX, TFASTX, etc.)
BLAST family- (Altschul et al., 1990, etc.) (Basic Local Alignment Search Tool)
In many cases, these algorithms produce results of a quality similar to the Optimal algorithms, although some sensitivity is sacrificed. 44
Searching sequence databases involves a tradeoff between: False negatives (FN) type I errors (α) False positives (FP) type II errors (β) ------Hypothesis: homologous sequences-----------Not Homologous Decision Homologous Accept True pos (TP) False pos (FP) (β) Reject False neg (FN) (α) True neg (TN) (Remington & Schork, 1985; brc.mcw.edu/MetaGene/General/ Analysis/criteria.html45 Gusfield, 1997; Pearson, 2000) #Sensitivity and Selectivity
Sensitivity (precision): ability to detect distant evolutionary relationships in sequences: TP ----------------[ TP + FN ] <= no. of Homologous sequences Selectivity or specificity (recall): ability to avoid assignment of high scores to unrelated sequences: TN --------------[ TN + FP ] <= no. of Non-homologous sequences (terms in parentheses from Prosite) 46
FASTA vs. BLAST • In general, BLAST programs are faster • FASTA programs give more accurate results • the Expect E() values for both are a statistical measure of the likelihood that the observed similarity score could have occurred by chance these values describe the random background noise that exists for matches between sequences
47
(Pearson, in press)
FASTA flowchart
(init1 score)
FromBarton’s chapter 48
(initn score)
BLAST flowchart
many HSPs 49
From Barton’s chapter
BLAST algorithm 1. Score each triple (3 character words) in the query using the identity elements in the chosen scoring matrix - if higher than threshold (T=11), add the word to a list of high-scoring triples, 2. Use this list to scan the database for matches 3. Extend the lengths of the segments 4. Find the local high scoring (HSP) and maximal segment pairs (MSP) 5. Estimate the statistical significance of the MSP scores assuming sequences are randomly composed 50
Others scores related to the PAM matrix • The scores that best distinguish between pairwise alignments
from chance are of this form and are called log-odds scores: q ij
S (i , j ) =
ln( p
ipj
λ
) (the BLAST score)
qi,j’s = frequency with which residues i and j are aligned (target probability or freq.) ==> data for an accepted point mutation matrix px’s = frequency of residue x occurring in a sequence (background probability or freq.) λ = a normalization (scaling) constant (from Karlin & Altschul, 1990, Altschul et al., 1994)
51
Statistical significance of the BLAST score S Expected freq. (E-value) of a chance occurrence of a segment having at least a score of S E = K N exp -λ S Given 2 random sequences (with the same AA composition), E is the number of local high-scoring segment pairs (HSP) expected to occur by chance. (an E-value = 0.1 means that 1 out of 10 matches having the score (S) are expected to occur merely by chance using this database) This equation relates the score, S, of an HSP to its expected frequency of being a chance occurrence N = product of the query & database sequence lengths K and λ are pre-calculated from substitution scores and sequence composition. 52 (www.ncbi.nlm.nih.gov/BLAST/blast_help.html)
Which to use?
NOTE: a no longer recommended 53
Pearson, in press
What threshold Expect value to use? • For both FASTA & BLASTP, a sensible E()-value threshold = 0.001-0.01 • the default max E()-values are set to 10 - these thresholds avoid false positive (type I) errors - little can be done to avoid false negative (type II) errors
54
Relating BLAST’s E- and P-values • P value (probability in the range of 0 to 1) • E value (range: 0 to infinity) P = 1 - exp(-E) as E as E
infinity, P 1 0, E = P (E<0.05)
55
E-values vs. P-values 1.2
P=1
P-values
1 0.8 0.6
P = 1 - exp(-E) if P=E
0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
E-values
56
Other BLAST items (v.2) • N = no. of High Scoring segment Pairs (HSPs) • Smallest Sum P-value: adds up all the HSP scores for a given alignment; this statistic tends to rank database matches better than the previously used Poisson statistic • >40 command line options (parameters) available to customize your query session • BLASTP version 1.4 still available for ungapped alignments 57
a partial BLAST output
Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N Rv1646, info ... 56 0.062 1 Rv1758, info ... 49 0.15 1
http://www.sanger.ac.uk/DataSearch/omniblast.shtml
58
More BLAST output... Rv1646, [Full Sequence] info TB.seq 1855762:1856691 MW:30220 Length = 311 Score = 56 (19.7 bits), Expect = 0.064, P = 0.062 Identities = 13/31 (41%), Positives = 19/31 (61%) Query: Sbjct:
21 NEFVALAAIPGGAATFAVCQMPNLDEIVSNA 51 N+FV A+ GGAA +A + N+ + V NA 74 NQFVG--ALRGGAAAYASAEAANVQQTVVNA 102
59
An additional BLAST output • A normalized S score can expressed as a relative entropy (H) value : H (bits) = (λS - log2 (K) )/log2(2) • the information expected to be obtained for each HSP that distinguishes it from a random alignment • unlike S, statistical significance can be associated with H that is independent of the scoring matrix used 60
Structure homology threshold as a function of sequence alignment length (from PDB) 80
Average % residue identity (homology thresholds)
NOT 3D homologues @30% identity
70 60
3D homologues @30% identity
50
Select 30% threshold
40 30
Asymptote
20
Short segments with 50% ID not significant
10 0 <10 10
20
30
40
50
60
70
80 >80
can get ~25% ID from unrelated proteins
Alignment lengths
At 30% sequence identity, alignment lengths of >=60 residues implies homology while a 30% identity for an alignment of <59 residues does not 61
The BLAST family blastn - for nucleotide - nucleotide comparisons blastp - for protein - protein comparisons tblastn - compares the protein "Sequence 1" against the nucleotide "Sequence 2" which has been translated in all six reading frames blastx - compares the nucleotide "Sequence 1" against the protein "Sequence 2" tblastx - compares nucleotide "Sequence 1" translated in all six reading frames against the nucleotide "Sequence 2" translated in all six reading frames.
62
...besides FASTA and NCBI’s BLAST: • Some Second Generation algorithms – WU-BLAST – PSI-BLAST
63
WU-BLAST - 1 • WU-BLAST (v.2.0) Washington Univ., St. Louis has some additional features • reports groups of statistically similar regions- (not just the “best” regions in a particular alignment) that are made up of individual, statistically insignificant segments 64
WU-BLAST - 2 • has an optional Smith-Waterman postprocessor that runs that alignment on the pairs of sequences found • supports parallel processing on different platforms • is run on selected databases (TIGR)
65
BLAST documents on the Internet • History of BLAST modifications maintained at http://blast.wustl.edu/blast-1.4/HISTORY • WU-BLAST information and BLAST reference manual pages (.pdf file): http://blast.wustl.edu/blast/README.html
66
Position-Specific Iterated (PSI) BLAST • a “last-resort” search approach • conducts an iterated (repeated) database searches; unlike BLAST, which only does a single search • unlike BLAST which uses a 20 x 20 residue substitution matrix, psi-BLAST is not tied to a specific score matrix
67
PSI- BLAST (cont.) • PSI-BLAST then “converges” and stops searching if all sequences found in round i+1 having E-values below the threshold were already in the model at the beginning of round (i). • This version of BLAST is more sensitive to statistically weak but biologically relevant sequence similarities • a PSI-BLAST site: http://seqsim.ncgr.org/newBlast.html Altschul et al., 1997
68
References •Apostolico and Giancarlo, J. Comput. Biol. 5: 173-196, 1998 •Barton, G.J. Ch. 2, Sternberg’s book, 1996 •Carroll, Vanity Fair, 1879 •Dayhoff, Schwartz & Orcutt, Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Wash. D.C. vol 5, suppl., 3, pps 345-352, 1978 •Doolittle et al. Cold Spring Harbor Symp.Quant. Biol.51:447455, 1986 •Gusfield, Algorithms on strings, trees, and sequences, Cambridge Univ. Press, NY, 1997. •Hamming, Bell Sys. Tech. J. 29: 147-160, 1950 •Henikov and Henikov, Meth. Enzymol. 266:88-105, 1996 •Levenshtein, Soviet Phys. Dokl., 6:707-710, 1966 •Pearson, Tutorial – ISMB2000 (www.med.virginia.edu/~wrp) •Pearson, Meth. Molec. Biol. 1998, in press ( “ ” ” “/ “ ) •Remington & Schork, Statistics with applications to the 69 biological and health sciences. 2nd Ed., Prentice Hall, NY, 1985