Sequence Search And Alignment

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Sequence Search And Alignment as PDF for free.

More details

  • Words: 3,418
  • Pages: 69
Protein Bioinformatics 410.639 (http://jhunix.hcf.jhu.edu/~flebeda)

Lecture 3 SEQUENCE SEARCH & ALIGNMENT I. Similar 3D structures without sequence similarity Frank J. Lebeda 1

Experimental Data AA

2D data

composition

Protein sequences

Course Flow-chart Patterns

Search Database

Substitution matrices Gap penalties

2D predictions

2D prediction evaluation Protein class predictions

Pair & Multiple alignments Sequence Homologues

TM & hTM predictions

Solvent accessibility predictions

3D predictions

2

The variety of Definitions and Terms in Protein Bioinformatics • Due to the many people with diverse backgrounds who have made contributions: • biologists • chemists • computer scientists • statisticians • linguists • mathematicians • speech, pattern recognition & image analysts 3

Sequence search & alignment • Rationale • Matrices & scoring pairwise alignments • Scoring algorithms – 1st generation: Optimal – 2nd generation: Heuristic

• Scoring algorithms -3rd generation – multiple alignments

• Profiles/patterns - 4th generation 4

Why do protein sequence searching, pairwise or multiple-alignments, matching? - to determine whether a new sequence has already been deposited in a databank - to validate predicted ORFs - to find homology = to deduce a common evolutionary ancestry: usually similar over an entire sequence; alignments >50% identity in short <40 residue regions can occur solely by chance

- to help build new 3D models - to represent protein families & superfamilies - to identify conserved sequence features that correlate 5 with function

Why do alignments? (continued) Build evolutionary trees. - Understand the lineage of different species. - Have an organizing principle for sorting species into a taxonomy. Understand how various functions evolved. Understand forces and constraints on evolution.

Altman’s Phylogenetic lecture http://smi-web.stanford.edu/projects/helix/bmi214/ 6

Results from similarity searches: The most important conclusion that one can infer from doing sequence searching and alignment: high sequence similarity usually implies significant structural & functional similarity. (The 1st fact of biological sequence comparison -Gusfield 1997)

HOMOLOGY - implies a common ancestry - similarity in structure & function - applies either to a pair of sequences or to a large family of proteins By definition - homologous proteins share a common fold, i.e. the secondary & 3D structure, …however... 7

Similarity does not necessarily mean Homology - Homologous proteins “need not share statistically significant or even detectable sequence similarity” [BUT they can still share the same 2D, 3D structures, & functions] (The 2nd fact of biological sequence comparison- Gusfield, 1997)

- Proteins having similar 3D structures may not necessarily share similar functions - Proteins having similar sequences may not have similar 3D structures 8

Pearson, 2000

1chg

1chg

Similar 3D structures without sequence similarity

Sequence similarity, without similar 3D structures Pearson, 2000

1sgc

2baa

9

Protein Evolution - governed by the constraint of maintaining a characteristic fold which enables some function. From this concept, we can infer relationships between proteins that last shared a common ancestor ~2.5 billion years ago. (De La Vega’s notes)

10

Some important dates in history

-2.5

11

Pearson, 2000

Tree of Life

12

adapted from Dayhoff et al, 1978 Pearson, 2000

Types of homology Orthologous sequences: have the same function in various species, and have arisen via speciation; differences among related species at a single locus Paralogous sequences: members of multigene families that have arisen via gene duplication; differences among similar sequences in the same genome A (parent gene) speciation duplication B C C’ (orthologs) (paralogs) 13 After Altman http://smi-web.stanford.edu/projects/helix/bmi214/)

Concepts of similarity & redundancy “...all of biology is an enormous redundancy.” (Doolittle) Evolution: “...duplication with modification...” (Doolittle) “...reuses, builds on, duplicates, and modifies successful structures.” (Gusfield) (Quotes from Gusfield, 1997)

14

Orthologous sequences: cytochrome c family

15

Pearson, 1998

Search (comparison) algorithms • Align sequences Exact matching of patterns Inexact matching (matching with errors) • Calculate similarity scores for the alignments ==> each score represents an quantitative assessment of an alignment • Infer sequence homologies 16

Searching databases for sequence homologs - major concepts 1. We assume that the amino acid sequences, the protein structures and functions are inter-related. 2. The algorithms used provide: - search efficiency (vs. doing it by hand) - a mathematical basis to do searches consistently - results that can be translated into a measure of statistical significance Adapted from A. McLysaght http://www.maths.tcd.ie/~lily/pres2

17

DNA vs protein sequence comparison

Pearson, 1998

18

Distance between sequences (in the language of computer scientists) A quantitative measure of the difference between two sequences that examines a set of k different (sub)strings {x1, ..., xk} over an alphabet Σ. string = an ordered list of contiguous characters sequence #1: x1...x2... x3 (3 substrings) sequence #2: x1..........x3 (2 substrings) |Σ| = the number of characters in an alphabet: |Σ| = 4 for nucleic acids |Σ| = 20 for amino acids

19

Historical Background of Alignment Scores Hamming Distance (1950) substitutions allowed; no gaps or insertions: Maximal Segment Pair (MSP) algorithm ==> BLAST Levenshtein Distance (1966) substitutions, gaps & insertions allowed with a unit cost: Smith-Waterman P G algorithm ==> SSEARCH E ILEPGHAC Î L H IL----AC I

A C

In 1879, C.L. Dodgson (Lewis Carroll) made an early 20 reference to an alignment problem ==>

Superimposing 3D and 1D structures P E I

L

G H A

...ILEPGHAC... ...IL----AC...

A “gap” in a sequence alignment is an artificial way of showing locations of inserts & deletes No “gaps” in 3D structures Protein 1 ...ILAC... vs 21 Protein 2 ...ILEPGHAC... C

Carroll’s word puzzle: Substitute letters to transform an English word into another by going through a series intermediate English words - Start with two words of equal length - then transform HEAD ==> TAIL

HEAD HEAL (D L) TEAL (H T) TELL (A L) TALL (E A) TAIL (L I)

total number of substitutions d1 =0 d2 = 1 d3 = 1 [Each substitution has a COST, d4 = 1 or SCORE = 1 ] d5 = 1 d6 = 1

total no. of substitutions (Hamming distance) = DH = Σ di = 5 [ summing the “scores”]

22

The use of distance measures in pattern recognition/image analysis longer distance

shorter distance “unknown” “templates”

23

Scoring Matrices for residue pairs (weighting, substitution, matching) • Identity/unity/PAM0 - either identical or nonidentical (score = 1 or 0) • genetic code - 0,1,2, or 3 base changes in codon to code for a different residue • chemical - hydropathicity, charge, solvent access. • Observed residue substitutions in aligned sequences – PAM – BLOSUM (from Barton, Ch. 2, Sternberg, 1996) 24

Scoring Matrices •“...the proper scoring matrix is the most critical technical element in a successful search of protein databases.” Gusfield •Their MAJOR limitation: no unique set of scores is satisfactory to always give a good alignment e.g., amino acids in the same protein may experience different local environments • BUT, the structures of these matrices have been investigated and used with some success 25

Alignment locates possible mutation Theory tries to predict reality Point Accepted Mutation or Percent Accepted Mutations (Dayhoff et al.’s PAM score (1967, ‘69, ‘72, ‘78) -- accepted by natural selection. •There is a different mutation probability matrix for each evolutionary distance (or interval) measured in PAMs (e.g., 250PAM, 120PAM, 50PAM, 1PAM etc.) •These matrices quantitatively represent the average way in which residues change during evolution 26

Meaning of PAM scoring units • 1 PAM means having 1 point mutation per 100 residues, a 1% rate of substitution • 250 PAM means a lot of change has occurred; such alignments are still expected to remain about 20% identical since a given location can be associated with several mutations • These alignments are in the ‘Twilight Zone’ - a region in sequence comparison space “that contains distantly related (true positives) and relatively high-scoring but unrelated sequences (false positives)” (Pearson, 1996) (Lebeda’s italics) 27

The limits of sequence similarity 0 20 40

% identity

60 80 100 (mutations/ 100 residues)

Pearson, 2000; Doolittle et al., 1986

28

The limits of sequence similarity 0 20 40

% identity

60 80 100 (mutations/ 100 residues)

automatic pairwise methods

consensus, templates, MAs

threading predictions 29

Pearson, 2000; Doolittle et al., 1986

Comparison of alignment methods for predicting homologous 3D structures “Twilight zone”

% identity 100

80

60

40

20

0

|

|

|

|

|

|

automatic pairwise methods

threading consensus & predictions template (pattern) methods; mult. alignments

modified from Schulze-Kremer, 1996

30

Log-odds scoring matrix for the evolutionary distance of 250 PAMs (190 possible exchanges)

31

(Data from Dayhoff et al., 1978)

Substitution matrices • The PAM250 was one of the most widely used substitution matrices • The BLOSUM series of matrices appear to be more sensitive to distantly related sequences

32

Scoring an alignment For each pair of residues (i and j) in an aligned pair of sequences, a score is calculated

p(i ← j) signal/noise score ~ p(i) p(j) ratio (odds) Where p(i ← j) is the probability of a location having residue “i” in one sequence that is substituted with residue “j” in the other sequence. p(i) and p(j) are the probabilities (frequencies) of these two residues in the database 33

Building the PAM250 matrix • Each element is a substitution score: target

p(i ← j) score(i ← j ) = log base ( ) p(i) p(j) background The scores that best distinguish between pairwise alignments from chance are called log-odds scores 34

Foundations: data for the PAM1 matrix • Start with the stochastic* matrix giving the target probability distribution for a distance of 1PAM • was experimentally determined by studying >1500 substitutions (accepted point mutations) • used 71 groups of closely related proteins (>85% identical)

* involving chance or probability

35

Log-odds scoring matrix for the evolutionary distance of 250 PAMs C

C S

(250PAM * 10 = MDM78)

S

R

K aromatics

hydrophil.

G

W

acidic

|

basic

hydrophob.

(Data from Dayhoff et al., 1978)

W

36

W

Varying the clustering percentages generated a family of BLOSUM matrices • different matrices were constructed by changing the value of the clustering % identity • For example, the BLOSUM62 substitution matrix (at 62%) contains scores from “clustered” segments that are on average 62% identical • Structure-based log-odds matrices were also constructed 37

Henikoff, 1996

Search algorithms (continued) Optimal - slow, rigorous, exhaustive: guaranteed to calculate an optimal similarity score for every sequence in a database Heuristic - a short-cut, rapid, approximate: not guaranteed to calculate an optimal similarity score 38

Pearson, 1998

Two Optimal algorithms Used to derive different alignment scores: 1. Global scores - calculated for an alignment that starts at the beginning of each sequence and extends to the end of each sequence; with or without gaps (Needleman and Wunsch; 1970) 2. Local scores - used to find short regions with similar residues (SSEARCH; Smith-Waterman, 1981)

39

Dynamic Programming Algorithms • Compares two sequences, computes the minimal distance between them when seeking an optimal alignment • Represents a group of efficient algorithms by which both optimal alignments and the resulting distances are computed at the same time 40

An example of scores (edit, distance or unit cost) • In an alignment, scores are given for an alignment of sequences A={a1,a2...ai..aN} and sequence B ={b1,b2…bj…bM} for every pair of: Matches (identical residues) σ(ai, bj) =1 Mismatches (substitutions) σ(ai, bj) =1 Insertions or deletions (gaps) σ(ai, -), σ(-, bj), =0 • Scores are calculated by adding all of these local (positive) scores for each alignment. Similarity scores unlike distance or cost scores may be < 0. • There may be many optimal alignments but only one optimal 41 score.

Local differences & their edit graphs a

K C insert b D Y

K CinsertEa F 1

K CinsertEa F

K C E F insert a = delete b

1

1

2

2

insert b= delete a

3

4 Scores: 4(optimal) Sequence A: KCEF

3 KCE-F

2 K-CE-F

Sequence B: KCDY

K-CDY

KC--DY

3

diagonal = substitution or identity vertical = insert b/delete a horizontal= delete b/ insert a

2

b

42

Another matrix representation: AATGC (sequence 1) EXAMPLE AG_GC (sequence 2)

ALIGNMENT (scores) s(i,j) s(i,j) s(-,j) s(i,-)

(insert seq 1)

A G G C

A 1 0 0 0

A 1 1 1 1

T 0 1 1 1

G 0 2 2 1

C 0 1 2 3

= = = =

+1 (i=j) 0 (i<>j) 0 gap 0 gap

(insert seq 2)

first arrow= G-G match

3+2+1+1+1= 8 43

Two heuristic algorithms FAST family - (Lipman & Pearson, 1985) (FASTP, FASTA, FASTX, TFASTX, etc.)

BLAST family- (Altschul et al., 1990, etc.) (Basic Local Alignment Search Tool)

In many cases, these algorithms produce results of a quality similar to the Optimal algorithms, although some sensitivity is sacrificed. 44

Searching sequence databases involves a tradeoff between: False negatives (FN) type I errors (α) False positives (FP) type II errors (β) ------Hypothesis: homologous sequences-----------Not Homologous Decision Homologous Accept True pos (TP) False pos (FP) (β) Reject False neg (FN) (α) True neg (TN) (Remington & Schork, 1985; brc.mcw.edu/MetaGene/General/ Analysis/criteria.html45 Gusfield, 1997; Pearson, 2000) #Sensitivity and Selectivity

Sensitivity (precision): ability to detect distant evolutionary relationships in sequences: TP ----------------[ TP + FN ] <= no. of Homologous sequences Selectivity or specificity (recall): ability to avoid assignment of high scores to unrelated sequences: TN --------------[ TN + FP ] <= no. of Non-homologous sequences (terms in parentheses from Prosite) 46

FASTA vs. BLAST • In general, BLAST programs are faster • FASTA programs give more accurate results • the Expect E() values for both are a statistical measure of the likelihood that the observed similarity score could have occurred by chance these values describe the random background noise that exists for matches between sequences

47

(Pearson, in press)

FASTA flowchart

(init1 score)

FromBarton’s chapter 48

(initn score)

BLAST flowchart

many HSPs 49

From Barton’s chapter

BLAST algorithm 1. Score each triple (3 character words) in the query using the identity elements in the chosen scoring matrix - if higher than threshold (T=11), add the word to a list of high-scoring triples, 2. Use this list to scan the database for matches 3. Extend the lengths of the segments 4. Find the local high scoring (HSP) and maximal segment pairs (MSP) 5. Estimate the statistical significance of the MSP scores assuming sequences are randomly composed 50

Others scores related to the PAM matrix • The scores that best distinguish between pairwise alignments

from chance are of this form and are called log-odds scores: q ij

S (i , j ) =

ln( p

ipj

λ

) (the BLAST score)

qi,j’s = frequency with which residues i and j are aligned (target probability or freq.) ==> data for an accepted point mutation matrix px’s = frequency of residue x occurring in a sequence (background probability or freq.) λ = a normalization (scaling) constant (from Karlin & Altschul, 1990, Altschul et al., 1994)

51

Statistical significance of the BLAST score S Expected freq. (E-value) of a chance occurrence of a segment having at least a score of S E = K N exp -λ S Given 2 random sequences (with the same AA composition), E is the number of local high-scoring segment pairs (HSP) expected to occur by chance. (an E-value = 0.1 means that 1 out of 10 matches having the score (S) are expected to occur merely by chance using this database) This equation relates the score, S, of an HSP to its expected frequency of being a chance occurrence N = product of the query & database sequence lengths K and λ are pre-calculated from substitution scores and sequence composition. 52 (www.ncbi.nlm.nih.gov/BLAST/blast_help.html)

Which to use?

NOTE: a no longer recommended 53

Pearson, in press

What threshold Expect value to use? • For both FASTA & BLASTP, a sensible E()-value threshold = 0.001-0.01 • the default max E()-values are set to 10 - these thresholds avoid false positive (type I) errors - little can be done to avoid false negative (type II) errors

54

Relating BLAST’s E- and P-values • P value (probability in the range of 0 to 1) • E value (range: 0 to infinity) P = 1 - exp(-E) as E as E

infinity, P 1 0, E = P (E<0.05)

55

E-values vs. P-values 1.2

P=1

P-values

1 0.8 0.6

P = 1 - exp(-E) if P=E

0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

E-values

56

Other BLAST items (v.2) • N = no. of High Scoring segment Pairs (HSPs) • Smallest Sum P-value: adds up all the HSP scores for a given alignment; this statistic tends to rank database matches better than the previously used Poisson statistic • >40 command line options (parameters) available to customize your query session • BLASTP version 1.4 still available for ungapped alignments 57

a partial BLAST output

Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N Rv1646, info ... 56 0.062 1 Rv1758, info ... 49 0.15 1

http://www.sanger.ac.uk/DataSearch/omniblast.shtml

58

More BLAST output... Rv1646, [Full Sequence] info TB.seq 1855762:1856691 MW:30220 Length = 311 Score = 56 (19.7 bits), Expect = 0.064, P = 0.062 Identities = 13/31 (41%), Positives = 19/31 (61%) Query: Sbjct:

21 NEFVALAAIPGGAATFAVCQMPNLDEIVSNA 51 N+FV A+ GGAA +A + N+ + V NA 74 NQFVG--ALRGGAAAYASAEAANVQQTVVNA 102

59

An additional BLAST output • A normalized S score can expressed as a relative entropy (H) value : H (bits) = (λS - log2 (K) )/log2(2) • the information expected to be obtained for each HSP that distinguishes it from a random alignment • unlike S, statistical significance can be associated with H that is independent of the scoring matrix used 60

Structure homology threshold as a function of sequence alignment length (from PDB) 80

Average % residue identity (homology thresholds)

NOT 3D homologues @30% identity

70 60

3D homologues @30% identity

50

Select 30% threshold

40 30

Asymptote

20

Short segments with 50% ID not significant

10 0 <10 10

20

30

40

50

60

70

80 >80

can get ~25% ID from unrelated proteins

Alignment lengths

At 30% sequence identity, alignment lengths of >=60 residues implies homology while a 30% identity for an alignment of <59 residues does not 61

The BLAST family blastn - for nucleotide - nucleotide comparisons blastp - for protein - protein comparisons tblastn - compares the protein "Sequence 1" against the nucleotide "Sequence 2" which has been translated in all six reading frames blastx - compares the nucleotide "Sequence 1" against the protein "Sequence 2" tblastx - compares nucleotide "Sequence 1" translated in all six reading frames against the nucleotide "Sequence 2" translated in all six reading frames.

62

...besides FASTA and NCBI’s BLAST: • Some Second Generation algorithms – WU-BLAST – PSI-BLAST

63

WU-BLAST - 1 • WU-BLAST (v.2.0) Washington Univ., St. Louis has some additional features • reports groups of statistically similar regions- (not just the “best” regions in a particular alignment) that are made up of individual, statistically insignificant segments 64

WU-BLAST - 2 • has an optional Smith-Waterman postprocessor that runs that alignment on the pairs of sequences found • supports parallel processing on different platforms • is run on selected databases (TIGR)

65

BLAST documents on the Internet • History of BLAST modifications maintained at http://blast.wustl.edu/blast-1.4/HISTORY • WU-BLAST information and BLAST reference manual pages (.pdf file): http://blast.wustl.edu/blast/README.html

66

Position-Specific Iterated (PSI) BLAST • a “last-resort” search approach • conducts an iterated (repeated) database searches; unlike BLAST, which only does a single search • unlike BLAST which uses a 20 x 20 residue substitution matrix, psi-BLAST is not tied to a specific score matrix

67

PSI- BLAST (cont.) • PSI-BLAST then “converges” and stops searching if all sequences found in round i+1 having E-values below the threshold were already in the model at the beginning of round (i). • This version of BLAST is more sensitive to statistically weak but biologically relevant sequence similarities • a PSI-BLAST site: http://seqsim.ncgr.org/newBlast.html Altschul et al., 1997

68

References •Apostolico and Giancarlo, J. Comput. Biol. 5: 173-196, 1998 •Barton, G.J. Ch. 2, Sternberg’s book, 1996 •Carroll, Vanity Fair, 1879 •Dayhoff, Schwartz & Orcutt, Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Wash. D.C. vol 5, suppl., 3, pps 345-352, 1978 •Doolittle et al. Cold Spring Harbor Symp.Quant. Biol.51:447455, 1986 •Gusfield, Algorithms on strings, trees, and sequences, Cambridge Univ. Press, NY, 1997. •Hamming, Bell Sys. Tech. J. 29: 147-160, 1950 •Henikov and Henikov, Meth. Enzymol. 266:88-105, 1996 •Levenshtein, Soviet Phys. Dokl., 6:707-710, 1966 •Pearson, Tutorial – ISMB2000 (www.med.virginia.edu/~wrp) •Pearson, Meth. Molec. Biol. 1998, in press ( “ ” ” “/ “ ) •Remington & Schork, Statistics with applications to the 69 biological and health sciences. 2nd Ed., Prentice Hall, NY, 1985

Related Documents