Sequence Analysis

  • Uploaded by: Ravi Kanth Redy
  • 0
  • 0
  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Sequence Analysis as PDF for free.

More details

  • Words: 1,766
  • Pages: 36
Sequence Analysis

Sequence analysis Pairwise Alignment: ¾Dot Matrix ¾Dynamic Programming ¾Heuristic Methods: FASTA BLAST

Pairwise Sequence Alignment

• The process of comparing two sequences by searching for a series of individual characters that are in the same order in the sequences

Terms used to indicate sequence similaity Term used

Description of the term

Homologous

Characters similar due to common ancestry

Analogous

Characters similar due to convergent evolution

Orthologous

Charecters homologous with conserved function

Paralogous

Characters homologous with divergent function

Xenologous

Sequence identity due to horizontal transfer

Dot matrix • • • •

Useful for simple alignments Uses graphical methods Easy to understand and apply Disadvantage: Does not produce an optimal alignment

Dot matrix • Constructed to visually compare two sequences and detect regions of close similarity between them • Similarity between two sequences can be detected as a diagonal on an identity matrix • Word size can be fixed to filter the best dot plot

Dynamic programming • A little bit challenging to understand when compared to Dot matrix • Produces optical alignment using both Global and Local alignments • Disadvantage: Needs skill to produce the most optical alignment

Scoring Matrices • The alignment that gives the highest similarity score is the optimal alignment • A scoring method used in the alignment of one residue against another • The number of alignments that must be checked to identify the optimal alignment increases exponentially with the lengths of the sequences

SCORING MATRICES for Proteins

• PAM (Percent Accepted Mutations) • BLOSUM (BLOcks SUbstitution Matrix)

PAM • The first substitution matrix developed for comparision of protein sequences • Also called as Dayhoff, MDM or PAM matrices • Developed by Margaret Dayhoff and co-workers • Developed from global alignments of closely related sequences • Number accompanying PAM refers to evolutionary distance between the sequences • Larger number – greater evolutionary distance

Table showing the PAM matrices and the corresponding percent identities

Matrix

PAM0 PAM30 PAM80 PAM110 PAM200

PAM250

% Identity

100

20

75

50

60

25

BLOSUM • Developed by Steve Henikoff and co-workers • Based on local alignments of distantly related sequences • Small number - greater evolutionary distance

Global Alignment • Is done across the entire sequence length to include as many matches as possible upto and including the sequence end • Needleman-Wunsh Algorithm • Example of programme using global alignment by implementing NeedlemanWunsh algorithm: Needle tool of EMBOSS

Local Alignment • Finds out the pair of regions, one from each sequence that exhibit high similarity • Smith-Waterman Algorithm • Example of programme using global alignment by implementing SmithWaterman algorithm: Water tool of EMBOSS

Heuristics methods • Fast and effective but approximate methods • BLAST and FASTA

FASTA • It is a programme written by W.R.Pearson and D.J.Lipman • Sensitive and rapid algorithm • EMBL and DDBJ use FASTA for whole database search • Disadvantage: Completely ignores more more distantly related sequences that have functional homology but no longer retain complete identity

Flavours of FASTA TYPE of fasta

Query sequence

Library/ Database scanned

fasta

Protein or DNA

Protein or DNA

tfasta

Protein

Translated DNA

lfasta

Compares two sequences for local similarity and shows local sequence alignments

plfasta

Compares two sequences for local similarity and plots the local sequence alignments

fastx

Tanslated DNA (all the six frames)

fasty

Protein

FASTA Considers exact matches for a given set of sequences

Uses DP algorithm to compute optimal alignment out of the exact matches

BLAST (Basic Local Alignment Search Tool) • Karlin and Altschul • Multiple hits to the same sequence is allowed in order to find out patches of regional similarity

BLAST Program

Database

Query

BLASTN

Nucleotide

Nucleotide

BLASTP

Protein

Protein

BLASTX

Protein

Nucleotide translated into protein

TBLASTN

Nucleotide translated into protein

Protein

TBLASTX

Nucleotide translated into protein

Nucleotide translated into protein

BLAST Looks for short subsequences that are likely to have significant matches

Tries to extend these subsequences in order to obtain maximum sequence similarity

Steps in alignment process 1. Matrix construction. 2. Matrix filling. 3. Back tracing. 4. Alignment plotting.

The two example sequences taken are: 5’ GAATTCAGTTA 3’ 5’ CCATCGG 3’

Matrix construction and filling: The matrix is constructed as in the slides that follow, the number of rows and columns taken depends on the number of bases present in both the sequences to be aligned. Once the matrix is constructed, it is filled based on some scoring rules and formulas as follows: Formula for determining the score of a cell (for global) :

Si,j = Max [(Si-1,j-1 + W), (Si-1,j + W), (Si,j-1 + W)]

(or)

Si,j = Max [(Si-1,j-i), (Si-1,j), (Si,j-1)] + W Where, Si,j = Score of the cell represented by ith column and jth row i = row in which the cell of consideration lies j = column in which the cell of consideration lies W = weight given to the substitution of bases represented by the cell of consideration

Scoring rules used: The weights given in this example for substitution or indel are the following :Match = 2

Si,j = Max [Si-1,j-1, Si-1,j, Si,j-1] + W

Mismatch = -1 Gap = -2

S2,2 = Max [S1,1, S1,2, S2,1] + W For example, the calculation of score in the case of the cell S2,2: S1,1 = 0 S1,2 = -2 S2,1 = -2 and W (GQC) = -1 Max of all the three is 0. So, 0 + -1 = -1 So, this score is given to the cell and arrow is placed from the diagonal cell from which the value is obtained, it can be to more than one cell also.

row

(i)

G col

(j)

1

0

-2

-2

-1

3

C

-4 4

A

-6 5

T

-8 6

C

-10 7

G

-12 8

G

-14

A 4

3

2

2

C

A -4

T 5

-6

T 6

-8

C 8

7

-10

A

-12

G 9

-14

T 10

-16

-18

T 11

-20

A 12

-22

Similarly all the cells are assigned scores and arrows are placed accordingly. Finally when the entire matrix is filled, the back tracing is done. The two differences between local and global alignment are: ™ Scoring: In case of global alignment whatever score we get from the formula, we assign it to the cell. Where as in case of local alignment the cell score is equaled to zero if the calculated maximum score is less than zero. So, the formula for calculating score of a cell for local alignment is written as: Si,j = Max [(Si-1,j-1 + W), (Si-1,j + W), (Si,j-1 + W), 0] ™ Back tracing: In case of global it starts from the bottom right and ends at the top left. Where as in case of local alignment, back tracing is done starting from the highest score in the matrix to the lowest score other than zero, and if more than one traces are there the longest one is chosen to plot the optimal local alignment. While back tracing, it is a must that trace should be always from the cell of consideration to only those cells from which it’s score is attained.

The final filled matrix for global alignment (including the optimal alignment trace in red) obtained for the example sequences taken is . . . .

row (i)

G col

(j)

1

A 4

3

2

A

T 5

T 6

C

A 8

7

G 9

T 10

T 11

A 12

0

-2

-4

-6

-8

-10

-12

-14

-16

-18

-20

-22

-2

-1

-2

-3

-4

-5

-3

-4

-5

-6

-7

-8

-4

-2

-2

-3

-4

-5

-3

-4

-5

-6

-7

-8

-6

-3

0

2

1

0

-1

1

0

-1

-2

0

-8

-4

-1

1

4

6

5

4

3

5

7

6

-10

-5

-2

0

3

5

8

7

6

5

4

6

-12

-3

-3

-1

2

4

7

7

9

8

7

6

-14

-1

-2

-2

1

3

6

6

11

10

9

8

2

C 3

C 4

A 5

T 6

C 7

G 8

G

If a score is followed by a diagonal, then there will be no gap. The bases represented by the cell can be aligned side by side in such case. If a score is followed by a horizontal or vertical arrow, a gap is inserted in the sequence towards which the arrow is pointing.

CAUTION: Gap must be inserted only when we come across a horizontal or vertical pointing arrow following a score, but not for a diagonal arrow following it. A diagonal arrow following a score represents an acceptable match, which can be either match or mismatch. Properly check where a gap must be inserted (the sequence towards which the arrow points should get a gap).

From the previous slide, there can be a total of four OPTIMAL GLOBAL ALIGNMENTS possible. These are:

GA A T T C A GT T A C C ATCG G GA A T T C A GT T A C C ATC GG G AA T T C A GT T A C C ATCG G G AA T T C A GT T A C C ATC GG

The final filled matrix for local alignment (including the optimal alignment trace in red) obtained for the example sequences taken is . . . .

row

(i)

G col

(j)

1

A 4

3

2

A

T 5

T 6

C

A 8

7

G 9

T 10

T 11

A 12

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

1

0

0

0

0

0

0

0

0

0

0

4

3

2

1

0

0

0

0

2

4

3

2

3

6

5

4

3

5

0

0

1

3

6

8

7

6

5

7

9

8

0

0

0

2

5

7

10

9

8

7

8

8

0

2

1

1

4

6

9

9

11

10

9

8

0

4

3

2

3

5

8

8

13

12

11

10

2

C 3

C 4

A 5

T 6

C 7

G 8

G

From the previous slide, there can be a total of two OPTIMAL LOCAL ALIGNMENTS possible. These are:

A A T T C A G

A A T T C A G

A TC G G

A TC GG

Distance • Distance between two sequences is the minimal sum of weights for a set of mutations transforming one into the other

• Similarity is useful for database searching • Distance measures are useful for phylogenetic tree construction

Related Documents

Sequence Analysis
November 2019 13
Sequence
April 2020 30
Area Sequence
May 2020 14
Sequence & Series
May 2020 15
Sequence Diagram.docx
December 2019 21

More Documents from "Irvan Sevenfold"