Final 2

  • Uploaded by: nby_j
  • 0
  • 0
  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Final 2 as PDF for free.

More details

  • Words: 2,618
  • Pages: 85
Phylogenet ics

Course Bio310: Bioinformatics-I Waseem Haider Alvi Lecturer, Dept. of Biosciences, COMSATS Institute of Information Technology Islamabad

Phylogenetic Analysis Overview Insight into evolution Evolutionary relationships shown as branches Length and nesting reflects degree of

similarity between any two items

Phylogenetic Analysis Relationship among species

IMAGE: http://www.howe.k12.ok.us/~jimaskew/bclasfy.htm

Phylogenetic Analysis Relationship within a species (HIV-1 vs HIV-2)

IMAGE: http://www-hto.usc.edu/~cbmp/2002/PhylogeneticAnalysis/Distance%20Methods.html

Phylogenetic Analysis Overview C-Terminal Motor Kinesin sequences  http://www.proweb.org/kinesin/BE4_Cterm.html

Phylogenetic Analysis Overview Objective: determine branch length and figure

out how the tree should be drawn Sequences most closely related drawn as

neighboring branches

Phylogenetic Analysis Overview Dependent upon good multiple sequence

alignment programs Group sequences with similar patterns of

substitutions

Phylogenetic Analysis Overview Consider two related sequences: ancestoral sequence can be (partially) derived With additional sequences, more information

can be gathered

Uses of Phylogenetic Analysis Given a set of genes, determine genes

likely to have equivalent functions

Follow changes occurring in a rapidly

changing species

Example: influenza Study rapidly changing genes Next year’s strain can be predicted Flu vaccination can be developed

Tree of Life Phylogenies study the evolution of a species

 Image: http://microbialgenome.org/primer/tree.html

Tree of Life Traditionally, morphological (visible features)

characters used to classify organisms Living organisms Fossil records

Sequence data now can be used

Tree of Life Many different resources including:  NCBI taxonomy web sites University of Arizona’s tree of life project

NCBI Taxonomy Web Site

 http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html

/

Tree of Life http://tolweb.org/tree/phylogeny.html

Evolutionary Trees Two dimensional graph showing evolutionary

relationship among a set of items organisms, genes, or sequences Each unit defined by a distinct branch

Evolutionary Trees leaves represent the units(taxa) being studied nodes and branches represent the

relationships among the taxa

Two taxa derived from the same common

ancestor share a node

Evolutionary Trees Two types: Evolutionary Time 

Assumes molecular clock hypothesis

Edit Distance (Phylogram)

Number of sequence level changes occurring  Distance not in direct relation to age 

Cladograms measure time.

Phylograms measure change.

sharks seahorses

Root

frogs owls

50 million years

crocodiles armadillos bats

sharks

seahorses frogs owls

Root crocodiles armadillos 5% change

bats

Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt

Ultrametricity All tips are an equal distance from the root.

X a Root

b c

d

e Y

a=b+c+d+e

Additivity Distance between any two tips equals the total branch length between them.

a X b Root

c

e d

XY = a + b + c + d + e

In simple scenarios, evolutionary trees are ultrametric and phylograms are additive.

Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt

Y

Homology: identical character due to shared ancestry (evolutionary signal) Homoplasy: identical character due to evolutionary convergence or reversal (evolutionary noise)

lizards snakes +hair

rodents primates

Homology

+flight birds snakes rodents bats +flight Homoplasy (Convergence)

worms lizards snakes +legs –legs Homoplasy (Reversal)

Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt

Rooted Trees One sequence (root) defined to be

common ancestor of all other sequences

Unique path leads from root node to any

other node

Direction of path indicates evolutionary

time

Root chosen as a sequence thought to

have branched off earliest

Rooted Trees If molecular clock hypothesis holds, it is

possible to predict a root

As the number of sequences increase, the

number of possible rooted trees increases very rapidly

bifurcating binary tree is usually the best

model to simulate evolutionary events

Example Rooted Tree

 Image source:

http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

Unrooted Tree (Star) Indicates evolutionary relationship without

revealing location of oldest ancestry Fewer possible unrooted trees than a rooted

tree

Example Unrooted Tree

Image source: http://www.shef.ac.uk/english/language/quantling/images/quantling1.jp

 Image:

http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

Number of Trees Unrooted trees

Rooted trees

# pairwise distances

3

3

1

3

3

4

4

6

3

5

15

6

5

10

15

7

105

8

6

15

105

9

945

10

10

45

2,027,025

17

34,459,425

18

30

435

8.69 × 1036

57

4.95 × 1038

58

N

N (N - 1) 2

# branches /tree

# branches /tree

# sequences

# trees

(2N - 5)! 2N - 3 (N - 3)!

2N - 3

# trees

(2N - 3)! 2N - 2 (N - 2)!

Taken From: http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt

2N - 2

Methods for Determining Trees Three main methods: maximum parsimony Distance maximum likelihood

Maximum Parsimony Predicts evolutionary tree minimizing number

of steps required to generate observed variation Multiple sequence alignment must first be

obtained

Maximum Parsimony For each position, phylogenetic trees

requiring smallest number of evolutionary changes to produce observed sequence changes are identified Trees producing smallest number of

changes for all sequence positions are identified

Maximum Parsimony Time consuming algorithm Only works well if the sequences have a

strong sequence similarity

Maximum Parsimony Example 1 2 3 4

A A A A

A G G G

G C A A

A C T G

G G A A

T T T T

G G C C

C C C C

A G A G

four sequences, three possible unrooted trees

Maximum Parsimony Example Possible Trees:

1

2

3

4

1

3

2

4

1

4

3

2

Maximum Parsimony Example Some sites are informative, others are not Informative site has same sequence character

in at least two different sequences

Only informative sites are considered

Maximum Parsimony Example 1 2 3 4

A A A A

A G G G

G C A A

A C T G

G G A A

T T T T

G G C C

C C C C

A G A G

Three informative columns

Maximum Parsimony Example 1 2 3 4

G G A A

G G C C

A G A G

1

3

1

2

1

3

2

4

3

4

4

2

Column 1 1

3

1

2

1

3

2

4

3

4

4

2

Column 2

Tree 1: 4 Tree 2: 5 Tree 3: 6

1

3

1

2

1

3

2

4

3

4

4

2

Column 3 Is a substitution

Maximum Parsimony Columns representing greater variation

dominate the analysis

To overcome this look only at transversion

events

Most significant base changes (i.e. changes a

purine to a pyrimidine or vice versa)

Lake’s method of invariants

Distance Method Looks at number of changes between each

pair in a group of sequences Goal: identify tree positioning neighbors

correctly that has branch lengths reproducing original data as closely as possible

Distance Method CLUSTALW: neighbor-joining method as a

guide to multiple sequence alignments PHYLIP suite of programs employ

neighbor-joining methods  http://evolution.genetics.washington.edu/phylip.html

Distance Programs in Phylip FITCH: estimates phylogenetic tree assuming

additivity of branch lengths using the FitchMargoliash method KITSH: same as FITCH, but under the

assumption of a molecular clock

Distance Programs in Phylip NEIGHBOR: estimates phylogenies using

either: neighbor-joining (no molecular clock assumed) unweighted pair group method with arithmetic

mean (UPGMA) (molecular clock assumed)

Distance Analysis distance score counted as: number of mismatched positions in alignment number of sequence positions changed to

generate the second sequence

Success depends on degree the distances

are additive on a predicted evolutionary tree

Example of Distance Analysis Consider the alignment:

A B C D

ACGCGTTGGGCGATGGCAAC ACGCGTTGGGCGACGGTAAT ACGCATTGAATGATGATAAT ACACATTGAGTGATAATAAT

Example of Distance Analysis Distances can be shown as a table A B C D

ACGCGTTGGGCGATGGCAAC ACGCGTTGGGCGACGGTAAT ACGCATTGAATGATGATAAT ACACATTGAGTGATAATAAT

Example of Distance Analysis Using this information, a tree can be drawn: A B C D

ACGCGTTGGGCGATGGCAAC ACGCGTTGGGCGACGGTAAT ACGCATTGAATGATGATAAT ACACATTGAGTGATAATAAT

A

B

C 2

1

4

1

2

D

Fitch and Margoliash Algorithm (3 sequences) Distance table used Sequences combined in threes define branches of predicted tree calculate branch lengths of the tree

Fitch and Margoliash Algorithm (3 sequences) 1) Draw unrooted tree with three branches

originating from common node:

A

B

a

b

c

C

Fitch and Margoliash Algorithm (3 sequences) 1) Calculate lengths of tree branches algebraically:              

distance from A to B = a + b = 22 (1) distance from A to C = a + c = 39 (2) distance from B to C = b + c = 41 (3)   subtracting (3) from (2) yields:   b + c = 41 -a – c = -39 __________ b – a = 2 (4)   adding (1) and (4) yields 2b = 24; b = 12 so a + 12 = 22; a = 10 10 + c = 39; c = 29

Fitch and Margoliash Algorithm (3 sequences) 3) Resulting tree:

A

10 C 29 12

B

Fitch and Margoliash Algorithm (5 sequences) Algorithm can be extended to more

sequences. Consider the distances:

A

a b

B

C

c

f g

d

e

D

E

Fitch and Margoliash Algorithm (5 sequences) Locate most closely related sequences

Fitch and Margoliash Algorithm (5 sequences) create a new table by combining

remaining sequences Distance from D to A,B,C: average distance of

each of these to D ( (39 + 41 + 18) / 3 = 32.7) Distance from E to A,B,C: average distance of

each of these to E ((41+43+20)/3 = 34.7).

Fitch and Margoliash Algorithm (5 sequences) average distance from D to ABC and E to ABC can be found by averaging the sum of the appropriate branch lengths: D to E: d + e = 10 (1) D to ABC: d + m = 32.7 where m = g + (c + 2f + a + b) / 3 (2) E to ABC: e + m = 34.7 (3)   By subtracting the third equation from the second equation we get: A d – e = -2 a Adding this result to (1) we get: 2d = 8; d = 4 Substitute back in to get e = 6 b B

C

c

f g

d

e

D

E

Fitch and Margoliash Algorithm (5 sequences) Treat DE as a single sequence Create new distance table Distance from A to DE is average of A to D and A to E (similar for other distances as well)

Fitch and Margoliash Algorithm (5 sequences) A

a b

B

By algebra, c = 9; g = 5

c

f g

5

C

9 d

e

D

4 6

E

Fitch and Margoliash Algorithm (5 sequences) Continue process until all lengths are found:

A

a 10

f

c

20 b B

12

g5

C

9 d

6

e

D

4 E

Summary of FitchMargoliash 1) Find the mostly closely related pairs of sequences (A, B). 2) Treat the rest of the sequences as a composite. Calculate the average distance from A to all others; and from B to all others. 3) Use these values to calculate the length of the edges a and b.

Summary of FitchMargoliash 4)

Treat A and B as a composite. Calculate the average distances between AB and each of the other sequences. Create a new distance table.

5)

Identify next pair of related sequences and begin as with step 1.

6)

Subtract extended branch lengths to calculate lengths of intermediate branches.

Summary of FitchMargoliash 7) Repeat entire process with all possible pairs of sequences. 8) Calculate predicted distances between each pair of sequences for each tree to find the best tree.

Neighbor Joining Similar to Fitch-Margoliash Sequences chosen to give best least-squares

estimate of branch length

Neighbor Joining Begin with star topology – no neighbors have

been joined B

C

D

A E

Neighbor Joining Tree modified by joining pairs of

sequences Pair is chosen by calculating sum of

branch lengths for the corresponding tree:

S mn

d mn ∑ d ij + + 2( N − 2) 2 N −2

d ∑ =

im

+ d in

Neighbor Joining If A and B are joined: B

C D

A

S mn

E

d mn ∑ d ij + + 2( N − 2) 2 N −2

d ∑ =

im

+ d in

Neighbor Joining Pair with smallest branch length chosen to

be joined Fitch-Margoliash algorithm used to

compute actual branch lengths a new distance table is created with joined

sequences entered as a composite

Neighbor Joining Repeat process to select next pair to join Process continues until correctly branched

tree and distances identified

UPGMA “Unweighted Pair Group Method Using

Arithmetic Averages”

Works by clustering sequences Produces Rooted Trees by assembling the

tree upwards beginning with most similar sequences

UPGMA distance dij between clusters Ci and Cj is

average distance between pairs of sequences from each cluster:  

1 d ij = d pq ∑ | Ci || C j | pinCi , qinC j

|Ci| and |Cj| are the number of sequences

in clusters i and j

UPGMA  INITIALIZATION STEPS

Assign each sequence i to its own cluster Ci Define one leaf of the tree T for each sequence, and place it at height 0.

UPGMA Iteration steps 3. Determine the two clusters, i and j for which dij is minimal 4. Define a new cluster k by Ck = Ci ∪ Cj, and define dkl for all l

UPGMA Iteration steps 5. Define a node k with daughter nodes i and j, and place it at height dij/2. 6. Add k to the current clusters and remove i and j.   7. Continue steps 3-6 until only two clusters i and j remain, and place the root of the tree at height dij/2

UPGMA Example using 5 Sequences: 

9 1

6

2

8 7

3

4

1 5

2

4

5

3

UPGMA Rooted trees provide distance measures that

can be used with constant molecular clock See Durbin, et al. 166-167 See Mount, 262-268

Maximum Likelihood Calculates likelihood of a tree given an

alignment Trees with least number of changes will be most likely Incorporates Jukes-Cantor and Kimura models

of evolution

Maximum Likelihood Probability of each tree is product of mutation

rates in each branch Likelihoods given by each column multiplied

to give the likelihood of the tree

Maximum Likelihood PHYLIP Programs: DNAML: Maximum likelihood, allows for different mutation models, unequal rates of mutation DNAMLK: Same as DNAML, but assuming

molecular clock

Maximum Likelihood Disadvantages: Computationally intensive Can only be done for a handful of sequences

Assessing significance of Tree Bootstrapping methods are used to assess

significance Given an alignment, randomly pick columns

from alignment with replacement Apply tree building to new data set

Assessing significance of Tree Repeat selection a number of times Frequency at which phylogenetic feature

appears is measure of confidence for this feature

Which Method to Choose? depends upon the sequences that are being

compared strong sequence similarity: 

maximum parsimony

clearly recognizable sequence similarity 

distance methods

All others: 

maximum likelihood

Which Method to Choose? Best to choose at least two approaches Compare the results – if they are similar, you

can have more confidence

Neighbor-joining

Maximum parsimony

Maximum likelihood

Uses only pairwise distances

Uses only shared derived characters

Uses all data

Minimizes distance between nearest neighbors

Minimizes total distance

Maximizes tree likelihood given specific parameter values

Very fast

Slow

Very slow

Easily trapped in local optima

Assumptions fail when evolution is rapid

Highly dependent on assumed evolution model

Good for generating tentative tree, or choosing among multiple trees

Best option when tractable (<30 taxa, homoplasy rare)

Good for very small data sets and for testing trees built using other methods

Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt

Building Alignments and Trees Simultaneously Sankoff & Cedergren gap substitution (using

dynamic programming) Hein’s affine cost algorithm Covered in detail in Durbin, et al. (not in this

class)

Difficulties With Phylogenetic Analysis Horizontal or lateral transfer of genetic

material (for instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events Genes selective pressure can be rapidly

evolving, masking earlier changes that had occurred phylogenetically

Difficulties With Phylogenetic Analysis two sites within comparative sequences

may be evolving at different rates Rearrangements of genetic material can

lead to false conclusions   duplicated genes can evolve along separate pathways, leading to different functions

HOMEWORK Read Mount, Chapter 7 (Phylogenetic

Analysis) Read Durbin, et al., chapters 7

Related Documents

Final 2
May 2020 1
2 Final
May 2020 4
Final 2
April 2020 8
Final 2
May 2020 3
Final 2
November 2019 6
Final 2
October 2019 9