Phylogenet ics
Course Bio310: Bioinformatics-I Waseem Haider Alvi Lecturer, Dept. of Biosciences, COMSATS Institute of Information Technology Islamabad
Phylogenetic Analysis Overview Insight into evolution Evolutionary relationships shown as branches Length and nesting reflects degree of
similarity between any two items
Phylogenetic Analysis Relationship among species
IMAGE: http://www.howe.k12.ok.us/~jimaskew/bclasfy.htm
Phylogenetic Analysis Relationship within a species (HIV-1 vs HIV-2)
IMAGE: http://www-hto.usc.edu/~cbmp/2002/PhylogeneticAnalysis/Distance%20Methods.html
Phylogenetic Analysis Overview C-Terminal Motor Kinesin sequences http://www.proweb.org/kinesin/BE4_Cterm.html
Phylogenetic Analysis Overview Objective: determine branch length and figure
out how the tree should be drawn Sequences most closely related drawn as
neighboring branches
Phylogenetic Analysis Overview Dependent upon good multiple sequence
alignment programs Group sequences with similar patterns of
substitutions
Phylogenetic Analysis Overview Consider two related sequences: ancestoral sequence can be (partially) derived With additional sequences, more information
can be gathered
Uses of Phylogenetic Analysis Given a set of genes, determine genes
likely to have equivalent functions
Follow changes occurring in a rapidly
changing species
Example: influenza Study rapidly changing genes Next year’s strain can be predicted Flu vaccination can be developed
Tree of Life Phylogenies study the evolution of a species
Image: http://microbialgenome.org/primer/tree.html
Tree of Life Traditionally, morphological (visible features)
characters used to classify organisms Living organisms Fossil records
Sequence data now can be used
Tree of Life Many different resources including: NCBI taxonomy web sites University of Arizona’s tree of life project
NCBI Taxonomy Web Site
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html
/
Tree of Life http://tolweb.org/tree/phylogeny.html
Evolutionary Trees Two dimensional graph showing evolutionary
relationship among a set of items organisms, genes, or sequences Each unit defined by a distinct branch
Evolutionary Trees leaves represent the units(taxa) being studied nodes and branches represent the
relationships among the taxa
Two taxa derived from the same common
ancestor share a node
Evolutionary Trees Two types: Evolutionary Time
Assumes molecular clock hypothesis
Edit Distance (Phylogram)
Number of sequence level changes occurring Distance not in direct relation to age
Cladograms measure time.
Phylograms measure change.
sharks seahorses
Root
frogs owls
50 million years
crocodiles armadillos bats
sharks
seahorses frogs owls
Root crocodiles armadillos 5% change
bats
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Ultrametricity All tips are an equal distance from the root.
X a Root
b c
d
e Y
a=b+c+d+e
Additivity Distance between any two tips equals the total branch length between them.
a X b Root
c
e d
XY = a + b + c + d + e
In simple scenarios, evolutionary trees are ultrametric and phylograms are additive.
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Y
Homology: identical character due to shared ancestry (evolutionary signal) Homoplasy: identical character due to evolutionary convergence or reversal (evolutionary noise)
lizards snakes +hair
rodents primates
Homology
+flight birds snakes rodents bats +flight Homoplasy (Convergence)
worms lizards snakes +legs –legs Homoplasy (Reversal)
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Rooted Trees One sequence (root) defined to be
common ancestor of all other sequences
Unique path leads from root node to any
other node
Direction of path indicates evolutionary
time
Root chosen as a sequence thought to
have branched off earliest
Rooted Trees If molecular clock hypothesis holds, it is
possible to predict a root
As the number of sequences increase, the
number of possible rooted trees increases very rapidly
bifurcating binary tree is usually the best
model to simulate evolutionary events
Example Rooted Tree
Image source:
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Unrooted Tree (Star) Indicates evolutionary relationship without
revealing location of oldest ancestry Fewer possible unrooted trees than a rooted
tree
Example Unrooted Tree
Image source: http://www.shef.ac.uk/english/language/quantling/images/quantling1.jp
Image:
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Number of Trees Unrooted trees
Rooted trees
# pairwise distances
3
3
1
3
3
4
4
6
3
5
15
6
5
10
15
7
105
8
6
15
105
9
945
10
10
45
2,027,025
17
34,459,425
18
30
435
8.69 × 1036
57
4.95 × 1038
58
N
N (N - 1) 2
# branches /tree
# branches /tree
# sequences
# trees
(2N - 5)! 2N - 3 (N - 3)!
2N - 3
# trees
(2N - 3)! 2N - 2 (N - 2)!
Taken From: http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
2N - 2
Methods for Determining Trees Three main methods: maximum parsimony Distance maximum likelihood
Maximum Parsimony Predicts evolutionary tree minimizing number
of steps required to generate observed variation Multiple sequence alignment must first be
obtained
Maximum Parsimony For each position, phylogenetic trees
requiring smallest number of evolutionary changes to produce observed sequence changes are identified Trees producing smallest number of
changes for all sequence positions are identified
Maximum Parsimony Time consuming algorithm Only works well if the sequences have a
strong sequence similarity
Maximum Parsimony Example 1 2 3 4
A A A A
A G G G
G C A A
A C T G
G G A A
T T T T
G G C C
C C C C
A G A G
four sequences, three possible unrooted trees
Maximum Parsimony Example Possible Trees:
1
2
3
4
1
3
2
4
1
4
3
2
Maximum Parsimony Example Some sites are informative, others are not Informative site has same sequence character
in at least two different sequences
Only informative sites are considered
Maximum Parsimony Example 1 2 3 4
A A A A
A G G G
G C A A
A C T G
G G A A
T T T T
G G C C
C C C C
A G A G
Three informative columns
Maximum Parsimony Example 1 2 3 4
G G A A
G G C C
A G A G
1
3
1
2
1
3
2
4
3
4
4
2
Column 1 1
3
1
2
1
3
2
4
3
4
4
2
Column 2
Tree 1: 4 Tree 2: 5 Tree 3: 6
1
3
1
2
1
3
2
4
3
4
4
2
Column 3 Is a substitution
Maximum Parsimony Columns representing greater variation
dominate the analysis
To overcome this look only at transversion
events
Most significant base changes (i.e. changes a
purine to a pyrimidine or vice versa)
Lake’s method of invariants
Distance Method Looks at number of changes between each
pair in a group of sequences Goal: identify tree positioning neighbors
correctly that has branch lengths reproducing original data as closely as possible
Distance Method CLUSTALW: neighbor-joining method as a
guide to multiple sequence alignments PHYLIP suite of programs employ
neighbor-joining methods http://evolution.genetics.washington.edu/phylip.html
Distance Programs in Phylip FITCH: estimates phylogenetic tree assuming
additivity of branch lengths using the FitchMargoliash method KITSH: same as FITCH, but under the
assumption of a molecular clock
Distance Programs in Phylip NEIGHBOR: estimates phylogenies using
either: neighbor-joining (no molecular clock assumed) unweighted pair group method with arithmetic
mean (UPGMA) (molecular clock assumed)
Distance Analysis distance score counted as: number of mismatched positions in alignment number of sequence positions changed to
generate the second sequence
Success depends on degree the distances
are additive on a predicted evolutionary tree
Example of Distance Analysis Consider the alignment:
A B C D
ACGCGTTGGGCGATGGCAAC ACGCGTTGGGCGACGGTAAT ACGCATTGAATGATGATAAT ACACATTGAGTGATAATAAT
Example of Distance Analysis Distances can be shown as a table A B C D
ACGCGTTGGGCGATGGCAAC ACGCGTTGGGCGACGGTAAT ACGCATTGAATGATGATAAT ACACATTGAGTGATAATAAT
Example of Distance Analysis Using this information, a tree can be drawn: A B C D
ACGCGTTGGGCGATGGCAAC ACGCGTTGGGCGACGGTAAT ACGCATTGAATGATGATAAT ACACATTGAGTGATAATAAT
A
B
C 2
1
4
1
2
D
Fitch and Margoliash Algorithm (3 sequences) Distance table used Sequences combined in threes define branches of predicted tree calculate branch lengths of the tree
Fitch and Margoliash Algorithm (3 sequences) 1) Draw unrooted tree with three branches
originating from common node:
A
B
a
b
c
C
Fitch and Margoliash Algorithm (3 sequences) 1) Calculate lengths of tree branches algebraically:
distance from A to B = a + b = 22 (1) distance from A to C = a + c = 39 (2) distance from B to C = b + c = 41 (3) subtracting (3) from (2) yields: b + c = 41 -a – c = -39 __________ b – a = 2 (4) adding (1) and (4) yields 2b = 24; b = 12 so a + 12 = 22; a = 10 10 + c = 39; c = 29
Fitch and Margoliash Algorithm (3 sequences) 3) Resulting tree:
A
10 C 29 12
B
Fitch and Margoliash Algorithm (5 sequences) Algorithm can be extended to more
sequences. Consider the distances:
A
a b
B
C
c
f g
d
e
D
E
Fitch and Margoliash Algorithm (5 sequences) Locate most closely related sequences
Fitch and Margoliash Algorithm (5 sequences) create a new table by combining
remaining sequences Distance from D to A,B,C: average distance of
each of these to D ( (39 + 41 + 18) / 3 = 32.7) Distance from E to A,B,C: average distance of
each of these to E ((41+43+20)/3 = 34.7).
Fitch and Margoliash Algorithm (5 sequences) average distance from D to ABC and E to ABC can be found by averaging the sum of the appropriate branch lengths: D to E: d + e = 10 (1) D to ABC: d + m = 32.7 where m = g + (c + 2f + a + b) / 3 (2) E to ABC: e + m = 34.7 (3) By subtracting the third equation from the second equation we get: A d – e = -2 a Adding this result to (1) we get: 2d = 8; d = 4 Substitute back in to get e = 6 b B
C
c
f g
d
e
D
E
Fitch and Margoliash Algorithm (5 sequences) Treat DE as a single sequence Create new distance table Distance from A to DE is average of A to D and A to E (similar for other distances as well)
Fitch and Margoliash Algorithm (5 sequences) A
a b
B
By algebra, c = 9; g = 5
c
f g
5
C
9 d
e
D
4 6
E
Fitch and Margoliash Algorithm (5 sequences) Continue process until all lengths are found:
A
a 10
f
c
20 b B
12
g5
C
9 d
6
e
D
4 E
Summary of FitchMargoliash 1) Find the mostly closely related pairs of sequences (A, B). 2) Treat the rest of the sequences as a composite. Calculate the average distance from A to all others; and from B to all others. 3) Use these values to calculate the length of the edges a and b.
Summary of FitchMargoliash 4)
Treat A and B as a composite. Calculate the average distances between AB and each of the other sequences. Create a new distance table.
5)
Identify next pair of related sequences and begin as with step 1.
6)
Subtract extended branch lengths to calculate lengths of intermediate branches.
Summary of FitchMargoliash 7) Repeat entire process with all possible pairs of sequences. 8) Calculate predicted distances between each pair of sequences for each tree to find the best tree.
Neighbor Joining Similar to Fitch-Margoliash Sequences chosen to give best least-squares
estimate of branch length
Neighbor Joining Begin with star topology – no neighbors have
been joined B
C
D
A E
Neighbor Joining Tree modified by joining pairs of
sequences Pair is chosen by calculating sum of
branch lengths for the corresponding tree:
S mn
d mn ∑ d ij + + 2( N − 2) 2 N −2
d ∑ =
im
+ d in
Neighbor Joining If A and B are joined: B
C D
A
S mn
E
d mn ∑ d ij + + 2( N − 2) 2 N −2
d ∑ =
im
+ d in
Neighbor Joining Pair with smallest branch length chosen to
be joined Fitch-Margoliash algorithm used to
compute actual branch lengths a new distance table is created with joined
sequences entered as a composite
Neighbor Joining Repeat process to select next pair to join Process continues until correctly branched
tree and distances identified
UPGMA “Unweighted Pair Group Method Using
Arithmetic Averages”
Works by clustering sequences Produces Rooted Trees by assembling the
tree upwards beginning with most similar sequences
UPGMA distance dij between clusters Ci and Cj is
average distance between pairs of sequences from each cluster:
1 d ij = d pq ∑ | Ci || C j | pinCi , qinC j
|Ci| and |Cj| are the number of sequences
in clusters i and j
UPGMA INITIALIZATION STEPS
Assign each sequence i to its own cluster Ci Define one leaf of the tree T for each sequence, and place it at height 0.
UPGMA Iteration steps 3. Determine the two clusters, i and j for which dij is minimal 4. Define a new cluster k by Ck = Ci ∪ Cj, and define dkl for all l
UPGMA Iteration steps 5. Define a node k with daughter nodes i and j, and place it at height dij/2. 6. Add k to the current clusters and remove i and j. 7. Continue steps 3-6 until only two clusters i and j remain, and place the root of the tree at height dij/2
UPGMA Example using 5 Sequences:
9 1
6
2
8 7
3
4
1 5
2
4
5
3
UPGMA Rooted trees provide distance measures that
can be used with constant molecular clock See Durbin, et al. 166-167 See Mount, 262-268
Maximum Likelihood Calculates likelihood of a tree given an
alignment Trees with least number of changes will be most likely Incorporates Jukes-Cantor and Kimura models
of evolution
Maximum Likelihood Probability of each tree is product of mutation
rates in each branch Likelihoods given by each column multiplied
to give the likelihood of the tree
Maximum Likelihood PHYLIP Programs: DNAML: Maximum likelihood, allows for different mutation models, unequal rates of mutation DNAMLK: Same as DNAML, but assuming
molecular clock
Maximum Likelihood Disadvantages: Computationally intensive Can only be done for a handful of sequences
Assessing significance of Tree Bootstrapping methods are used to assess
significance Given an alignment, randomly pick columns
from alignment with replacement Apply tree building to new data set
Assessing significance of Tree Repeat selection a number of times Frequency at which phylogenetic feature
appears is measure of confidence for this feature
Which Method to Choose? depends upon the sequences that are being
compared strong sequence similarity:
maximum parsimony
clearly recognizable sequence similarity
distance methods
All others:
maximum likelihood
Which Method to Choose? Best to choose at least two approaches Compare the results – if they are similar, you
can have more confidence
Neighbor-joining
Maximum parsimony
Maximum likelihood
Uses only pairwise distances
Uses only shared derived characters
Uses all data
Minimizes distance between nearest neighbors
Minimizes total distance
Maximizes tree likelihood given specific parameter values
Very fast
Slow
Very slow
Easily trapped in local optima
Assumptions fail when evolution is rapid
Highly dependent on assumed evolution model
Good for generating tentative tree, or choosing among multiple trees
Best option when tractable (<30 taxa, homoplasy rare)
Good for very small data sets and for testing trees built using other methods
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Building Alignments and Trees Simultaneously Sankoff & Cedergren gap substitution (using
dynamic programming) Hein’s affine cost algorithm Covered in detail in Durbin, et al. (not in this
class)
Difficulties With Phylogenetic Analysis Horizontal or lateral transfer of genetic
material (for instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events Genes selective pressure can be rapidly
evolving, masking earlier changes that had occurred phylogenetically
Difficulties With Phylogenetic Analysis two sites within comparative sequences
may be evolving at different rates Rearrangements of genetic material can
lead to false conclusions duplicated genes can evolve along separate pathways, leading to different functions
HOMEWORK Read Mount, Chapter 7 (Phylogenetic
Analysis) Read Durbin, et al., chapters 7