This document was uploaded by user and they confirmed that they have the permission to share
it. If you are author or own the copyright of this book, please report to us by using this DMCA
report form. Report DMCA
Development and Optimization of Phootprinter: A Web-based Tool for Phylogenetic Footprinting Across Hundreds of Eukaryotic Genomes By: Course: Professor: Date:
Lucas Lochovsky BOT 460 Nick Provart Friday, August 18, 2006
TABLE OF CONTENTS 1.) Abstract........................................................................................................................ 2.) Introduction................................................................................................................. 3.) Phootprinter Overview............................................................................................... 4.) User Interface..............................................................................................................
1.) ABSTRACT A phylogenetic profile of a gene consists of a binary string (i.e. consisting of 0s and 1s), where each bit corresponds to a particular organism, and a 1 indicates that the gene is present in that organism, and a 0 indicates that the gene is absent from that organism (Wu 1-2). The comparison of phylogenetic profiles can reveal clade-specific or species-specific gene families or pathways that can be further investigated to discover the unique biological characteristics of those organisms (Vandepoele 31). This ability is particularly useful in discovering genes unique to pathogen lineages, which can provide us with drug targets so that the pathogens are affected while the host remains unharmed (Searls 613). Phootprinter (http://theileria.ccb.sickkids.ca/phylo/cgi-bin/phootprinter.php) is a tool that allows users to construct phylogenetic profiles for genes in an organism and cluster them in a manner similar to the clustering of gene expression data in microarray analysis. It is intended to be used with PartiGeneDB, a database of partial genome sequences composed of expressed sequence tags (ESTs) based at the University of Toronto. This document describes the work I did to prepare Phootprinter for deployment on the PartiGeneDB server. Firstly, the interface was redesigned to make it easier for the user to select species and construct queries for Phootprinter. Phootprinter was test-driven by members of the Parkinson lab (who are responsible for running PartiGeneDB), and the interface was found to be simple and intuitive. Secondly, the running times of the clustering software used by Phootprinter and the scripts involved in generating Phootprinter’s output were benchmarked, and a major inefficiency in Phootprinter’s original design was discovered in the final step, in which the HTML of the results page is printed. This bottleneck was eliminated, drastically improving Phootprinter’s running time and making it suitable for public use.
Page 1
2.) INTRODUCTION Although a number of significant genomes, including that of Homo sapiens (humans), Saccharomyces cerevisiae (yeast), Drosophila melanoganster (fruit fly), and Arabidopsis thaliana (thale cress), have now been fully sequenced, the task of assigning functions, or lack thereof as the case may be, to each part of those genomes is ongoing. Functions have traditionally been elucidated through “wet lab” procedures, including biochemical assays and genetic knockout experiments. However, because DNA’s primary structure directly influences proteins’ primary structures, and therefore their higher-order structures as well, genomic sequences that are very similar to each other are likely to share the same function, as it is expected that important functions will be conserved across many species. As a result, searching for such “homologous sequences”, and annotating sequences of unknown function with the known function of a similar sequence, has proven to be very valuable in high-throughput functional annotation of genomic sequences. Of greatest interest to comparative geneticists are orthologous sequences, which are derived from a single common ancestor and therefore are most likely to have had their functions and special roles conserved over the course of evolution. Other homologous sequences include paralogous sequences, which are related by a gene duplication event, and xenologous sequences, which are related by a horizontal gene transfer event. In both these cases, the likelihood of functional similarity is much weaker as at least one copy is under much less negative selection, and is more free to evolve a different function (Frazer 1-2). Researchers have also found that similarity in the organisms that two or more genes are present in is also a strong indicator of functional similarity. In order to compare two genes on this basis, a “phylogenetic profile” must be constructed for each gene to be compared. A phylogenetic profile for gene x is a string of m binary bits (i.e. 0s and 1s), where m is the number of organisms in which one wishes to search for gene x, and the nth bit corresponds to the n-th organism in this group. If gene x is found in the n-th organism, the n-th bit is set to 1, otherwise it is set to 0. The determination of whether to put a 0 or a 1 in the n-th bit typically depends on whether or not gene x produced a BLAST score (or other score of similarity) higher than a particular threshold when Page 2
queried against all the genomic data of the n-th organism. These profiles can then be compared and clustered in a manner similar to the clustering of orthologous gene sequences and microarray expression data (Wu 1-2). Phylogenetic profiling has proven useful not only in functional determination of unannotated genes, but also in determining which genes and regulatory elements are species-specific or clade-specific. For example, comparison of phylogenetic profiles was used by Gutierrez et al. to determine which gene families are specific to Arabidopsis thaliana, as well as plants in general. Such comparisons have been most helpful in identifying the unique biological characteristics of various organisms. Of particular interest to many researchers is the identification of genes and transcription factors that are specific to certain pathogen lineages, as these may yield insights into new therapies and drug targets that may effectively neutralize these pathogens while minimizing the damage to the host (Searls 613). Phylogenetic profiling is most effective when there is sequence data available for a large number of organisms. Most efforts to date, however, have focused on sequencing entire genomes for a small number of organisms (Portnoy 4). Given the effort and time required to produce an entire genome sequence, it is unreasonable to wait for the whole genomes of enough organisms before making use of phylogenetic profiling. Instead, researchers have refocused their efforts on sequencing small parts of the genomes of a large number of organisms (Portnoy 6). This process is facilitated by the sequencing of expressed sequence tags, or ESTs. ESTs are sequence fragments of protein-coding DNA that are obtained by reverse-transcribing mRNA, thus forming a cDNA, which is then sequenced. They have proven useful for identifying transcribed regions of the genome that they represent, and determining the expression pattern of those genes. In many microarray experiments, EST hybridization to cellular mRNA is used to obtain the fluorescent patterns that are then analyzed to determine the expression patterns of numerous genes across several cell types, development stages, cell cycle stages, or other conditions (“Expressed Sequence Tag”). ESTs are ideal for phylogenetic profiling as they can quickly provide information on protein-coding regions–which we are most interested in comparing–across a wide range of species (Portnoy 2). Page 3
PartiGeneDB is a database of partial genome sequences for over 450 eukaryotic organisms collated from ESTs drawn from the NCBI’s dbEST database, a public repository of EST information. Every six months it is updated with new EST sequence information. These sequences are clustered with all existing ESTs in the database on the basis of sequence similarity, and consensus sequences are created from these clusters. These consensus sequences represent putative genes and form the partial genomes contained within PartiGeneDB. Currently, PartiGeneDB’s website offers users access to the database’s information via four portals. They may: 1.) Search for genes with a common annotation term 2.) Search for genes sharing similar sequences using a range of BLAST algorithms 3.) View all the gene sequences available for one particular organism, or 4.) Compare the genes of one organism with the genes of three other organisms using a visualization tool called SimiTri Given the phylogenetic coverage of PartiGeneDB's EST clusters, a phylogenetic profile analysis tool would make a natural addition to the database's repertoire. Phootprinter is a phylogenetic profile analysis tool developed by Professor Nick Provart's group at the University of Toronto. It consists of a web form that allows the user to select a reference organism and any number of "present in" organisms. Upon submission, a series of scripts are run to query the database for all the genes available in the reference organism's dataset, retrieve the highest precomputed BLAST score of that gene against each of the "present in" organisms' datasets, cluster genes with similar profiles, and return the output to the user in the form of a colour-coded heatmap. The matrix from which the heatmap was derived is also provided as a downloadable text file. The user is also given the option of selecting any number of "absent from" organisms. Conceptually, these are supposed to represent organisms in which the user does not expect to find the reference organism's genes, but as of this writing Phootprinter does not actually do anything with the "absent from" organisms. Over the summer of 2006, I have overhauled Phootprinter in preparation for its Page 4
deployment on PartiGeneDB's server. A new frontend interface for organism selections has been written, which makes it easier for the user to locate and select particular organisms or clades of organisms. I also improved the efficiency of Phootprinter’s backend processing by benchmarking the running time of each step in Phootprinter’s results generation to discover bottlenecks in the code and fix them. Initially, it was thought that the biggest bottleneck in this process was in the clustering step, but my analysis revealed that the slowest step was, in fact, the step in which the HTML of the results page is printed. This bottleneck was removed by using fewer HTML elements to achieve the same functionality as the original Phootprinter. The following sections will describe in detail the design and implementation of Phootprinter's new features and provide an evaluation of the program's efficiency. Section 3 provides a high-level description of each of Phootprinter’s main files. Section 4 describes the features of Phootprinter’s new user interface. Section 5 provides a description of Phootprinter’s output-generating scripts, as well as the results page. Section 6 describes the setup of the benchmark trials conducted and the results of these trials. Section 7 provides a Discussion on all of Phootprinter’s improvements and their significance to the user and comparative genomics research. Finally, Section 8 details issues to address in the future development of Phootprinter.
Page 5
3.) PHOOTPRINTER OVERVIEW Phootprinter consists of a series of scripts and files through which: i.)
The organisms available for use in queries are retrieved and presented to the user
ii.) The user constructs queries with these organisms iii.) Precomputed BLAST scores are retrieved, clustered, and presented as a heatmap to the user As the first step is performed only on each database update, and is not really part of the user’s interaction with Phootprinter, it is dealt with in Appendix 1. Each of the remaining steps will be described in the following sections. An overview of the entire process is shown in Fig. 1. Each of the files in the main flow chart will be discussed, with filenames appearing in italics.
Page 6
Figure 1: Overview of the Phootprinter process. Each important file, and their interactions with other files and entities (such as the database and clustering software) are indicated and explained. Page 7
4.) USER INTERFACE 4.1) Overview Phootprinter’s original user interface presented all the organism available for queries as a single phylogenetic tree that spanned several pages (Fig. 2), making it difficult to zero in on a particular organism that one wished to investigate. Additionally, user selections were reflected with a massive number of checkboxes and radio buttons (one for each organism) that, again, was very difficult to manage. Phootprinter’s new interface addresses both of these issues. Organisms can now be selected by a series of cascading pull-down menus that allow the user to start from the highest taxonomic rank and work their way down to a list of species limited to those in the groups they selected. Their current selections are also tracked with dynamic lists that display only the organisms the user has selected so far. These selections can be easily modified, and a new search field is provided to allow the user to immediately retrieve any species (or taxonomic groups) that match the search string.
Page 8
Figure 2: Phootprinter’s original interface. All organisms are presented in a multi-page “tree of life” with a vast array of accompanying checkboxes and radio buttons for user selections.
Page 9
4.2) Design & Implementation
Figure 3: Front-end interface as displayed in Mozilla Firefox 1.5.0.6 running in Mac OS X 10.4.7 When the user loads Phootprinter, s/he is presented with the form depicted in Fig. 3. The choices are presented as a series of pull-down menus, where the user selects a taxonomic group from the highest rank, which generates the next pull-down menu accordingly (Fig. 4).
Page 10
Figure 4: The user has selected “Basidiomycetes”, so the “Species” menu is generated, allowing the user to select species belonging to that phylum. Note that for each phylum (and subphylum, if applicable), the number of members is displayed in brackets next to the name. Additionally, when a species is selected, its taxonomic hierarchy string is displayed under the pull-down menus in the space labelled “Taxonomic hierarchy of selected species:”, as demonstrated in Fig. 5.
Page 11
Figure 5: The taxonomic groups that Ustilago maydis is a member of are displayed below the pull-down menus, as Ustilago maydis is the currently selected species. Between the pull-down menus and the display space for the taxonomic hierarchy string are a series of checkboxes and radio buttons through which the user will primarily make his/her choices. Clicking the “P” checkbox will add that organism, or, in the case of phyla and subphyla, their member organisms, to the “present in” organisms list under “Current selections”. The “A” button behaves similarly, but adds the organism(s) to the “absent from” organisms list. The “R” radio button, available only for the “Species” menu, will set the reference organism to be the currently selected species in the “Species” menu. In this way, users can add species to the “Current Selections” lists as they see fit and submit a query to the Phootprinter tool with these selections. Biologically speaking, it does not make sense to expect a “present in” organism to Page 12
be an “absent from” organism at the same time. Due to restrictions such as these, an organism can only appear in one of each of the “Current selections” lists. The interface automatically ensures that if one of the organisms that is about to be added already exists in any list, it is excluded from being added. For example, suppose Ustilago maydis was already used in the “present in” organism list. If the user then opted to add organisms under “Basidiomycetes” to the “absent from” list, all organisms under “Basidiomycetes” would be added except for Ustilago maydis (Fig. 6). This setup also ensures that duplicate entries will not appear in any of the lists.
Figure 6: As Ustilago maydis is a Basidiomycete, when all the Basidiomycetes are added to the “absent from” list, Ustilago maydis is excluded, as it is already used in the “present in” list. Users are given finer control over their selections with the four buttons immediately below the “Current Selections” lists. Each selected organism has a checkbox Page 13
next to it (Fig. 7), and clicking one of the four buttons will perform the action specified on the organisms with checked boxes, moving or removing organisms as directed by the user.
Page 14
Figure 7: (top) Organisms can be selected with the checkboxes next to them. Clicking one of the four buttons below the “Current Selections” lists (in this case, the “Set selected as ‘present in’ organisms” button) will move or delete the selected organisms accordingly. (bottom) Pleurotus ostreatus and Heterobasidion annosum are now “present in” organisms. Page 15
Another useful new feature is the search field between the pull-down menus and the “Current Selections” lists. The user may enter any text string and hit the “Search” button, and all the taxonomic names available will be searched for any that match (caseinsensitive) the search string, in whole or in part. These names are returned to the user below the search field (Fig. 8).
Figure 8: A search for “plasmo” has turned up two matches: Plasmodium berghei and Plasmodium yoelii. This allows the user to quickly produce a particular organism or group of organisms that s/he had in mind without having to navigate through the pull-down menus to reach them. Note that each search result also has its own checkbox. The user can check these and use the four buttons below the “Current Selections” to manipulate these entries as well. If a phylum or subphylum appears in the search results, then moving it into a “Current Page 16
Selections” list will add all its member species to that list. The user will be informed that it is impossible to use more than one organism as the reference organism if a phylum, subphylum, or multiple species from the other lists are selected when the user hits the “Set selected as the reference organism” button. The output given to the user is in the form of a heatmap (see section 5 for details). Each gene-organism pair is coloured according to the highest score between the two. However, it is expected that there will be numerous weak hits (i.e. low scores) relative to the number of strong hits (i.e. high scores). Phootprinter is designed to filter out lowscoring pairs by assigning them all a distinct “below cutoff” colour instead. The user has the ability to determine the threshold of this cutoff with a small text box at the bottom of the form (default is 0.3). This is perhaps the only part of the original interface to be carried over to the new interface. Finally, the standard Submit and Reset buttons are provided at the very bottom of the form. “Submit selections” will pass the user’s selections to a CGI script (genecmpGetDBData.cgi) which will produce the output on the next page (for details see the next section). “Reset form” will clear all the user’s selections and set all of the form’s other elements to their original state when the page was first loaded.
Page 17
5.) OUTPUT-GENERATING SCRIPTS 5.1) Overview The output-generating scripts consist of all the backend processing between form submission and the displaying of the results page. The files involved, and the functions they perform, are outlined in Fig. 9. Although much focus was placed on the clustering step performed by clusterDBData.pm, it was discovered that the worst running time was in genecmpGetDBData.cgi’s third step: printing the HTML of the results page. This step has been improved, which has resulted in drastically reduced running times for the entire process. Additionally, the heatmap was made wider and clearer to read by changing the Missing Score colour to white (the same as the background colour) at the request of the Parkinson lab members.
Figure 9: Overview of files and functions involved in output generation. Page 18
5.2) Design & Implementation User inputs are passed to a CGI script (genecmpGetDBData.cgi) that takes the form parameter strings and parses them into arrays and strings so that they are easier to work with. genecmpGetDBData.cgi then passes this information to two Perl modules in sequence that together handle the rest of the steps. clusterDBData.pm opens a connection to the “estdb” database on the PartiGeneDB server and consults the “blast_profile” table for the precomputed BLAST scores between each of the genes in the reference organism and each of the organisms listed by the user in the “present in” organism list. These scores are assembled and written to a plain text file in a format that Cluster 3.0, the clustering software available on the PartiGeneDB server, can understand (hereafter referred to as the “input cluster file”). Some gene-organism scores are missing from the database, and these are treated as zeroes. If all the scores for a gene are missing or equal to zero, that gene is not included in the input cluster file. Additionally, for queries in which there are four or more “present in” organisms, genes that have a nonzero hit to only one organism (a.k.a. “singletons”) are also not included in the input cluster file. They are, however, saved to an array to be used later when the output page is generated for the user. The clustering step is carried out by Cluster 3.0, an open source software package originally designed for gene expression data analysis. It was originally developed by Michael Eisen at Stanford University, and is available for Windows, Mac OS X, Linux and Unix. (More information on Cluster 3.0, including its source code, can be found at ) When the input cluster file is ready, Cluster 3.0 is invoked with a system call. It will cluster the genes and organisms in the file using complete linkage simple hierarchical clustering and using Pearson Correlation Coefficient as the distance metric when comparing both genes and organisms. For an analysis on the performance of Cluster 3.0 using various clustering methods and distance metrics, see Sections 6 and 7.3. Cluster 3.0 will produce a file with the same rows and columns as the original file, but reordered according to its cluster analysis (hereafter referred to as the “output cluster file”). This file is then utilized by the second Perl module called upon by Page 19
genecmpGetDBData.cgi, called outputClusterData.pm. It parses the output cluster file and assigns each score a colour that corresponds to its magnitude. It then produces a heatmap image similar to those used in microarray data analysis in the form of an HTML image map. This image map is used as part of the final HTML output that the user sees after s/he submits the web form. Fig. 10 is representative of what the output page displayed to the user will look like.
Figure 10: The output that the user will see if the form is submitted with the reference organism set to Ustilago maydis, and the remaining Basidiomycetes and Ascomycetes are used as “present in” organisms. The page begins with a number of statistics for the user’s reference, which include the reference organism, the number of genes compared (i.e. number of reference organism genes that were found in the database), the number of genes displayed (i.e. genes used in Page 20
the output after zero-hit genes and singletons were filtered out), and the cutoff score used. Below that is guide to the colours used in the heatmap, and below that, the heatmap itself. The “present in” organisms are represented by the rows, and the genes (clusters) by the columns. Two radio buttons are provided above the heatmap that allow the user to determine what information will be displayed when the heatmap is moused over. By default, the gene (column) the mouse is over will be displayed, and clicking on a specific column in the heatmap will open a separate window with more detailed information on that particular gene. Alternatively, the user may have the organism (row) the mouse is currently over be displayed instead (Fig. 11). These two states can be toggled by pressing the “e” key. Below the heatmap is a series of numbers that tell the user how many genes are represented in the heatmap, how many were filtered out as singletons, and how many did not register any hits (Fig. 12).
Page 21
Figure 11: Mousing over the heatmap can produce gene information (top) or organism information (bottom). These two states can be toggled with the “e” key.
Figure 12: The number of sequences represented in the heatmap, as well as the number of singleton sequences and sequences with no hits at all are displayed just below the heatmap. Page 22
6.) BENCHMARK RESULTS 6.1) Clustering step The running time of the clustering step involving Cluster 3.0 was benchmarked with gprof. Cluster 3.0 was evaluated using two types of simple hierarchical clustering (complete linkage and average linkage) and self-organizing maps (SOMs). Cluster 3.0 was also evaluated on its efficiency using several different distance measures, including: 1.) Pearson correlation 2.) Absolute value Pearson correlation (the absolute value of the final result is used to compare two genes, so genes with opposite, as well as similar, profiles will cluster with each other) 3.) Spearman rank correlation 4.) Euclidean distance, and 5.) City Block distance Information on each of these can be found in the Cluster 3.0 documentation. All these trials were run with two configurations of Cluster 3.0: i.) Gene clustering only ii.) Gene and organism clustering The benchmarks were run on two datasets: one with Ustilago maydis as the reference organism and all the remaining Basidiomycetes and Ascomycetes as “present in” organisms (i.e. 23 organisms, designated the “Average dataset”), and one with Ustilago maydis as the reference organism and all the other available organisms as “present in” organisms (i.e. 192 organisms, designated the “Everything dataset”) The results of these benchmarks are summarized in Fig. 13, with detailed results in Appendix 2, Table 1. From these results it is clear that SOMs perform with the best running time. Note that simple hierarchical clustering with complete linkage and Pearson correlation is the original method used by Phootprinter before this evaluation. Page 23
Simple hierarchical, complete linkage
ty
Ci
n,
n
di
at io
e
n
va lu
el
rr
co
s.
ab
io
at
el
rr
co
ea
id
nk
ra cl
Eu
's
an
io
at
el
n
so
ar
Pe
rr
co
ea rm
Sp
on
ar s
Bl
Running time (s)
Simple hierarchical, complete linkage
n st a oc nc ar Pe k so e ar di n st so co a n nc rr Sp co el e ea rr at el rm io at n, an io ab 's n s. ra nk va l ue co Eu rr cl el id at ea io Ci n n ty di st Bl Pe a o nc ar ck Pe so e ar di n st so co an n r Sp re co ce ea la rr tio el rm a n, an tio ab 's n s. ra nk va lu co Eu e rr cl el id at ea io Ci n n ty di st Bl an oc ce k di st an ce
Pe
Pe
ar
Pe
k
n
oc
Bl
st an
di
n
e
lu
io n
va
la t
la tio
re
co r
ea
lid
ab s.
rr e
co
co r
n
so
ce di st an Sp re co ce ea la rr tio el rm at n, an io ab 's n s. ra nk va l ue co Eu rr cl el id at ea io Ci n n ty di st Bl Pe a o nc ar ck Pe so e ar di n st so co an n r Sp re co ce ea la rr tio el rm a n, an tio ab 's n s. ra nk va lu co Eu e rr cl el id at ea io Ci n n ty di st Bl an oc ce k di st an ce n
so
Ci ty
,
nk
ra
Eu c
's
an
io n
n
so
el at
ar
Pe co rr m
ar
Sp e
on
ar s
ar
Pe
Pe
Running time (s)
120.00
100.00
80.00
60.00 Average dataset Everything dataset
40.00
20.00
0.00
Simple hierarchical, average linkage
Simple hierarchical, average linkage Self-Organizing Maps (SOMs)
140
120
100
80 Average dataset Everything dataset
60
40
20
0
Self-Organizing Maps (SOMs)
Figure 13: Running time of Cluster 3.0 on Different Clustering Algorithms and Dataset Sizes: (top) Gene Clustering Only (bottom) Gene and Organism Clustering Page 24
6.2) All output-generating scripts Benchmarks were also conducted on all output-generating scripts, a statistic that takes into account all activities associated with form submission. The time spent in each part of the output-generating process, as described in Fig. 9, was measured on both the Average dataset and the Everything dataset, the results of which are summarized in Fig. 14. Both tests were also run with two different configurations of Phootprinter. In its original configuration, Phootprinter creates an image map with a clickable area for each dot in the heatmap, which links to further information for the gene under that area. However, because each heatmap column represents a gene, it is possible to create a clickable area for the entire column and achieve the same result. Phootprinter was tested in both configurations, and the results are graphed in Fig. 14. Additionally, a comparison of the total running time of all output-generating scripts under each of the conditions described in Fig. 14 is provided in Fig. 15 (see Appendix 2 for raw data). From these results it is clear that links for each column dramatically improve the time taken to print the HTML output.
Page 25
140
117
120
Running time (s)
100
80 Links for each square Links for each column 60
40 22
22
20 8
tp
pm
L TM H g in Pr
ou
ge ne
cl
tp ut
tin
Cl
us te
us
rD
te r
BD
D
at
at
ou
a.
a. p
cg a. at BD D et pG cm
ut
0
i
0
6
0 m
0
Script
2000 1800
1800 1600
Running time (s)
1400 1200 1000 800 600 400
nk
56
m
tp ut
2
.p
m
ou
in t
tp ut
Cl
in g
us
H
te
TM L
rD
ou
at a
.p at a BD rD te us cl
cm ne ge
s
fo
r
ea
ch
pG
et
D
sq ua
re
BD
(t
at a
ot
.c
al
gi
)
0
Li
69
0
Pr
200
Links for each column
Figure 14: Running time of Output-Generating Scripts on (top) Medium Dataset, and (bottom) Everything Dataset. Page 26
2000 1800
1800
1600
Running time (s)
1400
1200
1000
800
600
400
200
147
127 28
0 Links for each square
Links for each column
Medium Dataset
Links for each square
Links for each column
Everything Dataset
Figure 15: Total running time of all output-generating scripts on the Medium and Everything Datasets. Each trial was conducted once with Phootprinter configured to produce a link for each heatmap square, and once when configured to produce a link for each heatmap column.
Page 27
7.) SUMMARY & DISCUSSION 7.1) User Interface The form in which the user may make his/her selections has been uploaded to the PartiGeneDB server, and its functions have been tested in a number of browsers, including Mozilla Firefox 1.5.0.4 and above, Microsoft Internet Explorer 6.0, Netscape 7.2, Safari 2.0.4, Camino 1.0.1, Seamonkey 1.0.1, and Epiphany 1.4.9. The form is fully functional in each of these browsers. Additionally, members of the Parkinson lab have attempted to use the form for their research purposes and find it very easy to handle. As there are now fewer obstacles to mastering the interface, it is now more likely that individuals that may not otherwise use Phootprinter will be drawn to it for their own research. The improved interface will also mean users spend less time getting things up and running and more time doing actual work with Phootprinter. 7.2) Output-Generating Scripts Based on the clustering benchmark results in Fig. 13, it is clear that SOMs are vastly superior to all simple hierarchical methods tested. There is also a clear trend in distance metrics regardless of the clustering method used. Spearman’s rank correlation produces the worst running time, followed by the Pearson correlation metrics, followed by Euclidean distance and city block distance. Although these last two perform the best, both require the dataset to be normalized, which changes the value of the scores in the input cluster file. While it is possible that city block or Euclidean could be used on a normalized dataset, with the output constructed using the scores of the original dataset and the ordering of the normalized dataset, it would require an extensive rewrite of the existing code, and would necessitate reading two files to obtain all the information to produce the output, rather than one. Additionally, in all trials, the Pearson trials are not significantly slower compared to the Euclidean and city block trials. Therefore, it was decided that, in the final version of the code, both genes and organisms would be clustered using Pearson’s correlation coefficient as the distance metric. Despite the promise shown by SOMs in the benchmarks, it was later determined by members of the Parkinson lab that significant features could not be discerned from the Page 28
heatmap produced by SOM clustering. When simple hierarchical clustering was used, the output did produce significant features that could be identified and followed up on (Fig. 16). Therefore, in order to obtain meaningful results, simple hierarchical clustering–the original algorithm used by Phootprinter–was used in the final code.
Figure 16: SOM clustering (top) produces a seemingly random clustering of profiles compared to complete linkage simple hierarchical clustering (bottom). Page 29
When the entire system of scripts, pages and programs was assembled and tested, it was discovered that the overall running time was not significantly improved when SOMs were used instead of simple hierarchical clustering. Benchmarks of the outputgenerating scripts revealed that an abnormally large amount of time was being taken just to print out the HTML of the results page (Fig. 14). Examination of the results page’s source code revealed that it was dominated by an incredibly large number of <area> tags that formed the HTML image map that specifies clickable areas on the heatmap, which links to pages with further information on the corresponding genes. Since each column corresponds to one gene, it is possible to use only one link area for each column to achieve the same effect. The benchmarks in Figs. 14 and 15 show that using a link only for each column virtually eliminates this bottleneck in output generation, as the number of link areas is now equal to the number of genes, not the number of genes multiplied by the number of organisms. Therefore, the production of link areas, which was previously an O(mn) operation, where m is the number of organisms and n is the number of genes, has now been reduced to a O(n) operation, which exhibits linear growth. It was also discovered that producing a separate set of link areas for organisms, and allowing the user to toggle which set to use (genes or organisms), has an insignificant effect on the overall running time. Therefore, the production of link areas in the final code actually runs in O(m+n), which is asymptotically better than the original O(mn) running time. The running time of Phootprinter is now feasible for queries of any size, as the worst case running time is now two-and-a-half minutes, as opposed to 30 minutes in the original configuration. Interestingly enough, the slowest step in the process now is the clustering step, and any further improvements to the running time would undoubtedly involve improving the running time of this step. 7.3) Results Page Increasing the size of the heatmap has made it easier to pick out and identify specific dots in the image for further analysis, and it is clearer which dot the mouse is over when mouseover information is displayed. Analysis is also helped by the new Data Missing colour: by making it the same as the page background, it is easier for the human Page 30
eye to pick out the signals from the noise. Furthermore, both gene and organism information can now be recovered from the heatmap without incurring massive running time penalties. 7.4) Recapitulation of Austin et al’s comparative analysis of Ustilago maydis ESTs In their Functional & Integrative Genomics paper, Austin et al. conducted a comparative analysis of ESTs from Ustilago maydis, a basidiomycete plant pathogen. As a test of Phootprinter’s abilities, the comparisons they made were duplicated as closely as possible in a Phootprinter query to compare to the results of the Austin paper. Unfortunately, many of the species that were used in the Austin paper were not available for use in Phootprinter. This may actually be due to the fact that the database table that Phootprinter relies on for organism selections does not yet reflect the entire organism database, as it is part of another, as-yet-unfinished project. A query was constructed with the following “present in” organisms: - Cryptococcus neoformans (basidiomycete) - Schizosaccharomyces pombe (ascomycete) - Saccharomyces cerevisiae (ascomycete) - Blumeria graminis (ascomycete) - Gibberella zeae (ascomycete) - Fusarium sporotrichiodes (ascomycete) - Mycosphaerella graminicola (ascomycete) - Verticillium dahliae (ascomycete) - Phytophthora infestans (oomycete) This produced the heatmap in Fig. 17. Of the 4612 U. maydis sequences available, 1430 were represented in the heatmap, and 537 were filtered out as singletons. The number of genes with significant hits to Cryptococcus neoformans, U. maydis’s closest ancestor in this query, was determined to be 538. As 1430 genes total were studied, this means that about 37.6% of the studied genes had significant hits to C. neoformans. In the Austin Page 31
paper, Austin et al. studied 4,221 U. maydis ESTs and found that 1,975 of them had significant hits to C. neoformans, which is roughly 46.8% of all studied ESTs. A meaningful comparison of these two results is, unfortunately, not possible, as Phootprinter’s analysis involved genes while Austin et al’s analysis involved ESTs. Additionally, the output was analyzed to see if genes of similar function did indeed cluster with each other in the heatmap. Fig. 18 illustrates that this was indeed a widespread phenomenon in the heatmap, confirming that Phootprinter can produce meaningful clusters. However, some of these genes in these clusters may in fact be isoforms of the same gene. This is plausible as isoforms sequences are very similar, which makes them cluster with each other, yet still slightly different, which makes them appear to be different genes to Phootprinter. Also, unlike the heatmap featured in Austin et al. (Fig. 19), there is no clear, gradual progression of profiles over the entire set of genes. Improving Phootprinter’s clustering in this regard would undoubtedly be quite helpful. Nevertheless, Phootprinter’s results were generated far more rapidly (~10 seconds) than those in Austin et al. paper (~4 months).
Page 32
Figure 17: Heatmap results of query with Ustilago maydis and other organisms used in Austin et al.
Figure 18: The proximity of numerous proteins of the same function with each other proves that Phootprinter’s clustering does produce meaningful results. Page 33
Figure 19: Heatmap results from the Austin et al. paper
Page 34
8.) FUTURE WORK Although the interface and output-generating efficiency were both greatly improved, not much attention was devoted to the presentation of the results page. It was originally planned that a Java applet for displaying the heatmap to the user and allowing him/her to zoom in and out on any part of the heatmap was to be integrated into the system, but time constraints forced this feature to be dropped. Any future work on Phootprinter should undoubtedly make the integration and evaluation of this feature a priority. Furthermore, the use of separate sets of link areas for columns and rows means that mousing over heatmap squares no longer provides the organism and the gene at the same time. Providing an efficient and well thought out means for letting the user know both the row and the column under the cursor without significantly worsening the computational efficiency of the whole system might also be useful in future versions of Phootprinter. It would be helpful to obtain feedback from Phootprinter’s users in order to verify whether or not Phootprinter’s features are appreciated and used, and gain insight into improvements to existing features or new features users would like to see implemented. Additionally, a more through analysis of the clustering algorithms in Cluster 3.0, as well as analysis of other algorithms not available in Cluster 3.0, would be invaluable to both improving the clustering step’s computational efficiency (as it is now the slowest step), and improving the quality of the clustered output. Finally, in order to make Phootprinter much more useful, the table that it consults for the precomputed BLAST scores must be expanded to encompass all organisms and all genes in PartiGeneDB, as BLASTs between all genes and all organisms have not yet been completed. In order to do this, there must be a clearly established, efficient, and regularly performed update process for the BLAST tables so that when PartiGeneDB’s coverage inevitably expands, the BLAST results available will maintain parity with the rest of the database.
Page 35
9.) REFERENCES Austin et al. “A comparative genomic analysis of ESTs from Ustilago maydis.” Funct Integr Genomics 4.4 (2004): 207-218. “Expressed Sequence Tag.” Wikipedia: The Free Encyclopedia. 30 July 2006. 9 August 2006. Frazer et al. “Cross-Species Sequence Comparisons: A Review of Methods and Available Resources.” Genome Research 13:1 (2003): 1-12. Gutierrez et al. “Phylogenetic profiling of the Arabidopsis thaliana proteome: what proteins distinguish plants from other organisms?” Genome Biology 5.8 (2004): R53. Portnoy, Matthew E., and Eric D. Green. “Comparative sequencing of vertebrate genomes.” John Wiley & Sons Canada Ltd. 9 August 2006. Searls, David B. “Pharmacophylogenomics: Genes, Evoluion and Drug Targets.” Nature Reviews Drug Discovery 2.8 (2003): 613-623. Tong Shen et al. "Phootprinter: a web-based query tool for phylogenetic footprinting." Eastern Great Lakes Molecular Evolution Meeting 2005 (Toronto, ON, Canada), April 2005, Poster 25. Vandepoele, Klaas, and Yves Van de Peer. “Exploring the Plant Transcriptome through Phylogenetic Profiling.” Plant Physiology 137.1 (2005): 31-42. Wu et al. “Gene annotation and network inference by phylogenetic profiling.” BMC Bioinformatics 7:80 (2006): 1-18. Page 36
APPENDIX 1: DATABASE UPDATE PROCESS As there are about 300 organisms available in PartiGeneDB, connecting directly to the database and retrieving all available organisms every time the web form is loaded is not practical. Instead, the database is only consulted every six months, i.e. whenever it is updated. A Perl script (dbconn.pl) connects to the “estdb” database in PartiGeneDB’s PostgreSQL server and consults the “scheme_division” table, which maps phyla to the individual species that are part of those phyla. This information is written to a plain text file (species_tree.txt), which places each entry (i.e. phylum, subphylum, or species) on a single line, nested according to its place in the hierarchy. Phyla are always placed on a line with no preceding tabs, and their member species may or may not be divided into subphyla. If they are, each subphylum name is produced on a line preceded by one tab, followed by the list of species belonging to that subphylum, with each species name on its own line preceded by two tabs. This list begins with the entry "Species:" to indicate that this is a list of species (the rationale for this will be explained further on). If the phylum's species are not divided into subphyla, then the species list is written below the phylum name with each species name on a separate line preceded by one tab. Again, this list starts with the entry "Species:". Thus, the number of tabs and the ordering of the names recapitulates the taxonomic hierarchy and membership of each organism. Furthermore, the script that parses this file can tell if the list nested under any given phylum is a list of subphyla or species, due to the use of "Species:" as the first entry of every species list. This allows the list to be parsed correctly. Additionally, for each species retrieved, all the taxonomic groups that this species is contained in are written immediately following the species name on the same line, separated by an arrow (i.e. the characters "-->"). The groups themselves are semicolondelimited, and are ordered from the highest rank (Eukaryota) to the lowest rank, which is the species name. This string of groups is hereafter referred to as a “taxonomic hierarchy string”. Following is an example of tdbtest5.pl’s output for the phylum “Apicomplexa” and its member species (note that in the actual output file the taxonomic hierarchy string is not word-wrapped; all information pertaining to a single organism will be contained on Page 37
one line): Apicomplexa Species: Plasmodium berghei-->Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium; Plasmodium berghei. Plasmodium yoelii-->Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. Theileria parva-->Eukaryota; Alveolata; Apicomplexa; Piroplasmida; Theileriidae; Theileria. Sarcocystis neurona-->Eukaryota; Alveolata; Apicomplexa; Coccidia; Eimeriida; Sarcocystidae; Sarcocystis. Neospora hughesi-->Eukaryota; Alveolata; Apicomplexa; Coccidia; Eimeriida; Sarcocystidae; Neospora. Gregarina niphandrodes-->Eukaryota; Alveolata; Apicomplexa; Gregarinia; Eugregarinida; Gregarinidae; Gregarina. In this case there are only species listed under “Apicomplexa”. As of this writing, the subphylum divisions have yet to be written to the “scheme_division” table, so to demonstrate the output for a phylum with subphyla, a sample from the test dataset will be used (note that the following are nonsense values and are in no way reflective of biological reality): Alpha Subphy 1 Species: Worm-->Eukaryota; Superworms; Worm. Slimeball-->Eukaryota; Weird Lifeforms; Slimy Lifeforms; Slimeball. Subphy 2 Species: Page 38
Cat-->Eukaryota; Household pet; Cat. Dog-->Eukaryota; Household pet; Dog. Chimpanzee-->Eukaryota; Found in the jungle; Hanging from trees; Chimpanzee. In this case, the species are split into to subphyla–Subphy 1 and Subphy 2–therefore the species lists are placed under their respective subphyla and further nested so that an organism’s phylum membership and subphylum membership may be recovered from species_tree.txt. (Note that a similar scheme for organizing taxonomic structure is used in PartiGeneDB’s SimiTri tool) Another Perl script (update_form.pl) then reads species_tree.txt and generates arrays that map each phylum to its member subphyla (or species) and each subphylum to its member species, if applicable. (e.g. array “Apicomplexa” contains the entries “Plasmodium berghei”, “Plasmodium yoelii”, “Theileria parva”, “Sarcocystis neurona”, “Neospora hughesi”, and “Gregarina niphandrodes”) Each array actually contains two versions of each entry: the first is a version that is more amenable to backend manipulation (basically the original name in lowercase with whitespace removed, e.g. “plasmodiumberghei”), and the second is simply the original name as provided, which is more human readable and is used and displayed in the frontend interface. The taxonomic hierarchy string of each species encountered is hashed, with the species it belongs to as the key, into a hash for keeping track of all the taxonomic hierarchy strings for later. After this, update_form.pl writes the HTML form (phootprinter.php) that will be loaded by the user’s browser every time Phootprinter is accessed. More information on this form can be found in Section 4.
Page 39
APPENDIX 2: DATA TABLES
Page 40
APPENDIX 3: FUNCTION REFERENCE dbconn.pl Global variables Name $outfile $con
Purpose File to write species information to PartiGeneDB “estdb” database connection handle
makeAChoice() This function iterates through the rows of the “scheme_division” table and determines whether or not the current line, which corresponds to a phylum, contains subphyla or not. If it does, then control passes to the underSubPhylum() subroutine, otherwise it passees to underPhylum(). underPhylum() This function reads the tax_id’s of the current phylum and retrieves the corresponding species names, which are then written to the $outfile. underSubPhylum() This function executes if a subphylum was detected in the current row of the “scheme_division” table. It takes the tax_id’s of that subphylum and writes the subphylum and its member species to $outfile formatted so that it is clear that this subphylum and its species are part of the last encountered phylum. update_form.pl Global variables Name $readin $printout %global_hash
Purpose Name of file to read organism names from Name of file to write the HTML form to Hash that maps the name of each array to the actual array data structure. Keeps track of the arrays that hold all the taxonomic groupings, species, and their relationships with each other. %hash_tax_hierarchy Hash that maps the name of each species to its “taxonomic hierarchy string”, which lists all the taxonomic groups it is a member of. @start Array for the menu choices for the highest rank (phylum) $i Keeps track of the next available index in @start for adding a new menu choice. Page 41
@label
@line $j
$the_string
An array that is used as a stack for labels that have been encountered so far. If a list of subphyla is encountered, the iteration through the phyla is put on hold while the subphyla list is “explored”. Likewise, if a list of species is encountered, the iteration through the subphyla (or phyla) is put on hold while the species list is “explored”. As this “exploration” proceeds, new entries are added to this stack. Upon a list’s completion, the last x number of labels in this stack correspond to entries in this list. Those labels are taken off the stack, added to an array, and this array is added to %global_hash. The list that was being processed before this still needs to be completed, and the labels encountered so far for this list are still in the stack, so the processing of the previous list can continue normally. Labels are the machine-readable form of each entry. Functions like @label, but stores the human-readable form of each entry. Keeps track of the next available index in @label and @line for adding new entries to these “stacks”, as well as retrieving entries off the top of these “stacks”. A change in the indentation preceding an entry in the input file signals a change in the hierarchy level that is currently being “explored”. When this happens, control passes to the block of code that is responsible for handling that level. This variable allows the last line read in to be available to that code block for processing of the first entry.
The first half of this file reads input line-by-line from the input file “$readin” and creates arrays out of the lists of taxonomic names. These is accomplished with the first two subroutines: menuBuildStart() and menuBuild(). Before this code is executed, a filehandle FILEIN is opened on “$readin”. menuBuildStart() This function is used to process the highest level of the species hierarchy found in “$readin” (i.e. the phyla). Two versions of the phylum name are created: a humanreadable version, which is the unmodified name, and a machine-readable version, which converts all characters to lowercase and removes whitespace from the name. Any member species of subphyla of this phylum are processed with a call to menuBuild(), which returns the number of members and allows it to be appended to the humanreadable name so the user of the web form can see how many members each phylum contains. This information is stored in the array “@start”.
Page 42
menuBuild() This function is used to process lines of input that correspond to a subphylum or species. A human-readable and machine-readable version are created as in menuBuildStart(), and the next line is examined to determine what to do next. There are three possibilities: 1.) A new list is nested below this entry, which means this entry a subphylum and the following lines are a list of its member species. 2.) There is another entry in the current list just below this one 3.) The end of this list has been reached, so the program must return to the list that this entry’s parent was part of (e.g. if the current entry is a subphylum, then its parent was a phylum and the program returns to processing the phylum list). The parent of this entry is determined by the variable “$parent_label”, which can be passed as a parameter in calls to menuBuild(). The parent is the name of the array to which this entry is added, so each array is named after a “parent label” and contains that parent’s members. ****** The second half of this file writes the HTML form (i.e. the front-end user interface) to “$printout” via the FILEOUT filehandle. In the content of the first <script> tag, the arrays that were generated in the first half are written as JavaScript arrays, and the name-to-array mappings are placed in a JavaScript associative array called “%hashy”, which is the JavaScript equivalent of “%global_hash”. Basically the complete array structure of “update_form.pl” is translated into JavaScript, where the scripts in “jscripts.js” can use these values in its client-side scripting when responding to the user’s input. More information on this can be found in the “jscripts.js” reference further down. After that some PHP code is written so that the form can detect which browser the user is using. Since different browsers have slightly different implementations of JavaScript, browser detection is important for loading the correct JavaScript file so that all scripts will function correctly. By default, the scripts from “jscripts.js” are used, with three alternative versions available for the following browsers: Windows Internet Explorer, Safari, and Camino. Following this, the first table, “main”, is created, which holds the cells that will be filled with the pull-down menus and the R, P, and A buttons. Initially only the first menu is generated from the entries listed in “@start”, along with its corresponding checkboxes; user selections will generate the remaining menus accordingly. Again, there is some PHP code used to detect the browser used, as the options available for detecting menu changes and calling JavaScript functions varies across browsers. Therefore, the invocation of the JavaScript functions used to effect the user-driven changes described above will be implemented depending on the user’s browser. After table “main”, a space for the display of a species’ taxonomic hierarchy string is created, and then the search field elements are created, including the search Page 43
string text field, the search button, and a cell for writing the results. Next the table “curselects” is created that holds the current selections lists. The page concludes with the four buttons for manipulating current selections, the cutoff score field, and the Submit and Reset buttons. It also adds three hidden input tags called “reference”, “allPresentIns”, and “allAbsentFroms”. These inputs hold the organisms selected by the user, delimited by the “|” character, and is passed as input to the Phootprinter script when the form is submitted. Additionally, a hidden input called “allPresentInColors” is used for recording a specific colour associated with a corresponding “present in” organism, which reflects which phylum it came from. The colours are listed in the same order as the “present in” organisms in “allPresentIns” jscripts.js This page contains the JavaScript scripts that rewrite the contents of “phootprinter.php” according to the user’s input. Originally embedded in the HTML, the sheer amount of scripting required to make every feature work forced the scripts to be moved to an external file. buildMenu() Parameters: number: The rank of the requested menu, and its table cell option: Determines which array to look in for menu’s options This function is called in response to the user selecting a choice from one of the non-Species menus. The initial choice (e.g. “<< Select a Phylum >>”) has no effect, as the selection of this choice makes the function return immediately. The string “ret” is created and the HTML that is to be written to the appropriate table cell (i.e. the next rank down). The corresponding array is retrieved from “hashy” and the labels are used as the