Current and Future Industry Applications of Genomics KARIN SCHMITT Exelixis Pharmaceuticals South San Francisco, CA USA
Already, there are numerous biotech companies that focus on gene identification and validation by applying the latest genomics technologies. The focus of research at Exelixis is on model organisms, ranging from worm, fly and zebrafish to mouse and human. Our goals are not only to identify new genes but to delineate entire biochemical pathways using our PathFinder (TM) technology. Since many of these networks have been conserved throughout evolution, this knowledge can be applied directly to common human diseases, as well as to various agricultural applications. Genomics, including high-throughput mapping and sequencing, has revolutionized many areas of traditional genetics and has made possible identifying every gene for a growing number of organisms. Application of these technologies at Exelixis required establishment of a genomics platform that includes sequencing, mapping, gene identification and mutation detection. All of these activities must be supported by computational biology and most of the bioinformatics infrastructure must be tailor-made for these specific applications. The main goal is to efficiently identify new genes and translate this knowledge across different species. Given the current trend in genomics of steadily decreasing costs with increasing capacities, it will be possible in the future to branch out to yet unexplored new organisms with specific applications in the agricultural area. Similar approaches will be taken in the field of animal breeding and genetics and initial genomics efforts are already underway.
Exelixis Pharmaceuticals — A Model Organism Company At Exelixis Pharmaceuticals, which is a young biotechnology company, we use model organism genomics and related technologies for the identification of novel genes with application in the pharmaceutical, human diagnostic, crop protection and animal health industries. Model organism genetics are coupled with genomics, informatics and biology, to directly elucidate gene regulatory networks. This strategy is used to identify critical genes in disease and physiological pathways, and to determine functional relationships between genes. Obtaining this knowledge requires sophisticated genetic manipulations and other experimental tools that are today only possible in a few organisms. It also seems reasonable to expect that many pathways are conserved, even between worm, fly and humans. In addition, work with insect and nematode models can lead to the identification of novel products for crop protection and animal health, which could revolutionize the pharmaceutical and agrochemical industries. This paper describes Exelixis’ philosophy on several areas which we consider crucial in the application of genomics technology to model organisms. New research directions and the future of genomics are discussed, especially as related to different methods for gene identification. - 97 -
Genomic Applications –PathFinder (TM) Technology The PathFinder (TM) technology (Artavanis-Tsakona et al., 1995; Karim et al., 1996) utilizes model organism genetics to identify gene targets that would be difficult or impossible to uncover using other experimental approaches. PathFinder(TM) screens identify novel genes which, when mutated, either reduce or increase the activity of the disease pathway (Figure 1). These new genes constitute potential targets for antagonists and, in some cases, agonists that are predicted to restore biochemical balance to the disease pathway.
FIGURE 1. Genetics PathFinder (TM) Technology.
The PathFinder (TM) technology is built on two key principles: 1) the value of genetics as a method to identify functional relationships between genes and pathways, and 2) the evolutionary conservation of biochemical pathways between invertebrates and humans. Genetics allows us to use invertebrates as in vivo assay systems that survey each and every gene in the genome to identify those that influence a particular disease pathway of interest. The short generation intervals of these model systems, coupled with the ease of germ line manipulation and ability to screen a large number of mutagenized individuals for rare mutations, allows this type of genetic analysis to be carried out quickly and efficiently. The primary model systems for this type of research are the invertebrate organisms Drosophila Melanogaster (fruit fly) and Caenorhabditis Elegans (nematode). The ease with which Drosophila and C. Elegans can be manipulated, makes them ideal biological read-out systems to identify new genes, elucidate biochemical pathways and characterize biological processes in the context of a living organism. In order to confirm and extend the data obtained from model organisms, knowledge from mouse and human genetics, as well as from computational biology, genomics and biochemistry, is considered in regards to pathway function and target validation.
- 98 -
To initiate a PathFinder (TM) screen, a candidate gene is selected that has been implicated in a specific disease. By either misexpressing the human gene or mutating the orthologue of this gene in one of the model organisms, the biochemical pathway in which the disease gene operates is perturbed, which results in a visible change in the morphology or behavior of the organism. This change serves as a visible assay for the output of the affected pathway and allows for the identification of additional genes in the disease pathway. The end result of the PathFinder (TM) screen is the identification of a new set of validated targets based on their in vivo functional relevance to the biochemical pathway of interest. Genes that increase or decrease the activity of the pathway can be chosen, depending on the desired outcome. These new interacting genes are then cloned, and their human and mouse counterparts isolated using genomics and informatics approaches. Vertebrate orthologues of new genes discovered in invertebrate screens can be rapidly identified. The end result of this analysis is the identification of a new set of genes that, based on their in vivo functional relevance to the biochemical pathways, are major contributors to a given disease. A variety of functional analyses are then performed, including further genetic interaction tests in the model systems, cell-based assays, and biochemical characterization, in order to further validate the new genes as potential therapeutic targets in the disease pathway. In addition, many of the genes discovered will find utility as diagnostic reagents or gene-based therapeutics. In order to utilize this system in a high-throughput way and guarantee efficient gene identification, a number of genetic and genomic resources are needed. The creation of genome sequence banks and improved computational biology tools facilitate the rapid translation of genetic information between organisms.
Genomics Technology Platform The goal of Exelixis is to provide the genomic resources needed to facilitate timely identification of genes through PathFinder (TM) analysis, including EST (expressed sequence tag)-based gene identification, SNP (single nucleotide polymorphism) development and construction of mapping resources. Specifics of the development of each of these resources are described below.
DNA Sequencing Resources Our sequencing operations began several years ago, with the goal of creating a truly scalable and efficient process suitable for large cDNA projects and sequencing of entire genomes. Automation and informatics have been central to a scalable and efficient sequencing operation. Many of the laboratory tasks are very repetitive and can be completed more efficiently on a large-scale. Automating the experimental set-up not only dramatically reduces the error rate but makes it even possible to start considering genome-wide projects because of the reduction in labor cost. We have developed ways to process large volumes of samples with a small staff, through automation and informatics systems that minimize human involvement. We have automated the processes for colony picking and replicating, template preparation and reaction set-up. Several of these processes have been developed at Exelixis and represent the outcome of the combined efforts of molecular biologists, and hardware and software engineers. We have implemented a comprehensive informatics system that manages data flow and data analysis. All samples are tracked from submission to completion with state of the art sample tracking software. Once the sequencing
- 99 -
process is complete, DNA trace files are submitted to our data analysis pipeline for annotation, loaded into our internal database and made available for viewing and further analysis. This includes comparisons to other databases and contigging with other sequences. One of our main strengths is the ability to respond rapidly and flexibly to new data and technological opportunities. The introduction of new sequencing equipment serves as an example to illustrate our commitment. We consider the choice of sequence detectors a key issue for further scale-up: the existing sequencers employ conventional slab-gel electrophoresis, which requires considerable manual labor (for plate washing, gel pouring, gel loading, and gel breakdown). Two new capillary instruments are commercially available that promise much greater automation. In addition, these instruments allow vastly increased instrument throughput via electrophoresis at higher electric fields. Enhancements of the data handling have to go hand-in-hand with such a change in sequencing platform. As a specific example of the development of DNA sequencing resources and their application, the Exelixis FlyTag (TM) project is directed towards the isolation, characterization, and functional annotation of the complete set of expressed genes in Drosophila. Currently over 95% of expressed genes are represented in this proprietary sequence database. This project has identified a wide array of novel genes, including orthologues of many human disease genes (see Rubin, 1998, for a review of public efforts). Resources such as this have helped to identify cross-species orthologues. In addition, intra-species surveys can increase the breadth of a protein family, which can result in the discovery of new genes and facilitate the positional cloning process by providing additional information to validate open reading frames. We have generated ESTs from various stages of fruit fly development and from different tissues. Because some genes are expressed at much higher levels, they are over-represented in certain cDNA libraries. Directly sampling the transcripts using EST projects will result in significant redundancy of genes. Typically, cDNA libraries used in EST projects are biochemically normalized (deselection) to reduce the degree of redundancy of sampled genes. Critical parameters for such projects include choice of RNA-material for library construction, methods of library construction, deselection procedures and data analysis. ESTs are an efficient tool to identify new genes and in addition provide a resource of basic information for developing other tools. The sequencing reads obtained in such a project often show significant homology to human, mouse and other model organism genes. Thus we can functionally annotate our resource. Sequencing reads from the 3’-end of libraries are ideal for development of polymorphic markers. ESTs can easily be mapped onto radiation hybrid panels to obtain localization along the chromosomes. The next phase of this project will focus on generating full-lengths coverage for novel ESTs. We are in the process of implementing high-throughput methods for generating full-lengths sequences of cDNA. Such a resource will increase the utility of EST databases as a genomic analysis tool, facilitate rapid full-lengths cloning for cDNA of interest and will lead to the development of tools in other species where genomic sequence information is limited.
- 100 -
Genome Analysis Resources SNP maps, large insert genomic libraries, and other physical and genetic mapping tools have facilitated the rapid mapping and cloning of genes in Drosophila, C. Elegans and other invertebrates. In order to accomplish this type of research, one needs to have state of the art capability in automated genotyping, genomic sequencing, mutation detection and full-length cDNA cloning (see Collins et al., 1998, for a review of the Human Genome project). The technique of positional cloning of genes has become a common component in modern genetics. This procedure assumes no functional information about the genes and identifies the gene solely on the basis of map position. The fine-scale localization of a particular gene in model systems requires recombination mapping. To improve on classical methods of recombination mapping, we have developed genome-wide SNPs for our ongoing projects in nematode and fruit fly (Wang et al., 1998). Positional cloning begins after the identification of a region of the genome following recombination mapping for a locus of interest. The first stage in the positional cloning process is the creation of a physical map. Such maps consist of a series of overlapping recombinant DNA clones and their associated markers. The creation of a physical map can be divided into the following steps: 1) create a bacterial artificial chromosome (BAC) map with available markers, 2) test additional genetic markers to narrow down the linked region, and 3) identify all genes and possible mutations in the region of interest. BAC libraries are currently available for many model organisms and we have constructed several in house.
Genotyping and Genetic Mapping Genetic mapping and genotyping technologies have been set up as a tool to support our map-based cloning approaches for gene discovery. There are multiple methods for genetic mapping and genotyping, each of which has pro’s and con’s (see Dietrich et al., 1999, for a review). For the construction of de novo maps, radiation hybrid mapping provides the best combination of both high-throughput and good resolution (Schuler et al., 1996). However, radiation hybrid panels are not yet available for many species. We mainly use meiotic maps in combination with our PathFinder (TM) technology. There are many types of markers that are used for genetic mapping, including Simple Sequence Repeats (SSRs), Restriction Fragment Length Polymorphisms (RFLPs), Amplified Fragment Length Polymorphisms (AFLPs), Random Amplified Polymorphic Differences (RAPDs) and SNPs derived from random or genic sequence (cSNPs). These markers have been developed to saturate the genetic map and can be used for easy identification of recombinant breakpoints. Such markers are also invaluable for marker-assisted selection strategies. The choice between these markers is highly dependent on the rate of polymorphism, available sequence information and costs or efficiency of large scale genotyping. For the Exelixis fruit fly mapping effort, we used SNPs derived from already localized sequence tags to build a collection of SNP markers covering the whole genome (Kwok et al., 1996). These markers are abundant in the fruit fly genome, have been derived from publicly available resources and are easily amenable to high-throughput genotyping.
- 101 -
There are many different techniques available for high-throughput genotyping and the method of choice depends on the available markers and scale of the experiment. Most of our efforts are centered around SNPs, which are expected to take the place of SSR-based markers (mainly microsatellites). SNPs are more prevalent in the genome than microsatellite markers, providing a large set of markers near any locus of interest. In addition, SNP genotyping is easier to automate, and increasingly efficient SNP typing methods will become available for very high throughput applications. Our current methods for genotyping include mainly denaturing HPLC and fluorescent SSCP (single-stranded conformational analysis) (see Schafer and Hawkins, 1998, for a review). We are in the process of investigating nongel-based assays (oligonucleotide ligation assay (OLA) and mini-sequencing methods using fluorescent primers), as well as genotyping by Mass Spectrometry. Despite the many methods available, there is a pressing need for increased throughputs and reduction in genotyping costs. There has been a trend towards miniaturization and alternative assays that will not require prior PCR amplification. Chip-based assays are now available for the human genome (Landegren et al., 1998).
Mutation Detection The final step of a map-based cloning effort or the genetic evaluation of candidate genes is mutation detection. For high-throughput mutation detection, we employ sequencing of cloned PCR products, denaturing HPLC and fluorescent SSCP. All three methods require PCR amplification from genomic DNA. The second phase involves the application of a suitable mutation detection technique to the PCR-amplified fragments. SSCP relies on the differential migration of singlestranded molecules through acrylamide, based on the effect of sequence variation on intra-strand loop formation. Denaturing HPLC involves the chromatographic separation of double-stranded DNA fragments. Duplexes containing mismatches are eluted less rapidly than those that are perfectly base-paired. Sequencing is performed on cloned PCR products. The key issues in deciding on a mutation detection system are: (i) sensitivity (if only a single strain or allele is available for screening, sensitivity must be high; if several independent alleles at a locus are known, a less sensitive but faster method may be more appropriate), (ii) cost and throughput, and (iii) rate of polymorphism. SSCP and HPLC are rapid, relatively inexpensive (except for primer cost, especially when using fluorescently labelled oligos) and highly sensitive if the assays are formatted correctly. However, both methods do not reveal the nature of a sequence change and will require follow-up by sequencing if this information is important. Throughput is only limited by the number of machines available for mutation detection. Sequencing of cloned products is very sensitive and relatively inexpensive. High quality sequence information is obtained from several clones using vector-specific primers. The limit in throughput currently is the manual analysis of sequence traces, since actual polymorphisms or mutational changes have to be distinguished from PCR artifacts. We are currently developing software to make this process more automated. Efficient and inexpensive mutation detection remains a challenge and new methods are constantly being developed. Resequencing chips are at present a means of rapid and unambiguous determination of variants (Wang et al., 1998). Several human chips are commercially available but it remains unclear whether a similar effort to provide chips will be undertaken for model organisms. In addition, complete sequence information is required for chip design. Mass
- 102 -
spectrometry is most immediately applicable to high-throughput genotyping, but may also be relevant to the identification of unknown variants (Kwok, 1998).
Computational Biology Resources Bioinformatics plays a central role in any genomics-oriented project and is a major factor in organizing any large-scale project efficiently. The tasks range from archiving of data in specially designed databases, providing electronic notebooks to capture experimental results, tracking all samples in a laboratory, and sophisticated algorithms to compare and annotate sequence information. Only very few of such tools are available commercially and most applications have to be written to fit a particular laboratory set-up. In order to support our activities, we have had to develop our own suite of tools for the identification and analysis of orthologous genes across broad evolutionary distances, from human to insects, nematodes, yeasts, and bacteria. In addition to expertise in sequence analysis and function prediction, we have needed to establish expertise in structural biology and computational genomics, as well as specialized software tools for laboratory information management, automated data capture, data organization and representation. Furthermore, this effort must be supported by a state of the art hardware platform to maintain rates of data transfer and sustain enhanced performance of large-scale computational tools.
The Future of Genomics at Exelixis Mutagenesis has been a powerful tool for understanding gene function and disease in many organisms. Large scale mutagenesis screens that generate a vast collection of mutant phenotypes have been extensively used in nematode and fruit fly research and form the basis of our PathFinder (TM) technology (Artavanis-Tsakona et al., 1995; Karim et al.,1996). Such projects are now underway on a genome-wide scale in the mouse (Brown and Nolan, 1998). After generating a large number of mutants that are carefully phenotypically analyzed, postional cloning methods are employed for identifying the underlying genes. In addition, one can screen a large collection of mutagenized animals for changes in a gene of interest using mutation detection, in order to create an allele series for a particular gene. One of the most immediate applications following gene identification in model organisms and the cloning of the mouse and human orthologues, will be in the area of pharmacogenomics (Marshall, 1997). Genes identified through PathFinder (TM) analysis will be tested for variations in human population-based disease samples, with the goal of validating the involvement in a certain disease. In addition, similar population resources will allow us to demonstrate a statistical correlation between a patient’s response to a drug and variations in certain genes that are suspected to be involved in the process. Our FlyTag (TM) project has been directed towards the isolation, characterization, and functional annotation of the complete set of expressed genes in Drosophila. We are now in the process of extending this effort to other beneficial and insect or nematode pest species that are relevant to crop protection and animal health, as part of the AgTag (TM) project. This database will crossreference orthologous sequences from a wide array of pest and beneficial invertebrates, in order to
- 103 -
enhance genetic target discovery and speed cross species cloning. We are also embarking on genomic sequencing programs for other major invertebrate pests, which is one of the most efficient ways to identify all genes for a given organism. Comparative genomics approaches will increasingly be used to explore many diverse insect species in our agricultural program. In addition, we will continue to improve technologies such as sequencing, genotyping, mutation detection, transcriptional profiling, and proteomics.
Genomics Applications in Animal Breeding and Genetics All of the genomic infrastructure described above, which is primarily aimed at gene identification, has already been put in place for the pig genome (Rohrer et al., 1996; Yerle et al. 1996 and 1998) and other livestock species. But one of the challenges that lies ahead is to take the next step and expand the resources on a truly genome-wide scale. In order to make positional cloning more efficient, a genome-wide physical map and a high-density genetic map are required. The initiation of positional cloning, fine-scale mapping, and BAC contigging, requires that markers are approximately within 1 cM of a given mutation or locus. An important question will be the choice and availability of markers. The human field has seen a transition to SNP markers over the last year because they are easy to identify, abundant in the genome and inexpensive to genotype. A large-scale EST project, together with a radiation hybrid mapping effort, will make the candidate gene approach possible. Such an integrated mapping resource will enhance the comparative and functional analysis of the livestock genomes. Genes are ideal markers for comparative genomics. Both ESTs and physical mapping resources (BACs) will be available to generate new markers than can be used for genetic mapping. All of this has become possible within the last few years and costs and labor are no longer overwhelming. The advent of genetic maps for livestock species with high density markers, will enable the genetic unraveling of many of the polygenic traits. The full strategy for polygenic trait gene identification incorporates many stages that employ the genomic techniques discussed above. The ultimate goal is gene identification. In contrast, strategies for phenotype-driven, marker-assisted selection are different. However, the experimental techniques at the genomic level are identical. With the ongoing improvements in ever increasing throughput, and costs for many assays that have considerably decreased over the last few years, large-scale projects have become feasible. Providing mapping resources and large genomic clone libraries has become routine. Mutation detection projects on a large number of candidate genes can, therefore, be considered. Genotyping of extremely large sample sizes is possible with minimal set-up costs. And most importantly, with genomics resources being provided for many model organisms and the full human genome sequence becoming available, comparative genomics approaches and synteny mapping are becoming more powerful.
References Artavanis-Tsakonas, S., K. Matsuno, M.E. Fortini, 1995. Notch signaling. Science 268:225–232. Brown, S.D.M., and P.M. Nolan, 1998. Mouse mutagenesis – systematic studies of mammalian gene function. Human Molecular Genetics 7:1627–1633.
- 104 -
Collins, F. S., A. Patrinos, E. Jordan, A. Chakravarti, R. Gesteland, L. Walter, 1998. New Goals for the U.S. Human Genome Project: 1998-2003. Science 282:682–689. Dietrich, W.F., J.L. Weber, D.A. Nickerson, P.Y. Kwok, 1999. Identification and analysis of DNA polymorphisms. Page 135 in: Genome Analysis: A laboratory manual, Volume 3. B. Birren, E.D. Green, S. Klapholz, R.M. Myers, H. Riethman, J. Roskams, eds. Cold Spring Harbor Laboratory Press. Karim, F.D., H.C. Chang, M. Therrien, D.A. Wassarman, T. Laverty, G.M. Rubin, 1996. A screen for genes that function downstream of Ras1 during Drosophila eye development. Genetics 143:315– 329. Kwok, P.Y., 1998. Genotyping by mass spectrometry takes flight. Nature Biotechnology 16,1314–1315. Kwok, P.Y., Q. Deng, H. Zakeri, S.L. Taylor, D.A. Nickerson, 1996. Increasing the information content of STS-based genome maps: identifying polymorphisms in mapped STSs. Genomics 31:123-126. Landegren, U., M. Nilsoon, P.Y. Kwok, 1998. Reading bits of genetic information: methods for singlenucleotide polymorphism analysis. Genome Research 8:769–776. Marshall, A., 1997. Getting the right drug into the right patient. Nature Biotechnology 15: 1249–1252. Rohrer, G. A., L. Alexander, Z. Hu, T. P. Smith, J.W. Keele, and C. Beattie, 1996. A comprehensive map of the porcine genome. Genome Research 6:371–391. Rubin, G. M., 1998. The Drosophila Genome Project: a progress report. Trends Genet. 14:340–343. Schafer, A. J,. and J.R. Hawkins, 1998. DNA variation and the future of human genetics. Nature Biotechnology 16:33–39. Schuler, G.D., M.S. Boguski, E.A. Stewart, L.D. Stein, G. Gyapay, K. Rice, R.E. White, P. RodriguezTom‚ A. Aggarwal, E. Bajorek, S. Bentolila, B.B. Birren, A. Butler, A.B. Castle, N. Chiannilkulchai, A. Chu, C. Clee, S. Cowles, P.J.R. Day, T. Dibling, C. East, N. Drouot, I. Dunham, S. Duprat, C. Edwards, J.-B. Fan, N. Fang, C. Fizames, C. Garrett, L. Green, D. Hadley, M. Harris, P. Harrison, S. Brady, A. Hicks, E. Holloway, L. Hui, S. Hussain, C. LouisDit-Sully, J. Ma, A. MacGilvery, C. Mader, A. Maratukulam, T.C. Matise, K.B. McKusick, J. Morissette, A. Mungall, D. Muselet, H.C. Nusbaum, D. C. Page, A. Peck, S. Perkins, M. Piercy, F. Qin, J. Quackenbush, S. Ranby, T. Reif, S. Rozen, C. Sanders, X. She, J. Silva, D. K. Slonim, C. Soderlund, W.-L. Sun, P. Tabar, T. Thangarajah, N. Vega-Czarny, D. Vollrath, S. Voyticky, T. Wilmer, X. Wu, M.D. Adams, C. Auffray, N.A.R. Walter, R. Brandon, A. Dehejia, P. N. Goodfellow, R. Houlgatte, J.R. Hudson Jr., S.E. Ide, K.R. Iorio, W.Y. Lee, N. Seki, T. Nagase, K. Ishikawa, N. Nomura, C. Phillips, M.H. Polymeropoulos, M. Sandusky, K. Schmitt, R. Berry, K. Swanson, R. Torres, J.C. Venter, J.M. Sikela, J.S. Beckmann, J. Weissenbach, R.M. Myers, D.R. Cox, M.R. James, D. Bentley, P. Deloukas, E.S. Lander, and T.J. Hudson, 1996. A gene map of the human genome. Science 274:540–546. Wang, D.G., J.-B. Fan, C.-J. Siao, A. Berno, P. Young, R. Sapolsky, G. Ghandour, N. Perkins, E. Winchester, J. Spencer, L. Kruglyak, L. Stein, L. Hsie, T. Topaloglou, E. Hubbell, E. Robinson, M. Mittmann, M.S. Morris, N. Shen, D. Kilburn, J. Rioux, C. Nusbaum, S. Rozen, T.J. Hudson, R. Lipshutz, M. Chee, and E. S. Lander, 1998. Large-scale identification of single-nucleotide polymorphisms in the human genome. Science 280:1077–1082. Yerle, M., Echard, G., Robic, A. Mairal, A., Dubut-Fontana, C., Riquet, J., Pinton, P., Milan, D., LahbibMansais Y., and Gellin, J. 1996. A somatic cell hybrid panel for pig regional gene mapping characterized by molecular cytogenetics. Cytogenet. Cell Genet. 73(3):194–202. Yerle, M., P. Pinton, A. Robic, C. Delcros, R. Hawken, A. Alexander, C. Beattie, L. Schook, D. Milan,. and E. Gellin., 1998. Construction of a whole genome radiation hybrid panel for high resolution gene mapping in pigs. Proc. 27th ISAG meetings, Aukland, NZ pg 60.
- 105 -