Chapter 5
EST DATABASES
Table of contents 1.
2.
3.
4.
5.
dbEST 1.1 Inroduction to EST and dbEST 1.2 Related entries in dbEST 1.3 Access to dbEST data 1.4 Example 1.5 Result analysis UniGene 2.1 Introduction 2.2 ESTs 2.3 Clustering- UniGene build procedure 2.4 Digital Differential Display 2.5 Current data STACK 3.1 Introduction 3.2 Stack generation 3.3 Incorporating RH mapping information 3.4 Gene discovery resource 3.5 Data access via the web 3.6 Architectural overview and future directions TGI 4.1 Introduction 4.2 Construction of database 4.3 New databases and tools References
1. dbEST 1.1 Introduction to ESTs and dbEST: Expressed Sequence Tags (ESTs) are short (usually about 300-500 bp), single-pass sequence reads from mRNA (cDNA). Typically they are produced in large batches. They represent a snapshot of genes expressed in a given tissue and/or at a given developmental stage. They are tags (some coding, others not) of expression for a given cDNA library.
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 1
Chapter 5
EST DATABASES
dbEST is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or "Expressed Sequence Tags", from a number of organisms. dbEST is maintained by NCBI-National center of Biotechnology Inforamtion. The web site for dbEST is http://www.ncbi.nlm.nih.gov/dbEST/ Most EST projects develop large numbers of sequences. These are commonly submitted to GenBank and dbEST as batches of dozens to thousands of entries, with a great deal of redundancy in the citation, submitter and library information. A brief account of the history of human ESTs in GenBank is available.
dbEST is reserved for single-pass reads. Assembled sequences should not be submitted to dbEST. GenBank will accept assembled EST submissions for the forthcoming TSA (Transcriptome Shotgun Assembly) division.
1.2 Released Entries In dbEST: First release was on 8/5/1998 in dbEST. In first release, total more than 1.6 million ESTs existed from those, more than 1 million EST data was from Homo Sapience. While about 300,000 ESTs was from Mus Musculus ans M. domesticus. Total entries up to current date which is released on July 4, 2008 is 54,447,050. According to this data 8,137,901 is for Homo sapiens (human) and 4,850,258 is for Mus musculus + domesticus (mouse). 1.3 Access to dbEST Data:
EST sequences are included in the EST division of GenBank, available from NCBI by anonymous ftp and through Entrez.
The nucleotide sequences may be searched using the BLAST electronic mail server. The TBLASTN program which takes an amino acid query sequence and compares it with six-frame translations of dbEST DNA sequences is particularly useful.
EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the /repository/dbEST directory at ftp.ncbi.nih.gov
The EST entries can be found by giving particular gene name and by giving limits some specificities are given for that gene like organism name, creation date, author, protein name, EST name and many more.
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 2
Chapter 5
EST DATABASES
And the result page include the information regarding dbEST Id, EST name, GenBank Acc, GenBank gi, Clone inforamtion, Primers, sequence, comments, library, submitter, citations, and given link to map data.
1.4 EXAMPLE: Here cytochrome is given as gene name and search against EST in dbEST and there are 75610 entries exist for cyrochrome and the first link shows the result given below:
1.5 Result page: IDENTIFIERS dbEST Id: 59134977 EST name: Hh_matS2_32C02_T3 GenBank Acc: FK702975 GenBank gi: 193890712 CLONE INFO Clone Id: Hh_matS2_32C02 (3') Plate: 32 Row: C Column: 02 DNA type: cDNA PRIMERS PCR forward: T7 short (AATACGACTCACTATAG) PCR backward: T3 short (ATTAACCCTCACTAAAG) Sequencing: T3(AATTAACCCTCACTAAAGGG) PolyA Tail: yes SEQUENCE TTTCGAGCGGCCGCCCGGGCAGGTACCTGGACGATTGAACCAAGCAACCTTTATTGTTAG CCGACCAGGCGTATTTTTCGGACAATGTTCTGAAATTTGCGGGGCTAACCATAGTTTTAT ACCCATTGTTGTAGAAGCAGTCCCCCTTGAACACTTTGAGAACTGGTCTTCACTAATAAT TGAAGACGCCTAA Quality: High quality sequence starts at base: 2 Quality: High quality sequence stops at base: 194 Entry Created: Jul 9 2008 Last Updated: Jul 9 2008
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 3
Chapter 5
EST DATABASES
COMMENTS A section within this EST was identified as potentially derived from vector sequence. It has been replaced by 'N' residues. PUTATIVE ID Assigned by submitter ref|YP_001403132.1| cytochrome c oxidase subunit II Hippoglossus hippoglossus. Score = 116 bits (291), Expect = 5e-25 LIBRARY Lib Name: Subtracted maternal mRNA library of Atlantic halibut Organism: Hippoglossus hippoglossus Develop. stage: Embryonic Lab host: E.coli Vector: pCR4-TOPO Description: SSH library using RNA from fertilized eggs as tester and from 10-somite stage embryos as driver SUBMITTER Name: Igor Babiak Lab: Reproductive biology Institution: Bodoe University College, Norway Address: 8049 Bodoe, Norway Tel: +47 75517922 Fax: +47 75517349 E-mail:
[email protected] CITATIONS Title: Maternal mRNA as molecular markers of egg quality in the Atlantic halibut (Hippoglossus hippoglossus L.) Authors: Mommens,M., Fernandes,J.M.O., Johnston,I.A., Babaik,I. Year: 2009 Status: Unpublished
MAP DATA
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 4
Chapter 5
EST DATABASES
2. UniGene 2.1 Introduction Website: www.ncbi.nlm.nih.gov/UniGene UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. In addition to sequences of well-characterized genes, hundreds of thousands novel expressed sequence tag (EST) sequences have been included. 2.2 Expressed Sequence Tags ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns, the intervening DNA sequences interrupting the protein coding sequence of a gene. 2.3 Clustering-UniGene Build Procedure: Transcriptome Based Clustering is the process of finding subsets of sequences that belong together within a larger set. This is done by converting discrete similarity scores to Boolean links between sequences. That is, two sequences are considered linked if their similarity exceeds a threshold. UniGene clustering proceeds in several stages, with each stage adding less reliable data to the results of the preceding stage. This staged clustering affords greater control than a more egalitarian treatment of all links between sequences
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 5
Chapter 5
EST DATABASES
2.4 Digital Differential Display(DDD):
DDD is a tool for comparing EST-based expression profiles among the various libraries, or pools of libraries, represented in UniGene. These comparisons allow the identification of those genes that differ among libraries of different tissues, making it possible to determine which genes may be contributing to a cell's unique characteristics, e.g., those that make a muscle cell different from a skin or liver cell. Along similar lines, DDD can be used to try to identify genes for which the expression levels differ between normal, premalignant, and cancerous tissues or different stages of embryonic development. As in UniGene, the DDD resource is organism specific and is available from the UniGene Web site for that organism. For those libraries that have sequences in UniGene, DDD lists the title and tissue source and provides a link to the UniGene Library page, which gives additional information about the library. From the libraries listed, the user can select two for comparison. DDD then displays those genes for which the frequency of the transcript is significantly different between the two libraries. The output includes, for each gene, the frequency of its transcript in each library and the title of the gene's corresponding UniGene cluster. Results are sorted by significance, with the genes having the largest differences in frequencies displayed at the top. Libraries can be added sequentially to the analysis, and DDD will perform an analysis on each possible library–gene pair combination.
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 6
Chapter 5
EST DATABASES
Similarly, groups of libraries can be pooled together and compared with other pools or single libraries.
2.5 Current Data As the July 14, 2008, the human subset of UniGene contained 69,19,847 sequence in 122,958 clusters; 98% of these clustered sequences were ESTs, and the remaining 2% were from mRNAs or CDSs annotated on genomic DNA. 3. STACK: 3.1 Introduction: STACK is a tool for detection and visualisation of expressed transcript variation in the context of developmental and pathological states. The expression state of a transcript can include developmental state, pathological association, site of expression and isoform of expressed transcript. STACK compromise transcripts are reconstructed from clusters that capture and reflect the growing verification of transcript diversity. The comprehensive capture of transcript variants is achieved by the use of a novel clustering approach that is tolerant of sub-sequence diversity and does not rely on pairwise alignment. This is in contrast with other gene indexing projects. STACK is generated at least four times a year and represents the full processing of all publicly available human EST data extracted from GenBank. This processed information can be explored through 15 tissue-specific categories, a disease-related category and a whole-body index and is accessible via WWW at http://www.sanbi.ac.za/Dbases.html. TACK represent a broadly applicable resource, as it is the only reconstructed transcript database for which the tools for its generation are also broadly available (http://www.sanbi.ac.za/CODES). Expressed sequence tags(ESTs) represent single pass reads from the 5’ and/or 3’ end of cDNA clones. These error-prone sequences gain significant biological value if they are reconstructed into consensus transcripts. STACK differ from other gene indexing projects in that it is focused on the capture of context and form of expression and through this a reconstruction of a contextual transcripted. The broad implementation of alignment of alignment based clustering tools can narrow the capture of potential variation within gene expression products. STACK does not rely on alignment based clustering tools to perform clustering, but instead incorporates a novel, loose clustering approach, d2-cluster,that performs comparisons via non-contextual assessment of the composition and multiplicity of words within each sequence. Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 7
Chapter 5
EST DATABASES
The non-alignment-based approach tends to capture transcript variants and contemning sequences that could represent chimeric clones or aberrant transcript associated with disease The use of a loose clustering approach also allows for the incorporation of sequence that would otherwise be discarded as poor quality sequence. The inclusion of low quality sequence in stack has led to the development of error checking tools to assess the integrity of each cluster, and in some cases, to elongate reconstructed transcripts. 3.2 Stack generation: Tissue data sets The STACK database represents the processing of all publicly available human EST data
extracted from GenBank. ESTs are first partitioned arbitrarily into tissue/state context bins followed by removal of contaminating sequences such as vector, low complexity, mitochondrial and ribosomal sequences. Each tissue-grouped set is sent through a pipeline comprising of clustering, assembly, consensus generation and clone linking. Each tissue-grouped set is sent through a pipeline comprising of clustering, assembly, consensus generation and clone linking. ESTs are clustered using d2_cluster, which attempts to capture alternate expressed forms of a gene within the same cluster, in contrast to other protocols that may discard sequences containing ‘noise’.
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 8
Chapter 5
EST DATABASES
Alignment assemblies are generated using the PHRAP package (http://www.phrap.org) and the aligned clusters are used as input for assembly analysis using stack Analyze and CRAW. Assembly artifacts and isoforms are partitioned within a cluster and the longest consensus sequence is assigned to a given cluster, while additional sub-consensus sequences are captured within each record if they exist. Sequences that originate from the same cDNA clone are traced and corresponding clusters are joined by a simple linker sequence producing extended STACK linked consensus entries. Whole-body index Tissue-based clusters, their consensus sequences and mRNAs extracted from GenBank have been used to create the STACK whole body index. Tissue-level consensus sequences are decomposed to their constituent ESTs prior to a PHRAP assembly in order to maximise the alignment accuracy over EST reads that are of low quality. The whole-body index data is subjected to assembly analysis and consensus generation. Radiation hybrid mapping information is integrated into the consensus sequences using ePCR. The final consensus sequence is presented in FASTA format where the header line captures the unique STACK-ID, the GenBank accession numbers for each of the constituent ESTs, the original clone libraries and mapping information if it exists. 3.3 Incorporating radiation hybrid (rh) mapping information Mapping methodologies have centred around the use of sequence tag sites (STSs) as unique landmarks across the genome. EST-based landmarks entered the realm of feasibility when it was demonstrated that single-pass sequences provide suitable templates for the design of gene-based STSs. An international consortium was established to develop STSs from ESTs for mapping studies using primarily RH techniques. Approximately 1000 genetic markers from the Genethon map were included in the analysis to serve as a mapping framework and to allow gene positions to be related to genetic linkage information. Recently, about 41 000 markers were placed onto RH panels and formed the basis of Genemap’98. 3.4 Gene discovery resource: Capture of alternate gene expression forms Databases such as TIGR and UniGene have focused on reconstructing the gene complement of the human genome and their technological developments have been directed towards achieving that goal. STACK, however, focuses on the detection and visualisation of transcript variation in the context of developmental and pathological states.
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 9
Chapter 5
EST DATABASES
Tissue specificity STACK allows the user to rapidly explore 15 tissue categories, a disease-related category and a whole-body index. The tissue-based segmentation speeds disease analysis and functional annotation by providing the user with an immediate representation of areas of the body where a gene is expressed as well as detailed library information that pinpoints the expression location for primary and alternate expression transcripts. For example, STACK was used in the characterisation of the retinitis pigmentosa (RP1) gene. Matches with an eye-specific transcript in STACK were used successfully in the mapping of retina-specific ESTs to chromosomal regions that coincide with the RP1 locu. 3.5 Data access via the web: The STACK database can be queried via the Web at http://www.sanbi.ac.za/stacksearch.html using a sequence as input. The BLAST search algorithms implemented in the search engine allow for both DNA and protein queries. The results of a blast query are hyperlinked to the STACK viewer, which allows for the extraction of detailed information pertaining to the matching STACK sequence. STACK consensus sequences matched to Drosophila sequences are searchable on the Drosophila Related Expressed Sequences (DRES) home page at the Telethon Institute of Genetics and Medicine (http://www.tigem.it).
Web Probe, the STACK database extraction and viewing tool. The STACK tissue category is used as input to the ‘project name’ field that returns a list of all the clustered information.
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 10
Chapter 5
EST DATABASES
3.6 architectural overview and future directions: The future development of STACK focuses on linking the underlying data more firmly to biological processes and making the resultant information accessible to a widening range of users. Inclusion of genomic information will be used to map clusters and expression state to genome location, as well as to confirm the quality of the indices. Genome context also allows reorganisation of the index into exons and transcript isoforms. Prediction of proteins from transcript isoforms and cross-references to known protein records follow, opening the door for association with standardised annotations, such as Gene Ontology (http://www.geneontology.org). STACK is freely available to academia and is distributed via the web site at http://www.sanbi.ac.za/CODES. The STACK_PACK tool set performs clustering, clustering management, alignment processing and analysis and is freely available to academic institutions and is distributed from http://www.sanbi.ac.za/CODES. 4. TGI: 4.1 Introduction: The TIGR Gene Indices is the Database of Clustering and Assembling of EST EST: A short cDNA sequence, of length around 200-500 bps, that is specific to particular gene, extracted from particular condition or specific development stage. TGI databases have moved at DFCI and are no longer available at TIGR/JCVI. So new web address is: “http://compbio.dfci.harvard.edu/tgi/” TGI is constructed using all publicly available EST and gene sequence data stored in GenBank. Individual databases are updated and released three times yearly. Sequences are first cleaned to identify and remove contaminating sequences, including, o vector, o adaptor, o Mitochondrial, ribosomal and chimeric sequences. These sequences are then searched pair wise against each other and grouped into clusters based on shared sequence similarity. The clusters are assembled at high stringency to produce tentative consensus (TC) sequences. TCs: It is prdicted sequence created by assembling ESTs into virtual transcripts. At present, 77 species are represented in the Gene Index databases, including , o 29 animals, o 25 plants, o 8 fungi and Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 11
Chapter 5
EST DATABASES
o 15 protists. 4.2 Construction of database:
The process used to assemble each Gene Index is efficient and accurate. mgBLAST, a modified version of the Megablast program, is now used for the pairwise sequence comparisons that are the basis for defining the sequence clusters which form the basis for assembly. For large clusters containing hundreds or thousands of sequences (e.g. highly expressed genes such as actin), sequence representation is reduced prior to assembly using a variety of multilayer approaches, including transitive clustering. The Paracel Transcript Assembler (PTA), a modified version of CAP3 assembly program, is used to assemble each TC. 4.3 New databases and tools: 1] The EGO: A database, previously known as TIGR Orthologous Gene Alignments (TOGA), uses pair wise sequence similarity searches and a transitive, reciprocal closure process to identify Tentative Ortholog Groups (TOGs) in eukaryotes. Website is “http://www.tigr.org/tdb/tgi/ego” EGO has expanded its representation to include all 77 species represented in the TGI and TOGs have been cross-referenced to the Online Mendelian in Man (OMIM).
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 12
Chapter 5
EST DATABASES
2] RESOURCERER: Provides annotation based on the TIGR Gene Indices for widely available microarray resources in human, mouse, rat, zebrafish and Xenopus. RESOURCERER provides a wide range of annotation and integration with genomic and other resources. Owing to its integration with the TGI and EGO, RESOURCERER also provides links between microarray platforms both within and between species. 5. References: (a) Heading 1st (dbEST- Dipali N.V): 1. Webliography: 1> http://www.ncbi.nlm.nih.gov/dbEST.html (b) Heading 2nd (UniGene- Anil S.T, Punit R.C): 1. Bibliography: 1> BIOINFORMATICS – A practical guide to the Analysis of Genes and Proteins Author: Andreas D. Bexevanis, B. F. Francis Ouellette, Page no. 288-292 2. Webliography: 1> www.ncbi.nlm.nih/gov/UniGene 2> http://www.ncbi.nlm.nih.gov/books (c)Heading 3rd (STACK- Sumit M.J): 1. Webliography: 1> http://nar.oxfordjournals.org/cgi/content/abstract/29/1/234?ck=nck 2> http://www.bioinformatics.fr/biology.php?subsection=Human&topics=databases#1001 3>http://www.ingentaconnect.com/search/download?pub=infobike%3a%2f%2foup%2fnar%2f 2001%2f00000029%2f00000001%2fart00234&mimetype=application%2fpdf 4> http://en.wikipedia.org/w/index.php?title=Protocol_stack&action=edit§ion=1
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 13
Chapter 5
EST DATABASES
(d)Heading 4th (TGI- Yogesh G.P): 1. Bibliography: 1> BIOINFORMATICS – A practical guide to the Analysis of Genes and Proteins Author: Andreas D. Bexevanis, B. F. Francis Ouellette 2. Webliography: 1> http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=540018 2> http://compbio.dfci.harvard.edu/tgi/ 3> http://compbio.dfci.harvard.edu/tgi/definitions.html
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 14
Chapter 5
EST DATABASES
Edited by- Dipali N.V, Anil S. T, Punit R.C, Sumit M.J, Yogesh G.P Published by- Anand M. B.
Page 15