Identification of Genes from Databases Presented with Respect to Dr.C.G. Joshi Professor Department of Animal Biotechnology, College of Veterinary Science & A.H.,Anand Presented by Patel Hiren M. M.V.Sc. (Anim.Biotechnology) 1
Introduction • Evolution is somewhat conservative • Evolution seems to have often involved the duplication and divergence of gene • Certain sequence may indicate a certain function • A structural set of data held in a computer • The structures of new genes are constantly adding 2
Identification of Gene From Databases Require Some Basic Knowledge • What is gene? • What is gene structure ? - Prokaryotes - Eukaryotes • ORFs (Open Reading Frame) • cDNA library • Human genome project 3
Prokaryote Genome • In prokaryotes ,introns are less common and genes often contain a single uninterrupted stretch of DNA, called a cistron, that codes for a product. • These functionally related genes are in clustered and can be transcribed together on same mRNA 4
5
Eukaryotic Genes • Much more complex than in prokaryotes. • Large genomes (0.1 to 3 billion bases) • A typical mammalian cell has 1,500 times as much DNA than the cell of E. Coli. • Low coding density (<30%)in many eukaryotes & only2-3% in humans
6
Gene Structure Eukaryotes
7
Data Mining Development of new tools for data mining – Sequence alignment – Genome sequencing – Genome comparison – Micro array data analysis – Proteomics data analysis – Small molecular array analysis 8
What is a database? • A database is a collection of information stored in a computer in a systematic way, such that a computer program can consult it to answer questions • The software used to manage and query a database is known as a database management system (DBMS) • The properties of database systems
9
Strategies and Adapted Tools for Gene Identification
• Find candidate genes for the trait (time and cost!) -What genes are there? -How gene are expressed? -What do they do? -How could they play a role in the disease -Gene synonyms -Gene location
10
DATA SOURCES • • • •
PubMed Conserved Domain Database GeneAtlas dbSNP
Links to above-mentioned databases: Gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene PubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed CDD: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd Homologene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene GeneAtlas: http://wombat.gnf.org/ dbSNP: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp
11
Gene Databases • Once a genome is in place, it is desirable to study the regions that make a particular organism what it is. • One such resource is located in the genetic regions of the organism, Several databases of genes and related structures exist. • Such database is the Ref Seq database curated at NCBI.
12
Classification of Databases • Primary sequence databases • Secondary sequence databases • Genomic database resource
13
Secondary sequence databases • Unigene:Historical use for selecting sequences for micro array. • The TIGR Gene Indices:The gene indices at the institute for genetic research are arranged according to species. The TIGR GI covers 19 animal species, 18 plant species and 7 fungal species. TIGR also include full information about splice variance in the database. • Ref Seq. (NCBI References Sequence Project):The Ref Seq. aims to collect sequences of genomes, complete chromosome genomic regions; mRNA’s & other type of RNA. 14
Genomic Database Resource • Ensembl - http://www.ensembl.org 19 species • UCSC Genome Browser - http://genome.ucsc.edu/ 28 species (Insects!) • NCBI MapViewer - http://www.ncbi.nlm.nih.gov/mapview/ 38 species (Plants, Fungi!) 15
Comparison of Sequence against Sequence Database • The most commonly used programmes for comparing an unknown sequence against the sequence the database are BLAST, FASTA. •
BLAST and FASTA are derivatives of Smith - Watermann Algorithm. 16
The FASTA algorithm
• Developed by Lipman and Pearson 1985 • First program to search sequence databases for gapped local alignment • The best scoring local region is given as output • It is an approximate heuristic algorithm used to compute sub-optimal pair wise similarity. • http://www-nbrf.georgetown.edu/pirwww/sear 17
BLAST • • • • • •
BLASTn Megablast Nucleotide querry BLASTx Protein querry tBLASTn, Nucleotide querry tBLASTx Conserved Domains RPS-BLAST, CDART • Pairwise BLAST 2BLAST 18
What is BLAST? • BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. • “local” means it searches and aligns sequence segments, rather than align the entire sequence. It’s able to detect relationships among sequences which share only isolated regions of similarity. • Currently, it is the most popular and most accepted sequence analysis tool. 19
Why BLAST? • Identify unknown sequences - The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then you may have access to a wealth of biological information. • Help gene/protein function and structure prediction – genes with similar sequences tend to share similar functions or structure. • Identify protein family – group related (paralog or ortholog) genes and their proteins into a family. • Prepare sequences for multiple alignments
20
Go to BLAST
21
Go to nucleotide BLAST
22
23
SOFTWARES FOR IDENTIFICATION OF GENES • Some computational tools that are most commonly used for gene prediction – Gene Mark – Glimmer M – GRAIL – GenScan – Genebuilder 24
Gene Mark • Used for finding prokaryotic genes. • This software employs non-homogenous markov model to classify DNA regions into protein coding, non-coding sequences • Limitation : Query sequence must be more 100 kbp than
25
Glimmer • • •
Glimmer uses interpolated markov models to identify coding regions and distinguish them from non-coding DNA. Glimmer is used as the primary gene finder tool at TIGR. The computation consists of two steps, namely model building and gene prediction. The model building involves training by the input sequence, which optimizes the parameters of the model 26
GRAIL • •
•
Use for eukaryotes This tool identifies exons, polyA sites, promoters, CpG islands, repetitive elements and frame shift errors in DNA sequences by comparing them to a database of known Human and Mouse sequence elements. Based on a neural network algorithm
27
cont…. • The program scans the query sequence with windows of variable lengths. • It scores for coding potentials and finally produces an output that is the result of exon candidates. • The program is currently trained for human, mouse, Arabidopsis, Drosophila, and Escherichia coli sequences. 28
GenScan • Programme uses probabilistic model of gene structure that is based on actual biological information about the transcriptional, translational and splicing signals. • Its high speed and accuracy make GenScan the method of choice for the initial analysis of large stretches of eukaryotic genomic DNA. • GenScan has being used as the principal tool for gene prediction in international Human genome project. 29
Cont… • Makes predictions based on 5th-order HMMs. • It combines hexamer frequencies with coding signals (initiation codons, TATA box, cap site, polyA, etc.) in prediction. • Exons are assigned a probability score (P)ofbeing a true exon. Only predictions with P >0.5 are deemed reliable. • This program is trained for sequences from vertebrates, Arabidopsis, and maize. It has been used extensively in annotating the human genome
30
Genebuilder •
Genebuilder performs ab initio gene prediction using numerous parameters, such as GC content, di-codon frequencies, splicing site data, CpG islands, repetitive elements and others. It also performs BLAST searches of predicted genes against protein and EST databases.
31
32