Databases in Bioinformatics (Roald Forsberg)
1
Overview The role of databases in bioinformatics The structure of databases – Relational databases – Database Management Systems – Accessing databases
Types of databases – Data types – Integrated databases (Entrez)
Nucleotide sequence formats – FASTA format – GenBank format – XML formats
2
Databases in Bioinformatics Bioinformatics – attempted definition: “The application of computational techniques to understand and organise the information associated with biological macromolecules” Adapted from Oxford English Dictionary
Biological experiments
Computational Biology
Databases
3
Ask your neighbour • What would you like to do with a database? • Which types of biological information could be stored in a database?
4
Use of databases •
Homology searching: – Use of knowledge from other often more well described organisms such as the model organisms Mouse, Drosophila, Fugu, C.Elegans etc.. – Sequence level – position, annotation – Structural level – proteins, RNA
•
Evolutionary analyses: – – – –
• • • •
Phylogenetics Population genetics Molecular evolution of genetic elements Genome evolution
Primer design Microarray design Drug design Many more…… 5
General types of databases • Primary – Raw and non-processed data
• Secondary – Curated – data chosen from criteria – E.g non-redundance, fold
• Tertiary – Data processed – HMM profile 6
Structure of relational databases Table 1
Table = genetic element
Entries MEQ147631 MEQ147632 MEQ147633 MEQ147634 MEQ147635 MEQ147636 MEQ147637 MEQ147638 MEQ147639 MEQ147640 MEQ147641
Field = position = chr. 4 Field = size = 3540 bp Field = coding = true Field = known EST = true
Table 2
Field = known structure = false
7
Structure of relational databases Interface (WEB)
Database Management system
Browser output Result files
Browser input scripts
Results from DMBS
Queries To DMBS
Terminal input scripts
DBMS software SQL language
Results from DMBS
Terminal output Stored results
(Structural Query Language) Structure of data
Database files
File
File
8
File
File
File
Queries To data
File
File
Database management systems • A software package designed to store and manage databases. • A computerized record-keeping system • Allows operations such as: – – – – – –
Adding new files Inserting data into existing files Retrieving data from existing files Changing data Deleting data Removing existing files from the database 9
Accessing a database • WEB – graphical user interface (GUI) • WEB – automated procedures – Batch search with script (Entrez) – Search robots with e-mail updates
• Local – Buy a big computer and a thick cable – Speed improvement 10
Protein sequence databases Database
URL
Protein sequence (primary) SWISS-PROT PIR-International
www.expasy.ch/sprot/sprot-top.html www.mips.biochem.mpg.de/proj/protseqdb
Protein sequence (composite) OWL
www.bioinf.man.ac.uk/dbbrowser/OWL
NRDB
www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
Protein sequence (secondary) PROSITE
www.expasy.ch/prosite
PRINTS
www.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html
Pfam
www.sanger.ac.uk/Pfam/
11
Nucleotide sequence databases • GenBank
www.ncbi.nlm.nih.gov/Genbank
• EMBL
www.ebi.ac.uk/embl
• DDBJ
www.ddbj.nig.ac.jp
12
Types of nucleotide data • cDNA – Reversely transcribed from mRNA
• Genomic sequences – Directly sequenced from DNA strings of various species
• EST’s – a tiny portion of an entire gene derived from mRNA 13
Macromolecular structure databases •
Protein Data Bank (PDB)
www.rcsb.org/pdb
•
Nucleic Acids Database (NDB) http://ndbserver.rutgers.edu//
•
PDBsum
•
CATH
•
SCOP
http://scop.mrc-lmb.cam.ac.uk/scop/
•
FSSP
www.embl-ebi.ac.uk/dali/fssp
www.biochem.ucl.ac.uk/bsm/pdbsum www.biochem.ucl.ac.uk/bsm/cath
14
Molecular interaction databases • General –
Biomolecular Interaction Network Database
http://bioinfo.mshri.on.ca/cgi-bin/bind/dataman
–
Molecular interactions Database (MINT)
http://cbm.bio.uniroma2.it/mint/
• Protein-Protein interactions –
Database of interacting proteins
http://dip.doe-mbi.ucla.edu/
• Biochemical pathways –
KEGG Metabolic Pathways
http://www.genome.ad.jp/kegg/metabolism.html
15
Proteomics databases • Yeast Proteome Database http://www.incyte.com/sequence/proteome/databases/YPD.shtml
• SWISS-2DPAGE http://us.expasy.org/ch2d/
• TMIG-2DPAGE http://proteome.tmig.or.jp/2D/
16
Genome databases •
Entrez genomes www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
•
Ensemble genomes http://www.ensembl.org/
•
HIV Sequence Database http://hiv-web.lanl.gov/content/hiv-db/mainpage.html
•
FlyBase
•
COGs
http://flybase.bio.indiana.edu/ www.ncbi.nlm.nih.gov/COG
17
Integrated databases Increasing the value of information • InterPro
www.ebi.ac.uk/interpro
• Sequence retrieval system (SRS) www.expasy.ch/srs5 • Entrez
www.ncbi.nlm.nih.gov/Entrez
18
The (ever) Expanding Entrez System UniGene
PubMed
Nucleotide Protein
Journals
Structure
CDD
Genome Entrez PopSet
SNP
OMIM
3D Domains
Taxonomy
UniSTS ProbeSet
Books 19
EBI services • http://www.ebi.ac.uk/services/index.html
20
The International Sequence Database Collaboration Entrez
NIH NCBI
GenBank
•Submissions •Updates
•Submissions •Updates
EMBL CIB
NIG getentry
DDBJ •Submissions •Updates
EBI SRS
EMBL
A closer look at GenBank • Maintained by NCBI • Accessed through Entrez • Synchonized with DDBJ and EMBL 22
Sequence file formats •
Ideally – a stringent, easy to parse, specified format to facilitate the dissemination of information
•
Reality – a plethora of coincidental and badly specified formats
•
Different levels of information
•
Some common formats – FASTA – GenBank – PHYLIP (PHYLIP package and others) – Nexus (PAUP package, MacClade and others) – Up and coming: XML
•
Simple – sequence and name attribute
•
Advanced – several attributes 23
FASTA format >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTT GLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKT VLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLP KERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYL NNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE TISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESI WAAELDRYKLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXX XXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK http://www.ncbi.nlm.nih.gov/BLAST/fasta.html
24
GenBank format A verbose but very informative format Contains much information in carefully specified format Harder to parse than FASTA
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
25
eXtensible Markup Language (XML) •
Markup language for data-representation – derived from SGML, sib of HTML
•
Stringent simple language with rigid rules
•
Human readable and versatile
•
Good parsers exists for multiple platforms
•
The ability to design own Document Type Definitions that parsers can use to validate a document permits complex data structures and grammars
•
Examples of use for sequence data: –
NCBI GBSeqXML
–
NCBI TinySeqXML
26
Links http://www.ncbi.nlm.nih.gov/Education/ http://www.infobiogen.fr/services/dbcat/ http://www.science.gmu.edu/~ntongvic/Bioinformatics/database.html http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-interaction.html http://www.no.embnet.org/Programs/DB/srs.php3
27