Biological Sequence Databases 2

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Biological Sequence Databases 2 as PDF for free.

More details

  • Words: 876
  • Pages: 27
Databases in Bioinformatics (Roald Forsberg)

1

Overview The role of databases in bioinformatics The structure of databases – Relational databases – Database Management Systems – Accessing databases

Types of databases – Data types – Integrated databases (Entrez)

Nucleotide sequence formats – FASTA format – GenBank format – XML formats

2

Databases in Bioinformatics Bioinformatics – attempted definition: “The application of computational techniques to understand and organise the information associated with biological macromolecules” Adapted from Oxford English Dictionary

Biological experiments

Computational Biology

Databases

3

Ask your neighbour • What would you like to do with a database? • Which types of biological information could be stored in a database?

4

Use of databases •

Homology searching: – Use of knowledge from other often more well described organisms such as the model organisms Mouse, Drosophila, Fugu, C.Elegans etc.. – Sequence level – position, annotation – Structural level – proteins, RNA



Evolutionary analyses: – – – –

• • • •

Phylogenetics Population genetics Molecular evolution of genetic elements Genome evolution

Primer design Microarray design Drug design Many more…… 5

General types of databases • Primary – Raw and non-processed data

• Secondary – Curated – data chosen from criteria – E.g non-redundance, fold

• Tertiary – Data processed – HMM profile 6

Structure of relational databases Table 1

Table = genetic element

Entries MEQ147631 MEQ147632 MEQ147633 MEQ147634 MEQ147635 MEQ147636 MEQ147637 MEQ147638 MEQ147639 MEQ147640 MEQ147641

Field = position = chr. 4 Field = size = 3540 bp Field = coding = true Field = known EST = true

Table 2

Field = known structure = false

7

Structure of relational databases Interface (WEB)

Database Management system

Browser output Result files

Browser input scripts

Results from DMBS

Queries To DMBS

Terminal input scripts

DBMS software SQL language

Results from DMBS

Terminal output Stored results

(Structural Query Language) Structure of data

Database files

File

File

8

File

File

File

Queries To data

File

File

Database management systems • A software package designed to store and manage databases. • A computerized record-keeping system • Allows operations such as: – – – – – –

Adding new files Inserting data into existing files Retrieving data from existing files Changing data Deleting data Removing existing files from the database 9

Accessing a database • WEB – graphical user interface (GUI) • WEB – automated procedures – Batch search with script (Entrez) – Search robots with e-mail updates

• Local – Buy a big computer and a thick cable – Speed improvement 10

Protein sequence databases Database

URL

Protein sequence (primary) SWISS-PROT PIR-International

www.expasy.ch/sprot/sprot-top.html www.mips.biochem.mpg.de/proj/protseqdb

Protein sequence (composite) OWL

www.bioinf.man.ac.uk/dbbrowser/OWL

NRDB

www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Protein sequence (secondary) PROSITE

www.expasy.ch/prosite

PRINTS

www.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html

Pfam

www.sanger.ac.uk/Pfam/

11

Nucleotide sequence databases • GenBank

www.ncbi.nlm.nih.gov/Genbank

• EMBL

www.ebi.ac.uk/embl

• DDBJ

www.ddbj.nig.ac.jp

12

Types of nucleotide data • cDNA – Reversely transcribed from mRNA

• Genomic sequences – Directly sequenced from DNA strings of various species

• EST’s – a tiny portion of an entire gene derived from mRNA 13

Macromolecular structure databases •

Protein Data Bank (PDB)

www.rcsb.org/pdb



Nucleic Acids Database (NDB) http://ndbserver.rutgers.edu//



PDBsum



CATH



SCOP

http://scop.mrc-lmb.cam.ac.uk/scop/



FSSP

www.embl-ebi.ac.uk/dali/fssp

www.biochem.ucl.ac.uk/bsm/pdbsum www.biochem.ucl.ac.uk/bsm/cath

14

Molecular interaction databases • General –

Biomolecular Interaction Network Database

http://bioinfo.mshri.on.ca/cgi-bin/bind/dataman



Molecular interactions Database (MINT)

http://cbm.bio.uniroma2.it/mint/

• Protein-Protein interactions –

Database of interacting proteins

http://dip.doe-mbi.ucla.edu/

• Biochemical pathways –

KEGG Metabolic Pathways

http://www.genome.ad.jp/kegg/metabolism.html

15

Proteomics databases • Yeast Proteome Database http://www.incyte.com/sequence/proteome/databases/YPD.shtml

• SWISS-2DPAGE http://us.expasy.org/ch2d/

• TMIG-2DPAGE http://proteome.tmig.or.jp/2D/

16

Genome databases •

Entrez genomes www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome



Ensemble genomes http://www.ensembl.org/



HIV Sequence Database http://hiv-web.lanl.gov/content/hiv-db/mainpage.html



FlyBase



COGs

http://flybase.bio.indiana.edu/ www.ncbi.nlm.nih.gov/COG

17

Integrated databases Increasing the value of information • InterPro

www.ebi.ac.uk/interpro

• Sequence retrieval system (SRS) www.expasy.ch/srs5 • Entrez

www.ncbi.nlm.nih.gov/Entrez

18

The (ever) Expanding Entrez System UniGene

PubMed

Nucleotide Protein

Journals

Structure

CDD

Genome Entrez PopSet

SNP

OMIM

3D Domains

Taxonomy

UniSTS ProbeSet

Books 19

EBI services • http://www.ebi.ac.uk/services/index.html

20

The International Sequence Database Collaboration Entrez

NIH NCBI

GenBank

•Submissions •Updates

•Submissions •Updates

EMBL CIB

NIG getentry

DDBJ •Submissions •Updates

EBI SRS

EMBL

A closer look at GenBank • Maintained by NCBI • Accessed through Entrez • Synchonized with DDBJ and EMBL 22

Sequence file formats •

Ideally – a stringent, easy to parse, specified format to facilitate the dissemination of information



Reality – a plethora of coincidental and badly specified formats



Different levels of information



Some common formats – FASTA – GenBank – PHYLIP (PHYLIP package and others) – Nexus (PAUP package, MacClade and others) – Up and coming: XML



Simple – sequence and name attribute



Advanced – several attributes 23

FASTA format >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTT GLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKT VLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLP KERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYL NNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE TISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESI WAAELDRYKLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXX XXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

24

GenBank format A verbose but very informative format Contains much information in carefully specified format Harder to parse than FASTA

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

25

eXtensible Markup Language (XML) •

Markup language for data-representation – derived from SGML, sib of HTML



Stringent simple language with rigid rules



Human readable and versatile



Good parsers exists for multiple platforms



The ability to design own Document Type Definitions that parsers can use to validate a document permits complex data structures and grammars



Examples of use for sequence data: –

NCBI GBSeqXML



NCBI TinySeqXML

26

Links http://www.ncbi.nlm.nih.gov/Education/ http://www.infobiogen.fr/services/dbcat/ http://www.science.gmu.edu/~ntongvic/Bioinformatics/database.html http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-interaction.html http://www.no.embnet.org/Programs/DB/srs.php3

27

Related Documents

Databases
November 2019 26
Sequence
April 2020 30
Databases
November 2019 36
Biological
June 2020 20
Sequence Breakdown 2.docx
November 2019 10