Biological Sequence Databases 2

November 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Biological Sequence Databases 2 as PDF for free.

More details

Words: 876
Pages: 27

Preview
Full text

Databases in Bioinformatics (Roald Forsberg)

1

Overview The role of databases in bioinformatics The structure of databases – Relational databases – Database Management Systems – Accessing databases

Types of databases – Data types – Integrated databases (Entrez)

Nucleotide sequence formats – FASTA format – GenBank format – XML formats

2

Databases in Bioinformatics Bioinformatics – attempted definition: “The application of computational techniques to understand and organise the information associated with biological macromolecules” Adapted from Oxford English Dictionary

Biological experiments

Computational Biology

Databases

3

Ask your neighbour • What would you like to do with a database? • Which types of biological information could be stored in a database?

4

Use of databases •

Homology searching: – Use of knowledge from other often more well described organisms such as the model organisms Mouse, Drosophila, Fugu, C.Elegans etc.. – Sequence level – position, annotation – Structural level – proteins, RNA

•

Evolutionary analyses: – – – –

• • • •

Phylogenetics Population genetics Molecular evolution of genetic elements Genome evolution

Primer design Microarray design Drug design Many more…… 5

General types of databases • Primary – Raw and non-processed data

• Secondary – Curated – data chosen from criteria – E.g non-redundance, fold

• Tertiary – Data processed – HMM profile 6

Structure of relational databases Table 1

Table = genetic element

Entries MEQ147631 MEQ147632 MEQ147633 MEQ147634 MEQ147635 MEQ147636 MEQ147637 MEQ147638 MEQ147639 MEQ147640 MEQ147641

Field = position = chr. 4 Field = size = 3540 bp Field = coding = true Field = known EST = true

Table 2

Field = known structure = false

7

Structure of relational databases Interface (WEB)

Database Management system

Browser output Result files

Browser input scripts

Results from DMBS

Queries To DMBS

Terminal input scripts

DBMS software SQL language

Results from DMBS

Terminal output Stored results

(Structural Query Language) Structure of data

Database files

File

File

8

File

File

File

Queries To data

File

File

Database management systems • A software package designed to store and manage databases. • A computerized record-keeping system • Allows operations such as: – – – – – –

Adding new files Inserting data into existing files Retrieving data from existing files Changing data Deleting data Removing existing files from the database 9

Accessing a database • WEB – graphical user interface (GUI) • WEB – automated procedures – Batch search with script (Entrez) – Search robots with e-mail updates

• Local – Buy a big computer and a thick cable – Speed improvement 10

Protein sequence databases Database

URL

Protein sequence (primary) SWISS-PROT PIR-International

www.expasy.ch/sprot/sprot-top.html www.mips.biochem.mpg.de/proj/protseqdb

Protein sequence (composite) OWL

www.bioinf.man.ac.uk/dbbrowser/OWL

NRDB

www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Protein sequence (secondary) PROSITE

www.expasy.ch/prosite

PRINTS

www.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html

Pfam

www.sanger.ac.uk/Pfam/

11

Nucleotide sequence databases • GenBank

www.ncbi.nlm.nih.gov/Genbank

• EMBL

www.ebi.ac.uk/embl

• DDBJ

www.ddbj.nig.ac.jp

12

Types of nucleotide data • cDNA – Reversely transcribed from mRNA

• Genomic sequences – Directly sequenced from DNA strings of various species

• EST’s – a tiny portion of an entire gene derived from mRNA 13

Macromolecular structure databases •

Protein Data Bank (PDB)

www.rcsb.org/pdb

•

Nucleic Acids Database (NDB) http://ndbserver.rutgers.edu//

•

PDBsum

•

CATH

•

SCOP

http://scop.mrc-lmb.cam.ac.uk/scop/

•

FSSP

www.embl-ebi.ac.uk/dali/fssp

www.biochem.ucl.ac.uk/bsm/pdbsum www.biochem.ucl.ac.uk/bsm/cath

14

Molecular interaction databases • General –

Biomolecular Interaction Network Database

http://bioinfo.mshri.on.ca/cgi-bin/bind/dataman

–

Molecular interactions Database (MINT)

http://cbm.bio.uniroma2.it/mint/

• Protein-Protein interactions –

Database of interacting proteins

http://dip.doe-mbi.ucla.edu/

• Biochemical pathways –

KEGG Metabolic Pathways

http://www.genome.ad.jp/kegg/metabolism.html

15

Proteomics databases • Yeast Proteome Database http://www.incyte.com/sequence/proteome/databases/YPD.shtml

• SWISS-2DPAGE http://us.expasy.org/ch2d/

• TMIG-2DPAGE http://proteome.tmig.or.jp/2D/

16

Genome databases •

Entrez genomes www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome

•

Ensemble genomes http://www.ensembl.org/

•

HIV Sequence Database http://hiv-web.lanl.gov/content/hiv-db/mainpage.html

•

FlyBase

•

COGs

http://flybase.bio.indiana.edu/ www.ncbi.nlm.nih.gov/COG

17

Integrated databases Increasing the value of information • InterPro

www.ebi.ac.uk/interpro

• Sequence retrieval system (SRS) www.expasy.ch/srs5 • Entrez

www.ncbi.nlm.nih.gov/Entrez

18

The (ever) Expanding Entrez System UniGene

PubMed

Nucleotide Protein

Journals

Structure

CDD

Genome Entrez PopSet

SNP

OMIM

3D Domains

Taxonomy

UniSTS ProbeSet

Books 19

EBI services • http://www.ebi.ac.uk/services/index.html

20

The International Sequence Database Collaboration Entrez

NIH NCBI

GenBank

•Submissions •Updates

•Submissions •Updates

EMBL CIB

NIG getentry

DDBJ •Submissions •Updates

EBI SRS

EMBL

A closer look at GenBank • Maintained by NCBI • Accessed through Entrez • Synchonized with DDBJ and EMBL 22

Sequence file formats •

Ideally – a stringent, easy to parse, specified format to facilitate the dissemination of information

•

Reality – a plethora of coincidental and badly specified formats

•

Different levels of information

•

Some common formats – FASTA – GenBank – PHYLIP (PHYLIP package and others) – Nexus (PAUP package, MacClade and others) – Up and coming: XML

•

Simple – sequence and name attribute

•

Advanced – several attributes 23

FASTA format >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTT GLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKT VLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLP KERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYL NNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE TISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESI WAAELDRYKLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXX XXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

24

GenBank format A verbose but very informative format Contains much information in carefully specified format Harder to parse than FASTA

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

25

eXtensible Markup Language (XML) •

Markup language for data-representation – derived from SGML, sib of HTML

•

Stringent simple language with rigid rules

•

Human readable and versatile

•

Good parsers exists for multiple platforms

•

The ability to design own Document Type Definitions that parsers can use to validate a document permits complex data structures and grammars

•

Examples of use for sequence data: –

NCBI GBSeqXML

–

NCBI TinySeqXML

26

Links http://www.ncbi.nlm.nih.gov/Education/ http://www.infobiogen.fr/services/dbcat/ http://www.science.gmu.edu/~ntongvic/Bioinformatics/database.html http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-interaction.html http://www.no.embnet.org/Programs/DB/srs.php3

27

Biological Sequence Databases 2

Overview

More details

Related Documents

Biological Sequence Databases 2

Databases

Sequence

Databases

Biological

Sequence Breakdown 2.docx