Modul 1 (struktur Datenbanken)

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Modul 1 (struktur Datenbanken) as PDF for free.

More details

  • Words: 1,931
  • Pages: 43
Modul 1 Datenbanken Strukturen Lokale Muster

Daten, Datensammlung, Datenbank

Inhalte

Implementierung

• Molekülstrukturen • Spektren • Patentinformation • Moleküleigenschaften • Fachliteratur • Verweise • Anbieterinformation • Preise •…

• „Flat file“ • Lokale Datenbank • www-Zugriff •…

Definition einer Datenbank

Datenbank = Verwaltungskomponente + Speicherungskomponente für persistente Daten, die einem bestimmten Zweck dienen.

„Molekül-Datenbanken“ Raw data

User User interface

Source file

Application software

Filtering

Index file

Library file

Data 1

Data 2

Data 3

Was ist das ?

• Wie erkennen wir einen „Baum“ ? • Welche „Bäume“ sind einander am ähnlichsten ?

Molecular Similarity Typical applications of the “similarity concept” • Similarity searching in databases • Pattern recognition in molecular structures • Similarity searching in virtual compound libraries • Data clustering / classification • Single compound design, de novo design • Compound library design • “Diversity” analysis of compound collections • SAR modeling & prediction

Analogy-based Feature Selection “Find function-determining features of macromolecular receptors and their small molecule effectors” A1 A2

a1 a2

B1 B2 B3 C1 C2

b1 b2 b3 c1 c2

Receptors

Ligands

Applications

• Drug Discovery • Chemical Biology • Functional Genomics • Similarity Searching & Virtual Screening • Identification of targets & ligands • Design of compound libraries

The Early Drug Discovery Process

Target Validation

Lead Identification

Preclinical Development

Drug Hit Target Identification Identification

Lead Optimization

Bioinformatics Cheminformatics

Primary Sequence Databases Database

Version

No. of Sequences

SwissProt

54.3 (10/2007)

285,335 (02/05: 168,297)

EMBL

92 (09/2007)

105,696,243 (12/04: 46,105,397)

TrEMBL

37.3 (10/2007)

4,935,209 (02/05: 1,589,670)

MIPS OWL … UniProt combines SwissProt, TrEMBL, PIR UniProtKB/TrEMBL Release 33.7: 3 189 332 entries

http://www.ebi.ac.uk/Databases/index.html http://pir.georgetown.edu/pirwww/ http://pir.georgetown.edu/pirwww/dbinfo/

Genome

mRNA

cDNA

EST

Genome Coding part (H.sapiens ~ 1%) E1

E2 I1

E3 I2

E4 I3

Eukaryotic gene with Intron/Exon structure

Splicing E1 5’-UTR

E2 E3 E4 3’-UTR Reverse Transcription

~7 x 106 (70%) EST in GenBank! 5’-EST

3’-EST (most common)

EST: C. Venter 1990s

From Raw Data to Sequences I) cDNA sequence fragments (ESTs)

II) Fragment matching (clustering) (>40 bp; >95% ident.)

VI) ORF (open reading frame) Prediction Six-frame translation 3’

5’

5’

3’

V) DNA complement 3’ 5’ III) Contig assembly

5’ 3’

IV) “Contig” (contiguous clone map)

Sequence “mature” in a database New sequence

DB-Entry

Unannotated

Preliminary

Unreviewed

Standard



Some Numbers

Organism

Genome Size

Genes

Epstein-Barr virus Escherichia coli Saccharomyces cerevisiae Drosophila melanogaster Homo sapiens

0.172 x 106 (bp) 4.6 x 106 12.1 x 106 180 x 106 3200 x 106

80 4406 5885 13601 ~ 25000

Most Most human human genes genes are are “hypothetical”, “hypothetical”, “unclassified”, “unclassified”, “unknown” “unknown”

UniProt_SwissProt Line Types ID AC DT DE GN OS OG OC RN RP RC RX RA RL

-

Identification CC Accession number(s) DR Date KW Description FT Gene name(s) SQ Organism species Organelle // Organism classification Reference number Reference position Reference comments Reference cross-references Reference authors Reference location

-

Comments or notes Database cross-references Keywords Feature table data Sequence header (blanks) sequence data Termination line

A SwissProt Entry

ID LEP_ECOLI STANDARD; PRT; 324 AA. AC P00803; P78098; DT 21-JUL-1986 (REL. 01, CREATED) DT 01-NOV-1997 (REL. 35, LAST SEQUENCE UPDATE) DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE) DE SIGNAL PEPTIDASE I (EC 3.4.21.89) (SPASE I) (LEADER PEPTIDASE I). GN LEPB. OS ESCHERICHIA COLI. OC PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS; OC ENTEROBACTERIACEAE. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 84008229. RA WOLFE P.B., WICKNER W., GOODMAN J.M.; RL J. BIOL. CHEM. 258:12073-12080(1983). CC -!- CATALYTIC ACTIVITY: CLEAVAGE OF N-TERMINAL LEADER SEQUENCES FROM CC SECRETED AND PERIPLASMIC PROTEINS PRECURSOR. CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. INNER MEMBRANE. CC -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY S26; ALSO KNOWN AS TYPE CC I LEADER PEPTIDASE FAMILY. DR EMBL; K00426; G146600; -. DR PIR; A00998; ZPECS. DR PROSITE; PS00501; SPASE_I_1; 1. KW INNER MEMBRANE; TRANSMEMBRANE; HYDROLASE; PROTEASE. FT MOD_RES 1 1 BLOCKED. FT TRANSMEM 4 22 FT DOMAIN 23 58 CYTOPLASMIC. FT TRANSMEM 59 77 FT DOMAIN 78 324 PERIPLASMIC. FT ACT_SITE 91 91 FT ACT_SITE 146 146 FT MUTAGEN 62 62 E->V: INDIFFERENT. LEP_ECOLI Length: 324 January 7, 1999 14:23 Type: P Check: 8977 .. 1 //

MANMFALILV IATLVTGILW CVDKFFFAPK RRERQAAAQA AAGDSLDKAT

..

SwissProt Feature Table The feature table may indicate regions that • perform or affect function • interact with other molecules • affect replication • are involved in recombination • are a repeat unit • have secondary or tertiary structure • are revised or corrected • DB searching • links between databases

A C D E F G H I K L M N P Q R S T V W Y

Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr

Alanine. Cysteine. Aspartic acid. Glutamic acid. Phenylalanine. Glycine. Histidine. Isoleucine. Lysine. Leucine. Methionine. Asparagine. Proline. Glutamine. Arginine. Serine. Threonine. Valine. Tryptophan. Tyrosine.

B Z X

Asx Glx Xaa

Aspartic acid or Asparagine. Glutamine or Glutamic acid. Any amino acid.

Amino acid codes

Levels of Pattern Conservation

Active site 3D protein structure

Predictive Conserved Patterns

Protein fold / domains 2D protein structure 1D protein structure (amino acid sequence) mRNA sequence DNA sequence

Alignment studies

A

C O

D

E

O

N

O

N

N

H

N

N

OH O

O

I

O

O

OH

K O

N

O

N

SH

G

The 20 standard L-amino acids

F O

N

L O

O N

N

NH

N

R1

NH2

M

N O

N

P O

Q O

N

R

N

H2N

O

O N

N

O

NH2 S

O

O

NH

NH2 HN

S

T

O N

N OH

V

O N OH

W

O

Y

O

NH2

O

N

N

NH OH

H N

R3

O

R2

N H

O OH

Peptide backbone N C

Stereochemie von Aminosäuren: Fischer-Projektion

COOH H 2N H R L

COOH H NH2 R D

Die Darstellung von Verbindungen mit einem oder mehreren Chiralitätszentren kann durch die Fischer-Projektion (Emil Fischer) erfolgen: • Hierbei wird die Kohlenstoff-Hauptkette vertikal angeordnet. • Das C-Atom mit der höchsten Oxidationsstufe wird nach oben geschrieben. • vertikale Bindungen zeigen nach hinten, horizontalen Bindungen kommen aus der Papierebene nach vorne heraus.

die 21. und 22. proteinogene Aminosäure O H

proteinogene Aminosäuren

OH

Se NH2

Selenocystein, Sec (UAG Stop-Codon) N

H N O

O OH NH2

Pyrrolysin, Pyl (UAG Stop-Codon) Selenocystein und Pyrrolysin - werden durch Codons kodiert, die unter gewöhnlichen Umständen die Proteinsynthese abbrechen: diese Codons müssen durch einen Prozess der Rekodierung umdefiniert werden, damit diese Aminosäuren in Proteine eingebaut werden können. http://www.biophys.uni-duesseldorf.de/~wilm/doc/ls_2003_01_secis_pp4.pdf

The Peptide Bond Ramachandran Plot

3-10

trans

Peptide notation: N C • white regions are disallowed except for glycine Tutorial http://www.cryst.bbk.ac.uk/PPS2/course/

The alpha-Helix Right-handed α-Helix

i+8

i+4 5.4 Å pitch i

• 3.6 residues in a turn (36 residues = 10 turns)

Helical Structures 3-10 Helix

• 3 residues in a turn • 10 atoms in ring formed by a hydrogen-bond

The beta-Strand & beta-Sheet Beta strand conformation Antiparallel beta-Sheet

7 Å pitch

C-terminus

Beta-Sheets

Flavodoxin (PDB: 1AG9)

Reverse Turns (“Beta-Turns”) • generally occur at the surface of the protein • Hydrogen-bond between residues i and i+3 (Cα distance < 7 Å) • nucleation centers during protein folding? Type I

Type II

Gly: no hindrance with C=O of (i+1) G

• difference between type I and II: orientation of the peptide bond between i+1 and i+2 • account for approx. 50% of all turns

Beta-Hairpin Turns • Beta-hairpin turns occur between two antiparallel beta-strands = Supersecondary Structure

Type I’

Residue 2: always Gly

Type II’

Residue 1: always Gly

Local Conformations are Context-Dependent VDLLKN

identical sequence, different 3D structure too short for homology assessment!

Global and local sequence features determine protein structure and function Ribonuclease T1 from Aspergillus oryzae A Guanyl-specific hydrolase

ACDYTCGSNCYSSSDVSTAQAAGYQL HEDGETVGSNSYPHKYNNYEGFDFSV SSPYYEWPILSSGDVYSGGSPGADRV VFNENNQLAGVITHTGASGNNFVECT

Amino acid sequence

Structural model PDB: 4RNT

Bioinformatics: Searching for Homologues Homolog Similar protein with a common ancestral sequence • may have similar function or structure • structural homology • functional homology • homology ≠ similarity ! • no “% homology” ! Ortholog Homolog proteins in different species Paralog Homolog proteins in the same species

Secondary Databases (Patterns & Motifs)

Database

Primary Source

Stored Information

PROSITE Profiles PRINTS BLOCKS IDENTIFY Pfam …

SwissProt SwissProt OWL (SwissProt) PROSITE/PRINTS BLOCKS/PRINTS SwissProt

Regular expressions (patterns) Weighted matrices (profiles) Aligned motifs (fingerprints) Aligned motifs (blocks) Fuzzy regular express. (patterns) Hidden Markov Models (HMM)

Databases integrating Genetic, Molecular, or Metabolic Data Amaze

Biochemical pathways www.ebi.ac.uk/research/pfmp/

Ecocyc / Metacyc

Metabolic pathways http://biocyc.org

KEGG

Metabolic pathways www.genome.ad.jp/kegg/

TransPath

Signal transduction pathways http://transpath.gbf.de/

BIND

Protein interaction and complexes www.bind.ca/

GeneNet

Gene networks http://wwwmgs.bionet.nsc.ru/mgs/systems/genenet/

CSNDB

Cell-signaling networks http://geo.nihs.go.jp/csndb/

Information Retrieval Systems

• SRS – Sequence Retrieval System (at EBI, UK) http://www.srs.ebi.ac.uk • Entrez (at NCBI, USA) http://www.ncbi.nlm.nih.gov/Entrez

Hausaufgabe: Üben ! • Amino acids – structures and codes http://bioinf.man.ac.uk/aacids/amino_acid.htm

Amino Acid Classification: A Venn-Diagram small

tiny

proline P

aliphatic

CSS I

L

V

A S G C SH T

M F

Y W

polar N

Q

D E

charged

H K R

negative hydrophobic positive aromatic

Sliding-Window: The Helical-Wheel Plot

Alpha-helix 3.6 residues per turn (100 degrees / residue) http://cti.itc.virginia.edu/~cmg/Demo/wheel/wheelApp.html Transmembrane helices of rhodopsin ( PDB)

Sliding-Window: The Hydrophobicity Plot Detect potential transmembrane segments

Hydrophobicity plot of human Rhodopsin (AC P08100 at ExPASy), ExPASy-Service ProtScale; window size = 9; Kyte&Doolittle hydrophobicity scale

Sliding-Window: Secondary Structure Chou-Fasman method • based on analyzing frequency of amino acids in different secondary structures • A, E, L, and M strong predictors of alpha helices • P and G are predictors in the break of a helix • Table of predictive values created for alpha helices, beta sheets, and loops • Structure with greatest overall prediction value used to determine the structure (80% majority, α+β window size = 5, turn: 4 residues) • GOR method improves upon the Chou-Fasman method: • Assumes amino acids surrounding the central amino acid influence

secondary structure central amino acid is likely to adopt Scoring matrices

Example of a multiple sequence alignment (ClustalW) “Block” SW_LEP_BACAM SW_LEPP_BACSU SW_LEP_ECOLI SW_LEP_SALTY SW_LEP_PSEFL SW_LEPC_BACCL SW_LEP_HAEIN SW_LEP_MYCTU

.......... .......... FAPKRRERQA FAPKRRARQA FAPRRRSAIA .......... VLPKRHRQVA AGQVFDAAPF

......MTEE .......... AAQAAAGDSL AAQTASGDAL SYQGSVSQP. .......... RAEQRSGKT. DAAPDADSEG

Q..KPTSEKS ....MTKEKV D..KATLKKV D..NATLNKV D..AVVIEKL ....MTKQKE ...LSEEEKA DSKAAKTDEP

VKRKSNTYWE FKKKS.SILE APKPG..WLE APKPG..WLE NKEPL..LVE KRGRR..... KIEPISEASE RPAKRSTLRE

WGKAIIIAVA WGKAIVIAVI TGASVFPVLA TGASVFPVLA YGKSFFPVLF WPWFVA..VC FLSSLFPVLA FAVLAVIAVV

SW_LEP_BACAM SW_LEPP_BACSU SW_LEP_ECOLI SW_LEP_SALTY SW_LEP_PSEFL SW_LEPC_BACCL SW_LEP_HAEIN SW_LEP_MYCTU

LALLIRHFLF LALLIRNFLF IVLIVRSFIY IVLIVRSFLY IVLVLRSFLV VVATLRLFVF VVFLVRSFLF LYYVMLTFVA

EPYLVEGSSM EPYVVEGKSM EPFQIPSGSM EPFQIPSGSM EPFQIPSGSM SNYVVEGKSM EPFQIPSGSM RPYLIPSESM

YPTLH..... DPTLV..... MPTLL..... MPTLL..... KPTLD..... MPTLE..... ESTLR..... EPTLHGCSTC

DGERLFVN.. DSERLFVN.. IGDFILVEKF IGDFILVEKF VGDFILVNKF SGNLLIVN.. VGDFLVVNKY VGDRIMVD..

.......... .......... AYGIKDPIYQ AYGIKDPIYQ SYGIRLPVID .......... AYGVKDPIFQ ..........

[S,G]-x-S-M-x-[P,S] “Pattern” Regular expression matching

Searching for Consensus Patterns in PROSITE Query: E.coli leader peptidase -Consensus pattern: [GS]-x-S-M-x-[PS]-[AT]-[LF] [S is an active site residue] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: 16. -Consensus pattern: K-R-[LIVMSTA](2)-G-x-[PG]-G-[DE]-x-[LIVM]-x-[LIVMFY] [K is an active site residue] -Sequences known to belong to this class detected by the pattern: ALL SPases I from prokaryotes as well as yeast IMP1, but not IMP2. -Other sequence(s) detected in SWISS-PROT: NONE. -Consensus pattern: [LIVMFYW](2)-x(2)-G-D-[NH]-x(3)-[SND]-x(2)-[SG] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: 10.

Spase_I_1

(G,S)xSMx(P,S)(A,T)(L,F) (S)xSMx(P)(T)(L) 89: PFQIP SGSMMPTL LIGDF

Amino Acid Composition 16 14 12 10

%

8 6 4 2 0

A C D E F G H I K L M N P Q R S T V W Y SwissProt V 40.30 Archaebakterium (Thermoplasma volcanium) E.coli K-12 P. falciparum Homo sapiens

Protein Targeting Signals Signal peptidase mature protein

e.g. secreted proteins mitochondrial matrix proteins chloroplast stromal proteins e.g. mitochondrial IMS proteins apicoplast proteins

Known exceptions:

e.g. some mitochondrial proteins ( ) SKL some peroxisomal proteins http://www.rockefeller.edu/pubinfo/proteintarget.html

Related Documents