Modul 1 Datenbanken Strukturen Lokale Muster
Daten, Datensammlung, Datenbank
Inhalte
Implementierung
• Molekülstrukturen • Spektren • Patentinformation • Moleküleigenschaften • Fachliteratur • Verweise • Anbieterinformation • Preise •…
• „Flat file“ • Lokale Datenbank • www-Zugriff •…
Definition einer Datenbank
Datenbank = Verwaltungskomponente + Speicherungskomponente für persistente Daten, die einem bestimmten Zweck dienen.
„Molekül-Datenbanken“ Raw data
User User interface
Source file
Application software
Filtering
Index file
Library file
Data 1
Data 2
Data 3
Was ist das ?
• Wie erkennen wir einen „Baum“ ? • Welche „Bäume“ sind einander am ähnlichsten ?
Molecular Similarity Typical applications of the “similarity concept” • Similarity searching in databases • Pattern recognition in molecular structures • Similarity searching in virtual compound libraries • Data clustering / classification • Single compound design, de novo design • Compound library design • “Diversity” analysis of compound collections • SAR modeling & prediction
Analogy-based Feature Selection “Find function-determining features of macromolecular receptors and their small molecule effectors” A1 A2
a1 a2
B1 B2 B3 C1 C2
b1 b2 b3 c1 c2
Receptors
Ligands
Applications
• Drug Discovery • Chemical Biology • Functional Genomics • Similarity Searching & Virtual Screening • Identification of targets & ligands • Design of compound libraries
The Early Drug Discovery Process
Target Validation
Lead Identification
Preclinical Development
Drug Hit Target Identification Identification
Lead Optimization
Bioinformatics Cheminformatics
Primary Sequence Databases Database
Version
No. of Sequences
SwissProt
54.3 (10/2007)
285,335 (02/05: 168,297)
EMBL
92 (09/2007)
105,696,243 (12/04: 46,105,397)
TrEMBL
37.3 (10/2007)
4,935,209 (02/05: 1,589,670)
MIPS OWL … UniProt combines SwissProt, TrEMBL, PIR UniProtKB/TrEMBL Release 33.7: 3 189 332 entries
http://www.ebi.ac.uk/Databases/index.html http://pir.georgetown.edu/pirwww/ http://pir.georgetown.edu/pirwww/dbinfo/
Genome
mRNA
cDNA
EST
Genome Coding part (H.sapiens ~ 1%) E1
E2 I1
E3 I2
E4 I3
Eukaryotic gene with Intron/Exon structure
Splicing E1 5’-UTR
E2 E3 E4 3’-UTR Reverse Transcription
~7 x 106 (70%) EST in GenBank! 5’-EST
3’-EST (most common)
EST: C. Venter 1990s
From Raw Data to Sequences I) cDNA sequence fragments (ESTs)
II) Fragment matching (clustering) (>40 bp; >95% ident.)
VI) ORF (open reading frame) Prediction Six-frame translation 3’
5’
5’
3’
V) DNA complement 3’ 5’ III) Contig assembly
5’ 3’
IV) “Contig” (contiguous clone map)
Sequence “mature” in a database New sequence
DB-Entry
Unannotated
Preliminary
Unreviewed
Standard
†
Some Numbers
Organism
Genome Size
Genes
Epstein-Barr virus Escherichia coli Saccharomyces cerevisiae Drosophila melanogaster Homo sapiens
0.172 x 106 (bp) 4.6 x 106 12.1 x 106 180 x 106 3200 x 106
80 4406 5885 13601 ~ 25000
Most Most human human genes genes are are “hypothetical”, “hypothetical”, “unclassified”, “unclassified”, “unknown” “unknown”
UniProt_SwissProt Line Types ID AC DT DE GN OS OG OC RN RP RC RX RA RL
-
Identification CC Accession number(s) DR Date KW Description FT Gene name(s) SQ Organism species Organelle // Organism classification Reference number Reference position Reference comments Reference cross-references Reference authors Reference location
-
Comments or notes Database cross-references Keywords Feature table data Sequence header (blanks) sequence data Termination line
A SwissProt Entry
ID LEP_ECOLI STANDARD; PRT; 324 AA. AC P00803; P78098; DT 21-JUL-1986 (REL. 01, CREATED) DT 01-NOV-1997 (REL. 35, LAST SEQUENCE UPDATE) DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE) DE SIGNAL PEPTIDASE I (EC 3.4.21.89) (SPASE I) (LEADER PEPTIDASE I). GN LEPB. OS ESCHERICHIA COLI. OC PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS; OC ENTEROBACTERIACEAE. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 84008229. RA WOLFE P.B., WICKNER W., GOODMAN J.M.; RL J. BIOL. CHEM. 258:12073-12080(1983). CC -!- CATALYTIC ACTIVITY: CLEAVAGE OF N-TERMINAL LEADER SEQUENCES FROM CC SECRETED AND PERIPLASMIC PROTEINS PRECURSOR. CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. INNER MEMBRANE. CC -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY S26; ALSO KNOWN AS TYPE CC I LEADER PEPTIDASE FAMILY. DR EMBL; K00426; G146600; -. DR PIR; A00998; ZPECS. DR PROSITE; PS00501; SPASE_I_1; 1. KW INNER MEMBRANE; TRANSMEMBRANE; HYDROLASE; PROTEASE. FT MOD_RES 1 1 BLOCKED. FT TRANSMEM 4 22 FT DOMAIN 23 58 CYTOPLASMIC. FT TRANSMEM 59 77 FT DOMAIN 78 324 PERIPLASMIC. FT ACT_SITE 91 91 FT ACT_SITE 146 146 FT MUTAGEN 62 62 E->V: INDIFFERENT. LEP_ECOLI Length: 324 January 7, 1999 14:23 Type: P Check: 8977 .. 1 //
MANMFALILV IATLVTGILW CVDKFFFAPK RRERQAAAQA AAGDSLDKAT
..
SwissProt Feature Table The feature table may indicate regions that • perform or affect function • interact with other molecules • affect replication • are involved in recombination • are a repeat unit • have secondary or tertiary structure • are revised or corrected • DB searching • links between databases
A C D E F G H I K L M N P Q R S T V W Y
Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
Alanine. Cysteine. Aspartic acid. Glutamic acid. Phenylalanine. Glycine. Histidine. Isoleucine. Lysine. Leucine. Methionine. Asparagine. Proline. Glutamine. Arginine. Serine. Threonine. Valine. Tryptophan. Tyrosine.
B Z X
Asx Glx Xaa
Aspartic acid or Asparagine. Glutamine or Glutamic acid. Any amino acid.
Amino acid codes
Levels of Pattern Conservation
Active site 3D protein structure
Predictive Conserved Patterns
Protein fold / domains 2D protein structure 1D protein structure (amino acid sequence) mRNA sequence DNA sequence
Alignment studies
A
C O
D
E
O
N
O
N
N
H
N
N
OH O
O
I
O
O
OH
K O
N
O
N
SH
G
The 20 standard L-amino acids
F O
N
L O
O N
N
NH
N
R1
NH2
M
N O
N
P O
Q O
N
R
N
H2N
O
O N
N
O
NH2 S
O
O
NH
NH2 HN
S
T
O N
N OH
V
O N OH
W
O
Y
O
NH2
O
N
N
NH OH
H N
R3
O
R2
N H
O OH
Peptide backbone N C
Stereochemie von Aminosäuren: Fischer-Projektion
COOH H 2N H R L
COOH H NH2 R D
Die Darstellung von Verbindungen mit einem oder mehreren Chiralitätszentren kann durch die Fischer-Projektion (Emil Fischer) erfolgen: • Hierbei wird die Kohlenstoff-Hauptkette vertikal angeordnet. • Das C-Atom mit der höchsten Oxidationsstufe wird nach oben geschrieben. • vertikale Bindungen zeigen nach hinten, horizontalen Bindungen kommen aus der Papierebene nach vorne heraus.
die 21. und 22. proteinogene Aminosäure O H
proteinogene Aminosäuren
OH
Se NH2
Selenocystein, Sec (UAG Stop-Codon) N
H N O
O OH NH2
Pyrrolysin, Pyl (UAG Stop-Codon) Selenocystein und Pyrrolysin - werden durch Codons kodiert, die unter gewöhnlichen Umständen die Proteinsynthese abbrechen: diese Codons müssen durch einen Prozess der Rekodierung umdefiniert werden, damit diese Aminosäuren in Proteine eingebaut werden können. http://www.biophys.uni-duesseldorf.de/~wilm/doc/ls_2003_01_secis_pp4.pdf
The Peptide Bond Ramachandran Plot
3-10
trans
Peptide notation: N C • white regions are disallowed except for glycine Tutorial http://www.cryst.bbk.ac.uk/PPS2/course/
The alpha-Helix Right-handed α-Helix
i+8
i+4 5.4 Å pitch i
• 3.6 residues in a turn (36 residues = 10 turns)
Helical Structures 3-10 Helix
• 3 residues in a turn • 10 atoms in ring formed by a hydrogen-bond
The beta-Strand & beta-Sheet Beta strand conformation Antiparallel beta-Sheet
7 Å pitch
C-terminus
Beta-Sheets
Flavodoxin (PDB: 1AG9)
Reverse Turns (“Beta-Turns”) • generally occur at the surface of the protein • Hydrogen-bond between residues i and i+3 (Cα distance < 7 Å) • nucleation centers during protein folding? Type I
Type II
Gly: no hindrance with C=O of (i+1) G
• difference between type I and II: orientation of the peptide bond between i+1 and i+2 • account for approx. 50% of all turns
Beta-Hairpin Turns • Beta-hairpin turns occur between two antiparallel beta-strands = Supersecondary Structure
Type I’
Residue 2: always Gly
Type II’
Residue 1: always Gly
Local Conformations are Context-Dependent VDLLKN
identical sequence, different 3D structure too short for homology assessment!
Global and local sequence features determine protein structure and function Ribonuclease T1 from Aspergillus oryzae A Guanyl-specific hydrolase
ACDYTCGSNCYSSSDVSTAQAAGYQL HEDGETVGSNSYPHKYNNYEGFDFSV SSPYYEWPILSSGDVYSGGSPGADRV VFNENNQLAGVITHTGASGNNFVECT
Amino acid sequence
Structural model PDB: 4RNT
Bioinformatics: Searching for Homologues Homolog Similar protein with a common ancestral sequence • may have similar function or structure • structural homology • functional homology • homology ≠ similarity ! • no “% homology” ! Ortholog Homolog proteins in different species Paralog Homolog proteins in the same species
Secondary Databases (Patterns & Motifs)
Database
Primary Source
Stored Information
PROSITE Profiles PRINTS BLOCKS IDENTIFY Pfam …
SwissProt SwissProt OWL (SwissProt) PROSITE/PRINTS BLOCKS/PRINTS SwissProt
Regular expressions (patterns) Weighted matrices (profiles) Aligned motifs (fingerprints) Aligned motifs (blocks) Fuzzy regular express. (patterns) Hidden Markov Models (HMM)
Databases integrating Genetic, Molecular, or Metabolic Data Amaze
Biochemical pathways www.ebi.ac.uk/research/pfmp/
Ecocyc / Metacyc
Metabolic pathways http://biocyc.org
KEGG
Metabolic pathways www.genome.ad.jp/kegg/
TransPath
Signal transduction pathways http://transpath.gbf.de/
BIND
Protein interaction and complexes www.bind.ca/
GeneNet
Gene networks http://wwwmgs.bionet.nsc.ru/mgs/systems/genenet/
CSNDB
Cell-signaling networks http://geo.nihs.go.jp/csndb/
Information Retrieval Systems
• SRS – Sequence Retrieval System (at EBI, UK) http://www.srs.ebi.ac.uk • Entrez (at NCBI, USA) http://www.ncbi.nlm.nih.gov/Entrez
Hausaufgabe: Üben ! • Amino acids – structures and codes http://bioinf.man.ac.uk/aacids/amino_acid.htm
Amino Acid Classification: A Venn-Diagram small
tiny
proline P
aliphatic
CSS I
L
V
A S G C SH T
M F
Y W
polar N
Q
D E
charged
H K R
negative hydrophobic positive aromatic
Sliding-Window: The Helical-Wheel Plot
Alpha-helix 3.6 residues per turn (100 degrees / residue) http://cti.itc.virginia.edu/~cmg/Demo/wheel/wheelApp.html Transmembrane helices of rhodopsin ( PDB)
Sliding-Window: The Hydrophobicity Plot Detect potential transmembrane segments
Hydrophobicity plot of human Rhodopsin (AC P08100 at ExPASy), ExPASy-Service ProtScale; window size = 9; Kyte&Doolittle hydrophobicity scale
Sliding-Window: Secondary Structure Chou-Fasman method • based on analyzing frequency of amino acids in different secondary structures • A, E, L, and M strong predictors of alpha helices • P and G are predictors in the break of a helix • Table of predictive values created for alpha helices, beta sheets, and loops • Structure with greatest overall prediction value used to determine the structure (80% majority, α+β window size = 5, turn: 4 residues) • GOR method improves upon the Chou-Fasman method: • Assumes amino acids surrounding the central amino acid influence
secondary structure central amino acid is likely to adopt Scoring matrices
Example of a multiple sequence alignment (ClustalW) “Block” SW_LEP_BACAM SW_LEPP_BACSU SW_LEP_ECOLI SW_LEP_SALTY SW_LEP_PSEFL SW_LEPC_BACCL SW_LEP_HAEIN SW_LEP_MYCTU
.......... .......... FAPKRRERQA FAPKRRARQA FAPRRRSAIA .......... VLPKRHRQVA AGQVFDAAPF
......MTEE .......... AAQAAAGDSL AAQTASGDAL SYQGSVSQP. .......... RAEQRSGKT. DAAPDADSEG
Q..KPTSEKS ....MTKEKV D..KATLKKV D..NATLNKV D..AVVIEKL ....MTKQKE ...LSEEEKA DSKAAKTDEP
VKRKSNTYWE FKKKS.SILE APKPG..WLE APKPG..WLE NKEPL..LVE KRGRR..... KIEPISEASE RPAKRSTLRE
WGKAIIIAVA WGKAIVIAVI TGASVFPVLA TGASVFPVLA YGKSFFPVLF WPWFVA..VC FLSSLFPVLA FAVLAVIAVV
SW_LEP_BACAM SW_LEPP_BACSU SW_LEP_ECOLI SW_LEP_SALTY SW_LEP_PSEFL SW_LEPC_BACCL SW_LEP_HAEIN SW_LEP_MYCTU
LALLIRHFLF LALLIRNFLF IVLIVRSFIY IVLIVRSFLY IVLVLRSFLV VVATLRLFVF VVFLVRSFLF LYYVMLTFVA
EPYLVEGSSM EPYVVEGKSM EPFQIPSGSM EPFQIPSGSM EPFQIPSGSM SNYVVEGKSM EPFQIPSGSM RPYLIPSESM
YPTLH..... DPTLV..... MPTLL..... MPTLL..... KPTLD..... MPTLE..... ESTLR..... EPTLHGCSTC
DGERLFVN.. DSERLFVN.. IGDFILVEKF IGDFILVEKF VGDFILVNKF SGNLLIVN.. VGDFLVVNKY VGDRIMVD..
.......... .......... AYGIKDPIYQ AYGIKDPIYQ SYGIRLPVID .......... AYGVKDPIFQ ..........
[S,G]-x-S-M-x-[P,S] “Pattern” Regular expression matching
Searching for Consensus Patterns in PROSITE Query: E.coli leader peptidase -Consensus pattern: [GS]-x-S-M-x-[PS]-[AT]-[LF] [S is an active site residue] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: 16. -Consensus pattern: K-R-[LIVMSTA](2)-G-x-[PG]-G-[DE]-x-[LIVM]-x-[LIVMFY] [K is an active site residue] -Sequences known to belong to this class detected by the pattern: ALL SPases I from prokaryotes as well as yeast IMP1, but not IMP2. -Other sequence(s) detected in SWISS-PROT: NONE. -Consensus pattern: [LIVMFYW](2)-x(2)-G-D-[NH]-x(3)-[SND]-x(2)-[SG] -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: 10.
Spase_I_1
(G,S)xSMx(P,S)(A,T)(L,F) (S)xSMx(P)(T)(L) 89: PFQIP SGSMMPTL LIGDF
Amino Acid Composition 16 14 12 10
%
8 6 4 2 0
A C D E F G H I K L M N P Q R S T V W Y SwissProt V 40.30 Archaebakterium (Thermoplasma volcanium) E.coli K-12 P. falciparum Homo sapiens
Protein Targeting Signals Signal peptidase mature protein
e.g. secreted proteins mitochondrial matrix proteins chloroplast stromal proteins e.g. mitochondrial IMS proteins apicoplast proteins
Known exceptions:
e.g. some mitochondrial proteins ( ) SKL some peroxisomal proteins http://www.rockefeller.edu/pubinfo/proteintarget.html