CCS HAU
Bioinformatics
Bio(-)informatics
Dr. Sudhir Kumar CCS HAU, Hisar
[email protected]
CCS HAU
Bioinformatics
Bio = Biology/biological
Informatics = Information Science including technology
CCS HAU
Bioinformatics
What is Bioinformatics? Mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information. Bioinformatics is conceptualizing biology in terms of macromolecules and then applying “informatics” techniques to understand and organize the information associated with these molecules, on a large scale.
CCS HAU
Bioinformatics
Bioinformatics • Bioinformatics is the application of information technology to analyze, process, and manage biological data. • Bioinformatics provides computational tools to facilitate the process of Data
Information
Knowledge
Discovery
CCS HAU
Bioinformatics
Suggestive Biology-Language Homologies • Cell Nucleotide Bases Amino Acids Exons Folding Proteins Protein Circuits Biological Functions Regulation of gene expression
• Human Language Alphabet Words Phrases Syntax Word Senses Sentences Semantics Language generation
CCS HAU
Bioinformatics
Overview • Biological databases are being produced at a phenomenal rate • As a result computers are becoming indispensable for biological research • Aims 1- organize data 2- develop tools 3- use tools to apply to biology
CCS HAU Bioinformatics
-Genome and protein databases -aligning sequences -searching -visualizing protein structure -homology modeling -molecular mechanics and molecular dynamics -structure prediction -docking -drug design -metabolic pathways -NMR and x-ray crystallography and many more ….
Bioinformatics
CCS HAU
Bioinformatics
Definitions: Biocomputing and computational biology are synonyms and describe the use of computers and computational techniques to analyze any type of a biological system, from individual molecules to organisms to overall ecology. Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any type of biological database. Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, interactions, evolution, and perhaps structure of biological molecules. Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within the same and/or across different genomes. Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.
CCS HAU First “Behind the Screen” •
Biological databases are largely devoted to search.
– Also, integrity, security, etc. •
Search means taking a query and retrieving some database entry that matches it.
•
Efficiency is a key; want to find things fast, regardless of how big the database gets.
Bioinformatics
CCS HAU
Bioinformatics
Rate of growth
CCS HAU
Bioinformatics
Bioinformatics: post-genomic era
High-throughput technologies generate petabytes of data Sequencing, Microarray, Recombinatory chemistry, High throughput screening, Mass spectroscopy, …
Rapid growth of data and databases in the public and private domains Genomics, Gene expression profiles, Proteomics, Pharmacogenomics, Clinical trials, Literature, …
Proliferation of computational tools for data analysis and processing Statistical analysis tools for sequence analysis and gene finding, Clustering algorithms, Protein folding and structure predictions,Drug docking, Visualization tools, Data mining tools, …
CCS HAU
Bioinformatics
The Promises • Digitization of the biological systems and processes Simulation and Modeling of protein-protein interactions, protein pathways, genetic networks, biochemical and cellular processes, normal and disease physiological states,…
• Blurring of the boundary between experimentally generated data and computational data search and analysis • In silico discovery in complement with wet lab experiments
The Landscape of Biological Data Sources PRINTS
Patent USPTO PFAMB
BLOCKS
PIR
PFAMA PROSITEDOC
LOCUS LINK
NRL3D
DOMO
Patent JPO
SWISSFAM PROSITE
GENEPEPT
Patent PCT
TFCLASS
Medline
TREEMBL
TFMATRIX
PRODOM
UNIGENE
TFSITE
EMBL DSSP
DDBJ
DBSTS
GSDB
TFCELL
TIGR
SWISSPROT Entrez PDB GENBANK
RHDB
TAXONOMY
EBI
Celera
GENETICCODE HUGO
GDB
SNP
WIT
Fly Base
OMIM Clinical DB
KEGG dbSNP Contact
SNP Consortium
Microbial Genomes
STKE ENZYME
FASTA BLAST
dbSNP Population
SSEARCH
C. Elegans
CLUSTALW
CCS HAU
Bioinformatics
Databases are of two types - Primary & Secondary PRIMARY DATABASES
•
• •
Primary source of information and can be consider as reservoir of sequence information. Primary repository for the newly discovered sequence. e.g. Genbank at NCBI, EMBL, DDBJ
SECONDARY DATABASES
•
•
•
•
These databases derives the information by resolving the primary databases. They express any particular attribute of the primary databases. ( like motif, pattern etc.) They add the value to the information present in the primary databases. Eg., pfam, BLOCK, prints etc.
CCS HAU
Bioinformatics
Primary Nucleotide Repository • NCBI • EMBL • DDBJ
( http://www.ncbi.nlm.nih.gov) (http:// www.ebi.ac.uk/embl) (http://www.ddbj.nig.ac.jp/)
Primary Protein Repository • PIR • Swissprot/Uniprot • Protein Data Bank
(http://pir.georgetown.edu) (http:// www.ebi.ac.uk/swissprot) (http://www.rcsb.org/pdb)
CCS HAU
Bioinformatics
Secondary ‘pattern’ databases PROSITE PRINTS Pfam Profiles BLOCKS IDENTIFY
SWISS-PROT SWISS-PROT/TrEMBL SWISS-PROT/TrEMBL SWISS-PROT PRINTS/InterPro/Domo PRINTS/InterPro
Regular expressions (patterns) Aligned motifs (fingerprints) Hidden Markov Models (HMMs) Weight matrices (profiles) Weighted motifs (blocks) Permissive regular expressions
CCS HAU
Bioinformatics
NUCLEOTIDE REPOSITORY • • •
EMBL- European Molecular Biology Laboratory, at Cambridge, UK. GENBANK- at NCBI, a division at NIH campus, USA. DDBJ- DNA Data Bank of Japan, Mishima, Japan
• Since 1982 Work in collaboration. • Collect information from their region. • Automatically update each other every 24 hours. To organize huge amount of information, the database has been split into numerous divisions (17) and each division has specific 3-letter code. e.g.
Human Virus Fungi
HUM VRL FUN
CCS HAU
Bioinformatics
NCBI EMBL
Bioinformatics Centre, BISR, Jaipur
DDBJ
18
CCS HAU
Bioinformatics
The Biological data and databases
Complex data types range from protein and nucleic acid sequences, texts, 3-dimensional molecular structures, images of cells and tissues
Hierarchical data organizations range from molecules, biochemical pathways, cells, tissues, organisms, populations
Heterogeneous database locations, storage formats, and access methods
Dynamic data contents and database schema are constantly changing
CCS HAU
Bioinformatics
The computational tools and algorithms
Input/Output data formats Each application program requires specific I/O data formats that may impede data flow from one program to the next
Rapidly evolving New algorithms development and improvement of old ones
Require graphical display or presentation of results viewers for sequence alignments, 3-D structures, multidimensional plots,…
Integration Data Data Bases Bases and and Scientific Scientific Algorithms Algorithms Medline Medline (Asn.1) (Asn.1)
Microarray Data (RDBMS, Excel)
BLAST BLAST (FASTA) (FASTA)
OMIN (Text File)
Integration Integration
BioInformatics BioInformatics KEGG (HTML Text, Binary Images)
Entrez/NCBI Entrez/NCBI (Asn.1) (Asn.1)
ClustalW (FASTA)
PDB PDB (Oracle, (Oracle,3D 3Dimages) images)
CCS HAU
Bioinformatics
Examples of Bioinformatics • • • • • • •
Database interfaces – Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, … Sequence alignment – BLAST, FASTA Multiple sequence alignment – Clustal, MultAlin, DiAlign Gene finding – Genscan, GenomeScan, GeneMark, GRAIL Protein Domain analysis and identification – pfam, BLOCKS, ProDom, Pattern Identification/Characterization – Gibbs Sampler, AlignACE, MEME Protein Folding prediction – PredictProtein, SwissModeler
CCS HAU
Bioinformatics
Five websites that all biologists should know • NCBI (The National Center for Biotechnology Information; – http://www.ncbi.nlm.nih.gov/ • EBI (The European Bioinformatics Institute) – http://www.ebi.ac.uk/ • The Canadian Bioinformatics Resource – http://www.cbr.nrc.ca/ • SwissProt/ExPASy (Swiss Bioinformatics Resource) – http://expasy.cbr.nrc.ca/sprot/ • PDB (The Protein Databank) – http://www.rcsb.org/PDB/
CCS HAU
Bioinformatics
Database Growth (cont.) The Human Genome Project and numerous smaller genome projects have kept the data coming at alarming rates. As of February 2001 45 complete, finished genomes are publicly available for analysis, not counting all the virus and viroid genomes available. The International Human Genome Sequencing Consortium announced the completion of a "Working Draft" of the human genome in June 2000.
CCS HAU
Bioinformatics
What is bioinformatics , genomics, sequence analysis, computational molecular biology . . . ? The Reverse Biochemistry Analogy. Biochemists no longer have to begin a research project by isolating and purifying massive amounts of a protein from its native organism in order to characterize a particular gene product. Rather, now scientists can amplify a section of some genome based on its similarity to other genomes, sequence that piece of DNA and, using sequence analysis tools, infer all sorts of functional, evolutionary, and, perhaps, structural insight into that stretch of DNA!
The computer and molecular databases are a necessary, integral part of this entire process.
Vaccine development In Post-genomic era: Reverse Vaccinology Approach.
CCS HAU
Bioinformatics
CCS HAU
Bioinformatics COMPND HETATM HETATM HETATM HETATM HETATM HETATM HETATM HETATM HETATM HETATM HETATM HETATM
123.PDB 1 O 2 C 3 N 4 N 5 C 6 C 7 C 8 O 9 C 10 C 11 O 12 C
HETATM HETATM HETATM HETATM HETATM HETATM HETATM CONECT CONECT CONECT CONECT
24 25 26 27 28 29 30 1 2 3 4
H H H H H H H 2 1 2 2
CONECT 29 15 CONECT 30 17 END
-1.250 -2.964 0.008 -0.398 -2.223 0.438 -0.056 -1.110 -0.255 0.215 -2.505 1.614 -0.732 -0.857 -1.489 0.943 -0.166 0.171 1.170 -1.673 2.096 -0.192 0.337 -2.121 -2.208 -0.564 -1.230 1.548 -0.444 1.330 1.716 -1.925 3.144 -1.205 1.278 -2.349 -2.768 3.574 2.610 2.407 -1.351 -0.176 -2.056
3 4 5 6 7 18
0.082 0.173 0.443 1.487 1.949 2.831 4.016
-3.214 1.498 2.943 1.544 -0.315 -1.281 -0.887
CCS HAU
Bioinformatics
CCS HAU
Bioinformatics
Challenges in bioinformatics
• Explosion of information – Need for faster, automated analysis to process large amounts of data – Need for integration between different types of information (sequences, literature, annotations, protein levels, RNA levels etc…) – Need for “smarter” software to identify interesting relationships in very large data sets • Lack of “bioinformaticians” – Software needs to be easier to access, use and understand – Biologists need to learn about the software, its limitations, and how to interpret its results
CCS HAU
Bioinformatics
New areas in Bioinformatics •Microarrays •Functional Genomics •Structural Genomics •Comparative Genomics •Pharmacogenomics •Medical Informatics
What is bioinformatics?
CCS HAU
Bioinformatics
Your Turn: ANY Question(s)