Bioinformatics A starter K. Mani. Reader in Botany, PSG College of Arts and Science, Coimbatore, Tamilnadu, India.
[email protected]
Summary • • • • • • • • • • • • • •
What is Bioinformatics? How it started? Who are the Bioinformaticians? the tool makers and the tool users The information: Nucleotides, proteins, and structures. The Database: Primary, Secondary and special. The elements: Sequence analysis: Pair wise and Multiple sequence analysis Phylogenetic analysis: Genomics: Structural, functional and comparative Proteomics: Structural, functional, comparative and interactive Metabolomics:Reconstruction of metabolic pathways Systems Biology: Cell function Simulation Applications of Bioinformatics: Conclusion
What is Bioinformatics? Bioinformatics is about collecting, storing, maintaining and distribution of Biological data for extraction and identification of meaningful information and converting them into knowledge with the help of computers. The first phase in Bioinformatics is conversion of Biological data into digital format. The second phase is cleaning and arranging the data into an easily retrievable from. This is called the database. The third phase is extraction of hidden information in the data by comparison and analysis using computer programs to convert the data into knowledge.
How it started? Two major incidents in Biology gave a kick start for the bioinformatics. The attempts of Margarette Dayhoff in 1980 to analyze protein sequences with the help of computer program were the initial ground work for Bioinformatics. However the real jumpstart occurred only after the release of Human genome draft in the year 2001. All necessary information for the growth and development of an organism is already present in the nucleotide sequences of its genome. Similarly the structure and function of a protein is already inscribed in its primary sequence. Therefore all one need
to know is the sequence the remaining can be discerned from it using computational analysis. The attempt to cull out every bit of information from a given sequence or structure is the bioinformatics.
The two pillars of bioinformatics Since, bioinformatics is a happy marriage of computer and biology, they remain as the two pillars of Bioinformatics. Computer savvy biologists and Biology loving computational scientists both enjoy the Bioinformatics equally. Each of them have their own share in the development and growth of Bioinformatics. Computational domain involves creation and maintenance of biological data and developing beastly program that can crunch the huge data and smart analytical tools that would extract hidden secrets buried inside the genetic codes. The second domain belongs to Biologists. The secrets hidden in the nucleotide and amino acid sequence are unearthed using intelligent tools. For instant from a given nucleic acid sequence the following information can be dug out. • Locating position genes on the genome sequence • Locating eh intron and exon positions • Identifying the deletion, insertion and substitution mutations • Locating paralogous, orthologous and psuedogenes • Identifying non coding gene regulatory elements • The protein to which the gene codes for • The gene flanking sequences that would serve as primer for PCR Similarly from protein sequences the following and much more information can be traced: • Protein function • Protein secondary structure • Tertiary structure • Super secondary structures • Domains • Patterns • Motifs • Antigenic regions • Post translational modifications • Grand Hydropathy • Molecular weight • Electrical properties • Half life • Optical extinction coefficient • Interaction with other molecule including other proteins • Cellular location • Signal peptides • Active site • Probable location in the 2D gel electrophoresis
• • •
Trans membrane properties Phylogeny Iso-electric point
The nature of Biological data
There are three kinds of biological data. • The gene related • protein related • gene and protein function related and • Structure related. The gene related data consists of gene sequences, the location of the genes and the regulatory elements. Further the Genome sequences, chromosomal architecture, non coding regions, gene markers, chromosome map elements, and Single nucleotide polymorphic markers. These data are stored as text files in a format known as FASTA. The protein related data include sequences, domain, pattern, motif etc are stored in FASTA format in the databanks. The structure related data include: three dimensional atomic coordinate files of proteins and small molecular compounds, ligands and tRNA such that. The Data Banks Biological databanks collect, maintain and distribute data. Some databanks are large store houses known as data warehouse. Some databanks are well classified like supermarkets. Sequence, structure and gene expression data are maintained digitally in robust computer servers and linked to Internet through World Wide Web. Excepting a few, all most all biological data are free and open access.
The Big three Across the world there are three major biological data bases. These databases can be accessed through internet browsers. The site maps of these databases will take the visitor to a tour and introduce all their components. National Center for Biotechnology Information (NCBI) European Molecular Biology laboratory (EMBL) DNA data bank of Japan (DDBJ) Nucleic acid sequences either as genes or genomes with complete annotations are available in these databases. Entrez (NCBI), SRS (EMBL) are the user friendly data retrieval systems offered by the databases themselves. Classification of DATA bases Every first issue of the year of Nucleic acid Research come out with newly added databases. The total number of databases in the world is nearing now 1000. Databases can be classified based on the nature of data source. Primary databases consists of data derived from the source laboratories. The secondary databases are information enriched data derived from the primary databases. The specialized databases contain data of special interest. Primary databases stack either nucleic acid sequence data or Protein sequence data. Structure files from crystallography research centers are other primary data. Genes,
genomes, micro array gene expression data, protein sequences data, protein structure data are the major primary data.
Specialized databases Special data like Expressed sequence tags, sequence tagged sites, Single nucleotide polymorphism which are created for special purpose and they comprise the content of special database. These data can be again obtained from NCBI. The STS are short sequences of nucleic acid which help to locate a gene by its unique sequence. EST are reverse transcribed DNA sequence from mRNA. The set of EST from a particular cell type gives information about the type of gene expression. The assembly of EST sequences offers us the complete gene sequence without the intervention of introns. SNP data are helpful as genetic marker in location of specific disease related genes on the chromosomes. It is also useful as disease diagnostic data. OMIM is another specialized data base that lists out all known inherited diseases of Man and mouse. Apart from giving the literature, sequence it also offers links to databases that contain related data. Other specialized databases are species specific. The biological data concerted with a particular model organism is maintained separately as special database. Worm base is for Caenorhabditis elegans, Cyanobase is for Cyanobacteria, ATH for Arabidopsis thaliana, Fly base is for Drosophila. PubMed and Agricola are specialized database for biological literatures.
Secondary Database Secondary, derived or value added databases are highly information enriched data sources. The primary data after meaningful classification and curation are called secondary data. Secondary data are free from redundancy, easy to retrieve and ready for use in the analytical point of view. The overwhelming size of the primary data makes the user very difficult to retrieve successfully desired data. For example millions of protein sequences are reduced to limited protein classes and domains. Similarly billions of nucleic acid sequences are classified as unigenes, gene families, and functionally related genes. Apart from its easy to retrieve options they help one to predict the structure and function of unknown gene or protein easily.
Elements of Bioinformatics The study of Bioinformatics has following aspects. • Sequence analysis • Genomics • Proteomics • Metabolomics • Systems biology • Application bioinformatics Sequence analysis Currently there are 52 billion nucleotide bases and 50 million sequences of DNA in the GenBank. These DNA sequences belong to several species of organisms including
plants, animals, microbes and viruses. Equally amazing is the protein sequence data. The Uniprot of EXPASSY server has several thousands of well curated unique protein sequences. The first and basic approach in bioinformatics is analysis of the sequences. There are two types of sequence comparison namely the pair wise sequence comparison and the multiple sequence comparison.
Pair-wise sequence comparison Two gene or two protein sequences can be compared to find out the relation ship between them. How much they resemble each other? Do they come from common ancestral gene? How much they have deviated from each other since their origin from the common ancestor can be known from sequence comparison. Prior to sequence similarity comparison they have to be aligned with each other. The process of sequence alignment is a daunting job. For instance two sequences consisting of each 100 residues will have ten thousand possibilities of alignment. Though only one among them is the real alignment, to identify that alignment needs the help of computer. The most probable alignment is called the optimal alignment. Optimal alignment will have high similarity score. There are two competitive sequence alignment algorithms namely the local alignment and the global alignment. The local alignment created by Waterman and Smith tries to maximize the alignment and achieve high similarity score even if the sequences align only by fragments. As many proteins domain nature, they are related to each other only local domains rather than to their entire sequence length. Attempting to align the two sequence globally would miss the locally occurring significant similarity. The global alignment algorithm of NeedleMan and Wunsch tries to align two sequences from one end to the other. Both alignments will introduce necessary gaps in the alignment to optimize the alignment. Both these algorithm made a break through in the early days of computational biology. Creation of suitable substitution matrix augmented the development of Bioinformatics. Substitution matrix is a look up table for scoring the amino acids and nucleic acids substitutions that are seen between related sequences. During the late nineties of yester century, the number of sequences deposited into the genbank mounted to several millions. A fast and accurate sequence comparison algorithm was created in the place the original reliable but slow algorithms. Thus the birth of BLAST (Basic Local Alignment Search Tool) occurred. Altshchul and others created this ultra fast heuristic alignment and scoring tool. Immediately the Bioinformatics took rapid stride towards gene annotation and rapid sequence retrieval. Now BLAST has its own family of algorithms that perform various tasks.
Multiple Sequence Alignment The complexity in the pair-wise alignment is tiny before the intricateness of aligning several sequences together. Aligning several related sequences has a set a new trend in the study of phylogeny. Branch of Molecular biology envisaged by Linus Pauling become realized and labeled Molecular phylogeny. The following are the information that can be extracted from the sequences on multiple alignment: • The conserved (unchanged) portion of the sequences will be evident
• • • • •
Conserved regions may be the secondary structures like alpha helix or beta strands. The Residues that constitute active site of an enzyme, binding site for the ligands, hetero atoms annexed places, trans membrane regions can be identified. Residues that have mutated together but set apart in the sequence may be identified. Assemblage of contigs into single sequence is possible (genome assembly) Multiple sequence alignment is the initial step in the phylogenetic analysis of sequences.
Phylogenetic analysis Organisms acquire new genes when they meet new environment. New genes help the individual to survive better in the new environment. New genes are created from the already existing genes by duplication and variation process. Some organisms borrow new genes laterally from other species. The family of genes that have arose due to duplication are related to each other as siblings. These are homologous sequences buy specifically named as paralogous genes because they exist in the same genome. Those family of homologous genes that are distributed in several species and yet have common origin are called orthologous genes. Molecular phylogeny rely upon counting the number of mutational steps that have occurred between the sequences to measure their phylogenetic distance. The number of mutations also serve as a kind of molecular clock to calculate the time passed between the original common ancestor sequence to the present sequence. There are two major approaches in deciding the phylogenetic distance between the sequences. Phenetic and Cladistic approaches are the two. Phenetic approach is very straight forward, fast but artificial. The Cladistic approach is computationally intensive but natural. Cladistic approach gives different weightage to aminoacid substitution based upon the number of codons that have intervened from the ancestral state to current state. Maximum parsimony and Maximum likelihood are the two Cladistic algorithms used in phylogenetic analysis.
Genomics At present (2007) there are nearly 1000 completed genomes deposited in the Genbank. Genomes of model organisms like mouse, rat, drosophila, fish, Chimpanzee, sea urchin, worm, yeast, Neurospora, Arabidopsis, rice, poplar and several hundreds of bacteria, archaea and viruses are now available. Gene expression data, better known as micro array experiment data are also accumulating in the public databases. Genomic data analysis has three perspectives. Structural genomics, comparative genomics and functional genomics are the three line of perspectives. Genome is the complete set of information based on which a zygote develops and grow into an adult organism. If bioinformatics is what it is believed to be, it should decipher the codified genetic information that turn the single diploid cell into a complete organism.
The objective of the genome sequencing is not only to unravel the mysteries that shroud the developmental biology but also to improve the crop or to mitigate the human sufferings from diseases. "If only I had the opportunity to have entire human genome in my computer I wouldn't have spent 7 hard years to locate a single gene that was responsible for Cystic fibrosis disorder. It would have taken only a few hours to do so" said before Collins the Director of Human genome project completed the project.
Structural genomics Structural genomics tries to locate the precise position of the genes and their regulatory elements on the genome. Before genome sequence was available scientists relied upon genetic mapping to locate a gene. Either the Genetic mapping (linkage mapping) or physical mapping (RFLP, SINE or LINE repetitive sequence position mapping) were not very precise as that Genome itself. Genome sequence alone is the ultimate of the gene mapping because it places the genes to its exact nucleotide. Structural genomics aims at mapping genes, their regulatory elements, exons, introns, poly A tail start point, pseudo genes, paralogs, non coding conserved regions etc on the genome.
Functional genomics Assigning genes their function and their regulations expression under different situations are called functional genomics. The recently introduced Micro array analysis of entire genome under varied experimental conditions has revolutionized our understanding of gene expression. In a single experiment studying the levels of expression of 10000 genes under varied physiologic conditions is possible now. Micro array data for yeast gene expression under 600 different set of physiological conditions are available with Stanford Microarray database. It is only a fraction of the real amount of data available there. These data offer biologists to compare the gene expression level under diseased state and normal state. Comparison of cancer cells with normal cell would reveal those genes that are involved in the cancer state. Genes that are under expressed, over expressed, co expressed, contra expressed can be clustered and studied in detail in the laboratories. The computer aided drug designers rely upon these analysis to find a suitable drug target proteins.
Comparative genomics Comparison of two or more genomes with each other is the major study in bioinformatics. Genome level comparison of couple of species supply several information such as: • Genome evolution • Differences in the metabolic pathways • Distinguish the parasitic biology from saprophytic biology • Distinguish the toxigenic genes from non toxigenic genes • Drug resistant from susceptible • Tolerant with intolerant When two closely related genomes are compared with a distantly related genome has revealed several highly conserved but non coding regions of genome. Genomes are
compared at several levels. One has to start with comparison at nucleotide level. Dinucleotide frequency, codon preference, CpG islands etc will reveal interesting differences between the organisms. Comparison at gene level would reveal evolution of genes and gene organization, gene synteny, chromosomal rearrangement etc. Comparing gene regulatory elements will reveal the operon systems. Finally one may go up to the level of comparing the genes related to various metabolism and gene ontology.
Proteomics 'one gene one enzyme' has become a sentence of historic important, because one gene several protein is the reality in eukaryotic systems now. That single gene can code for more than one protein due to alternate splicing of exons has helped us to come out of the riddle how come a cell posses number of proteins three times to that of genes! The reason for our inability match the phenotypes with genotype is because we were stuck to genes alone. Indeed the proteins are the real stalwarts that manifest the phenotypes. The epigenetic mysteries become explainable. Proteomics is documenting every individual protein of a cell. It includes understanding the structure and function of the proteins. Just like genomics, proteomics also has three perspectives. Structural proteomics deals with constructing protein structures. Functional proteomics is about the expression of proteins in a cell under varied conditions. Comparative Proteomics deals with comparison of entire protein complement of two different cells or a cell under different conditions. Added to these perspectives study of protein-protein interaction has become a major interest in biology. Protein function prediction Protein sequence data and protein structure data are very impressive. Though the Uniprot boasts itself to be well curated protein database, yet 80 percent of proteins in a given genome are only hypothetical or putative. Assigning function to the hypothetical proteins is the major task of bioinformatics. Structure prediction Protein structures are elucidated by X-ray crystallography and Nuclear Magnetic resonance imaging techniques. This is possible only after a particular protein has been purified 100 % and made into a crystal. As the trans membrane proteins are not amenable for purification and crystallization the number of entries in Protein Data Bank for trans membrane proteins are limited. Unfortunately most of the proteins implicated in the human diseases are membrane proteins a urgent need to construct the protein structure out of its sequence knowledge become imminent. Without the structure of the target protein computer aided drug designing is impossible. Multi billion opportunity awaits person capable of modeling membrane proteins theoretically. Protein interaction Proteins function in team. Proteins are capable of self assembly. Many proteins assemble themselves into a single entity and bring about certain cell function cooperatively. How proteins recognize their partners, and interact precisely is a fascinating study. Though yeast dihybrid experiments are now widely used technique to study the protein-protein interaction, bioinformatics method has greater scope in the near future.
Metabalomics Now a day's increasing amount of interest is shown in sequencing the genomes of extremophilic micro organisms. Archaea are extremophilic organisms capable of living under temperature nearing boiling point of water and pressure that exceeds thousands of barometers. The genes and proteins of these organisms are extremely useful in biotechnology. It may be recollected how the discovery of TAQ polymerase an enzyme derived form an archaea have revolutionized the genetic engineering and biotechnology with its PCR mechanism. Bioinformatics skill can help a person to reconstruct the entire metabolic pathway of an organism from its genomic sequence alone. Predicting the entire set of metabolism from genomic sequences is possible now thanks to the metabolic pathway database of KEGG. Comparative genomics reveals only one fourth of the information. Proteomics supplies another fifty percent and only the metabolomics completes the information. While the former two are static the latter is dynamics of the cell. In other words metabolomics fulfills the demands of Bioinformatics.
Systems Biology The pinnacle of Bioinformatics resides in simulation of cellular process inside the computer system. The cellular system is simulated in total by a computer system is systems biology. Cell with all its metabolic complements, regulations, energetics and dynamics can be simulated only if we know entire cell biology. Simulating a human cell though may be the ultimate goal, our current knowledge permits only for simulating a prokaryotic minimal cell. A minimal cell is one which is capable of living independently and carry out all its biological activities such as growth, reproduction and adjustments to environment with minimum number of genes. The smallest genome comprising only 500 and odd genes belong to a couple of parasitic bacteria. Truly independent cell might need at least 1000 genes. If one has achieved the knowledge of minimal gene the true E Cell will go ahead with its fullness. Cell simulation has enormous potential. It is a boon for biotechnologists. They can study cell expression, metabolic regulation and perform several gene manipulation and expression virtually. The virtual cell would predict the toxicology, pharmaco dynamics and pharmacokinetic properties of drugs without even toughing mouse or rat in the laboratory. Animal activist would rest in peace.
Bioinformatics spin off Though the main objectives of Bioinformatics is storage, maintenance, update and supply of biological data in one hand and extraction of information and converting them into biological knowledge the other, it has generated several useful spin off techniques to the biologists. Several bioinformatics tools have relieve biologists from drudgery of repetitive operations and trial and error methodologies. • Identifying disease target proteins from micro array data or from comparative genomics studies
• • • • •
Primer designing: the short flanking sequences of a gene which is used use as the primer for PCR reaction can be predicted with precision Using disease markers for disease diagnosis Selection of suitable plasmid vectors, restriction enzymes for gene cloning. Prediction of antigenic site of a protein for developing vaccines Distinguishing strains and species through phylogenetic analysis
Computer Aided drug design The convergent point of bioinformatics is knowledge based drug discovery. Every year 10,000 new chemicals are being sent for approval by FDA. Only 5 percent of them are rejected. There are 75000 drug molecules in the market. More than a million natural substances are known to exist in nature. But why there are very few promising drugs that are specific and devoid of side effects. National Cancer Institute (US) has screened 10,000 natural substances every year for the past decades on 60 different types cancer cell lines and come up with just handful of substances that worked really on them. These approaches are blind and irrational. Bioinformatics offers the rational and knowledge based drug design. The computer aided drug design has following phases in the discovery. 1. Identification of the disease target gene (protein). Micro array based identification. Comparative genomics, consulting OMIM database and extensive literature mining 2. Validation of the target. Knock out experiment, SNP analysis 3. Constructing the 3D structure of the target protein. Homology modeling or classical crystallography approach. 4. Locating the ligand interacting site on the target molecule. Employing appropriate computational tool. 5. De novo lead construction. Based on the receptor site, lead is constructed either link or grow method. 6. Use of QSAR analysis using homologation or bioisosterism approach in converting already known lead into more potent and safe. 7. Screening the lead candidate for ADME/TOX properties and fulfilling Lipinski's rules. 8. Dock the selected lead on to the target and fine tune the structure further if needed. 9. Ask the chemist to synthesize the drug or prodrug. 10. Go for clinical trials.
Conclusion Like any other branch of science Bioinformatics also has started with a humble objectives and cold reception. But unlike the other scientific discipline, it has soared to unimaginable heights. It is a kind of science totally different from others in viewing a problem from whole to part, rather than from part to the whole. Its Gestalt view of the problem and fathered by several branch of science are unique. Molecular biologists,
pathologists, biologists, computer scientists, statisticians, mathematicians, biophysicists and biochemists have joined their hands in the development of Bioinformatics. Biology is no longer the same now. If a biologist leaves out bioinformatics they are lame. The future battle with human suffering and conserving nature is going to be fought with the weapons of Bioinformatics.