Note On Cogs

  • Uploaded by: somchais
  • 0
  • 0
  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Note On Cogs as PDF for free.

More details

  • Words: 808
  • Pages: 6
1

Note on The COG database

S. Saengamnatdej August 15, 2009

COG standing for Clusters of Orthologous Groups of proteins is a database that classifies predicted proteins from orthologues, homologous genes which are derived by vertical descent from a single ancestral gene in the last common ancestor of the compared species and typically have the same function and domain architecture, and can be used to annotate proteins in a new genome. VERSIONS Current version (Figure 1) Initial version (enter through the link on the current version page, Figure 2.) Figure 1

Current version of the COG database

2 Figure 2

Initial version of the COG database

FEATURES 1. 4873 COGs as reported [1] (begun with 720 then 860, to 2091 [2], currently 3307 COGs [3] including groups with known function(s) such as chemotaxis proteins, predicted functions (e.g. predicted extracellular nucleases), and uncharacterized functions.) 2. 138,458 proteins [1] (75% of the 185,505 predicted proteins) currently, 192,987 proteins [3] 3. 66 genomes of prokaryotes (started from 5 then 6 to 43 and, now, 63 genomes) and unicellular eukaryotes (the only S. cerevisiae at the beginning to, now, 3). COG coverage of most genomes is approaching saturation (~80% of the genes of most free-living prokaryotes belong to COGs). 4. 1% conserved phyletic patterns 5. Many new microbial genomes are being added (See Figure 3).

3 Figure 3

The upcoming mycobial genomes.

5. KOGs (eukaryotic orthologous groups) 5.1) predicted orthologs for seven complex multicellular eukaryotic genomes including 3 animals (nematode, Caenorhabditis elegans ; fruit fly, Drosophila melanogaster ; and human, Homo sapiens), 1 plant (thale cress, Arabidopsis thaliana), 2 fungi (Saccharomyces cerevisiae ; Schizosaccharomyces pombe) , and 1 parasite (Encephalitozoon cuniculi). 5.2) 4852 clusters 5.3) 59,838 proteins (54%of 110,655 gene products) 5.4) 20% conserved phyletic patterns (probably due to the small numbers of included eukaryotic genomes) 5.5) KOG coverages are still far from saturation. 5.6) The upcoming eukaryote genomes are Oryza sativa (rice), Anopheles gambiae (mosquito), Pan troglodytes (chimpanzee), Canis familiaris (dog), Mus musculus (mouse), Rattus norvegicus (rat), and Ascomycota genomes including Magnaporthe grisea & Neurospora crassa.

CONSTRUCTION 1. Mask the low-complexity and predicted coiled-coil regions of the proteins. 2. Use the gapped BLAST programme for all-against-all protein sequence comparison. 3. Detect the proteins with consistent BeTs (genomespecific best hits). 4. Group the proteins to form COGs. (It's required that each COG includes proteins from at least three sufficiently distant species.) 5. Manually split multidomain proteins (identified by RPS-BLAST) into the component domains. 6. Manually curated and annotated the proteins. 7. Other groups 1) The proteins, which consisted solely of widespread, "promiscuous" domains (e.g., SH2, SH3, WD40 repeats or TPR repeats) and did not show clearcut orthologous relationships, were assigned to Fuzzy Orthologous Groups (FOGs). 2) Two genomes (TWOGs) (as a preliminary group) 3) lineage-specific expansions (LSEs) of paralogs (as a preliminary group)

ASSIGNMENT OF A PROTEIN TO A GROUP. 1. With the annotated protein from a new species, search BLAST against COG database 2. Use COGNITOR programme to assign the protein to a group of COGs. 3. If none of the groups can be assigned, form a new COGs.

4

APPLICATIONS 1. Functional annotation of newly sequenced genomes. By using programme "cognitor" to identify orthologues, this method is not to make phylogenetic analysis of the entire homologous protein, which is time-consuming and error-pone, but rather find the protein in a target genome which is most similar to a given protein from the query genome. Three BeTs are used to assign the protein into a COG. (the BeT : genome-specific best hit) 2. Identify the protein(s) which are found in one species but not in the others. In some cases, an investigator may want to analyse the difference between two species to determine which proteins or gene products confering the organism's characteristics. This can be done by using phyletic pattern search with user-selected species.

EXAMPLE After clicking on the linked eukaryotic clusters, you are presented with a new page. On this page, click on the programme "Kognitor", then paste on your protein sequence of any length with a name in fasta format and click compare button. the blast result page will show up. Then, click on the linked KOG entry (KOG3378), you will be presented with the page like this picture (Figure 4).

5

Figure 4

The result page for KOG entry (KOG3378)

This dendogram should not be construed as a phylogenetic tree because of its crudeness. The result shows that there are 11 human paralogues, 24 worm paralogues, and 1 fruitfly paralogue. In addition, the vertebrates and nematode globins comprise co-orthologous sets.

REFERENCES 1. Tatusov, R. L. et al (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41 2. Tatusov, R. L. et al (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research,Vol.28,No.1, 33–36. 3. http://www.ncbi.nlm.nih.gov/COG accessed August, 2009.

6

Related Documents

Note On Cogs
May 2020 5
Note On Gds
November 2019 11
Note On Java
June 2020 2
Note On Gravel
November 2019 12
Guidance Note On Actg
November 2019 15
A Note On Stability
June 2020 7

More Documents from ""