PATHEMA Burkholderia PATHEMA-Burkholderia A CLADE SPECIFIC BIOINFORMATICS RESOURCE CENTER http://pathema.tigr.org
What is Pathema? Pathema (http://pathema.tigr.org) A NIAID Bioinformatics Resource Center designed to support bio-
defense and infectious disease research. Pathema provides detailed curation of six target pathogens: à Category A priority pathogens:
Bacillus anthracis, Clostridium botulinum à Category B priority pathogens:
Burkholderia mallei, Burkholderia pseudomallei, Clostridium perfringens, Entamoeba histolytica
What is Pathema? •
A customized suite of sophisticated bioinformatics software
•
An extensive library of literature and manually curated data
•
Comparative p analysis y tools customized to retrieve,, display, p y, and compute results relevant to ongoing bio-defense research Overarching Goal: to provide core resources that will accelerate scientific p progress g on understanding, g, detecting, g, diagnosing and treating Category A-C pathogens and other pathogens involved in new and re-emerging infectious diseases. di
Differences from the CMR? h http://cmr.tigr.org // i
Curated datasets related to biodefense & infectious disease research
Customized analysis & tool development
Most up-to-date genomic data
3 d party annotations 3rd i
Community data submission, integration, and display
Will include sequence data from non-genome projects
Distinct clade-specific databases & web resources
Community Outreach St t i Plan Strategic Pl Phase Ph I: I Subjective S bj ti Assessment A t à Who is the Constituency? à What are the information needs? à What are the bioinformatics needs?
Phase II: Design and Refinement à Prototype development à Continued community correspondence
Phase III: Performance Assessment & Refinement à Online usability study à Workshops and training
Entamoeba histolytica y ((1))
All Burkholderia (25)
All Bacillus (16) B. anthracis (10) B. cereus (3) B. thuringiensis (2) B. subtilis
All Clostridium (15) C. perfringens (4) C. botulinum (6) C. difficile C. thermocellum C. acetobutylicum C tetani C. C. butyricum
B. mallei (8) B. pseudomallei (10) B. thailandensis B. cenocepacia B sp. B. sp B. phages (4)
Workshop p Overview Part 1: P Prokaryotic k ti Annotation A t ti Methodologies M th d l i • Gene Model & Functional Curation Introduction to Pathema and Curation • Sequence Sources & Genomic Datatypes • Pathogenicity & Literature curation
Part 2: Pathema Navigation and Analysis Tools • Advanced Ad dD Data t Mi Mining i • Genome and Comparative Analysis Tools
PATHEMA-Burkholderia Data Curation Lauren Brinkac lbrinkac@jcvi org
[email protected]
Pathema-Burkholderia Home Page
Burkholderia Update Schedules
Pathema-Burkholderia Home Page
Community Registration
Participation and access to community curation tool tool. Submission of experimentally characterized proteins. Participation P ti i ti iin ffuture t usability bilit studies. t di Community recognition.
Pathema-Burkholderia Home Page
Community Survey Results
General Survey Results Top Burkholderia Research Interests à Gene/Annotation à Virulence Factors à Protocols P t l à DNA/Clone/Strain information Top Pathema Resource Interests à Identify Id tif virulence i l ffactors t à Identification of biochemical pathways à Proteomics à Genetic differences between species/strains
Pathema-Burkholderia Home Page
Detailed Strain Information
BEI Resources
Biodefense & Emerging Infections Research Resources Repository (BEI) gy and Infectious Diseases ((NIAID)) Established byy the National Institute of Allergy to provide reagents and information for studying Category A, B, and C priority pathogens and emerging infectious disease agents to facilitate research and product development. BEI Resources is managed under contract by American Type Culture Collection (ATCC) to acquire, generate, authenticate, store, and distribute these materials to the scientific community.
Available Burkholderia Genomes
Sequence Sources & Curation Types Internally Sequenced Genomes à Automated - automated assignment of protein name name, gene symbol, ec #, GO terms à Mapped - genome features/annotation transfer from a closely related manually curated reference genome à Manual - manual curation of g gene models and g gene functional predictions - Sequenced internally - Sequenced externally (3rd party annotation) - Community curation
Externally Sequenced Genomes à JCVI Automated Annotation
Available Burkholderia Genomes
Genome-specific Home Page
Genome-specific Home Page
Region View
Sequence & Gene Retrieval
Functional Curation
What is Annotation? Webster’s definition of annotation” à “to make or furnish critical or explanatory notes or
comments”
What this includes for genomics à gene finding à functional characteristics of gene products à physical characteristics of gene/protein/genome à overall metabolic profile of the organism
Steps in the annotation process …
Microbial Annotation Overview Gene Finding Glimmer
Homology Searches BLAST-Extend-Repraze Hidden Markov Models Paralogous Families
Functional Assignments Automated Manual Mapped
ORF Management Overlaps InterEvidence
Data Availability
Gene Finding g using g Glimmer Gene Locator and Interpolated Markov Modeler Glimmer uses Interpolated Markov Models (IMMs)
to predict which ORFs in a genome are real genes genes. Using Glimmer is a two two-part part process: à Train Glimmer with g genes from the organism g that was
sequenced. à Run the trained Glimmer against the entire genome
sequence.
Interpolated Markov Model (IMM) • Computes the conditional probability that nucleotide X occurs following oligomer Y (for all combinations up to 8mers) 8 e s) in a set se of o known ow genes. ge es. •
Used to score new sequences for representative genes. ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA
Gene Finding with Glimmer: T i i Training Glimmer Gli Identification Id tifi ti off ORFs ORF with significant homology to other organisms using BLAST searches.
Identification Id tifi ti off ORFs ORF 500500 1000 nucleotides in length, depending on GC GC-content content, and elimination of long overlapping ORFs.
Glimmer IMM
Gene Prediction
Gene Finding with Glimmer: Scoring Sco g ORFs O s & Calling Ca g Genes Ge es 1) Find ORFs above length cutoff +3 +2 +1 -1 -2 -3
= ORFs meeting minimum length
= laterally transferred DNA
2) Scoring ORFs: Green ORFs scored well to the model, red ORFs scored less well. The green ORFs are chosen by Glimmer as the set of likely genes.
3) Calling Genes: Glimmer numbers the genes sequentially from the beginning of the DNA molecule on which they reside. Genes missed by Glimmer and overlapping genes are resolved by post-Glimmer processes, which will be discussed on later slides.
ORF00001
?
ORF00002
ORF00004 ORF00003
Blast-extend-repraze (BER) searches Glimmer Protein Translation BLAST query against an internal non nonredundant amino acid database (NIAA : GenBank, SwissProt, PIR, TIGR)
Significant hits stored in minidatabase 300bp extension on both ends Smith-Waterman Alignment Full length pairwise alignment
BER Extensions The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame in frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding sequence, green line indicates up- and downstream extensions. Red line is the match protein.
end5 d5 300 bp
ORFxxxxx
end3 d3 300 bp search protein match protein
normal full length match
! !
FS
similarity extending through a frameshift upstream or downstream into extensions
*
PM
similarity extending in the same frame through a stop codon
?
FS or PM ?
two functionally unrelated genes from other species matching one query protein could indicate incorrectly fused ORFs
Sample BER alignment
Frameshifted alignment
What makes a good alignment? Minimum Mi i off 35% id identity-preferably tit f bl hi higher h Full-length F ll l th match t h ((approx 70%) Match M t h tto experimentally i t ll characterized h t i d protein t i
- we have a database storing accessions of proteins known to be experimentally i t ll characterized h t i d - our tools highlight experimentally characterized proteins
CHAR : Experimentally Characterized Protein Database A manually y curated,, experimentally p y characterized protein p
database which relates protein names and taxonomic classifications with scientific literature. What does it include: à Controlled vocabulary describes the type of experimentation
performed in each publication à All A relevant annotation i data types (protein ( name, gene symbol, Enzyme Commission (EC) number, taxonomic data, Gene Ontology (GO) terms) are extracted à All synonymous protein accessions obtained from public databases (Genbank and UniProt) are stored
CHAR : Experimentally Ch Characterized t i d Protein P t i Database D t b Integrated into our Manual Annotation Tool (MANATEE)
via the protein accessions stored our non redundant database,, PANDA. à Actively update and enter TCHAR entries during curation à Use curated entries as evidence supporting curation - Reduction in transitive annotation
Curated data predominantly from Prokaryotic organisms: à 21,000 protein entries à 1,517 organisms à approx 9,000 manually curated
Hidden Markov Models: HMMs Definition: statistical model of the patterns of amino acids
in a multiple alignment of proteins which share sequence and functional similarity Based on curated multiple protein alignments Built B ilt iin-house h (TIGRFAM (TIGRFAMs)) and d att PFAM TIGRFAMs are assigned to a category (“isology ( isology type”) type )
which is determined by the types of proteins used for the HMM All proteins searched versus HMMs
Building HMMs Collect proteins to be in the “seed” (same function/similar domain/ family membership)
Generate and Curate Multiple Alignment of Seed proteins
Region of good alignment and closest similarity R HMM algorithm Run l ith
Computes statistical probabilities for amino acid patterns in the seed
this step thi t may need da few iterations
Search new model against g all proteins
Choose “noise” and “trusted” cutoff scores based on what scores the “known” vs. “unknown” proteins receive.
HMMs: choosing cutoff scores matches (seed members bold) protein “definitely” definitely protein “absolutely” protein “confident” protein “safe safe bet” bet protein “very confident” protein “has to be one” protein t i “could “ ld b be”” protein “maybe” protein “not sure” protein “no way” protein “can’t be” protein “not a chance” p
score 547 501 398 376 365 355 210 198 150 74 54 47
proteins scoring above trusted (300) considered part of the family modeled by the HMM proteins scoring below noise (100) not considered part of the family modeled by the HMM
300
100
HMM Isology Types Equivalog: designed so that all members of the family being modeled and all proteins scoring p g above trusted share the exact same function. Superfamily: This type of HMM describes a group of proteins which have full length protein sequence similarity and have the same domain architecture, but which do not necessarily have the same function. Subfamily: This type of HMM describes a group of proteins which also have full length homology homology, which represent more specific sub sub-groupings groupings with a superfamily superfamily. Domain: These HMMs describe a region of homology that is not required to be the full length of a protein. protein The function of the region may or may not be known known. (The above are just a sampling of the most commonly seen isology types there are more types than those listed here.))
Pfam: Indicates that no TIGR isology type has yet been assigned to the Pfam HMM.
Annotation is attached to HMMs TIGR00433 à isology: equivalog à name: biotin synthase à EC: 2.8.1.6 à gene symbol: bioB à TIGR role: 77 (Biotin biosynthesis) à GO terms: GO:0004076 (biotin synthase activity) GO:0009102 (biotin biosynthesis) PF04055 à isology: domain à name: radical SAM domain protein à EC: not applicable à gene symbol: not applicable à TIGR role: 703 (enzymes of unknown specificity) à GO terms: GO:0003824 (catalytic activity) GO:0008152 (metabolism)
Evaluating HMM scores
-
if the protein’s protein s total score is is….. …above trusted: the protein is a member of family the HMM models -50
P
0 N
100
T
…below noise: the protein is not a member of family the HMM models -50
0
100
…in-between noise and trusted: the protein MAY be a member of th ffamily the il th the HMM models d l -50
0
100
...above trusted and some or all scores are negative: the protein is a member of the family the HMM models -50
Q
0
100
Useful Protein Databases Swiss-Prot à European Bioinformatics Institute (EBI) and Swiss Institute of
Bioinformatics (SIB) à all entries manually curated à annotation includes - Literature references - coordinates of protein features, eg active sites, signal peptides - links to cross-referenced databases - HMMs - Enzyme Commission
Protein Information Resource
National Center for Biotechnology Information (NCBI) à protein t i and d DNA sequences à taxonomy resource
Omnium à database that underlies TIGR’s CMR à contains data from all completed sequenced bacterial genomes à data is downloaded from the sequencing center
Enzyme Commission à categorized collection of enzymatic reactions à reactions have accession numbers indicating the type of reaction
Metabolic pathways: KEGG, Metacyc, etc. Boutique databases: Transport Classification database,
MEROPS,, etc.
Other searches PROSITE Motifs à collection of protein motifs associated with active sites, binding sites,
etc. etc
InterPro à Brings together HMMs (both TIGR and Pfam) Prosite motifs and other
forms of motif/domain clustering (Prints, Smart) à GO terms have been assigned to many of these
TmHMM à an HMM that recognizes membrane spans à a product of the Center for Biological Sequence Analysis
Signal P à potential secreted proteins
Lipoprotein p p
Other Searches/Information/Calculations Molecular Weight/pI RNAs à tRNAs are found using tRNAscan (Sean Eddy) à structural RNAs are found using BLAST searches à We are starting to implement Rfam, a set of HMMs modeling non-
coding RNAs (Sanger, WashU)
GC content à for the whole genome and individual genes
terminators DNA repeats (currently ( tl nott predicted) di t d) Operons (currently not predicted)
Functional Assignments g Assign ss g a annotation o a o to o eac each p protein: oe à Protein name à Gene symbol à Comment à EC number à TIGR role à GO terms
Annotation Procedures Automated à a software script combines several programs: glimmer, BER and HMM searches, and autoAnnotate Manual à Automated procedure above plus manual curation of individual genes Transferred (Mapped) à Software S ft scrip i used d to t map annotation t ti onto t a genome
AutoAnnotate: Assignment Hierarchy equivalog-level HMM (score above trusted)
Identifying information assigned to predicted coding region
BER search results (full length protein matches (>80%) with high percent identity (>35%))
BER matches are hypothetical proteins and/or do not fulfill match criteria.
non-equivalog level HMM “conserved hypothetical protein” “hypothetical protein”
BER matches are hypothetical proteins and no HMM hit. No BER matches and no HMM hits
Manual Annotation: Evaluating the Evidence Visually inspect alignments à should be full length with at least 35% identity Check HMM scores and isology type Review Genome Properties analysis à pathways, complexes? Check for operon p structure or other information from
neighboring genes. à presence of a gene in an operon can supplement weak similarity
evidence
Are there transmembrane regions? Is there a signal peptide? Are there any motifs that might give a clue to function? Is there a paralogous family?
Functional Assignment: Protein Annotation Knowledge about function reflected in specificity of protein names high confidence à “adenylosuccinate lyase”, purB, 4.3.2.2 general function, lacks specificity à “carbohydrate carbohydrate kinase kinase, FGGY family” family - no gene symbol, partial EC number family designation à “Cbby family protein” homolog designation à “recA recA homolog” homolog hypotheticals à “hypothetical yp p protein” à “conserved hypothetical protein” “putative recA” à “putative “ t ti ribokinase” ib ki ”
Translation disruptions Electropherograms reviewed in-house to verify sequence authentic frameshift authentic point mutation degenerate truncation deletion insertion interruption fusion fragment
Single Gene Annotation MANATEE: Manual Annotation Tool Etc Etc. à Web-based manual annotation tool for accessing & editing
annotation data manatee.sourceforge.net
The Annotation Engine Service A free service which provides automated annotation
and tools for analysis of another center's center s prokaryotic sequence. Two components: 1. Production of output from JCVI JCVI'ss automated annotation
pipeline - includes search results and automatically generated annotation in a MySQL database and associated files. 2. The manual annotation tool Manatee - an open p source
web based interface for interacting with and editing annotation data.
Mapped Curation Mapping based on whole genome analysis using nucmer and
associated scripts All genome features (gene models, RNAs, repeats, etc) and
annotation t ti from f a manually ll curated t d reference f genome mapped d onto the genome(s) of closely related organisms. Unique genes are manually reviewed in the query genome Advantage vs vs. gene-based mapping: à Positional context of the gene within the genome is taken into account,, increasing g the confidence that similar genes g are orthologous. à Identification of genes unique to the query genome
Functional Assignment: TIGR Roles TIGR bacterial roles were first adapted from Monica Riley’s roles for E. coli Amino A i acid id bi biosynthesis th i Purines, pyrimidines, nucleosides, and nucleotides Fatty a y ac acid d metabolism e abo s Biosynthesis of cofactors, prosthetic groups, and carriers Central intermediary metabolism E Energy metabolism t b li Transport and binding proteins DNA metabolism Transcription p Protein synthesis Protein Fate Regulatory Functions Si Signal lT Transduction d ti Cell envelope Cellular processes Other categories g Unknown Hypothetical
AutoAnnotate makes a first pass at assigning role, based on roles l associated i t d with ith HMMs or with match proteins.
Human annotator checks and adjusts as necessary.
Functional Assignment: Genome Ontology (GO) GO is a set of controlled vocabularies The vocabularies are used to describe 3 aspects of a gene product: what it does (molecular function), why it does what it does (biological process), and where it does what it does (cellular component)
What is a controlled vocabulary? a set of defined terms used to attach meaningful descriptions to objects given term always means the same thing
the “vehicle” controlled vocabulary vehicle car sedan 2-door sedan 4-door sedan station wagon hatch-back truck pick-up i k t truck k dump truck motorcycle
GO a community response to need for precise, consistent terminology in biology Different biological g p processes described by y single g term:
Example : - sporulation could refer to reproductive sporulation - sporulation l ti to t survive i adverse d conditions diti Enzyme Commission has standardized enzyme names, but no
concerted t d effort ff t to t do d so for f other th protein t i functions f ti With the advent of g genomics, ambiguity g y across organisms g and
communities is problematic In order to conduct meaningful g comparisons p and searches,,
genomic data must be organized using a consistent set of descriptors
The structure of GO The GO consists of three controlled vocabularies stored as
ontologies: - molecular function (what) - biological process (why) - cellular component (where)
The ontologies are made up of individual terms Each term has a name and a detailed definition Each term also has a unique id number - this makes the term
searchable by a computer. Each term is related to others in a “parent-child” relationship where
the child term is more specific than the parent term.
Example GO term •ID number: GO:0004076 •Name: biotin synthase activity •Definition: Catalysis of the reaction: dethiobiotin + sulfur = biotin. biotin •comment: none •cross reference: EC:2.8.1.6 EC:2 8 1 6 •synonyms: none parent term: sulfurtransferase activity (GO:0016783) •parent •relationship to parent: “is a”
Functional Assignment: Assigning GO terms Knowledge about function reflected in specificity of GO terms
Annotation evidence for 3 genes
Sample GO trees
#1 Example Gene
F Function ti
- HMM for ribokinase - characterized match to ribokinase
#2 Example Gene - HMM for kinase - characterized match to glucokinase l ki and d fructokinase f t ki
#3 Example Gene - HMM for kinase
catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity f t ki fructokinase activity ti it
Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose t metabolism t b li ribose metabolism
Gene Model Curation Insuring su g tthat at tthe ep predicted ed cted ge genes es have a e tthe e co correct ect
coordinates and that the set of predicted genes is complete and correct à Start site curation à overlap resolution (false positives) à missed genes (false negatives)
Gene Model Curation: Start site edits What is considered: - Similarity to match protein protein, both in BER and Paralogous Family - probably the most important factor. - Ribosome Binding Site (RBS): a string of AG rich sequence located 5-11 bp upstream of the start codon - Start site frequency: ATG >> GTG >> TTG
3 possible start sites
RBS upstream of start
This ORF’s upstream boundary
Gene Model Curation: Overlaps and InterEvidence Overlap analysis When two ORFs overlap (boxed areas), the one without similarity to anything (another protein, an HMM, etc.) is removed. If neither has evidence, other considerations such as presence in a putative operon and potential start codon quality are considered considered.
InterEvidence regions Areas of the genome with no genes and areas within genes without any kind of evidence (no match to another protein, HMM, etc.) are translated in all 6 frames and searched against niaa by blastx. Results are evaluated by the annotation team.
Data Availability Publication à TIGR staff/collaborators analysis of genome data GenBank à Sequence q and annotation submitted to GenBank at the time of publication à Updates sent as needed Pathema à Data available for download à Extensive E t i analyses l within ithi and db between t genomes GO repository à Data D t available il bl ffor d download l d à Whole repository searchable with AmiGO
Functional Curation
Community Annotation
Community Participants