Pages From 5thwmc_workshop-1

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Pages From 5thwmc_workshop-1 as PDF for free.

More details

  • Words: 3,637
  • Pages: 75
PATHEMA Burkholderia PATHEMA-Burkholderia A CLADE SPECIFIC BIOINFORMATICS RESOURCE CENTER http://pathema.tigr.org

What is Pathema? ƒ Pathema (http://pathema.tigr.org) ƒ A NIAID Bioinformatics Resource Center designed to support bio-

defense and infectious disease research. ƒ Pathema provides detailed curation of six target pathogens: à Category A priority pathogens:

Bacillus anthracis, Clostridium botulinum à Category B priority pathogens:

Burkholderia mallei, Burkholderia pseudomallei, Clostridium perfringens, Entamoeba histolytica

What is Pathema? •

A customized suite of sophisticated bioinformatics software



An extensive library of literature and manually curated data



Comparative p analysis y tools customized to retrieve,, display, p y, and compute results relevant to ongoing bio-defense research Overarching Goal: to provide core resources that will accelerate scientific p progress g on understanding, g, detecting, g, diagnosing and treating Category A-C pathogens and other pathogens involved in new and re-emerging infectious diseases. di

Differences from the CMR? h http://cmr.tigr.org // i

™

Curated datasets related to biodefense & infectious disease research

™

Customized analysis & tool development

™

Most up-to-date genomic data

™

3 d party annotations 3rd i

™

Community data submission, integration, and display

™

Will include sequence data from non-genome projects

™

Distinct clade-specific databases & web resources

Community Outreach St t i Plan Strategic Pl ƒ Phase Ph I: I Subjective S bj ti Assessment A t à Who is the Constituency? à What are the information needs? à What are the bioinformatics needs?

ƒ Phase II: Design and Refinement à Prototype development à Continued community correspondence

ƒ Phase III: Performance Assessment & Refinement à Online usability study à Workshops and training

Entamoeba histolytica y ((1))

All Burkholderia (25)

All Bacillus (16) B. anthracis (10) B. cereus (3) B. thuringiensis (2) B. subtilis

All Clostridium (15) C. perfringens (4) C. botulinum (6) C. difficile C. thermocellum C. acetobutylicum C tetani C. C. butyricum

B. mallei (8) B. pseudomallei (10) B. thailandensis B. cenocepacia B sp. B. sp B. phages (4)

Workshop p Overview Part 1: ™ P Prokaryotic k ti Annotation A t ti Methodologies M th d l i • Gene Model & Functional Curation ™ Introduction to Pathema and Curation • Sequence Sources & Genomic Datatypes • Pathogenicity & Literature curation

Part 2: ™ Pathema Navigation and Analysis Tools • Advanced Ad dD Data t Mi Mining i • Genome and Comparative Analysis Tools

PATHEMA-Burkholderia Data Curation Lauren Brinkac lbrinkac@jcvi org [email protected]

Pathema-Burkholderia Home Page

Burkholderia Update Schedules

Pathema-Burkholderia Home Page

Community Registration

ƒ Participation and access to community curation tool tool. ƒ Submission of experimentally characterized proteins. ƒ Participation P ti i ti iin ffuture t usability bilit studies. t di ƒ Community recognition.

Pathema-Burkholderia Home Page

Community Survey Results

General Survey Results ƒ Top Burkholderia Research Interests à Gene/Annotation à Virulence Factors à Protocols P t l à DNA/Clone/Strain information ƒ Top Pathema Resource Interests à Identify Id tif virulence i l ffactors t à Identification of biochemical pathways à Proteomics à Genetic differences between species/strains

Pathema-Burkholderia Home Page

Detailed Strain Information

BEI Resources

Biodefense & Emerging Infections Research Resources Repository (BEI) gy and Infectious Diseases ((NIAID)) Established byy the National Institute of Allergy to provide reagents and information for studying Category A, B, and C priority pathogens and emerging infectious disease agents to facilitate research and product development. BEI Resources is managed under contract by American Type Culture Collection (ATCC) to acquire, generate, authenticate, store, and distribute these materials to the scientific community.

Available Burkholderia Genomes

Sequence Sources & Curation Types ƒ Internally Sequenced Genomes à Automated - automated assignment of protein name name, gene symbol, ec #, GO terms à Mapped - genome features/annotation transfer from a closely related manually curated reference genome à Manual - manual curation of g gene models and g gene functional predictions - Sequenced internally - Sequenced externally (3rd party annotation) - Community curation

ƒ Externally Sequenced Genomes à JCVI Automated Annotation

Available Burkholderia Genomes

Genome-specific Home Page

Genome-specific Home Page

Region View

Sequence & Gene Retrieval

Functional Curation

What is Annotation? ƒ Webster’s definition of annotation” à “to make or furnish critical or explanatory notes or

comments”

ƒ What this includes for genomics à gene finding à functional characteristics of gene products à physical characteristics of gene/protein/genome à overall metabolic profile of the organism

ƒ Steps in the annotation process …

Microbial Annotation Overview Gene Finding Glimmer

Homology Searches BLAST-Extend-Repraze Hidden Markov Models Paralogous Families

Functional Assignments Automated Manual Mapped

ORF Management Overlaps InterEvidence

Data Availability

Gene Finding g using g Glimmer ƒ Gene Locator and Interpolated Markov Modeler ƒ Glimmer uses Interpolated Markov Models (IMMs)

to predict which ORFs in a genome are real genes genes. ƒ Using Glimmer is a two two-part part process: à Train Glimmer with g genes from the organism g that was

sequenced. à Run the trained Glimmer against the entire genome

sequence.

Interpolated Markov Model (IMM) • Computes the conditional probability that nucleotide X occurs following oligomer Y (for all combinations up to 8mers) 8 e s) in a set se of o known ow genes. ge es. •

Used to score new sequences for representative genes. ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA ATGCGTAAGGCTTTCACAGTATGACAGTACACTGACGA

Gene Finding with Glimmer: T i i Training Glimmer Gli Identification Id tifi ti off ORFs ORF with significant homology to other organisms using BLAST searches.

Identification Id tifi ti off ORFs ORF 500500 1000 nucleotides in length, depending on GC GC-content content, and elimination of long overlapping ORFs.

Glimmer IMM

Gene Prediction

Gene Finding with Glimmer: Scoring Sco g ORFs O s & Calling Ca g Genes Ge es 1) Find ORFs above length cutoff +3 +2 +1 -1 -2 -3

= ORFs meeting minimum length

= laterally transferred DNA

2) Scoring ORFs: Green ORFs scored well to the model, red ORFs scored less well. The green ORFs are chosen by Glimmer as the set of likely genes.

3) Calling Genes: Glimmer numbers the genes sequentially from the beginning of the DNA molecule on which they reside. Genes missed by Glimmer and overlapping genes are resolved by post-Glimmer processes, which will be discussed on later slides.

ORF00001

?

ORF00002

ORF00004 ORF00003

Blast-extend-repraze (BER) searches Glimmer Protein Translation BLAST query against an internal non nonredundant amino acid database (NIAA : GenBank, SwissProt, PIR, TIGR)

Significant hits stored in minidatabase 300bp extension on both ends Smith-Waterman Alignment Full length pairwise alignment

BER Extensions The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame in frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding sequence, green line indicates up- and downstream extensions. Red line is the match protein.

end5 d5 300 bp

ORFxxxxx

end3 d3 300 bp search protein match protein

normal full length match

! !

FS

similarity extending through a frameshift upstream or downstream into extensions

*

PM

similarity extending in the same frame through a stop codon

?

FS or PM ?

two functionally unrelated genes from other species matching one query protein could indicate incorrectly fused ORFs

Sample BER alignment

Frameshifted alignment

What makes a good alignment? ƒ Minimum Mi i off 35% id identity-preferably tit f bl hi higher h ƒ Full-length F ll l th match t h ((approx 70%) ƒ Match M t h tto experimentally i t ll characterized h t i d protein t i

- we have a database storing accessions of proteins known to be experimentally i t ll characterized h t i d - our tools highlight experimentally characterized proteins

CHAR : Experimentally Characterized Protein Database ƒ A manually y curated,, experimentally p y characterized protein p

database which relates protein names and taxonomic classifications with scientific literature. ƒ What does it include: à Controlled vocabulary describes the type of experimentation

performed in each publication à All A relevant annotation i data types (protein ( name, gene symbol, Enzyme Commission (EC) number, taxonomic data, Gene Ontology (GO) terms) are extracted à All synonymous protein accessions obtained from public databases (Genbank and UniProt) are stored

CHAR : Experimentally Ch Characterized t i d Protein P t i Database D t b ƒ Integrated into our Manual Annotation Tool (MANATEE)

via the protein accessions stored our non redundant database,, PANDA. à Actively update and enter TCHAR entries during curation à Use curated entries as evidence supporting curation - Reduction in transitive annotation

ƒ Curated data predominantly from Prokaryotic organisms: à 21,000 protein entries à 1,517 organisms à approx 9,000 manually curated

Hidden Markov Models: HMMs ƒ Definition: statistical model of the patterns of amino acids

in a multiple alignment of proteins which share sequence and functional similarity ƒ Based on curated multiple protein alignments ƒ Built B ilt iin-house h (TIGRFAM (TIGRFAMs)) and d att PFAM ƒ TIGRFAMs are assigned to a category (“isology ( isology type”) type )

which is determined by the types of proteins used for the HMM ƒ All proteins searched versus HMMs

Building HMMs Collect proteins to be in the “seed” (same function/similar domain/ family membership)

Generate and Curate Multiple Alignment of Seed proteins

Region of good alignment and closest similarity R HMM algorithm Run l ith

Computes statistical probabilities for amino acid patterns in the seed

this step thi t may need da few iterations

Search new model against g all proteins

Choose “noise” and “trusted” cutoff scores based on what scores the “known” vs. “unknown” proteins receive.

HMMs: choosing cutoff scores matches (seed members bold) protein “definitely” definitely protein “absolutely” protein “confident” protein “safe safe bet” bet protein “very confident” protein “has to be one” protein t i “could “ ld b be”” protein “maybe” protein “not sure” protein “no way” protein “can’t be” protein “not a chance” p

score 547 501 398 376 365 355 210 198 150 74 54 47

proteins scoring above trusted (300) considered part of the family modeled by the HMM proteins scoring below noise (100) not considered part of the family modeled by the HMM

300

100

HMM Isology Types Equivalog: designed so that all members of the family being modeled and all proteins scoring p g above trusted share the exact same function. Superfamily: This type of HMM describes a group of proteins which have full length protein sequence similarity and have the same domain architecture, but which do not necessarily have the same function. Subfamily: This type of HMM describes a group of proteins which also have full length homology homology, which represent more specific sub sub-groupings groupings with a superfamily superfamily. Domain: These HMMs describe a region of homology that is not required to be the full length of a protein. protein The function of the region may or may not be known known. (The above are just a sampling of the most commonly seen isology types there are more types than those listed here.))

Pfam: Indicates that no TIGR isology type has yet been assigned to the Pfam HMM.

Annotation is attached to HMMs ƒ TIGR00433 à isology: equivalog à name: biotin synthase à EC: 2.8.1.6 à gene symbol: bioB à TIGR role: 77 (Biotin biosynthesis) à GO terms: GO:0004076 (biotin synthase activity) GO:0009102 (biotin biosynthesis) ƒ PF04055 à isology: domain à name: radical SAM domain protein à EC: not applicable à gene symbol: not applicable à TIGR role: 703 (enzymes of unknown specificity) à GO terms: GO:0003824 (catalytic activity) GO:0008152 (metabolism)

Evaluating HMM scores

-

if the protein’s protein s total score is is….. …above trusted: the protein is a member of family the HMM models -50

P

0 N

100

T

…below noise: the protein is not a member of family the HMM models -50

0

100

…in-between noise and trusted: the protein MAY be a member of th ffamily the il th the HMM models d l -50

0

100

...above trusted and some or all scores are negative: the protein is a member of the family the HMM models -50

Q

0

100

Useful Protein Databases ƒ Swiss-Prot à European Bioinformatics Institute (EBI) and Swiss Institute of

Bioinformatics (SIB) à all entries manually curated à annotation includes - Literature references - coordinates of protein features, eg active sites, signal peptides - links to cross-referenced databases - HMMs - Enzyme Commission

ƒ Protein Information Resource

ƒ National Center for Biotechnology Information (NCBI) à protein t i and d DNA sequences à taxonomy resource

ƒ Omnium à database that underlies TIGR’s CMR à contains data from all completed sequenced bacterial genomes à data is downloaded from the sequencing center

ƒ Enzyme Commission à categorized collection of enzymatic reactions à reactions have accession numbers indicating the type of reaction

ƒ Metabolic pathways: KEGG, Metacyc, etc. ƒ Boutique databases: Transport Classification database,

MEROPS,, etc.

Other searches ƒ PROSITE Motifs à collection of protein motifs associated with active sites, binding sites,

etc. etc

ƒ InterPro à Brings together HMMs (both TIGR and Pfam) Prosite motifs and other

forms of motif/domain clustering (Prints, Smart) à GO terms have been assigned to many of these

ƒ TmHMM à an HMM that recognizes membrane spans à a product of the Center for Biological Sequence Analysis

ƒ Signal P à potential secreted proteins

ƒ Lipoprotein p p

Other Searches/Information/Calculations ƒ Molecular Weight/pI ƒ RNAs à tRNAs are found using tRNAscan (Sean Eddy) à structural RNAs are found using BLAST searches à We are starting to implement Rfam, a set of HMMs modeling non-

coding RNAs (Sanger, WashU)

ƒ GC content à for the whole genome and individual genes

ƒ terminators ƒ DNA repeats (currently ( tl nott predicted) di t d) ƒ Operons (currently not predicted)

Functional Assignments g ƒ Assign ss g a annotation o a o to o eac each p protein: oe à Protein name à Gene symbol à Comment à EC number à TIGR role à GO terms

Annotation Procedures ƒ Automated à a software script combines several programs: glimmer, BER and HMM searches, and autoAnnotate ƒ Manual à Automated procedure above plus manual curation of individual genes ƒ Transferred (Mapped) à Software S ft scrip i used d to t map annotation t ti onto t a genome

AutoAnnotate: Assignment Hierarchy equivalog-level HMM (score above trusted)

Identifying information assigned to predicted coding region

BER search results (full length protein matches (>80%) with high percent identity (>35%))

BER matches are hypothetical proteins and/or do not fulfill match criteria.

non-equivalog level HMM “conserved hypothetical protein” “hypothetical protein”

BER matches are hypothetical proteins and no HMM hit. No BER matches and no HMM hits

Manual Annotation: Evaluating the Evidence ƒ Visually inspect alignments à should be full length with at least 35% identity ƒ Check HMM scores and isology type ƒ Review Genome Properties analysis à pathways, complexes? ƒ Check for operon p structure or other information from

neighboring genes. à presence of a gene in an operon can supplement weak similarity

evidence ƒ ƒ ƒ ƒ

Are there transmembrane regions? Is there a signal peptide? Are there any motifs that might give a clue to function? Is there a paralogous family?

Functional Assignment: Protein Annotation Knowledge about function reflected in specificity of protein names ƒ high confidence à “adenylosuccinate lyase”, purB, 4.3.2.2 ƒ general function, lacks specificity à “carbohydrate carbohydrate kinase kinase, FGGY family” family - no gene symbol, partial EC number ƒ family designation à “Cbby family protein” ƒ homolog designation à “recA recA homolog” homolog ƒ hypotheticals à “hypothetical yp p protein” à “conserved hypothetical protein” ƒ “putative recA” à “putative “ t ti ribokinase” ib ki ”

Translation disruptions Electropherograms reviewed in-house to verify sequence ƒ authentic frameshift ƒ authentic point mutation ƒ degenerate ƒ truncation ƒ deletion ƒ insertion ƒ interruption ƒ fusion ƒ fragment

Single Gene Annotation ƒ MANATEE: Manual Annotation Tool Etc Etc. à Web-based manual annotation tool for accessing & editing

annotation data manatee.sourceforge.net

The Annotation Engine Service ƒ A free service which provides automated annotation

and tools for analysis of another center's center s prokaryotic sequence. ƒ Two components: 1. Production of output from JCVI JCVI'ss automated annotation

pipeline - includes search results and automatically generated annotation in a MySQL database and associated files. 2. The manual annotation tool Manatee - an open p source

web based interface for interacting with and editing annotation data.

Mapped Curation ƒ Mapping based on whole genome analysis using nucmer and

associated scripts ƒ All genome features (gene models, RNAs, repeats, etc) and

annotation t ti from f a manually ll curated t d reference f genome mapped d onto the genome(s) of closely related organisms. ƒ Unique genes are manually reviewed in the query genome ƒ Advantage vs vs. gene-based mapping: à Positional context of the gene within the genome is taken into account,, increasing g the confidence that similar genes g are orthologous. à Identification of genes unique to the query genome

Functional Assignment: TIGR Roles TIGR bacterial roles were first adapted from Monica Riley’s roles for E. coli Amino A i acid id bi biosynthesis th i Purines, pyrimidines, nucleosides, and nucleotides Fatty a y ac acid d metabolism e abo s Biosynthesis of cofactors, prosthetic groups, and carriers Central intermediary metabolism E Energy metabolism t b li Transport and binding proteins DNA metabolism Transcription p Protein synthesis Protein Fate Regulatory Functions Si Signal lT Transduction d ti Cell envelope Cellular processes Other categories g Unknown Hypothetical

AutoAnnotate makes a first pass at assigning role, based on roles l associated i t d with ith HMMs or with match proteins.

Human annotator checks and adjusts as necessary.

Functional Assignment: Genome Ontology (GO) GO is a set of controlled vocabularies The vocabularies are used to describe 3 aspects of a gene product: what it does (molecular function), why it does what it does (biological process), and where it does what it does (cellular component)

What is a controlled vocabulary? ƒ a set of defined terms used to attach meaningful descriptions to objects ƒ given term always means the same thing

the “vehicle” controlled vocabulary vehicle car sedan 2-door sedan 4-door sedan station wagon hatch-back truck pick-up i k t truck k dump truck motorcycle

GO a community response to need for precise, consistent terminology in biology ƒ Different biological g p processes described by y single g term:

Example : - sporulation could refer to reproductive sporulation - sporulation l ti to t survive i adverse d conditions diti ƒ Enzyme Commission has standardized enzyme names, but no

concerted t d effort ff t to t do d so for f other th protein t i functions f ti ƒ With the advent of g genomics, ambiguity g y across organisms g and

communities is problematic ƒ In order to conduct meaningful g comparisons p and searches,,

genomic data must be organized using a consistent set of descriptors

The structure of GO ƒ The GO consists of three controlled vocabularies stored as

ontologies: - molecular function (what) - biological process (why) - cellular component (where)

ƒ The ontologies are made up of individual terms ƒ Each term has a name and a detailed definition ƒ Each term also has a unique id number - this makes the term

searchable by a computer. ƒ Each term is related to others in a “parent-child” relationship where

the child term is more specific than the parent term.

Example GO term •ID number: GO:0004076 •Name: biotin synthase activity •Definition: Catalysis of the reaction: dethiobiotin + sulfur = biotin. biotin •comment: none •cross reference: EC:2.8.1.6 EC:2 8 1 6 •synonyms: none parent term: sulfurtransferase activity (GO:0016783) •parent •relationship to parent: “is a”

Functional Assignment: Assigning GO terms Knowledge about function reflected in specificity of GO terms

Annotation evidence for 3 genes

Sample GO trees

#1 Example Gene

F Function ti

- HMM for ribokinase - characterized match to ribokinase

#2 Example Gene - HMM for kinase - characterized match to glucokinase l ki and d fructokinase f t ki

#3 Example Gene - HMM for kinase

catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity f t ki fructokinase activity ti it

Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose t metabolism t b li ribose metabolism

Gene Model Curation ƒ Insuring su g tthat at tthe ep predicted ed cted ge genes es have a e tthe e co correct ect

coordinates and that the set of predicted genes is complete and correct à Start site curation à overlap resolution (false positives) à missed genes (false negatives)

Gene Model Curation: Start site edits What is considered: - Similarity to match protein protein, both in BER and Paralogous Family - probably the most important factor. - Ribosome Binding Site (RBS): a string of AG rich sequence located 5-11 bp upstream of the start codon - Start site frequency: ATG >> GTG >> TTG

3 possible start sites

RBS upstream of start

This ORF’s upstream boundary

Gene Model Curation: Overlaps and InterEvidence Overlap analysis When two ORFs overlap (boxed areas), the one without similarity to anything (another protein, an HMM, etc.) is removed. If neither has evidence, other considerations such as presence in a putative operon and potential start codon quality are considered considered.

InterEvidence regions Areas of the genome with no genes and areas within genes without any kind of evidence (no match to another protein, HMM, etc.) are translated in all 6 frames and searched against niaa by blastx. Results are evaluated by the annotation team.

Data Availability ƒ Publication à TIGR staff/collaborators analysis of genome data ƒ GenBank à Sequence q and annotation submitted to GenBank at the time of publication à Updates sent as needed ƒ Pathema à Data available for download à Extensive E t i analyses l within ithi and db between t genomes ƒ GO repository à Data D t available il bl ffor d download l d à Whole repository searchable with AmiGO

Functional Curation

Community Annotation

Community Participants

Related Documents