Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Exploring Alternative Splicing Features using Support Vector Machines Jing Xia1 , Doina Caragea1 , Susan J. Brown2 1 Computing and Information Sciences Kansas State University, USA 2 Bioinformatics Center Kansas State University, USA
Nov 4. 2008
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Outline 1
Background & Motivation
2
Problem & Feature Construction Problem Definition Data Set Feature Construction
3
Experiments Design & Results Experimental Design Experimental Results
4
Conclusions and Future Work Conclusion
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Alternative Splicing
Alternative Splicing
exon
intron
exon
intron
exon
DNA
Splicing: important step during gene expression Variable splicing process (Alternative splicing) one gene -> many proteins
5’UTR Trasncription
GT
AG
GT
AG
3’UTR
TSS ATG exon
intron
exon
exon
intron
pre−mNRA cap 5’UTR Splicing
GU
AG
GT
AG
3’UTR
AUG
mRNA Translation protein
Genes expression: genes to proteins
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Alternative Splicing
Alternative Splicing Splicing: important step during gene expression Variable splicing process (Alternative splicing) one gene -> many proteins
Gene
pre−mRNA Alternative Splicing
transcript isoforms
Proteins
One genes to many proteins
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Patterns of Alternative Splicing
Patterns of Alternative Splicing Exon skipping (most frequent)
Constitutively Spliced Exon (CSE)
Alternatively Spliced Exon (ASE) CSE exon1
CSE exon2
ASE exon3
CSE exon4
Alternative 5’ splice sites Alternative 3’ splice sites Intron retention Mutually exclusive
Here, focus on predicting alternatively spliced exons (ASE) and constitutively spliced exons (CSE) based on SVM
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Identifying Alternative Splicing in genome
Alternative splicing Wet lab experiments finding AS is time consuming Traditionally, align EST to genome alignments (limited to amount of EST available to the genome)
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Identifying Alternative Splicing in genome
Transcripts
Alternative splicing Wet lab experiments finding AS is time consuming
genomic DNA
Traditionally, align EST to genome alignments (limited to amount of EST available to the genome) Alternative 3’ Exon
Exon Skipping
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Identifying Alternative Splicing in genome Alternative splicing Wet lab experiments finding AS is time consuming Traditionally, align EST to genome alignments (limited to amount of EST available to the genome) Use machine learning algorithms that to predict AS at the genome level
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Problem Definition
Problem Definition: given an exon, can we predict it as alternatively spliced exons (ASE) or constitutively spliced exons (CSE)? Constitutively Spliced Exon (CSE)
Alternatively Spliced Exon (ASE) CSE exon1
CSE exon2
ASE exon3
CSE exon4
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM) Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE) Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Example of data set ASE ASE CSE
GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ATACTATAGCGTCTTG....ACCGATCGTACACGCT GTACTATAGCGTCTTG....ACCGATCGTACTCGCT
AG
exon
GT
AG −100
0
+100
−100
0
+100
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Example of data set ASE ASE CSE
GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ATACTATAGCGTCTTG....ACCGATCGTACACGCT GTACTATAGCGTCTTG....ACCGATCGTACTCGCT
AG
exon
GT
AG −100
0
+100
−100
0
+100
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Example of data set ASE ASE CSE
GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ATACTATAGCGTCTTG....ACCGATCGTACACGCT GTACTATAGCGTCTTG....ACCGATCGTACTCGCT
AG
exon
GT
AG −100
0
+100
−100
0
+100
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Previous work: Motifs captured and identified by kernel G. Ratch et al., Length of exons and flanking introns Sorek et al. Our work: Exploit more biologically significant features Use several additional approaches to derive features
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Data Set Published data set from the model organism, C. elegans (worm) Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites Previous work: Motifs captured and identified by kernel G. Ratch et al., Length of exons and flanking introns Sorek et al. Our work: Exploit more biologically significant features Use several additional approaches to derive features
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Feature List
Several features known to be biologically important Strength of splice sites (SSS) Motif features Intronic splicing regulator (ISR) Motifs derived from local sequences (MAST) Exonic splicing enhancer (ESE)
Reduced set of motif features based on locations of motifs on secondary structure (MAST-R) Optimal folding energy (OPE) Basic sequence features (BSF)
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Strength of Splice Sites SSS
ISR
MAST
ESE
MAST-R
OPE
BSF
SSS: Strength of Splice Site
CGAG
exon
AGGTAAGT
We consider all splice sites
CGAG
exon
AGGTAAGT
GGAG
exon
AGGTAGGT
CGAG
exon
AGGTTAGT
CCAG −3 +7
exon
score =
X i
log
F (Xi ) , F (X )
where X ∈ {A, U, G, C}. i ∈ {−3, +7} for 3’ splice sites (3’ss) and i ∈ {−26, +2} for 5’ splice sites (5’ss).
3’ ss
−26
AGGTAAGT +2 5’ ss
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Motif Features SSS
ISR
MAST
ESE
MAST-R
OPE
BSF
Motif: sequence pattern that occurs repeatedly in group of sequences Intronic Splicing Regulator: identified in Kabat et al. MAST: derived by MEME using [-100,+100] sequence Exon Splicing Enhancers: based on two assumption
ISR exon Illustration of ISR dispersed among sequences
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Motif Features SSS
ISR
MAST
ESE
MAST-R
OPE
BSF
Motif: sequence pattern that occurs repeatedly in group of sequences Intronic Splicing Regulator: identified in Kabat et al. MAST: derived by MEME using [-100,+100] sequence Exon Splicing Enhancers: based on two assumption
Example: a 20-base motif derived from sequences around splice sites
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Motif Features SSS
ISR
MAST
ESE
MAST-R
OPE
BSF
Motif: sequence pattern that occurs repeatedly in group of sequences Intronic Splicing Regulator: identified in Kabat et al. MAST: derived by MEME using [-100,+100] sequence Exon Splicing Enhancers: based on two assumption more frequent in exons than in introns more frequent in exons with weak splice sites than in exons with strong splice sites
ISR
MAST ESE
Motifs - dispersed among exons and introns
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
pre-mRNA Secondary Structure SSS
ISR
MAST
ESE
MAST-R
OPE
Pre-mRNA secondary structures influence exon recognition Secondary structure:
BSF
motif AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA Located at different structure
derived from Mfold filter motifs using secondary structure Loop
Optimal Folding Energy: stability of RNA secondary structure
Stem
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
pre-mRNA Secondary Structure SSS
ISR
MAST
ESE
MAST-R
OPE
Pre-mRNA secondary structures influence exon recognition Secondary structure:
BSF
motif AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA Located at different structure
derived from Mfold filter motifs using secondary structure Loop
Optimal Folding Energy: stability of RNA secondary structure
Stem
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Problem Definition Data Set Feature Construction
Sequence features SSS
ISR
MAST
ESE
MAST-R
GC content (G & C ratio), = sequence Sequence length
OPE
BSF
G+C A+U+G+C ,
characteristics of
Length of exons and length of exons’ flanking introns frames of stop codons
Summary of features Motif features Secondary structure Strength of splice sites Sequence features
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train (kernel linear, RBF.., Cost C) Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs
20%
80%
split3
split4
split5
5−fold cross validation
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train (kernel linear, RBF.., Cost C) Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs
20%
80%
split3
split4
split5
5−fold cross validation
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train (kernel linear, RBF.., Cost C) Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs
20%
80%
split3
split4
split5
5−fold cross validation
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train (kernel linear, RBF.., Cost C) Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs
20%
80%
split3
split4
split5
5−fold cross validation
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Experimental Design Experimental Design List of previous defined features as SVM input Combination of different features to represent ASEs & CSEs
split1
split2
Tune SVM parameters to train (kernel linear, RBF.., Cost C) Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs
20%
80%
split3
split4
split5
5−fold cross validation
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Experimental results
Results of alternatively spliced exon classification. All features, including ISR motifs, are used. C Split1 Split2 Split3 Split4 Split5
0.05 0.05 0.1 0.01 0.1
Cross Validation Score fp 1% AUC % 32.45 86.55 39.33 88.32 37.56 87.76 40.86 89.02 36.48 87.50
Test score fp 1% AUC% 56.48 90.05 52.04 89.04 38.71 87.97 37.63 84.42 35.79 85.69
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Experimental results 1 0.9
True Positive Rate
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Mixed-Feas (85.55%) Base-Feas(78.78%) 0
0.2
0.4 0.6 False Positive Rate
0.8
1
Comparison of ROC curves obtained using basic features only and basic features plus other mixed features (except conserved ISR motifs). Models trained using 5-fold CV with C = 1.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Experimental results
AUC score comparison between data sets with secondary structural features and data sets without secondary structural features
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Motif Evaluation Intersection between motifs derived from sequences & intronic splicing regulators
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Motif Evaluation
Conserved ESE in metazoans (animals), Human and Mouse
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Experimental Design Experimental Results
Motif Evaluation
Comparison with A. thaliana
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Conclusion
Conclusions
Alternative splicing (AS) events can be found using transcripts Machine learning effectively used for prediction of AS events Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Conclusion
Conclusions
Alternative splicing (AS) events can be found using transcripts Machine learning effectively used for prediction of AS events Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Conclusion
Conclusions
Alternative splicing (AS) events can be found using transcripts Machine learning effectively used for prediction of AS events Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Conclusion
Conclusions
Alternative splicing (AS) events can be found using transcripts Machine learning effectively used for prediction of AS events Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Conclusion
Future Work Apply this approach to specific organism Identify motifs more accurately Refine relationships between features (2nd Structure:w and motifs) Learn other types of AS events (not only skipped exons)
adapted from "Detection of Alternative Splicing Events Using Machine Learning"
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Conclusion
Thank you for your attention!
Questions? Related work RASE http://www.fml.tuebingen.mpg.de/ raetsch/projects/RASE Acknowledgement data set from Dr. Ratsch’s FML group http://www.fml.tuebingen.mpg.de/raetsch/ projects/RASE/altsplicedexonsplits.tar.gz Dr. Caragea’s MLB group http://people.cis.ksu.edu/~dcaragea/mlb Dr. Brown’s Bininformatics Center at KSU http://bioinformatics.ksu.edu