CARL
Probabilistic Cellular Automata Model for Identification of CpG island in DNA string
carl
Soumyabrata Ghosh
Cellular Automata Research Lab Group Shiladitya Munshi | Nirmalya S. Maiti | Soumyabrata Ghosh
Salt Lake City | Kolkata | www.carltig.res.in
carl
Agendas •
Background
•
CpG Island Identification by Hidden Markov Model & its drawbacks
•
Probabilistic Cellular Automata (PrCA) & its Rule Vector Graph
•
PrCA Model for CpG island identification
•
Filter Design to reduce false positive predictions
•
Experimental results
Background •
Identification of CpG island in DNA string • Higher concentration of C-G dinucleotides • Probable “start” regions of genes
•
De facto standard: Hidden Markov Model • Based on Markov chain with transition & emission probabilities • Maximum probable path prediction by Viterbi Algorithm
carl
•
Disadvantages of HMM: • • • •
States are assumed to be independent of each other Local Maxima Over training False positive prediction
Hidden Markov Model (HMM) • • •
•
carl
•
Based on Markov Chain Probability of a state depends only on previous state Transition Probability Matrix • Transition between underlying states (A+, T+ , G+, C+ for CpG island and A-, T-, G-, C- for non-CpG) Emission Probability Matrix • Emission of expressed symbol (A, T, G, C) with respect to underlying state Viterbi Algorithm – Initializes the probability calculations by taking the product of the initial Transition probabilities with the associated emission probabilities. – Determines the most probable path through the underlying states of the HMM
carl
Our Approach •
Design of Probabilistic CA based model
•
Encoding of nucleotides by 3 bits binary string
•
Derivation of Probabilistic Rules from Transition Probability Matrix
•
Model Traversal by Viterbi algorithm
•
Feature Extraction from encoded DNA string
•
Filter design based on Feature Extraction to reduce false positive predictions
Probabilistic Cellular Automata (PrCA) •
Cellular Automata: discrete dynamical system. Space, time, and the states of the system are discrete.
•
The next state function of ith cell depends on current state of (i-1)th, ith, (i+1)th cell. The next state function is represented by CA Rule.
•
Three Neighborhood Periodic Boundary Probabilistic CA
carl
An n cell CA with Rule Vector •
Rule Mean Term (RMT): Minterm of a 3 variable Boolean function for a 3-neighborhood CA cell denoted as T= {T(m)}, (m=0 to 7)
•
In Probabilistic CA, for a given RMT, the next state value is 0 or 1 with probability less than 1
Probabilistic CA Rules
carl
Deterministic & Probabilistic Rule 232
carl
Rule Vector Graph (RVG) of PrCA •
Efficient data-structure for characterization of CA evolution
•
RVG of n cell CA consists of n subgraphs
•
The output of ith subgraph serves as input node of (i+1)th subgraph
•
Nodes contain possible RMTs derived from RMTs of previous level
•
Root Node & Sink Node contain all possible RMTs for periodic boundary CA
•
An edge explicitly denotes probability of 1 whereas the probability of 0 is implicit
RVG Structure of Probabilistic Periodic Boundary 3 Neighborhood CA
carl
Probabilistic Rule Table
RVG Traversal • 2n possible number of paths for n cell PrCA • Forward RVG Traversal • Most probable next state identification
• Backward RVG Traversal • Most probable transition identification
• Forward/Backward Traversal
carl
• Most probable path determination through the RVG
carl
Pr CA Model for CpG island Identification •
3 cell CA can generate 8 states
•
Each of 8 states represents one of underlying states (A+,G+, T+, C+,A-, T- G-, C-)
•
One RVG represent single underlying state transition
•
Sequence of N such transitions can be modeled by a network of N RVGS
•
Most probable path determination by N sequential traversal of the RVG using Viterbi Algorithm
carl
RVG Network of CA Model & Model Traversal
Synthesis of Probabilistic Rules • •
Determination from Transition Probability Matrix by permuting binary encoding Lowest conflict encoding is taken
carl
Probabilistic Rules applied on three bits of binary encoding of underlying states
Three bit encoding for 8 under lying states
Filter Design •
Two Binary slices derived from DNA sequence by using common two bit encoding of nucleotides: C(00), A(01), G(10), T(11)
•
Feature Extraction from binary string of traing DNA sequences based on experimentally known CpG/non-CpG differentiation
carl
– – –
•
Class separation using Bhattacharyya distance Window size 200 bits (lowest difference between 2 classes) K-means Clustering to produce representative Datapoints
Analysis of feature vectors of testing sequence to assign 4th bit encoding –
– –
Using Mahanalobis Distance Measure
1 for window belongs to CpG 0 for window belongs to non-CpG
carl
Experimental Results
Prediction of CpG Islands on AF111167 gene sequence. The model and filter are trained by CPGISLE dataset
Inferences of experimental results • HMM prediction: » 4 False Positive Predictions » 2 False Negative Predictions
• Pr CA Model (with Filter):
carl
» 1 False Positive Prediction » 2 False Negative Predictions
carl
Concluding Remarks •
Probabilistic CA based Model for DNA sequence analysis
•
Improvement of CpG island prediction by reducing False Positive predictions
•
Analysis of binary encoded DNA string may provide better insight of local interaction of nucleotides
carl
Thanks !
Cellular Automata Research Lab Group Shiladitya Munshi | Nirmalya S. Maiti | Soumyabrata Ghosh
Salt Lake City | Kolkata | www.carltig.res.in