presented by: Ji ZhiLiang (HT006436A) Hu YanJun (HT016111W) Li ChangQin (HT016121U)
Data Mining in Bioinformatics
General Content
Presented by: Ji ZhiLiang
Bioinformatics is the computer-assisted data management discipline that helps us gather, analyze, and represent biological information in order to understand life's processes.
What is Bioinformatics?
Data mining: the extraction of hidden predictive information from large databases.
What is Data mining?
information extraction, and the amount is still updating.
• Pubmed: more than 12,000,000 biological abstracts for
• Swiss_Prot Database: more than 10,000 proteins
of raw sequence data, comprising overlapping fragments totaling 3.9 billion bases; many tens of thousands of genes have been identified from the genome sequence. Analysis of the current sequence shows 38,000 predicted genes confirmed by experimental evidence.
• The Human Genome Project: over 22.1 billion bases
form
♦ Massive biological information of various
Why need Data mining?
biological data
♦ Sophisticated relationship among the
Why need Data mining? (cont.)
generating more chemical and biological screening data than it knows what to do with or how best to handle. As a result, deciding which target and lead compound to develop further is often a long and arduous task.
♦ The biopharmaceutical industry is
Why need Data mining? (cont.)
not adequate to deal with enormous data flow.
♦ The traditional data analysis methods are
Why need Data mining? (cont.)
based on statistical significance.
♦ Rule induction: The extraction of useful if-then rules from data
record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.
♦ Nearest neighbor method: A technique that classifies each
such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
♦ Genetic algorithms: Optimization techniques that use processes
decisions.
♦ Decision trees: Tree-shaped structures that represent sets of
learn through training and resemble biological neural networks in structure.
♦ Artificial neural networks: Non-linear predictive models that
Most common techniques in Data mining
♦ Decision Tree: predict the possible cause of diseases from clinical records; phylogenetic data mining of gene sequences from public databases; ♦ Artificial neural networks: Drug safety study and new drug development; ♦ Rule induction: Pattern in clinical pathway;
Data mining applications in Bioinformatics
Data mining applications in Bioinformatics (cont.)
(as opposed to linear) data in large databases are scanned for influences between specific data sets, and this is done along many dimensions and in multi-table formats. These systems find applications wherever there are significant causeand-effect relationships between data sets—as occurs, for example, in large and multivariant gene expression studies, which are behind areas such as pharmacogenomics.
♦ Influence-based mining: complex and granular
Approaches of Data mining in Bioinformatics
sets are analyzed across multiple dimensions, and the data-mining system identifies data points or sets that tend to be grouped together. These systems differentiate themselves by providing hierarchies of associations and showing any underlying logical conditions or rules that account for the specific groupings of data. This approach is particularly useful in biological motif analysis, whereby it is important to distinguish "accidental" or incidental motifs from ones with biological significance.
♦ Affinity-based mining: large and complex data
Approaches of Data mining in Bioinformatics (cont.)
not available immediately and in complete form, but is collected over time. The systems designed to handle such data look for patterns that are confirmed or rejected as the data set increases and becomes more robust. This approach is geared towards long-term clinical trial analysis and multicomponent mode of action studies, for example.
♦ Time delay data mining: the data set is
Approaches of Data mining in Bioinformatics (cont.)
analyzes large and complex data sets in terms of any changes that occur in specific data sets over time. The data sets can be user-defined, or the system can uncover them itself. Essentially, the system reports on anything that is changing over time.
♦ Trend-based mining: the software
Approaches of Data mining in Bioinformatics (cont.)
overlaying large and complex data sets that are similar to each other and comparing them. This is particularly useful in all forms of clinical trial meta analyses, where data collected at different sites over different time periods, and perhaps under similar but not always identical conditions, need to be compared. Here, the emphasis is on finding dissimilarities, not similarities.
♦ Comparative data mining: it focuses on
Approaches of Data mining in Bioinformatics (cont.)
is lacking somewhat if it is unable to also offer a framework for making simulations, predictions, and forecasts, based on the data sets it has analyzed. It combines pattern matching, influence relationships, time set correlations, and dissimilarity analysis to offer simulations of future data sets.
♦ Predictive data mining: data mining alone
Approaches of Data mining in Bioinformatics (cont.)
Richards G, Rayward-Smith VJ, Sonksen PH, Carey S and Weng C. Data Mining for Indicators or Early Mortality in Database of Clinical Records. (2001). Artifical Intelligence in Medicine. 22: 215-231.
Case study
Li changqing HT016121U
Case Study
♦ Interpretation
♦ Rule evaluation
♦ Rule induction
♦ Data preparation
♦ Introduction
Outline
informatics (An actual medical system) ♦ Medical data has increased dramatically ♦ Manual analysis is not adequate ♦ Data mining is necessary
♦ How the KDD can be used in in bio-
Introduction
observations of diabetes and early death
♦ To find the relationship between the first
Objective
♦ Leveling of ages of observation
♦ Missing data
♦ Classification of cases
Data preparation
chosen to discriminate between those who have died young and those who have not ♦ ‘age of death’ <=60, Class T; ‘age of death’ >60, Class F ♦ The records of the people who have died can be used ♦ If age>60 and not died then Class F
♦ A threshold value of 60 years of age was
Classification of cases
♦ Category C is discarded
Classification of cases
records. Randomly split into a training set(2/3)(D60_train) and a testing set(1/3)(D60_test)
♦ The result of this stage was a file (D60) of 3971
Result of Classification
significant proportion of blank fields ♦ Solutions: a. discard all attributes or all records with missing data b. estimate the missing values by reference to existing values
♦ Most records and most attributes have a
Missing data
…
Rec2
Rec1
2
1
Attri1
Example
2
Attri2
3
2
…
have been examined at a younger age than those that don’t, thus bias occurs, no comparable ♦ Three methods to solve this problem
♦ Those who die young(<=60) will tend to
Leveling of ages of observation
D60_del, D60_dup, D60_adj
♦ The leveled training sets were saved as files
♦ Adjustment
♦ Duplication
♦ Deletion
Methods for leveling
Results for leveling
c/b ♦ Maximize c, minimize (a-c) and minimize (b-c)
♦ a=|A|, b=|B|, c=|C| ♦ Accuracy(confidence)=c/a,Coverage(sensitivity)=
C = {r | condition and conclusion of rule, r, is satified }
B = {r | conclusion of rule, r, is satified }
A = {r | condition of rule, r, is satified }
♦ Using simulated annealing algorithm ♦ The following sets were defined
Rule induction
at 5% were rejected ♦ Rules with coverage less than 5% were also rejected ♦ Then 9 of the 41 rules were rejected
♦ Based on training data, rules not significant
Rule evaluation on training data
at 5% were rejected ♦ Then 22 of the left 32 rules were rejected, only 10 rules were left for final use
♦ Based on training data, rules not significant
Rule evaluation on testing data
mined from KDD
♦ A recent medical study confirmed the rules
most important indicator of early death
♦ Rules indicate that nerve damage is the
Interpretation
Conclusion
Hu YanJun
are significant association ♦ KDD methods is useful in above study ♦ To find whether findings are general the statistical analysis is required to confirm the association is “real” association
♦ First visit observations and early mortality
Conclusion on Bio-info DM
are being developed to identify patterns within the data that can be exploited ♦ Following are corresponding techniques that can be used in KDD
♦ KDD (Knowledge Discovery in Databases)
KDD process
♦ The pre-processing stage – Data no clean – Too many attributes – Discrete data needed – Missing data-can not delete easily!
Data preparation
relational stage ♦ Reforming the data into a spreadsheet format in a form suitable for data mining
♦ Usually the original data was derived from a
Data preparation (cont)
– Data collected for each patient vary considerably – Some unreliability – Some patients have long-period recording while other patients only have a single visit
typical for clinical records
♦ This extensive pre-processing work is
Data preparation (cont)
influence the whole training data ♦ The training set was adjusted so that the distribution on one specific attribute is almost the same
♦ Some attributes may bias so that they may
Leveling
eliminate the effect of age of observation from the training data ♦ Stripping unnecessary attributes, only important attributes left, such as gender, marital status …
♦ The leveling object of this case is to
Leveling (cont)
values ♦ Use an entropy based information ♦ Not useful for all cases, especially when the distinction between FSS and rule induction becomes blurred
♦ In order to adjust to compensate for missing
FSS (Feature Subset Selection)
control of the simulated annealing algorithm ♦ The algorithm use heuristic searching strategy to find the “best” rule ♦ Users can control various parameters in order to optimize the search
♦ The interface provides users with extensive
Data mining
– In the medical domain primary objective was explanation rather than prediction
included ♦ The generated rules were simple to understand
♦ Comprehensive pre-processing facilities are
Data mining (cont)
– Medical databases typically have a high proportion of missing values – Many of the discovered rules were based on attributes with 50% missing values
handle the missing values
♦ This data mining software can efficiently
Data mining (cont)
♦
2 test was a useful method
– They may be produced by chance
coverage
♦ Some rules have low levels of accuracy and
Rule evaluation
♦ Further work – Incorporating 2 into the fitness function of the simulated annealing algorithm.
♦ N Need further investigated – A very large number of rules are generated and evaluted
Rule evaluation (cont)
♦ Questions and Answers
End