Data Mining

November 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Data Mining as PDF for free.

More details

Words: 1,719
Pages: 50

Preview
Full text

presented by: Ji ZhiLiang (HT006436A) Hu YanJun (HT016111W) Li ChangQin (HT016121U)

Data Mining in Bioinformatics

General Content

Presented by: Ji ZhiLiang

Bioinformatics is the computer-assisted data management discipline that helps us gather, analyze, and represent biological information in order to understand life's processes.

What is Bioinformatics?

Data mining: the extraction of hidden predictive information from large databases.

What is Data mining?

information extraction, and the amount is still updating.

• Pubmed: more than 12,000,000 biological abstracts for

• Swiss_Prot Database: more than 10,000 proteins

of raw sequence data, comprising overlapping fragments totaling 3.9 billion bases; many tens of thousands of genes have been identified from the genome sequence. Analysis of the current sequence shows 38,000 predicted genes confirmed by experimental evidence.

• The Human Genome Project: over 22.1 billion bases

form

♦ Massive biological information of various

Why need Data mining?

biological data

♦ Sophisticated relationship among the

Why need Data mining? (cont.)

generating more chemical and biological screening data than it knows what to do with or how best to handle. As a result, deciding which target and lead compound to develop further is often a long and arduous task.

♦ The biopharmaceutical industry is

Why need Data mining? (cont.)

not adequate to deal with enormous data flow.

♦ The traditional data analysis methods are

Why need Data mining? (cont.)

based on statistical significance.

♦ Rule induction: The extraction of useful if-then rules from data

record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.

♦ Nearest neighbor method: A technique that classifies each

such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.

♦ Genetic algorithms: Optimization techniques that use processes

decisions.

♦ Decision trees: Tree-shaped structures that represent sets of

learn through training and resemble biological neural networks in structure.

♦ Artificial neural networks: Non-linear predictive models that

Most common techniques in Data mining

♦ Decision Tree: predict the possible cause of diseases from clinical records; phylogenetic data mining of gene sequences from public databases; ♦ Artificial neural networks: Drug safety study and new drug development; ♦ Rule induction: Pattern in clinical pathway;

Data mining applications in Bioinformatics

Data mining applications in Bioinformatics (cont.)

(as opposed to linear) data in large databases are scanned for influences between specific data sets, and this is done along many dimensions and in multi-table formats. These systems find applications wherever there are significant causeand-effect relationships between data sets—as occurs, for example, in large and multivariant gene expression studies, which are behind areas such as pharmacogenomics.

♦ Influence-based mining: complex and granular

Approaches of Data mining in Bioinformatics

sets are analyzed across multiple dimensions, and the data-mining system identifies data points or sets that tend to be grouped together. These systems differentiate themselves by providing hierarchies of associations and showing any underlying logical conditions or rules that account for the specific groupings of data. This approach is particularly useful in biological motif analysis, whereby it is important to distinguish "accidental" or incidental motifs from ones with biological significance.

♦ Affinity-based mining: large and complex data

Approaches of Data mining in Bioinformatics (cont.)

not available immediately and in complete form, but is collected over time. The systems designed to handle such data look for patterns that are confirmed or rejected as the data set increases and becomes more robust. This approach is geared towards long-term clinical trial analysis and multicomponent mode of action studies, for example.

♦ Time delay data mining: the data set is

Approaches of Data mining in Bioinformatics (cont.)

analyzes large and complex data sets in terms of any changes that occur in specific data sets over time. The data sets can be user-defined, or the system can uncover them itself. Essentially, the system reports on anything that is changing over time.

♦ Trend-based mining: the software

Approaches of Data mining in Bioinformatics (cont.)

overlaying large and complex data sets that are similar to each other and comparing them. This is particularly useful in all forms of clinical trial meta analyses, where data collected at different sites over different time periods, and perhaps under similar but not always identical conditions, need to be compared. Here, the emphasis is on finding dissimilarities, not similarities.

♦ Comparative data mining: it focuses on

Approaches of Data mining in Bioinformatics (cont.)

is lacking somewhat if it is unable to also offer a framework for making simulations, predictions, and forecasts, based on the data sets it has analyzed. It combines pattern matching, influence relationships, time set correlations, and dissimilarity analysis to offer simulations of future data sets.

♦ Predictive data mining: data mining alone

Approaches of Data mining in Bioinformatics (cont.)

Richards G, Rayward-Smith VJ, Sonksen PH, Carey S and Weng C. Data Mining for Indicators or Early Mortality in Database of Clinical Records. (2001). Artifical Intelligence in Medicine. 22: 215-231.

Case study

Li changqing HT016121U

Case Study

♦ Interpretation

♦ Rule evaluation

♦ Rule induction

♦ Data preparation

♦ Introduction

Outline

informatics (An actual medical system) ♦ Medical data has increased dramatically ♦ Manual analysis is not adequate ♦ Data mining is necessary

♦ How the KDD can be used in in bio-

Introduction

observations of diabetes and early death

♦ To find the relationship between the first

Objective

♦ Leveling of ages of observation

♦ Missing data

♦ Classification of cases

Data preparation

chosen to discriminate between those who have died young and those who have not ♦ ‘age of death’ <=60, Class T; ‘age of death’ >60, Class F ♦ The records of the people who have died can be used ♦ If age>60 and not died then Class F

♦ A threshold value of 60 years of age was

Classification of cases

♦ Category C is discarded

Classification of cases

records. Randomly split into a training set(2/3)(D60_train) and a testing set(1/3)(D60_test)

♦ The result of this stage was a file (D60) of 3971

Result of Classification

significant proportion of blank fields ♦ Solutions: a. discard all attributes or all records with missing data b. estimate the missing values by reference to existing values

♦ Most records and most attributes have a

Missing data

…

Rec2

Rec1

2

1

Attri1

Example

2

Attri2

3

2

…

have been examined at a younger age than those that don’t, thus bias occurs, no comparable ♦ Three methods to solve this problem

♦ Those who die young(<=60) will tend to

Leveling of ages of observation

D60_del, D60_dup, D60_adj

♦ The leveled training sets were saved as files

♦ Adjustment

♦ Duplication

♦ Deletion

Methods for leveling

Results for leveling

c/b ♦ Maximize c, minimize (a-c) and minimize (b-c)

♦ a=|A|, b=|B|, c=|C| ♦ Accuracy(confidence)=c/a,Coverage(sensitivity)=

C = {r | condition and conclusion of rule, r, is satified }

B = {r | conclusion of rule, r, is satified }

A = {r | condition of rule, r, is satified }

♦ Using simulated annealing algorithm ♦ The following sets were defined

Rule induction

at 5% were rejected ♦ Rules with coverage less than 5% were also rejected ♦ Then 9 of the 41 rules were rejected

♦ Based on training data, rules not significant

Rule evaluation on training data

at 5% were rejected ♦ Then 22 of the left 32 rules were rejected, only 10 rules were left for final use

♦ Based on training data, rules not significant

Rule evaluation on testing data

mined from KDD

♦ A recent medical study confirmed the rules

most important indicator of early death

♦ Rules indicate that nerve damage is the

Interpretation

Conclusion

Hu YanJun

are significant association ♦ KDD methods is useful in above study ♦ To find whether findings are general the statistical analysis is required to confirm the association is “real” association

♦ First visit observations and early mortality

Conclusion on Bio-info DM

are being developed to identify patterns within the data that can be exploited ♦ Following are corresponding techniques that can be used in KDD

♦ KDD (Knowledge Discovery in Databases)

KDD process

♦ The pre-processing stage – Data no clean – Too many attributes – Discrete data needed – Missing data-can not delete easily!

Data preparation

relational stage ♦ Reforming the data into a spreadsheet format in a form suitable for data mining

♦ Usually the original data was derived from a

Data preparation (cont)

– Data collected for each patient vary considerably – Some unreliability – Some patients have long-period recording while other patients only have a single visit

typical for clinical records

♦ This extensive pre-processing work is

Data preparation (cont)

influence the whole training data ♦ The training set was adjusted so that the distribution on one specific attribute is almost the same

♦ Some attributes may bias so that they may

Leveling

eliminate the effect of age of observation from the training data ♦ Stripping unnecessary attributes, only important attributes left, such as gender, marital status …

♦ The leveling object of this case is to

Leveling (cont)

values ♦ Use an entropy based information ♦ Not useful for all cases, especially when the distinction between FSS and rule induction becomes blurred

♦ In order to adjust to compensate for missing

FSS (Feature Subset Selection)

control of the simulated annealing algorithm ♦ The algorithm use heuristic searching strategy to find the “best” rule ♦ Users can control various parameters in order to optimize the search

♦ The interface provides users with extensive

Data mining

– In the medical domain primary objective was explanation rather than prediction

included ♦ The generated rules were simple to understand

♦ Comprehensive pre-processing facilities are

Data mining (cont)

– Medical databases typically have a high proportion of missing values – Many of the discovered rules were based on attributes with 50% missing values

handle the missing values

♦ This data mining software can efficiently

Data mining (cont)

♦

2 test was a useful method

– They may be produced by chance

coverage

♦ Some rules have low levels of accuracy and

Rule evaluation

♦ Further work – Incorporating 2 into the fitness function of the simulated annealing algorithm.

♦ N Need further investigated – A very large number of rules are generated and evaluted

Rule evaluation (cont)

♦ Questions and Answers

End

Data Mining

Overview

More details

Related Documents

Data Mining

Data Mining

Data Mining

Data Mining

Data Mining

Data Mining