Modul 7 (preprocessing)

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Modul 7 (preprocessing) as PDF for free.

More details

  • Words: 2,014
  • Pages: 41
Module 6 Data preprocessing PCA Hypothesis testing Clustering

Class A

?

Class B

Classification – “Work Flow” Raw Data Preprocessing Feature Selection Classifier Quality Assessment

Raw Data • Typically vectors Features ID Mol1 Mol2 Mol3 … MolM

x

Target value y

12,2,1,3,4,36,7.5 4,5.6,3,8,99,1,5.2 2.1,2.2,4,6,3,1,11

1 0 1

x1,x2, … xn

ym

M data points in an N-dimensional space.

Preprocessing: Mean-Centering • Eliminates a constant offset of the data * ik

x = xik − xk

i : row index j : column index

Preprocessing: Range Scaling • Eliminates different column metrics

xik − xk (min) x = xik (max) − xk (min) * ik

i : row index j : column index

* ik

0 ≤ x ≤1

Preprocessing: Autoscaling • Eliminates different column metrics • Reveals data with zero mean and unit variance M

xik − xk x = sk * ik

∑ (x where sk =

ik

− xk )

i =1

N −1 Sphericization of groups

Preprocessing / Feature extraction: PCA • Reduces dimension of the data vectors • Generates new orthogonal co-ordinate system

x3

PCA scores (2D)

PC2

x2

x1

Var2

Raw data (3D)

Var1

PC1

PCA

Mol1 Mol2 Mol3 Mol4 Mol5 Mol6 Mol7 Mol8 Mol9 Mol10

Scores

x1

x2

x3

PC1

PC2

PC3

34

12

3

0.7

-1.4

1.6

6

2

3

0.4

1.0

-1.5

2

12

28

-1.1

0.4

-0.2

9

17

4

-0.6

-1.7

-0.4

20

5

9

0.5

0.4

0.2

4

14

56

-1.9

1.1

3

14

2

-0.6

30

2

3

23

3

15

5

Raw data

Correlation PC1

PC2

PC3

x1

0.81

-0.16

0.57

1.4

x2

-0.76

-0.63

0.16

-1.1

-1.2

x3

-0.79

0.44

0.43

1.3

0.5

0.7

1

1.0

0.4

-0.0

6

0.4

0.4

-0.4

Scores

Exkurs: PCA

Ausgangsdatenmatrix, X

Mol1 Mol2 Mol3 Mol4

x1 2 1 6 5

x2 1 1 5 5

x3 1 2 6 6

x4 1 1 4 5

x5 3 6 6 4

x6 3 7 5 4

• Welche „Grundeigenschaften“ (Faktoren) liegen den sechs Variablen zugrunde? • Werden alle sechs Variablen zur Beschreibung der Moleküle benötigt? Korrelationsanalyse Korrelationsanalyse zur zur Aufdeckung Aufdeckung der der Variablenzusammenhänge Variablenzusammenhänge Faktorenanalyse Faktorenanalyse

Exkurs: PCA

Korrelationsmatrix, R Korrelationskoeffizient (Pearson) N

∑ (x

1i

rx1,x2 =

− x1 ) ⋅ (x 2i − x 2 )

in [-1,1]

i =1 N

i =1

x1 x2 x3 x4 x5 x6

N

2 ( ) ( ) x − x ⋅ x − x ∑ 1i 1 ∑ 2i 2 2

x1 1

i =1

x2 0.97 1

x3 0.93 0.99 1

N: Anzahl der Objekte (Moleküle) (n: Anzahl der Variablen)

x4 0.92 0.98 0.97 1

x5 0.14 0.19 0.32 0.08 1

x6 -0.29 -0.17 -0.02 -0.21 0.87 1

Exkurs: PCA

Standardisierte Variablenmatrix, Z

z ji =

x ji − x j sj

sj: Standardabweichung der Variablen j N 1 2 sj = ⋅ ∑ (x ji − x j ) N − 1 i =1

Mittelwert gleich Null Standardabweichung gleich 1

1 R= ⋅ Z ⋅ ZT N −1

ZT: Transponierte der standardisierten Ausgangsmatrix X

einfachere Berechnung von R Korrelationsmatrix und Varianz-Kovarianzmatrix sind identisch

N 1 s 2jk = cov( j , k ) = ⋅ ∑ z ji zki N − 1 i =1

j und k: zwei der n Variablen

Exkurs: PCA

Hauptkomponentananalyse (PCA) Modell:

z j = a j 1F1 + a j 2F2 + K + a jn Fn

Z = A ⋅F

n: Anzahl der Variablen in X F: unkorrelierte Komponenten („Faktoren“)

Jede der n Variablen in Z wird durch n neue, unkorrelierte Variablen F beschrieben. Unterstellter linearer Zusammenhang zwischen den Faktoren und den Variablen!

Exkurs: PCA

Hauptkomponentananalyse (PCA)

Z = F⋅A

X = S ⋅ LT Score Vektoren ( neue Variablenwerte)

Ladungen („Loadings“) ( Korrelation mit Originalvariablen)

Hauptkomponenten einer Matrix • sind orthogonal zueinander („senkrecht“) • ändern bei einer linearen Transformation der Matrix nicht ihre Richtung

heuristisch: NIPALS Algorithmus exakt: Eigenvektoren der Varianz-Kovarianzmatrix

Exkurs: Kovarianz

Kovarianzmatrix der Zufallsvariablen x1, x2, x3  Cov( x1 , x1 ) Cov( x1 , x2 ) Cov( x1 , x3 )    Σ =  Cov( x2 , x1 ) Cov( x2 , x2 ) Cov( x2 , x3 )   Cov( x , x ) Cov( x , x ) Cov( x , x )  3 1 3 2 3 3  

mit

1 ( ) Cov x, y = N

N

∑ (x − µ )(y − µ ) i

x

i

y

i =1

Determinante Σ einer Matrix A: Die Determinantenfunktion det(A) ordnet Matrizen einen eindeutigen Funktionswert (Zahl) zu. Diesen Funktionswert nennt man "Determinante" („Skalierungswert des Volumens“) = Produkt der Eigenwerte „Regel von Sarrus“ (gilt nur für 3×3 Matrizen!)

Σ = det( A) = a11a22 a33 + a12 a23a31 + a13 a21a32 − a13a22 a31 − a11a23a32 − a12 a21a33

Exkurs: PCA

NIPALS Algorithmus (S. Wold) nonlinear iterative partial least squares

Reproduktion der Datenmatrix X durch eine Matrixmultiplikation von S (Scoringmatrix ) und LT (transponierte Loadingmatrix). iterative Schätzung von s1, s2, etc.

s := xi

Wähle eine beliebige Spalte der Matrix X als den initialen Scoringvektor s.

XTs l= T s s l l= l

Projiziere X auf s, um den entsprechenden Loadingvektor zu erhalten.

4)

sold := s

Zwischenspeichern des aktuellen Scores.

5)

s=

6)

d := s old − s

Bestimme die Differenz d des neuen und alten Scores.

d < t?

Falls d größer als ein definierter Schwellenwert t (z.B. 10-6) ist, gehe zurück zu Schritt 2. Ansonsten fahre fort bei Schritt 8.

1) 2) 3)

7)

Normalisiere l auf die Länge 1.

Xl lT l

8)

E := X − sl

9)

X := E

Projiziere X auf l, um den entsprechenden neuen Scoringvektor zu erhalten.

T

Entferne den Anteil der gefundenen Hauptkomponente aus der Datenmatrix X. Um weitere Hauptkomponenten zu finden, wiederhole den Algorithmus ab Schritt 1 mit der neuen Datenmatrix.

PCA – Variance Explained / Eigenvalues Eigenvalue (EV)

PC1 PC2 PC3

1.9 0.6 0.5

63 20 17

cumulative EV 1.9 2.5 3.0

63 83 100

Eigenvalue-one Criterion Scree-Plot

% variance explained

cumulative variance

PCA – Scree Plot

Eigenvalue

• Selection of relevant PC-variables

1 Eigenvector-Rank Eigenvectors (PC) to be considered

A Practical Example ACE inhibitors vs. MMP inhibitors Descriptors: MW, clogP, SAS, ovality, boiling point

4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

PC2

Value

Scree Plot

1

2

3

4

Number of Eigenvalues

5

PC1

A Practical Example ACE inhibitors vs. Cytotoxic agents Descriptors: MW, clogP, SAS, ovality Scree Plot 3.0 2.5

1.5 1.0

PC2

Value

2.0

0.5 0.0

1

2

3

4

Number of Eigenvalues

PC1

Homework! PCA Tutorial at Manchester Metropolitan University http://149.170.199.144/multivar/pca.htm

Hypothesis testing: Student’s t-Test (1) tests whether the means of two groups are statistically different from each other The t-Test judges the difference between two means relative to the spread or variability of the two data groups.

intuitively „most different“ • assumes Gaussian data distibution http://www.socialresearchmethods.net/kb/stat_t.htm http://www.physics.csbsju.edu/stats/t-test.html

Hypothesis testing: Student’s t-Test (2) signal difference between group means t= = noise variability of groups

t=

xtest − xcontrol 2 σ test

N test

+

P(x)

2 σ control

N control x

Test group

Control group

Is t large enough to be significant? α-level: risk of being wrong (typically set to 0.05 = 5%) degrees-of-freedom = Ntest+ Ncontrol – 2 lookup-table.

Hypothesis testing: Kolmogorov-Smirnov-Test (1) • no assumption about data distribution („non-parametric“) in contrast to Student‘s t-Test! KS-testing for: • discriminating features • normal distribution • similarity of two distributions KS statistics is applicable to small sample sizes. based on cumulative distribution analysis

http://www.physics.csbsju.edu/stats/KS-test.html

Hypothesis testing: Kolmogorov-Smirnov-Test (2) “Dissimilarity” between two property distributions

K = max P( x) − P * ( x) −∞ < x >∞

actual distribution

Maximum value of the absolute difference between two distribution functions

target (reference, control) distribution

Example:

http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Kolmogorov-Smi/basis-statistik-kolmogorov-smi.html

Pause!

Classification Scenarios The data representation must reflect a solution of the problem.

linear separator

nonlinear separator

?

Classification Methods • Hierarchical clustering (e.g., Ward‘s method) • Non-hierarchical methods (e.g., k-means clustering)

Tutorial at Manchester Metropolitan University http://149.170.199.144/multivar/ca.htm

Data Preparation data universe available data

• are the available data representative of the problem? • is the number of data points sufficient for meaningful statistics?

Split into training and test set

supervised methods

unsupervised methods

Experimental design: • Selection of “representative” data sets

• ”D/S-optimal” design • similarity/dissimilarity-based methods • partition-based methods • approximation by e.g., Kennard-Stone (Maxmin) algorithm Ref.: Gillet et al. (1998) Similarity and dissimilarity methods for processing chemical structure databases. www3.oup.co.uk/computer_journal/subs/Volume_41/Issue_08/Gillet.pdf

Outlier detection By • visual inspection of the data distribution • a formal criterion, e.g. 3σ threshold



Software tools • SIMCA-P • Statistica •R •…

Measuring Binary Classification Accuracy • Training data (re-classification) • Test data (classification) P+N Q1 = # predicted

Q2 =

P P+O

Q3 =

P P +U

Matthews

PN − OU cc = (N + U )(N + O )(P + U )(P + O )

P: # positive correct N: # negative correct O: # false-positives (“overprediction”) U: # false-negatives (“underprediction”)

in [-1,1]

Example: Matthews cc Class 1 Class 0

cc =

PN + OU (N + U )(N + O )(P + U )(P + O )

Total = 11 P=4 cc = (4•4 + 1•2) / sqrt((4+2) • (4+1) • (4+2) • (4+1)) N=4 = 18 / sqrt(900) O=1 = 0.6 U=2

Hierarchical Cluster Analysis Nested classes “Phylogenetic” classification (dendrogram) 1. Data preprocessing (which variables, scaling etc.) 2. Calculate the distance matrix, D 3. Find the two most similar objects in D 4. Fuse these two objects 5. Update D: calculate the distances between the new cluster and all other objects 6. Repeat with step 3) until all cases are in one cluster

Clustering Algorithms - Examples Aggregation of clusters can be based on minimal dissimilarity between clusters.

Average Linkage

Single Linkage

Complete Linkage

(lowest average distance)

(nearest neighbor) loose clusters long “chains”

(furthest neighbor) compact clusters well-separated

Recommended (UPGMA: Umweighted Pair-Groups Method Average)

Example: Hierarchical Clustering Dendrogram

Data x1 1 2 4 5 7

x2 1 1 5 7 7

B2 B3

B1

x2

Euclidean distances

7 6 Linkage Distance

ID A1 A2 B1 B2 B3

Euclidian Distance Average Linkage

5 4 3 2 1

A1

A2

x1

0

B3

B2

B1

A2

A1

Clustering Algorithms – Ward‘s method

+ +: centroid (mean) • Variance approach! Cluster membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. produces “well-structured” dendrograms tends to create small clusters

Example: “Phylogenetic” Tree 4 species with homologous sequences: ATCC ATGC TTCG TCGG

Assumption: Assumption: Number Number of of pair-wise pair-wise mutations mutations

dissimilarity dissimilarity

Hamming Hamming Distance Distance (sequences (sequences must must have have the the same same length) length) ≡≡ number number of of positions positions containing containing different different symbols symbols

agtc cgta

Hamming Hamming distance distance == 22

Levenshtein Levenshtein Distance Distance (1965) (1965) („Edit („Edit Distance“) Distance“) ≡≡ number number of of edit edit operations operations (deletion, (deletion, insertion, insertion, change change of of aa single single symbol) symbol) converting converting one one sequence sequence into into the the other other

ag-tcc cgctca

Levenshtein Levenshtein distance distance == 33

Step 1) Distance matrix ATCC ATGC TTCG TCGG ATCC 0 1 2 4 ATGC 0 3 3 TTCG 0 2 TCGG 0 Step 2) Reduced distance matrix {ATCC, ATGC} {ATCC, ATGC} 0 TTCG TCGG

1.5

UPGMA

0.5 ATCC

0.5 ATGC

TTCG ½(2+3)=2.5 0

1.5 1 TTCG

ATCC

1 TCGG

ATGC

TCGG ½(4+3)=3.5 2 0

branch length = distance between nodes

Issues • branch lengths need not be proportional to the time since the separation of taxa • different evolution rates of branches • unweighted mutation / editing events Cladistic Methods („Abstammungsverhältnisse“) 1. Maximum-Parsimony (W. Fitch, 1971) (number of evo. changes 2. Maximum-Likelihood (logL for the tree topology max.)

min.)

Example: 4 homologous sequences: ATCG, ATGG, TCCA, TTCA

ATCA A G

ATCG A T

TTCA

ATCG C G

ATCG

G A

ATGG

T C

TCCA

4 mutations

ATCA A G

TTCA

ATCG

A T T C

TCCA

A T

T A C G

TTCG

ATGG

7 mutations

G A

TTCA

Related Documents

Data Preprocessing
July 2020 14
Modul 7
June 2020 35
Modul 7
May 2020 26
Modul 7
July 2020 19