Module 6 Data preprocessing PCA Hypothesis testing Clustering
Class A
?
Class B
Classification – “Work Flow” Raw Data Preprocessing Feature Selection Classifier Quality Assessment
Raw Data • Typically vectors Features ID Mol1 Mol2 Mol3 … MolM
x
Target value y
12,2,1,3,4,36,7.5 4,5.6,3,8,99,1,5.2 2.1,2.2,4,6,3,1,11
1 0 1
x1,x2, … xn
ym
M data points in an N-dimensional space.
Preprocessing: Mean-Centering • Eliminates a constant offset of the data * ik
x = xik − xk
i : row index j : column index
Preprocessing: Range Scaling • Eliminates different column metrics
xik − xk (min) x = xik (max) − xk (min) * ik
i : row index j : column index
* ik
0 ≤ x ≤1
Preprocessing: Autoscaling • Eliminates different column metrics • Reveals data with zero mean and unit variance M
xik − xk x = sk * ik
∑ (x where sk =
ik
− xk )
i =1
N −1 Sphericization of groups
Preprocessing / Feature extraction: PCA • Reduces dimension of the data vectors • Generates new orthogonal co-ordinate system
x3
PCA scores (2D)
PC2
x2
x1
Var2
Raw data (3D)
Var1
PC1
PCA
Mol1 Mol2 Mol3 Mol4 Mol5 Mol6 Mol7 Mol8 Mol9 Mol10
Scores
x1
x2
x3
PC1
PC2
PC3
34
12
3
0.7
-1.4
1.6
6
2
3
0.4
1.0
-1.5
2
12
28
-1.1
0.4
-0.2
9
17
4
-0.6
-1.7
-0.4
20
5
9
0.5
0.4
0.2
4
14
56
-1.9
1.1
3
14
2
-0.6
30
2
3
23
3
15
5
Raw data
Correlation PC1
PC2
PC3
x1
0.81
-0.16
0.57
1.4
x2
-0.76
-0.63
0.16
-1.1
-1.2
x3
-0.79
0.44
0.43
1.3
0.5
0.7
1
1.0
0.4
-0.0
6
0.4
0.4
-0.4
Scores
Exkurs: PCA
Ausgangsdatenmatrix, X
Mol1 Mol2 Mol3 Mol4
x1 2 1 6 5
x2 1 1 5 5
x3 1 2 6 6
x4 1 1 4 5
x5 3 6 6 4
x6 3 7 5 4
• Welche „Grundeigenschaften“ (Faktoren) liegen den sechs Variablen zugrunde? • Werden alle sechs Variablen zur Beschreibung der Moleküle benötigt? Korrelationsanalyse Korrelationsanalyse zur zur Aufdeckung Aufdeckung der der Variablenzusammenhänge Variablenzusammenhänge Faktorenanalyse Faktorenanalyse
Exkurs: PCA
Korrelationsmatrix, R Korrelationskoeffizient (Pearson) N
∑ (x
1i
rx1,x2 =
− x1 ) ⋅ (x 2i − x 2 )
in [-1,1]
i =1 N
i =1
x1 x2 x3 x4 x5 x6
N
2 ( ) ( ) x − x ⋅ x − x ∑ 1i 1 ∑ 2i 2 2
x1 1
i =1
x2 0.97 1
x3 0.93 0.99 1
N: Anzahl der Objekte (Moleküle) (n: Anzahl der Variablen)
x4 0.92 0.98 0.97 1
x5 0.14 0.19 0.32 0.08 1
x6 -0.29 -0.17 -0.02 -0.21 0.87 1
Exkurs: PCA
Standardisierte Variablenmatrix, Z
z ji =
x ji − x j sj
sj: Standardabweichung der Variablen j N 1 2 sj = ⋅ ∑ (x ji − x j ) N − 1 i =1
Mittelwert gleich Null Standardabweichung gleich 1
1 R= ⋅ Z ⋅ ZT N −1
ZT: Transponierte der standardisierten Ausgangsmatrix X
einfachere Berechnung von R Korrelationsmatrix und Varianz-Kovarianzmatrix sind identisch
N 1 s 2jk = cov( j , k ) = ⋅ ∑ z ji zki N − 1 i =1
j und k: zwei der n Variablen
Exkurs: PCA
Hauptkomponentananalyse (PCA) Modell:
z j = a j 1F1 + a j 2F2 + K + a jn Fn
Z = A ⋅F
n: Anzahl der Variablen in X F: unkorrelierte Komponenten („Faktoren“)
Jede der n Variablen in Z wird durch n neue, unkorrelierte Variablen F beschrieben. Unterstellter linearer Zusammenhang zwischen den Faktoren und den Variablen!
Exkurs: PCA
Hauptkomponentananalyse (PCA)
Z = F⋅A
X = S ⋅ LT Score Vektoren ( neue Variablenwerte)
Ladungen („Loadings“) ( Korrelation mit Originalvariablen)
Hauptkomponenten einer Matrix • sind orthogonal zueinander („senkrecht“) • ändern bei einer linearen Transformation der Matrix nicht ihre Richtung
heuristisch: NIPALS Algorithmus exakt: Eigenvektoren der Varianz-Kovarianzmatrix
Exkurs: Kovarianz
Kovarianzmatrix der Zufallsvariablen x1, x2, x3 Cov( x1 , x1 ) Cov( x1 , x2 ) Cov( x1 , x3 ) Σ = Cov( x2 , x1 ) Cov( x2 , x2 ) Cov( x2 , x3 ) Cov( x , x ) Cov( x , x ) Cov( x , x ) 3 1 3 2 3 3
mit
1 ( ) Cov x, y = N
N
∑ (x − µ )(y − µ ) i
x
i
y
i =1
Determinante Σ einer Matrix A: Die Determinantenfunktion det(A) ordnet Matrizen einen eindeutigen Funktionswert (Zahl) zu. Diesen Funktionswert nennt man "Determinante" („Skalierungswert des Volumens“) = Produkt der Eigenwerte „Regel von Sarrus“ (gilt nur für 3×3 Matrizen!)
Σ = det( A) = a11a22 a33 + a12 a23a31 + a13 a21a32 − a13a22 a31 − a11a23a32 − a12 a21a33
Exkurs: PCA
NIPALS Algorithmus (S. Wold) nonlinear iterative partial least squares
Reproduktion der Datenmatrix X durch eine Matrixmultiplikation von S (Scoringmatrix ) und LT (transponierte Loadingmatrix). iterative Schätzung von s1, s2, etc.
s := xi
Wähle eine beliebige Spalte der Matrix X als den initialen Scoringvektor s.
XTs l= T s s l l= l
Projiziere X auf s, um den entsprechenden Loadingvektor zu erhalten.
4)
sold := s
Zwischenspeichern des aktuellen Scores.
5)
s=
6)
d := s old − s
Bestimme die Differenz d des neuen und alten Scores.
d < t?
Falls d größer als ein definierter Schwellenwert t (z.B. 10-6) ist, gehe zurück zu Schritt 2. Ansonsten fahre fort bei Schritt 8.
1) 2) 3)
7)
Normalisiere l auf die Länge 1.
Xl lT l
8)
E := X − sl
9)
X := E
Projiziere X auf l, um den entsprechenden neuen Scoringvektor zu erhalten.
T
Entferne den Anteil der gefundenen Hauptkomponente aus der Datenmatrix X. Um weitere Hauptkomponenten zu finden, wiederhole den Algorithmus ab Schritt 1 mit der neuen Datenmatrix.
PCA – Variance Explained / Eigenvalues Eigenvalue (EV)
PC1 PC2 PC3
1.9 0.6 0.5
63 20 17
cumulative EV 1.9 2.5 3.0
63 83 100
Eigenvalue-one Criterion Scree-Plot
% variance explained
cumulative variance
PCA – Scree Plot
Eigenvalue
• Selection of relevant PC-variables
1 Eigenvector-Rank Eigenvectors (PC) to be considered
A Practical Example ACE inhibitors vs. MMP inhibitors Descriptors: MW, clogP, SAS, ovality, boiling point
4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
PC2
Value
Scree Plot
1
2
3
4
Number of Eigenvalues
5
PC1
A Practical Example ACE inhibitors vs. Cytotoxic agents Descriptors: MW, clogP, SAS, ovality Scree Plot 3.0 2.5
1.5 1.0
PC2
Value
2.0
0.5 0.0
1
2
3
4
Number of Eigenvalues
PC1
Homework! PCA Tutorial at Manchester Metropolitan University http://149.170.199.144/multivar/pca.htm
Hypothesis testing: Student’s t-Test (1) tests whether the means of two groups are statistically different from each other The t-Test judges the difference between two means relative to the spread or variability of the two data groups.
intuitively „most different“ • assumes Gaussian data distibution http://www.socialresearchmethods.net/kb/stat_t.htm http://www.physics.csbsju.edu/stats/t-test.html
Hypothesis testing: Student’s t-Test (2) signal difference between group means t= = noise variability of groups
t=
xtest − xcontrol 2 σ test
N test
+
P(x)
2 σ control
N control x
Test group
Control group
Is t large enough to be significant? α-level: risk of being wrong (typically set to 0.05 = 5%) degrees-of-freedom = Ntest+ Ncontrol – 2 lookup-table.
Hypothesis testing: Kolmogorov-Smirnov-Test (1) • no assumption about data distribution („non-parametric“) in contrast to Student‘s t-Test! KS-testing for: • discriminating features • normal distribution • similarity of two distributions KS statistics is applicable to small sample sizes. based on cumulative distribution analysis
http://www.physics.csbsju.edu/stats/KS-test.html
Hypothesis testing: Kolmogorov-Smirnov-Test (2) “Dissimilarity” between two property distributions
K = max P( x) − P * ( x) −∞ < x >∞
actual distribution
Maximum value of the absolute difference between two distribution functions
target (reference, control) distribution
Example:
http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Kolmogorov-Smi/basis-statistik-kolmogorov-smi.html
Pause!
Classification Scenarios The data representation must reflect a solution of the problem.
linear separator
nonlinear separator
?
Classification Methods • Hierarchical clustering (e.g., Ward‘s method) • Non-hierarchical methods (e.g., k-means clustering)
Tutorial at Manchester Metropolitan University http://149.170.199.144/multivar/ca.htm
Data Preparation data universe available data
• are the available data representative of the problem? • is the number of data points sufficient for meaningful statistics?
Split into training and test set
supervised methods
unsupervised methods
Experimental design: • Selection of “representative” data sets
• ”D/S-optimal” design • similarity/dissimilarity-based methods • partition-based methods • approximation by e.g., Kennard-Stone (Maxmin) algorithm Ref.: Gillet et al. (1998) Similarity and dissimilarity methods for processing chemical structure databases. www3.oup.co.uk/computer_journal/subs/Volume_41/Issue_08/Gillet.pdf
Outlier detection By • visual inspection of the data distribution • a formal criterion, e.g. 3σ threshold
3σ
Software tools • SIMCA-P • Statistica •R •…
Measuring Binary Classification Accuracy • Training data (re-classification) • Test data (classification) P+N Q1 = # predicted
Q2 =
P P+O
Q3 =
P P +U
Matthews
PN − OU cc = (N + U )(N + O )(P + U )(P + O )
P: # positive correct N: # negative correct O: # false-positives (“overprediction”) U: # false-negatives (“underprediction”)
in [-1,1]
Example: Matthews cc Class 1 Class 0
cc =
PN + OU (N + U )(N + O )(P + U )(P + O )
Total = 11 P=4 cc = (4•4 + 1•2) / sqrt((4+2) • (4+1) • (4+2) • (4+1)) N=4 = 18 / sqrt(900) O=1 = 0.6 U=2
Hierarchical Cluster Analysis Nested classes “Phylogenetic” classification (dendrogram) 1. Data preprocessing (which variables, scaling etc.) 2. Calculate the distance matrix, D 3. Find the two most similar objects in D 4. Fuse these two objects 5. Update D: calculate the distances between the new cluster and all other objects 6. Repeat with step 3) until all cases are in one cluster
Clustering Algorithms - Examples Aggregation of clusters can be based on minimal dissimilarity between clusters.
Average Linkage
Single Linkage
Complete Linkage
(lowest average distance)
(nearest neighbor) loose clusters long “chains”
(furthest neighbor) compact clusters well-separated
Recommended (UPGMA: Umweighted Pair-Groups Method Average)
Example: Hierarchical Clustering Dendrogram
Data x1 1 2 4 5 7
x2 1 1 5 7 7
B2 B3
B1
x2
Euclidean distances
7 6 Linkage Distance
ID A1 A2 B1 B2 B3
Euclidian Distance Average Linkage
5 4 3 2 1
A1
A2
x1
0
B3
B2
B1
A2
A1
Clustering Algorithms – Ward‘s method
+ +: centroid (mean) • Variance approach! Cluster membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. produces “well-structured” dendrograms tends to create small clusters
Example: “Phylogenetic” Tree 4 species with homologous sequences: ATCC ATGC TTCG TCGG
Assumption: Assumption: Number Number of of pair-wise pair-wise mutations mutations
dissimilarity dissimilarity
Hamming Hamming Distance Distance (sequences (sequences must must have have the the same same length) length) ≡≡ number number of of positions positions containing containing different different symbols symbols
agtc cgta
Hamming Hamming distance distance == 22
Levenshtein Levenshtein Distance Distance (1965) (1965) („Edit („Edit Distance“) Distance“) ≡≡ number number of of edit edit operations operations (deletion, (deletion, insertion, insertion, change change of of aa single single symbol) symbol) converting converting one one sequence sequence into into the the other other
ag-tcc cgctca
Levenshtein Levenshtein distance distance == 33
Step 1) Distance matrix ATCC ATGC TTCG TCGG ATCC 0 1 2 4 ATGC 0 3 3 TTCG 0 2 TCGG 0 Step 2) Reduced distance matrix {ATCC, ATGC} {ATCC, ATGC} 0 TTCG TCGG
1.5
UPGMA
0.5 ATCC
0.5 ATGC
TTCG ½(2+3)=2.5 0
1.5 1 TTCG
ATCC
1 TCGG
ATGC
TCGG ½(4+3)=3.5 2 0
branch length = distance between nodes
Issues • branch lengths need not be proportional to the time since the separation of taxa • different evolution rates of branches • unweighted mutation / editing events Cladistic Methods („Abstammungsverhältnisse“) 1. Maximum-Parsimony (W. Fitch, 1971) (number of evo. changes 2. Maximum-Likelihood (logL for the tree topology max.)
min.)
Example: 4 homologous sequences: ATCG, ATGG, TCCA, TTCA
ATCA A G
ATCG A T
TTCA
ATCG C G
ATCG
G A
ATGG
T C
TCCA
4 mutations
ATCA A G
TTCA
ATCG
A T T C
TCCA
A T
T A C G
TTCG
ATGG
7 mutations
G A
TTCA