Program & Bibliographie - 3(1,2): ~5 theory (301, B2) +10 practice (Comp. Chem. Lab by gro up)
- Website: www2.hcmut.edu.vn/~dzung / (available from Sep 15)
TIN HỌC TRONG CNTP
- R: www.rwww.r-project.org
Nguyễ Nguyễn Hoà Hoàng Dũng, ng, PhD. Trườ Trường Đại học Bách khoa Tp. Tp. HCM
NHDzung – Lesson 1, slide 2
Problem Foreign and Vietnamese Cheeses : Quality and Preference ? HowHow-to Conduct a research 1. Sampling 2. Measurement 3. Collect data * 4. Analysis and present your results *
NHDzung – Lesson 1, slide 3
1-1. Samples and Populations A population consists of the set of all measurements in which the investigator is interested. A sample is a subset of the measurements selected from the population. A census is a complete enumeration of every item in a population.
NHDzung – Lesson 1, slide 5
Sensory practices
NHDzung – Lesson 1, slide 4
Simple Random Sample Sampling from the population is often done randomly, randomly, such that every possible sample of equal size (n) will have an equal chance of being selected. A sample selected in this way is called a simple random sample or just a random sample. sample.
NHDzung – Lesson 1, slide 6
1
Samples and Populations
Problem Foreign and Vietnamese Cheeses : Quality and Preference ? HowHow-to Conduct a research 1. Sampling 2. Measurement 3. Collect data * 4. Analysis and present your results
Population (N)
Sample (n)
*
Sensory practices
NHDzung – Lesson 1, slide 7
Measurements
NHDzung – Lesson 1, slide 8
The criteria of “science”
•The assigning of numbers to the values of a variable (SS Stevens, Science 1946;103:677 -80) •Rules specify procedures to assign numbers to values
Science
Pseudoscience
Logic, experimental evidence
Belief, loyalty
Results are repeatable
Results are not repeatable
Falsiability* Falsiability*
Not falsifiable
PeerPeer-reviewed journals
Not in peer reviewed journals
Evolution / learn from mistakes
Constant, unchanged belief
*capable of being tested (verified or falsified) by experiment o r observation NHDzung – Lesson 1, slide 9
Criteria of measurements Validity measures what it purports to
NHDzung – Lesson 1, slide 10
Accuracy vs reliability (precision)
Accuracy - the degree of “truthfulness” truthfulness” of an attribute that is being measured.
Reliability (consistency and repeatability) Sensitivity to important variation
precision
accuracy Measurement error decreases the accuracy of measurement NHDzung – Lesson 1, slide 11
NHDzung – Lesson 1, slide 12
2
Some important concepts: Data - Variables Scales Qualitative - Categorical Frequency or Nominal:
Quantitative - Measurable or Countable:
Examples areare• Color • Gender • Nationality
Examples areare• Temperatures • Humidity • Gross compounds • Preference points scored on a 100 point
THÔNG TIN CHUNG 1.1 Mô tả ngườ người trả trả lời phỏ phỏng vấn 1.1.1 Giới tính của người được phỏng vấn?1 n?1. Nam 1. Độc thân Tình trạng hôn nhân: nhân:
2. Nữ 2. Có gia đình
1.1.2 Tuổi của người được phỏng vấn? Dướ Dưới 25 tuổ tuổi 25 – 30 tuổ tuổi 31 – 54 tuổ tuổi >55 tuổ tuổi 1.1.3 Xin Ông/Bà Ông/Bà cho biết nghề nghiệp hiện nay ? Học sinh, sinh, sinh viên Bác sĩ/giá /giáo viên Công nhân/ nhân/ lao động làm thuê/bá thuê/bán hàng Hưu trí trí 1.1.4 Ông/Bà Ông/Bà cho biết thu nhập của gia đình Ông/Bà Ông/Bà ở mức nào sau đây 1 . Thấ Thấp ( ≥ 2 triệ triệu đồng và < 5 triệ triệu) 2 . Trung bình (≥ 5 triệ triệu và <8 triệ triệu) 3 . Cao ( ≥ 8 triệ triệu)
NHDzung – Lesson 1, slide 13
NHDzung – Lesson 1, slide 14
Some important concepts: Data - Variables Scales •8 phomat phomat (EdamF (EdamF,, EdamH, EdamH, GoudaH, GoudaH, m1, m2, m3, m4, m5) m5) •11 ngườ người thử thử (chuyên gia) •3 lầ lần lặ lặp lạ lại
Variable Measurement scales • Discrete variables • Nominal scales ? (Label) • Continuous variables • Ordinal scales (Ranks in Army) • Independent variables • Inteval scales (Celsius, • Dependent variables Fahrenheit)
•15 thuậ thuật ngữ ngữ mô tả: sour bitterness umami salty greasiness
• Ration scales (true zero point, ratio)
butter_odor milk_odor acrid rancid lactic cheese_flavor acetic full flavor yellow hard •Thang điể điểm không cấ cấu trú trúc từ từ 0-100 mm NHDzung – Lesson 1, slide 15
NHDzung – Lesson 1, slide 16
Types of measurement Qualitative Qualitative (định (địnhchất) chất)
Qualitative measurements
Quantitative Quantitative (định (địnhlượng) lượng)
Nominal
Interval
Ordinal
Ratio
NHDzung – Lesson 1, slide 17
Nominal level
Ordinal level
• Classification • A set of objects can be classified into exhaustive, mutually exclusive and unique symbol • Ex: religion, sex, location, etc
• Classification + Ordering • A set of numbers can be assigned rank values and nothing more. • Ex: socio-economic status, education, levels of satisfaction, etc
NHDzung – Lesson 1, slide 18
3
Quantitative measurements Interval level
Ratio level
• Classification + Ordering + Standard distance • A set of objects can be described by units that indicate how far one case is from another case • Ex: temperature
• Classification + Ordering + Standard distance + Natural zero • Quantitative variable with natural zero • Ex: income, age, weight, bone mineral density
Problem Foreign and Vietnamese Cheeses : Quality and Preference ? HowHow-to Conduct a research 1. Sampling 2. Measurement 3. Collect data * 4. Analysis and present your results *
Sensory practices
NHDzung – Lesson 1, slide 19
1.2.2. Ông/Bà Ông/Bà cho biết loại pho mát cứng nào mà Ông/Bà Ông/Bà thường sử dụng Cheddar Gouda Edam Emental Khá Khác (ghi rõ) rõ)…………………….. …………………….. 1.2.4. Ông/Bà thích chung đối với sản phẩm phó phó mát Ông/Bà cho biết mức độ ưa thí bán cứng 1 2 3 4 5 6 7 8 9 1.2.5. Xin Ông/Bà phó mát bán cứng. ng. Ông/Bà cho biết tần số sử dụng sản phẩm phó > 3 lần/tuầ n/tuần 1 – 2 lần/tuầ n/tuần 1-3 lần/thá n/tháng 1.2.6. Xin Ông/Bà Ông/Bà cho biết lượng phó phó mát bán cứng sử dụng trong tuần của Ông/Bà Ông/Bà < 100g 100 – 300g > 300g
NHDzung – Lesson 1, slide 20
1.2.7. Theo Ông/Bà Ông/Bà phó phó mát cứ ng ăn v ới sản phẩm nào? Bánh mì Bánh sandwich Salad Bánh biscuit Rượ Rượu vang Khá Khác (ghi rõ tên) tên)……………………………… 1.2.8. Khi chọn mua sản phẩm phó phó mát cứ ng, ng, Ông/Bà Ông/Bà cho biết mức độ quan tâm đối với những y ếu tố sau đây (1=r (1=rất không quan tâm, tâm, 2=không 2= không quan tâm, tâm, 3=không 3=không ý kiến, 4=quan 4=quan tâm, tâm, 5=r 5=rất quan tâm) tâm) Giá 1 2 3 4 5 Giá cả Tính chấ 2 3 4 5 chất cảm quan của sản phẩ phẩm 1 Mức độ quen thuộ 1 2 3 4 5 thuộc Thuậ 1 2 3 4 5 Thuận lợi khi sử dụng Có lợi cho sức khoẻ 1 2 3 4 5 khoẻ Khố 1 2 3 4 5 Khối lượ lượng sản phẩ phẩm
NHDzung – Lesson 1, slide 21
NHDzung – Lesson 1, slide 22
•8 phomat phomat (EdamF (EdamF,, EdamH, EdamH, GoudaH, GoudaH, m1, m2, m3, m4, m5) m5) •11 ngườ người thử thử (chuyên gia) •3 lầ lần lặ lặp lạ lại •15 thuậ thuật ngữ ngữ mô tả: sour bitterness umami salty greasiness butter_odor milk_odor acrid rancid lactic cheese_flavor acetic full flavor yellow hard •Thang điể điểm không cấ cấu trú trúc từ từ 0-100 mm NHDzung – Lesson 1, slide 23
NHDzung – Lesson 1, slide 24
4
Summary Measures Population Parameters Sample Statistics judge
session
product
sour
bitterness
umami
S1
1
m1
50
18
0
salty 40
S2
1
m1
100
65
40
100
S3
1
m1
32
11
35
4
S4
1
m1
30
10
25
1
S5
1
m1
60
23
30
29
S6
1
m1
30
35
25
50
S7
1
m1
50
32
45
64
S8
1
m1
32
23
40
40
S9
1
m1
78
27
45
21
S10
1
m1
55
30
34
18
S11
1
m1
62
21
43
32
Measures of Variability
Measures of Central Tendency
• Range • Variance • Standard Deviation
• Median • Mode • Mean l
NHDzung – Lesson 1, slide 25
NHDzung – Lesson 1, slide 26
1-3. Measures of Central Tendency or Location • Median
â Middle value when sorted in order of magnitude â 50th percentile
• Mode
â Most frequentlyoccurring value
• Mean
Other summary measures: – Skewness – Kurtosis
Arithmetic Mean or Average The mean of a set of observations is their average - the sum of the observed values divided by the number of observations. Sample Mean
Population Mean N
µ=
â Average
n
∑x
x=
i =1
N
NHDzung – Lesson 1, slide 27
i =1
n
NHDzung – Lesson 1, slide 28
Arithmetic Mean or Average
Median Robust parameter of central tendency Non affected by outliers
Affected by outliers
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
∑x
0 1 2 3 4 5 6 7 8 9 10 12 14
0 1 2 3 4 5 6 7 8 9 10 12 14
Means = 5
NHDzung – Lesson 1, slide 29
Means = 6
Median = 5
Median = 5
NHDzung – Lesson 1, slide 30
5
Mode
Measures of Central Tendency or Location
x =
1 n
x =
Ø Mean :
1 n
k
∑nx i
i
n
∑x i =1
i
=
x1 + x 2 + K + x n n
n1 x1 + n2 x 2 + K + nk x k n
=
i =1
Sample size 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 3 4 5 6
Mode = 9
med ( x ) = x ( p + 1)
Ø Median :
Without Mode
= NHDzung – Lesson 1, slide 31
Mean or Median ?
x ( p ) + x ( p + 1) 2
si
n = 2p + 1
si
n = 2p
NHDzung – Lesson 1, slide 32
Quartiles The value of the boundary at the 25th, 50th, or 75th percentiles of a frequency distribution divided into four parts, each containing a quarter of the population
Ÿ Outliers : median
25%
Ÿ Many of « ex aequo » (variable discrete) : mean
25%
( Q1 )
25%
( Q3 )
( Q2 )
Position of ith quartile
Position Position of Q1 =
25%
( Qi ) =
1 ( 9 + 1) 4
= 2.5
Q1 =
i ( n + 1) 4
(12 + 13 ) = 12.5 2
Data classified in increasing order : 11 12 13 16 16 17 18 21 22 NHDzung – Lesson 1, slide 33
1-4. Measures of Variability or Dispersion Range • Difference between maximum and minimum values Variance • Mean* squared deviation from the mean Standard Deviation • Square root of the variance
NHDzung – Lesson 1, slide 34
Dispersion Range ( x ) = x( n ) − x (1)
Ø Range :
Range = 12 - 7 = 5
7
8
9
10
11
Range = 12 - 7 = 5
7
12
8
9
10
11
12
q0.75 − q0.25 Ø Intervalle interquartile : ∗
Definitions of population variance and sample variance differ slightly . NHDzung – Lesson 1, slide 35
NHDzung – Lesson 1, slide 36
6
Mean (average)
Variation
Given a series of values xi (i = 1, … , n): n): x1, x2, …, xn, the mean is: 1 n x=
n
∑ xi i =1
Study 1: 1: the color scores of 6 consumers are: 6, 7, 8, 4, 5, and 6. The mean is: n
x=
1 6 + 7 + 8 + 4 + 5 + 6 36 = =6 ∑ xi = 6 6 n i =1
Study 2: 2: the color scores of 4 consumers are: 10, 2, 3, and 9. The mean is: 1 n 10 + 2 + 3 + 9 24
x=
∑ xi = n i =1
=
4
4
The mean does not adequately describe the data. We need to know the variation in the data. An obvious measure is the sum of difference from the mean: For study 1, the scores 6, 7, 8, 4, 5, and 6, we have: (6(6-6) + (7(7-6) + (8(8-6) + (4(4-6) + (5(5-6) + (6(6-6) =0+1+2–2–1+0 =0
=6
NOT SATISFACTORY!
NHDzung – Lesson 1, slide 37
NHDzung – Lesson 1, slide 38
Sum of squares
Variance
We need to make the difference positive by squaring them. This is called “Sum of squares” squares” (SS) For study 1: 6, 7, 8, 4, 5, 6, we have: SS = (6(6-6)2 = (5-6)2 + (6(4-6)2 + (5(8-6)2 + (4(7-6)2 + (8(6-6)2 + (710 For study 2: 10, 2, 3, 9, we have: SS= (10(9-6)2 = 50 (3-6)2 + (9(2-6)2 + (3(10-6)2 + (2-
We have to divide the SS by sample size n. But in each square we use the mean to calculate the square, so we lose 1 degree of freedom. Therefore the correct denominator is n-1. This is called variance (denoted by s2)
s2 =
s2 =
NHDzung – Lesson 1, slide 39
Variance - example
n
σ=
∑ (x − x) i =1
N N
=
s = 2
i =1
∑x
2
i =1
( x) −
N ∑ i =1
n
=
N
σ
2
s= NHDzung – Lesson 1, slide 41
s2 =
2
(n − 1)
2
N
For study 1: 6, 7, 8, 4, 5, and 6, the variance is:
Sample Variance
N
σ2 =
1 n 2 ∑ ( xi − x ) n − 1 i =1 NHDzung – Lesson 1, slide 40
1-5. Variance and Standard Deviation ∑ (x − µ)2
n −1
Or, in the sum notation:
This is better! But it does not take into account sample size n.
Population Variance
(x1 − x )2 + (x 2 − x )2 + ... + (x n − x )2
( )
∑x − 2
i =1
n ∑x i =1
n
(n − 1) s
2
2
(6 − 6 )2 + (7 − 6 )2 + (8 − 6 )2 + (5 − 6 )2 + (6 − 6 )2 6 −1
=
10 =2 5
For study 2: 10, 2, 3, 9, the variance is: s2 =
(10 − 6 )2 + (2 − 6 )2 + (3 − 6 )2 + (9 − 6 )2 4 −1
=
50 = 16 .7 3
The scores in study 2 were much more variable than those in study 1. NHDzung – Lesson 1, slide 42
7
Standard deviation
Standard Deviation
The problem with variance is that it is expressed in unit squared, squared, whereas the mean is in the actual unit. We need a way to convert variance back to the actual unit of measurement.
Data A 11
12 13 14
Mean = 15.5 s = 3.338
15 16 17 18
19 20 21
15 16 17 18
19 20 21
Mean = 15.5 s = .9258
15 16 17 18
19 20 21
Mean = 15.5 s = 4.57
Data B
We take the square root of variance – this is called “standard deviation” deviation” (denote by s)
11
12 13 14
Data C
For study 1, s = sqrt(2) = 1.41 For study 2, s = sqrt(16.7) = 4.1
11
12 13 14
NHDzung – Lesson 1, slide 43
NHDzung – Lesson 1, slide 44
1-6 Form indicators: Skewness & Kurtosis
Skewness
Skewness
Skewed to left
• Measure of asymmetry of a frequency distribution
• Skewed to left • Symmetric or unskewed • Skewed to right Kurtosis
Mean < median < mode
• Measure of flatness or peakedness of a frequency distribution
• Platykurtic (relatively flat) • Mesokurtic (normal) • Leptokurtic (relatively peaked)
F re q ue nc y
3 0
2 0
1 0
0 1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
x
NHDzung – Lesson 1, slide 45
NHDzung – Lesson 1, slide 46
Kurtosis
Kurtosis Mesokurtic - not too flat and not too peaked
Platykurtic - flat distribution 7 0 0 5 0 0 6 0 0 4 0 0
F re q u e n c y
F re q u e n c y
5 0 0 4 0 0 3 0 0
3 0 0
2 0 0
2 0 0 1 0 0 1 0 0 0
0 - 3 .5
- 2 .7
- 1 .9
- 1 .1
- 0 .3
X
NHDzung – Lesson 1, slide 47
0 .5
1 .3
2 .1
2 .9
3 .7
-4
-3
-2
-1
0
1
2
3
4
X
NHDzung – Lesson 1, slide 48
8
Quantitative variable
Diagram
NHDzung – Lesson 1, slide 49
NHDzung – Lesson 1, slide 50
Quantitative variable : boxplot
Quantitative variable
x x
If we want to see in detail: 21 freq. between 1.65 m & 1.70 m distribute in 8 in [1.65 ; 1.675] & 13 in [1.675 ; 1.70]
Plus grande valeur inférieure à q 0.75 +1.5(q 0.75 - q 0.25) q 0.75 Median q 0.25 Plus petite valeur supérieure à q 0.25 -1.5(q 0.75 - q 0.25)
?
x
Boîte à moustaches NHDzung – Lesson 1, slide 51
NHDzung – Lesson 1, slide 52
Principes of good « figure »
Form indicators γ1 < 0 Asymetry
γ1 > 0 Symetry
Asymetry
§Biể Biểu diễ diễn kết quả quả phứ phức tạp một cách rõ ràng, ng, chí chính xác và hiệ hiệu quả quả §Trì Trình bày nhiề nhiều ý tưở tưởng một cách hiệ hiệu quả quả nhấ nhất §Không nói dối !
Q1
Q 2 Q3
Q1 Q 2Q3
NHDzung – Lesson 1, slide 53
Q1 Q2
Q3
NHDzung – Lesson 1, slide 54
9
A BAD figure Digestion interactions of coral
Freq.
A
o cr
po
ri
da
e
P
i or
te
s
(M
)
M
us
si
da
e A
y lc
on
ac
ea
ns
P
i or
te
s
(B
120 110 100 90 80 70 60 50 40 30 20 10 0
) A
a lg
e F
i av
id
ae
Wins
S
n po
ge
Figure 3. Digestion interactions for coral taxa sampled at Pioneer Bay, Orpheus Island
s
Losses
Frequency
Fig.
A GOOD figure
60
Wins
50 40
Losses
30 20 10 0
op cr A
ae id or
) (M es rit o P
ae sid us M
an ce na yo lc A
s
( es rit Po
B)
ae lg A
ae id vi Fa
o Sp
es ng
Taxon
NHDzung – Lesson 1, slide 55
NHDzung – Lesson 1, slide 56
10