Liu, Smith 1
Jamie Liu and Adam Smith 6.825 – Project 2 11/4/2004 We learned a lot from this project. Enjoy.
1. Variable Elimination Functionality After executing our variable elimination procedure, we obtained the following results for each of the queries below. For the sake of easy analysis of the PropCost probability distributions obtained throughout this project from the insurance network, we define the function f to be a weighted average across the discrete domain, resulting in a single scalar value representative of the overall cost. More specifically, f = 1E5*PHundredThou + 1E6*PMillion + 1E4*PTenThou + 1E3*PThousand 1. P(Burglary | JohnCalls = true, MaryCalls = true) <[Burglary] = [false]> = 0.7158281646356072 <[Burglary] = [true]> = 0.284171835364393
2. P(Earthquake | JohnCalls = true, Burglary = true) <[Earthquake] = [false]> = 0.8239331615949207 <[Earthquake] = [true]> = 0.17606683840507917
3. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou) <[PropCost] <[PropCost] <[PropCost] <[PropCost]
= = = =
[HundredThou]> = 0.1729786918964137 [Million]> = 0.02709352198178344 [TenThou]> = 0.3427002442093675 [Thousand]> = 0.45722754191243536
(f = 48275.62)
These results are consistent with those obtained by executing the given enumeration procedure, and those given in Table 1 of the project hand-out.
2. More Variable Elimination Exercise A. Insurance Network Queries
Liu, Smith 2
1. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar) If the MakeModel of the car in question is that of a sports car then, based on the network as illustrated in Figure 1 of the handout, we expect that the driver would be less risk averse, the driver would have more money, the car would be of higher value. All of these things should cause the cost of insurance to “go up,” relative to our previous query which did not involve any evidence about the MakeModel of the car. An increase in the PropCost domain sense means that the probability distribution should be shifted towards the higher cost elements of the domain (e.g. Million might have a higher probability than Thousand). Indeed, this is what happens. As can be seen below, f is about four thousand dollars greater in this case relative to that from Section 1.3. <[PropCost] <[PropCost] <[PropCost] <[PropCost]
= = = =
[HundredThou]> = 0.17179333672003955 [Million]> = 0.03093877334365239 [TenThou]> = 0.34593039737969233 [Thousand]> = 0.45133749255661565
(f = 52028.74)
2. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True) In this case, counter-intuitive as it may seem, if the driver is a GoodStudent, then the overall cost of insurance goes up. This follows from the network as shown in Figure 1 of the project handout, i.e. GoodStudent is only connected to the network through two parents: Age and SocioEcon. Since Age is an evidence variable, SocioEcon is the only node affected by the augmentation of GoodStudent to the evidence. More specifically, if the adolescent driver is a good student, they are likely to have more money, and thus drive fancier cars, be less risk averse, et cetera. This result is manifested in the results after variable elimination given the proper evidence. More specifically, f is a little less than four thousand dollars greater in this case relative to that from Section 1.3. <[PropCost] <[PropCost] <[PropCost] <[PropCost]
= = = =
[HundredThou]> = 0.1837467917616061 [Million]> = 0.029748793596801583 [TenThou]> = 0.32771416728772235 [Thousand]> = 0.4587902473538701
(f = 51859.40)
Liu, Smith 3
B. Carpo Network Queries
1. P(N112 | N64 = “3”, N113 = “1”, N116 = “0”) <[N112] = [0]> = 0.9880400004226929 <[N112] = [1]> = 0.01195999957730707
2. P(N143 | N146 = “1”, N116 = “0”, N121 = “1”) <[N143] = [0]> = 0.899999996961172 <[N143] = [1]> = 0.10000000303882783
3. Random Elimination Ordering A. Histograms
Histogram of Computation Time under Random Elimination Ordering: Problem 1 6000 5000 4000 3000 2000 1000 0 1
2
3
4
5
6
7
8
Trials
Figure 1. Histogram of Computation Time for P(ProbCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar).
9
10
Liu, Smith 4
Histogram of Computation Time under Random Elimination Ordering: Problem 2 6000 5000 4000 3000 2000 1000 0 1
2
3
4
5
6
7
8
9
10
Trials Figure 2. Histogram of Computation Time for P(ProbCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True).
Histogram of Computation Time under Random Elimination Ordering: Problem 3 6000 5000 4000 3000 2000 1000 0 1
2
3
4
5
6
7
8
9
Trials
Figure 3. Histogram of Computation Time for P(N112 | N64 = "3", N113 = "1", N116 = "0").
10
Liu, Smith 5
Histogram of Computation Time under Random Elimination Ordering: Problem 4 6000 5000 4000 3000 2000 1000 0 1
2
3
4
5
6
7
8
9
10
Trials
Figure 4. Histogram of Computation Time for P(N143 | N146 = "1", N116 = "0", N121 = "1").
B. Discussion Error! Reference source not found. through Error! Reference source not found. illustrate the running time of a random order variable elimination algorithm for each of the problems in Task 2 of the project handout. We ran the algorithm ten times for each problem. For each bar, if there it is stacked with a purple bar on top of it, then the heap ran out of memory during that execution. In this case, we know that the execution would have taken at least the amount of time illustrated by the blue bar, the time it executed before running out of memory. We suppose that each execution where the computer ran out of memory would have taken at least 5000 seconds to complete. It is worth noting that the time taken on the successful runs (the samples without a purple bar) is much lower than the time taken to execute the unsuccessful runs before they crashed. I.e. the successful blue bars tend to be shorter than the unsuccessful blue bars. This indicates that either random ordering tends to get it very right or very wrong.
Liu, Smith 6
4. Greedy Elimination Ordering A. Histograms
Greedy Variable Elimination Runtimes
1.4 1.2 1 0.8 0.6 0.4 0.2 0 1
2
3
4
Problem Number Figure 5. Greedy Variable Elimination Runtimes for 10 trials of running each of 4 problems.
Problem Insurance – 1 Insurance – 2 Carpo – 1 Carpo – 2
Average Time (seconds) 0.629 1.086 0.088 0.087
Table 1. Average time of execution for variable elimination for the problems from Task 2. Averages are constructed across ten independant runs each, which are illustrated in Figure 5.
B. Discussion As can be seen from Table 1, the time needed for variable elimination is much smaller for a greedy elimination ordering versus a random ordering. This makes a lot of sense, because the random ordering could happen to eliminate a parent of many children, creating a huge factor which slows down the algorithm and eats up memory. On the contrary, greedy ordering variable elimination works very well. Even in the cases from Section 3 in which we did not run out of memory, the greedy algorithm tends to be about 100-200 times faster.
Liu, Smith 7
5. Likelihood Weighting and Gibbs Sampling Functionality Each of our results below look like they are in the right neighborhood. We give more explicit quality results in the problems that follow this one.
A. Basic Results – Likelihood Weighting 1. P(Burglary | JohnCalls = true, MaryCalls = true) <[Burglary] = [false]> = 0.5448387970739699 <[Burglary] = [true]> = 0.4551612029260302
2. P(Earthquake | JohnCalls = true, Burglary = true) <[Earthquake] = [false]> = 0.9997158283603297 <[Earthquake] = [true]> = 2.8417163967036946E-4
3. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou) <[PropCost] <[PropCost] <[PropCost] <[PropCost]
= = = =
[HundredThou]> = 0.17105091038203132 [Million]> = 0.021563876240368398 [TenThou]> = 0.35877461270610517 [Thousand]> = 0.44861060067149516
4. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar) <[PropCost] <[PropCost] <[PropCost] <[PropCost]
= = = =
[HundredThou]> = 0.16339257873401916 [Million]> = 0.030620517617711222 [TenThou]> = 0.35048331774243846 [Thousand]> = 0.4555035859058312
5. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True) <[PropCost] <[PropCost] <[PropCost] <[PropCost]
= = = =
[HundredThou]> = 0.20177159162635994 [Million]> = 0.032866049889275516 [TenThou]> = 0.30414914618811645 [Thousand]> = 0.46121321229624807
6. P(N112 | N64 = “3”, N113 = “1”, N116 = “0”) <[N112] = [0]> = 0.9910128302117664 <[N112] = [1]> = 0.00898716978823346
Liu, Smith 8
7. P(N143 | N146 = “1”, N116 = “0”, N121 = “1”) <[N143] = [0]> = 0.9172494563262301 <[N143] = [1]> = 0.08275054367376986
B. Basic Results – Gibbs Sampling 1. P(Burglary | JohnCalls = true, MaryCalls = true) <[Burglary] = [false]> = 0.71 <[Burglary] = [true]> = 0.29
2. P(Earthquake | JohnCalls = true, Burglary = true) <[Earthquake] = [false]> = 0.842 <[Earthquake] = [true]> = 0.158
3. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou) <[PropCost] <[PropCost] <[PropCost] <[PropCost]
= = = =
[HundredThou]> = 0.06 [Million]> = 0.01 [TenThou]> = 0.355 [Thousand]> = 0.5750000000000001
4. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar) <[PropCost] <[PropCost] <[PropCost] <[PropCost]
= = = =
[HundredThou]> = 0.09 [Million]> = 0.011 [TenThou]> = 0.34 [Thousand]> = 0.559
5. P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True) <[PropCost] <[PropCost] <[PropCost] <[PropCost]
= = = =
[HundredThou]> = 0.213 [Million]> = 0.038 [TenThou]> = 0.372 [Thousand]> = 0.377
6. P(N112 | N64 = “3”, N113 = “1”, N116 = “0”) <[N112] = [0]> = 0.97 <[N112] = [1]> = 0.03
Liu, Smith 9
7. P(N143 | N146 = “1”, N116 = “0”, N121 = “1”) <[N143] = [0]> = 0.922 <[N143] = [1]> = 0.078
6. Ignoring Prefix of Samples in Gibbs Sampling A. Results
KL Divergence
Prefix Throwaway in Gibbs Sampling 4.00E-03 3.50E-03 3.00E-03 2.50E-03 2.00E-03 1.50E-03 1.00E-03 5.00E-04 0.00E+00 0
200
400
600
800
1000
Size of Prefix Thrown Away
Figure 6. Quality (KL divergence) of estimates produced by Gibbs sampler. Each run used 2000 samples, and threw away the first x samples, the independant variable expressed on the x-axis.
Average KL Divergence
Prefix Throwaway in Gibbs Sampling 1.20E-03 1.00E-03 8.00E-04 6.00E-04 4.00E-04 2.00E-04 0.00E+00 0
200
400
600
800
1000
Size of Prefix Thrown Away
Figure 7. Averages for different prefix throwaway sizes from Figure 6.
Liu, Smith 10
B. Discussion In this analysis, we ran the Gibbs sampler with 2000 samples on the same problem (Carpo – 1). For each iteration, we threw away a variable number of the first samples. The idea is that since Gibbs sampling is a Markov Chain algorithm, each sample highly depends on the samples before it. Since we choose a random initialization vector for each variable, it can take some “burn in” time before the algorithm begins to settle into the right global solution. The results of our experiments are expressed in Figure 6 and Figure 7. We have a fairly nice characteristic curve as can be seen in the average graph, with the only exception being when we threw away the first 600 samples. Looking at each run, however, at x = 600 there was a single outlier with an extremely high KL divergence; we can ignore it based on the many runs that we did. It seems that the ideal “burn in” time, a tradeoff between good initialization and diversity of counted samples, is 800 samples.
7. Detailed Analysis – KL Divergences A. Results We present results indexed first by the algorithm (Likelihood Weighting, then Gibbs Samples) and then by the problem. Within each problem we display two graphs: the first showing the results from ten iterations, and the second showing the average KL divergence across each iteration.
Liu, Smith 11
1. Likelihood Weighting Likelihood Weighting - Problem Insurance1 7.00E-02
KL Divergence
6.00E-02 5.00E-02 4.00E-02 3.00E-02 2.00E-02 1.00E-02
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
18
19
00
1
00
0.00E+00 17
)Number of Samples (x1000
Figure 3. KL Divergences when applying Likelihood Weighting to P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, MakeModel = SportsCar).
Likelihood Weighting: Average KL Divergence Problem Insurance1 0.035
KL Divergence
0.03 0.025 0.02 0.015 0.01 0.005
Number of Samples
Figure 4. Average KL Divergence when applying Likelihood Weighting to P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, MakeModel = SportsCar) to sample sizes between 100 and 2000.
20
00
19
00
18
00
17
00
16
00
15
00
14
00
13
00
12
00
11
0
0
0
0
0
0
0
0
10
90
80
70
60
50
40
30
20
10
0
0
20
Liu, Smith 12
Likelihood Weighting - Problem Insurance2 7.00E-02 6.00E-02
Divergence
5.00E-02 4.00E-02 3.00E-02 2.00E-02 1.00E-02
10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00
90 0
80 0
70 0
60 0
50 0
40 0
30 0
20 0
10 0
0.00E+00
Sample Size
Figure 8. KL Divergences when applying Likelihood Weighting to P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, GoodStudent = True). Likelihood Weighting: Average KL Divergence Problem Insurance2 0.02 0.018 KL Divergence
0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002
90 0 10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00
0 80
0
70
0 60
0
50
0 40
0 30
0
20
10
0
0
Number of Samples
Figure 9. Average KL Divergence when applying Likelihood Weighting to P(PropCost | Age = Adolescent, Antilock=False, Mileage = FiftyThou, GoodStudent = True).
Liu, Smith 13
Likelihood Weighting - Problem 3 8.00E-02 7.00E-02 6.00E-02 5.00E-02 4.00E-02 3.00E-02 2.00E-02 1.00E-02
Number of Samples
Figure 10. KL Divergences when applying Likelihood Weighting to P(N112 | N64 = "3", N113 = "1", N116 = "0").
Likelihood Weighting: Average KL Divergence Problem Carpo1 0.05 0.045 KL Divergence
0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005
10 0 20 0 30 0 40 0 50 0 60 0 70 0 80 0 90 0 10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00
0
Number of Samples
Figure 11. Average KL Divergence when applying Likelihood Weighting to P(N112 | N64 = "3", N113 = "1", N116 = "0").
2000
1900
1800
1700
1600
1500
1400
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0.00E+00
Liu, Smith 14
Likelihood Weighting - Problem 4 2.50E-02 2.00E-02 1.50E-02 1.00E-02 5.00E-03
Number of Samples
Figure 12. KL Divergences when applying Likelihood Weighting to P(N143 | N146 = "1", N116 = "0", N121 = "1").
Likelihood Weighting: Average KL Divergence Problem Carpo2 0.007
KL Divergence
0.006 0.005 0.004 0.003 0.002 0.001
10 0 20 0 30 0 40 0 50 0 60 0 70 0 80 0 90 0 10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00 18 00 19 00 20 00
0
Number of Samples
Figure 13. Average KL Divergence when applying Likelihood Weighting to P(N143 | N146 = "1", N116 = "0", N121 = "1").
2000
1900
1800
1700
1600
1500
1400
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0.00E+00
Liu, Smith 15
2. Gibbs Sampling
Gibbs Sampling: KL Divergences vs Number of Samples for Problem 1 1.2 1 0.8 0.6 0.4 0.2
Number of Samples
Figure 14. Divergences resulting from Gibbs Sampling applied to P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar) for sample sizes between 1000 and 25000.
25000
24000
23000
22000
21000
20000
19000
18000
17000
16000
15000
14000
13000
12000
11000
9000
10000
8000
7000
6000
5000
4000
3000
2000
1000
0
Liu, Smith 16
Gibbs Sampling: Average KL Divergence vs Number of Samples for Problem 1 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05
25000
24000
23000
22000
21000
20000
19000
18000
17000
16000
15000
14000
13000
12000
11000
9000
10000
8000
7000
6000
5000
4000
3000
2000
1000
0
Number of Samples
Figure 15. Average divergence resulting from Gibbs Sampling applied to P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, MakeModel = SportsCar) for sample sizes between 1000 and 25000.
Gibbs Sampling: KL Divergences vs Number of Samples for Problem 2 1.2 1 0.8 0.6 0.4 0.2
Number of Samples
Figure 16. Divergences resulting from Gibbs Sampling applied to P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True) for sample sizes between 1000 and 25000.
25000
24000
23000
22000
21000
20000
19000
18000
17000
16000
15000
14000
13000
12000
11000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Liu, Smith 17
Gibbs Sampling: Average KL Divergence vs Number of Samples for Problem 2 0.25
0.2
0.15
0.1
0.05
25000
24000
23000
22000
21000
20000
19000
18000
17000
16000
15000
14000
13000
12000
11000
9000
10000
8000
7000
6000
5000
4000
3000
2000
1000
0
Number of Samples
Figure 17. Average divergence resulting from Gibbs Sampling applied to P(PropCost | Age = Adolescent, Antilock = False, Mileage = FiftyThou, GoodStudent = True) for sample sizes between 1000 and 25000.
Gibbs Sampling: KL Divergences vs Number of Samples for Problem 3 4.00E-03 3.50E-03 3.00E-03 2.50E-03 2.00E-03 1.50E-03 1.00E-03 5.00E-04
Number of Samples
Figure 18. Divergences resulting from Gibbs Sampling applied to P(N112 | N64 = "3", N113 = "1", N116 = "0") for sample sizes between 1000 and 25000.
25 00 0
23 00 0
21 00 0
19 00 0
17 00 0
15 00 0
13 00 0
11 00 0
90 00
70 00
50 00
30 00
10 00
0.00E+00
Liu, Smith 18
Gibbs Sampling: Average KL Divergence vs Number of Samples for Problem 3 1.20E-03 1.00E-03 8.00E-04 6.00E-04 4.00E-04 2.00E-04
25 00 0
23 00 0
21 00 0
19 00 0
17 00 0
15 00 0
13 00 0
11 00 0
90 00
70 00
50 00
30 00
10 00
0.00E+00
Number of Samples
Figure 19. Average Divergence resulting from Gibbs Sampling applied to P(N112 | N64 = "3", N113 = "1", N116 = "0") for sample sizes between 1000 and 25000.
Gibbs Sampling: KL Divergences vs Number of Samples for Problem 4 0.035 0.03 0.025 0.02 0.015 0.01 0.005
Number of Samples
Figure 20. Divergences resulting from Gibbs Sampling applied to P(N143 | N146 = "1", N116 = "0", N121 = "1") for sample sizes between 1000 and 25000.
25000
24000
23000
22000
21000
20000
19000
18000
17000
16000
15000
14000
13000
12000
11000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Liu, Smith 19
Gibbs Sampling: Average KL Divergence vs Number of Samples for Problem 4 0.014 0.012 0.01 0.008 0.006 0.004 0.002
Number of Samples
Figure 21. Average divergence resulting from Gibbs Sampling applied to P(N143 | N146 = "1", N116 = "0", N121 = "1") for sample sizes between 1000 and 25000.
B. Discussion of Results Four interesting things: 1. Number of samples in Gibbs versus Likelihood Weighting As seen from the figures in Section 7.A.1, Likelihood weighting tends to converge after about 500 samples, but always after 1000 in our problems and analyses. We originally assumed that Gibbs sampling would converge in about the same time, if not better. It turns out that Gibbs takes much longer; it typically converges by 5000 samples, a full order of magnitude higher, as can be seen from the figures in Section 7.A.2. This is likely because of the Markov Chain approach used; since each sample depends on the ones before it, it can take many iterations before the algorithm settles into the global optima, whereas likelihood weighting by definition discovers the appropriate probabilities (i.e. weights). 2. Variance of time to converge can be high The convergence of Likelihood Weighting in Problem 3, as illustrated in Figure 10 and Figure 11, exhibits very interesting properties. In the other problems, likelihood weighting runs tended to exhibit relatively low variance in time to convergence. However, here we see some runs which converged very quickly, and others that took abnormally
25000
24000
23000
22000
21000
20000
19000
18000
17000
16000
15000
14000
13000
12000
11000
9000
10000
8000
7000
6000
5000
4000
3000
2000
1000
0
Liu, Smith 20
long. This high variance occurred with high consistency in this problem, and thus is likely induced by some characteristic in the problem; one likely explanation is that our query variable is a leaf node in a very poly-tree-like network. 3. Convergence is logarithmic This is an evident feature of all of the graphs, but has enormous implications for a choice of algorithms. The criterion for “completeness” of an algorithm is that it arrives at the right answer. In the case of the sampling methods that we surveyed, unfortunately it takes an infinite time to arrive at the right answer. However, it is important to note that variable elimination always arrives at the exact answer. Thus, if a user needs completeness (i.e. the right answer), they should probably use variable elimination. However, if they only need a certain level of completeness, i.e. they want to be x% right, they still cannot rely on sampling methods. This gives rise to the x% correct y% of the time metric. We certainly see this from our graphs. 4. Local optima in Gibbs sampling, but not in Likelihood Weighting This is a very interesting point. In both problems 3 and 4 from Task 2 under Gibbs sampling, one of the runs from each of these problems do not converge to zero. Instead, they seem to converge to a local optima (which is not the global optima). This can be seen in the pink line in Figure 18 and the jungle green line in Figure 20. This is probably more likely in some networks than others. We could probably construct a very simple network that would not provoke this behavior.
C. Computational Considerations – Sampling versus Variable Elimination In comparing the computation time of sampling methods to variable elimination, we limit ourselves to discussion of greedy ordering variable elimination; since random ordering is very sub-optimal (see Section 4). It turns out that for the networks and queries that we considered, variable elimination is the champ on both accuracy and speed. As can be seen from Table 2, variable elimination performed in near-second times on each problem, while Gibbs took about 15 seconds and Likelihood Weighting took around 5 seconds. This is with 1000 samples for the sampling algorithms, and an effective infinite samples for variable elimination. Our results might have been different if the networks involved were much more dense (i.e. connected) or much larger.
Liu, Smith 21
Task2. Insurance 1
Task2. Insurance 2 1.142
Task2 . Carpo 1 0.120
Task2 . Carpo 2 0.090
Variable Eliminatio n
0.741
Gibbs Sampling
12.778
13.530
19.228
18.045
4.377 4.687 5.608 5.317 Likelihoo d Weighting Table 2. Execution time of various algorithms on the four problems from Task 2.