Problem Set 3, Cs181

  • Uploaded by: Jared Friedman
  • 0
  • 0
  • July 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Problem Set 3, Cs181 as PDF for free.

More details

  • Words: 1,160
  • Pages: 3
Assignment 3 Jared Friedman April 6, 2005 1.a. i. No such perceptron exists. Consider numbering the pixels one through nine, first by row and then by column. Number each weight according to its corresponding pixel. Since 7/9 > 3/4, and 6/9 > 3/4, the weights of any perceptron which performs this task must satisfy the following two relations: −w0 + w1 + w2 + w3 + w4 + w5 + w6 − w7 − w8 − w9 < 0, −w0 + w1 + w2 + w3 + w4 + w5 + w6 + w7 − w8 − w9 > 0, which imply that w7 > 0. However, they must also satisfy −w0 − w1 − w2 − w3 − w4 − w5 − w6 + w7 + w8 + w9 < 0 −w0 − w1 − w2 − w3 − w4 − w5 − w6 − w7 + w8 + w9 > 0, which imply that w7 < 0. And thus we have a contradiction and no such perceptron exists. ii. Here is such a perceptron. Let w1 = 2, w2 = 2, w3 = 2, w4 = −1, w5 = −1, w6 = −1, w7 = −1, w8 = −1, w9 = −1, and let the threshold be 0.5. iii. No such perceptron exists. Assume that there is such a perceptron. I will derive a contradiction by considering the 2 x 2 upper left corner. If such a perceptron exists, its weights must satisfy +w1 + w2 + w4 − w5 − w0 − w3 − w6 − w7 − w8 − w9 > 0 and −w1 + w2 + w4 − w5 − w0 − w3 − w6 − w7 − w8 − w9 < 0, which together imply that w1 > 0. similarly, we have that −w1 − w2 − w4 + w5 − w0 − w3 − w6 − w7 − w8 − w9 > 0 and +w1 − w2 − w4 + w5 − w0 − w3 − w6 − w7 − w8 − w9 < 0, which together implied that w1 < 0. b. i. 196+1 = 197 ii. I assume that the number of output nodes is 10, as with all networks for digit recognition. Then, following the theorem in the notes, an upper bound on the VC dimension is given by 2(197)(108) log2 (e108) = 339134. iii. The decision boundary of a decision stump with n inputs is a hyperplane in n dimensions. However, this hyperplane is subject to the additional

restriction that it must be perpendicular to one of the basis vectors. Thus, an upper bound of its VC dimension comes from the VC dimension of perceptrons with the same number of inputs, which is n + 1, or 197. v. consider decision trees with one continuous input. Consider the set S consisting of the first N positive integers. For any dichotomy on S, here is an algorithm for constructing a decision tree which represents that dichotomy. each split will be of the form X > a − 0.5, where a is an integer. For any given split X > a − 0.5, the left branch will be a positive classification if a is a positive instance of the dichotomy, and a negative classification if a is a negative instance of the dichotomy; the right branch will lead to the split X > a + 1 − 0.5. Clearly, we can use this method to construct a decision tree to shatter a set of size N, for any N. Therefore, the VC dimension of decision trees must be infinite. This is not a valid reason to reject decision trees because it shows only that the upper bound we derived for the sample complexity in terms of the VC dimension is infinite; the lower bound we derived is still finite. Also, even if the sample complexity is not polynomial, decision trees could be a weak learner that performs well in practice. c. Multi-layer feed-forward neural networks are clearly the most appropriate algorithm for the task. Decision trees with continuous attributes are a poor choice because the inductive bias of decision trees is to look for hypotheses where the classification depends primarily on the values of only a few attributes. Except possibly for the pixels on the border, all the inputs to the decision tree are roughly equally important in determining the classification. Similarly, the domain of digit recognition shows translation independence – if all the pixel values are shifted over by one, the classification should not change – and this will require a very large decision tree to represent. On the other hand, the fact that the VC dimension of decision trees with continuous attributes is infinite suggests that this will be a highly expressive hypothesis space. Decision trees may well be able to represent a correct hypothesis, however they will have to be very large to do so, and will likely be untrainable. Boosted decision stumps are a better choice. For one, as we can see in part b, the VC dimension of boosted decision stumps is very similar to the VC dimension of neural networks with one hidden layer. However, the inductive bias is still wrong. There is no reason to think that any particular pixel will have particularly good discriminatory value in this domain, as would be required for decision stumps to be a suitable choice. Also, classification into 2

digits will likely depend upon a fairly complex interaction between many pixels. Boosted decision stumps, being a kind of a weighted sum, will have trouble modeling this interaction. Perceptrons are a reasonable choice with respect to their inductive bias, and will probably perform reasonably well, especially considering the short amount of time to train them. However, they are simply not expressive enough to achieve top performance. As we can see from part b, the VC dimension of perceptrons is much less than the VC dimension of a neural net with a sizable hidden layer. Multi-layer feed-forward neural networks combine the inductive bias of perceptrons with the high dimensionality of the other two algorithms, and are therefore the best choice. 3.a. The error function F is addressing the problem that the weights in a neural network generally increase over a large amount of training, and can become unreasonably large. This can lead to overfitting. To correct for this, the error function F penalizes large weights by calculating a weighted sum of the training set error and the square of the Euclidean distance between the origin and the weight vector. ∂F b. To derive a weight update rule, all we need to do is to calculate ∂w . ji Since F is a sum of two terms, we can take the partial derivatives of those P ∂F terms separately, and we find that ∂w = − d∈D adj δid + 2λwji . Therefore, ji

the update rule is wji ← wji + α(

P

d∈D

3

adj δid − 2λwji ).

Related Documents

Problem Set 3, Cs181
July 2019 27
Problem Set 3
June 2020 11
212 - Problem Set 3
April 2020 19
Problem Set
May 2020 13
Problem Set
November 2019 37
Ec 1723 Problem Set 3
November 2019 12

More Documents from ""