Deep Learning Kairit Sirts Lecture in TUT 19.12.2016
Outline
• What can be done with deep learning? • Deep learning demystified • How can you get started with deep learning?
2
Why deep learning? Deep learning
Random Forest
Gradient boosting
Linear model
http://www.infoworld.com/article/3003315/big-data/deep-learning-a-brief-guide-for-practical-problem-solvers.html
3
What can be done with deep learning?
Handwritten digit recognition MNIST benchmark dataset The best reported error rate is 0.21%
5
Street view number recognition • Obtained from house numbers in Google Street View images • Best error rate is 1.69%
6
Image classification
7
Image classification 10 objects 6000 labeled instances for each object Best accuracy so far 96.53%
8
Image classification
9
Image classification 20 superclasses 100 finegrained classes 600 labeled images per class Best classification accuracy 75.72%
10
Detecting doodles https://quickdraw.withgoogle.com There are other simple and fun AI experiments launched by Google https://aiexperiments.withgoogle.com
11
Image captioning
12
Image captioning – not so great results
13
Automatic colorization of images
http://richzhang.github.io/colorization/resources/images/teaser3.jpg
14
Automatic colorization of images - failed
15
DeepDream
https://deepdreamgenerator.com
16
DeepDream
17
DeepDream
18
DeepDream
19
Word embeddings
http://metaoptimize.s3.amazonaws.com/cw-embeddings-ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png
20
Word embeddings months
weekdays
numbers
21
Word embeddings
• 𝑊 man − 𝑊 woman ≈ 𝑊 king − 𝑊(queen) • 𝑊 walking − 𝑊 walked ≈ 𝑊 swimming − 𝑊(swam) 22
Automatic text generation – pseudo Shakespeare
http://karpathy.github.io/2015/05/21/rnn-effectiveness
23
Machine translation • Google Translate app
24
Learning to play Atari Arcade games
https://www.youtube.com/watch?v=cjpEIotvwFY
25
AlphaGo
https://www.youtube.com/watch?v=PQCrX1sQSzY
26
Other tasks tackled with deep neural networks • Speech recognition • Various tasks in robotics • Log analysis/risk detection • Recommendation systems • Motion detection from videos • Business and Economics analytics • Etc … 27
Deep learning demystified How does deep learning work?
• Biological neuron
• Artificial neuron
http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7
29
• Biological neural network
https://www.eeweb.com/blog/rob_riemen/deep-machine-learning-and-the-google-brain
• Artificial neural network
http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7
30
What happens inside a neuron?
<
ℎ = 𝑥7 𝑤7 + 𝑥: 𝑤: + ⋯ + 𝑥< 𝑤< = = 𝑥> 𝑤> >?7
Output: ℎ = 𝑓(𝑧) 31
Activation function
1 if 𝑧 ≥ th 𝑓 𝑧 =J 0 if 𝑧 < th
1 𝑓 𝑧 = 1 + 𝑒 DE
𝑒 E − 𝑒 DE 𝑓 𝑧 = E 𝑒 + 𝑒 DE
𝑓 𝑧 = max (0, 𝑧)
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/neural_networks.html
32
Single neuron logic gates • Threshold activation function
https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/
33
XOR gate • Cannot be done with a single neuron • A hidden layer is necessary
𝒙𝟏 𝒙𝟐
OR
NOT AND
AND
0
0
𝕀 0 ∙ 1 + 0 ∙ 1 > 0.5 = 0
𝕀 0 ∙ −1 + 0 ∙ −1 > −1.5 = 1
𝕀 0 ∙ 1 + 1 ∙ 1 > 1.5 = 0
0
1
𝕀 0 ∙ 1 + 1 ∙ 1 > 0.5 = 1
𝕀 0 ∙ −1 + 1 ∙ −1 > −1.5 = 1
𝕀 1 ∙ 1 + 1 ∙ 1 > 1.5 = 1
1
0
𝕀 1 ∙ 1 + 0 ∙ 1 > 0.5 = 1
𝕀 1 ∙ −1 + 0 ∙ −1 > −1.5 = 1
𝕀 1 ∙ 1 + 1 ∙ 1 > 1.5 = 1
1
1
𝕀 1 ∙ 1 + 1 ∙ 1 > 0.5 = 1
𝕀 1 ∙ −1 + 1 ∙ −1 > −1.5 = 0
𝕀 1 ∙ 1 + 0 ∙ 1 > 1.5 = 0
https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/
34
How to assign weights?
8Y9+9Y9+9Y9+9Y4= = 270 weights
http://neuralnetworksanddeeplearning.com/
35
Backpropagation • Standard and efficient method for training neural networks • The general idea: • Compute the error with a forward pass • Propagate the error back to change the weights such that the error would become smaller
ERROR à ERROR’ ERROR’ < ERROR
36
Diversion to calculus - derivative • 𝑦_ = 𝑓 _ 𝑥 • Derivative is the slope of the tangent line • It is the rate of change when going in the direction of steepest ascent
37
Derivatives • When 𝑓 _ 𝑥 = 0 then it is the local or global maximum or minimum or a saddle point • When 𝑓 _ 𝑥 > 0 then the function is increasing • When 𝑓 _ 𝑥 < 0 then the function is decreasing
38
Gradients • Generalization of derivatives to multivariate functions • Derivative is a vector pointing to the direction of steepest ascent • ∇𝑓(𝑥, 𝑦) = •
ab ab , ac ad
ab ab , ac ad
- partial derivatives – take
derivative wrt one variable while treating all others as constant 39
Gradients and backpropagation • Backpropagation is used to compute the gradients with respect to all parameters in a neural network. • The gradients are then used in a general method of gradient descent for minimizing functions. • We want to minimize the cost function that measures the error made by the neural network. • In order to do that we need to move to the direction of deepest descent given by the gradients.
40
Gradient descent • An iterative algorithm • Start with initial parameter values 𝜃 f • Update parameters iteratively until convergence: 𝜃 gh7 =: 𝜃 g − 𝛼∇𝑓 𝜃 • 𝛼 - learning rate, controls the step size
41
Deep learning demystified How does backpropagation work?
Backpropagation explained • Example from: https://mattmazur.com/2015/03/17/ • 2 inputs • 1 hidden layer with 2 neurons • Bias terms in both the hidden and output layer • 2 outputs 43
Initial configuration • Training values
• Initial weights: 𝑤7 , … , 𝑤l • Initial biases: 𝑏7 , 𝑏:
44
Forward pass – first hidden unit
45
Forward pass – first hidden unit
46
Forward pass – second hidden unit
47
Forward pass – first output unit
48
Forward pass – second output unit
49
Forward pass – error of the first output
50
Forward pass – output error
51
Forward pass – output error
52
Backwards pass • Consider 𝑤n • How much a change in 𝑤n affects the total error? • Apply the chain rule:
53
Chain rule • Formula for computing derivative of the composition of two or more functions • 𝐹 𝑥 ≡ 𝑓(𝑔 𝑥 ) ≡ (𝑓 ∘ 𝑔)(𝑥) – composition of functions 𝑓 and 𝑔 • 𝐹 _ 𝑥 = 𝑓 _ 𝑔 𝑥 𝑔_ 𝑥
• 𝐹 𝑥 = 𝑒 sc
𝑔 𝑥 = 3𝑥
𝑓 𝑔 𝑥
• 𝐹 _ 𝑥 = 𝑓 _ 𝑔 𝑥 𝑔_ 𝑥 = (𝑒 u(c) )′𝑔′(𝑥) = 𝑒 u
c
= 𝑒 u(c) = 𝑒 sc (3𝑥)′ = 𝑒 sc Y 3 = 3𝑒 sc
54
Backwards pass • Consider 𝑤n • How much a change in 𝑤n affects the total error? • Apply the chain rule:
55
How much does error change wrt the output?
56
How much does output change wrt its net input?
57
Derivative of the sigmoid function
1 𝑓 𝑧 = 1 + 𝑒 DE 𝑓 _ 𝑧 = 𝑓(𝑧)(1 − 𝑓 𝑧 )
58
How much does output change wrt its net input?
59
How much does net input change wrt 𝑤n ?
60
Putting it all together
61
This is known as the delta rule • Delta rule is the gradient descent rule for updating the weights of the inputs to neurons in a single-layer neural network
62
Apply delta rule to outer layer weights
63
Update the weights with gradient descent • set learning rate 𝛼 = 0.5
𝜽𝒕h𝟏 =: 𝜽𝒕 − 𝜶𝜵𝒇 𝜽
64
Backpropagation to hidden layer • Continue backwards pass to calculate new values for 𝑤7 , 𝑤: , 𝑤s and 𝑤|
65
BP through hidden layer
• 𝑜𝑢𝑡€7 affects both 𝑜7 and 𝑜: and thus needs to take into account both:
66
BP through hidden layer • Consider one of those:
• First term can be calculated using values computed before:
• Second term is just 𝑤n
67
BP through hidden layer • Plug the values in:
• Compute the same value for 𝑜: :
• Compute the total:
68
BP through hidden layer • Next we need weight 𝑤
a•‚gƒ„ a<…gƒ„
and
a<…gƒ„ a†
for each
• Compute the partial derivative wrt a weight
69
BP through hidden layer • Putting it together
• We can now update 𝑤7
70
BP through hidden layer • Compute the partial derivatives in the same way for 𝑤: , 𝑤s and 𝑤| • Update 𝑤: , 𝑤s and 𝑤|
71
After first update with backpropagation
72
Did the error decrease?
• Old error was: 0.298371109 • Improvement: 0.007343335 • After 10000 updates the error will be ca 0.000035085 • The generated outputs will be 0.015912196 for 0.01 target and 0.984065734 for 0.99 target 73
In conclusion • Neural networks consist of artificial neurons organized into layers and connected to each other with learnable weights. • Backpropagation with gradient descent is the standard method for training neural networks. • Backpropagation can be used to compute the gradients of a neural network, regardless of the depth of the network. • Of course, there are other important tricks and tips but this is the basis of understanding neural networks and deep learning.
74
Common neural network architectures
Feed-forward network • Simplest type of neural network • Connections between units do not form cycles • Information always moves in one direction • It never goes backwards
https://upload.wikimedia.org/wikipedia/en/5/54/Feed_forward_neural_net.gif
76
Recurrent neural network
• Connections between units form cycles • They possess internal memory – they “remember” the past inputs • Suitable for modeling sequential/temporal data, such as for instance text and language data 77
Convolutional neural networks • Convolutional layers have neurons arranged in 3 dimensions • Especially suitable for processing image data
http://parse.ele.tue.nl/education/cluster2
78
Autoencoders • Output layer attempts to reconstruct the input • Used for unsupervised feature learning • The hidden layer has typically less neurons, thus performing data compression
79
Getting started with neural networks
Courses and tutorials • https://www.coursera.org/learn/machine-learning • Introductory course on machine learning, provides necessary background
• https://www.coursera.org/learn/neural-networks • Course on neural networks – assumes knowledge about machine learning
• http://ufldl.stanford.edu/tutorial/ • Tutorial on deep learning but covers also some simpler machine learning
• http://cs231n.stanford.edu/ • Course on convolutional neural networks
• https://www.udacity.com/course/deep-learning--ud730 • Course on deep learning
• There are many others … just google … 81
Books • http://www.deeplearningbook.org/ • Deep Learning: A Practitioner’s approach – not released yet • Fundamentals of deep learning – not released yet
• See more from: • http://machinelearningmastery.com/deep-learning-books/
82
Low level libraries • Theano - http://deeplearning.net/software/theano/ • Tensorflow - https://www.tensorflow.org/get_started/ • Python-based • Automatic differentiation • Can use cuda for computing on GPU
• Torch – http://torch.ch/ • Based on Lua • Modular pieces that are easy to combine • Lots of pretrained models
• See more: https://deeplearning4j.org/compare-dl4j-torch7-pylearn 83
Higher level libraries • Keras - https://keras.io/ • On top of theano and tensorflow • Based on python • Modular • Supports both convolutional and recurrent networks • Supports arbitrary connectivity • Runs on both CPU and GPU
84
Keras – example code
85
What else? • Take the Machine Learning course in spring semester • Use neural networks for your thesis work • Potential supervisors in UT: • Kairit Sirts (problems involving natural language) • Mark Fishel (machine translation) • Raul Vicente (computational neuroscience) • Ilya Kuzovkin (computational neuroscience)
• Potential supervisors in TUT • Juhan Ernits • Tanel Alumäe (speech data) • There are possibly others
86
In conclusion - Deep learning • Can be used to solve very complex problems • Based on artificial neural networks with many hidden layers • Each artificial neuron is a simple computational unit • Neural networks are trained with gradient descent algorithm • Backpropagations algorithm is used to compute the gradients with respect to tunable parameters • There are many tutorials and online courses about deep learning • There are various software libraries that enable to get started with deep learning relatively easily 87