













Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This lecture is about feed-forward neural networks for image classification. Dmitry Kobak | Machine Learning I | Neural networks and deep learning ...
Typology: Slides
1 / 21
This page cannot be seen from the preview
Don't miss anything!
Deep Learning by Goodfellow et al. defines ‘deep learning’ as algorithms enabling “the computer to learn complicated concepts by building them out of simpler ones”. This is implemented using neural networks that consist of hierarchically organized simple processing units, neurons. This field had a series of huge successes starting in ∼2012: classifying images, generating images, playing Go, writing text, folding proteins, etc. Known together as the ‘deep learning revolution’. This lecture is about feed-forward neural networks for image classification.
We have already spent one entire lecture talking about a neural network classifier!
Logistic regression is a linear network that has an input layer and an output layer. We now add a hidden layer :
∑ i
[ yi log h ( x i ) + (1 − yi ) log
( 1 − h ( x i )
)]
Linear: h ( x ) = 1 1 + e − W^2 W^1 x^
Nonlinear: 1 1 + e − W^2 ϕ ( W^1 x )
We want to use some nonlinear activation function ϕ that is easy to work with. The most common choice:
ϕ ( x ) = max(0 , x ).
Such neurons are called rectified linear units (ReLU).
What we had above was a shallow network. Here is a deeper one:
∑ i
[ yi log h ( x i ) + (1 − yi ) log
( 1 − h ( x i )
)]
h ( x ) = 1 1 + e
− W 4 ϕ
( W 3 ϕ
( W 2 ϕ ( W 1 x )
))
Logistic regression gradient (see Lecture 5):
∇L = −
∑ ( yi − h ( x i )
) ∇( β ⊤ x i ) = −
∑ ( yi − h ( x i )
) x i.
The gradient for the deep network:
∇L = −
∑ ( yi − h ( x i )
) ∇
[ W 4 ϕ
( W 3 ϕ
( W 2 ϕ ( W 1 x i )
))] .
Let us write this term down with indices:
z =
∑ a
W 4 aϕ
[ ∑
b
W 3 abϕ
( ∑
c
W 2 bcϕ
( ∑ d
W 1 cdxid
))] .
Now we need to use the chain rule: if h ( x ) = f ( g ( x )), then h ′( x ) = f ′( g ( x )) g ′( x ).
Using backpropagation, we can compute the gradient (partial derivatives with respect to each weight) and use gradient descent. Two notes:
If there are K classes, then the last softmax layer has K output neurons:
P ( y = k ) = e
zk ∑ i ezi^
where zi are pre-nonlinearity activations: z = W Lϕ
( W L − 1 ϕ (... )
) . The cross-entropy loss function can be written as
∑^ n i =
∑^ K
k =
Yik log P ( yi = k ) ,
where Yik = 1 if yi = k and 0 otherwise.
The input image has three input channels:
Several feature maps :
First layer:
Krizhevsky et al. 2012
Hidden layer:
Olah et al. 2017
Ridge ( L 2 ) regularization: λ ∥ W l ∥^2 on each layer l. This is also called weight decay (see Lecture 4). Another method is called early stopping :
Remark: for linear regression one can show that early stopping penalizes smaller singular values stronger, as does the ridge penalty.
Modern neural networks are typically used in the overparametrized regime, i.e. they can perfectly or near-perfectly overfit training data. This can be shown by training them using randomly shuffled labels: they can still achieve zero training loss. At the same time, generalization performance can be high: ‘benign’ overfitting due to implicit regularization. Sometimes one sees overfitting in the test loss but not in the test accuracy: