Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Neural networks and deep learning, Slides of Machine Learning

This lecture is about feed-forward neural networks for image classification. Dmitry Kobak | Machine Learning I | Neural networks and deep learning ...

Typology: Slides

2021/2022

Uploaded on 09/27/2022

lilwayne
lilwayne 🇬🇧

4.1

(7)

243 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Neural networks and deep learning
Deep Learning by Goodfellow et al. defines ‘deep
learning’ as algorithms enabling “the computer
to learn complicated concepts by building them
out of simpler ones”.
This is implemented using neural networks that
consist of hierarchically organized simple processing
units, neurons.
This field had a series of huge successes starting in 2012: classifying
images, generating images, playing Go, writing text, folding proteins, etc.
Known together as the ‘deep learning revolution’.
This lecture is about feed-forward neural networks for image classification.
Dmitry Kobak | Machine Learning I | Neural networks and deep learning
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Neural networks and deep learning and more Slides Machine Learning in PDF only on Docsity!

Neural networks and deep learning

Deep Learning by Goodfellow et al. defines ‘deep learning’ as algorithms enabling “the computer to learn complicated concepts by building them out of simpler ones”. This is implemented using neural networks that consist of hierarchically organized simple processing units, neurons. This field had a series of huge successes starting in ∼2012: classifying images, generating images, playing Go, writing text, folding proteins, etc. Known together as the ‘deep learning revolution’. This lecture is about feed-forward neural networks for image classification.

We have already spent one entire lecture talking about a neural network classifier!

A hidden layer

Logistic regression is a linear network that has an input layer and an output layer. We now add a hidden layer :

L = −

i

[ yi log h ( x i ) + (1 − yi ) log

( 1 − h ( x i )

)]

Linear: h ( x ) = 1 1 + eW^2 W^1 x^

Nonlinear: 1 1 + eW^2 ϕ ( W^1 x )

Activation function

We want to use some nonlinear activation function ϕ that is easy to work with. The most common choice:

ϕ ( x ) = max(0 , x ).

Such neurons are called rectified linear units (ReLU).

Going deeper

What we had above was a shallow network. Here is a deeper one:

L = −

i

[ yi log h ( x i ) + (1 − yi ) log

( 1 − h ( x i )

)]

h ( x ) = 1 1 + e

W 4 ϕ

( W 3 ϕ

( W 2 ϕ ( W 1 x )

))

Gradient

Logistic regression gradient (see Lecture 5):

∇L = −

∑ ( yih ( x i )

) ∇( βx i ) = −

∑ ( yih ( x i )

) x i.

The gradient for the deep network:

∇L = −

∑ ( yih ( x i )

) ∇

[ W 4 ϕ

( W 3 ϕ

( W 2 ϕ ( W 1 x i )

))] .

Let us write this term down with indices:

z =

a

W 4

[ ∑

b

W 3 abϕ

( ∑

c

W 2 bcϕ

( ∑ d

W 1 cdxid

))] .

Now we need to use the chain rule: if h ( x ) = f ( g ( x )), then h ′( x ) = f ′( g ( x )) g ′( x ).

Stochastic gradient descent

Using backpropagation, we can compute the gradient (partial derivatives with respect to each weight) and use gradient descent. Two notes:

  1. In practice, gradient descent algorithm is often used with some modifications: momentum , adaptive learning rates (e.g. Adam), etc.
  2. Gradient descent requires summation over all training samples at each step. In practice, training data are split into batches and are processed one ny one: stochastic gradient descent (SGD). One sweep through the entire training dataset is called an epoch.

Multiclass classification

If there are K classes, then the last softmax layer has K output neurons:

P ( y = k ) = e

zki ezi^

where zi are pre-nonlinearity activations: z = W

( W L − 1 ϕ (... )

) . The cross-entropy loss function can be written as

L = −

∑^ n i =

∑^ K

k =

Yik log P ( yi = k ) ,

where Yik = 1 if yi = k and 0 otherwise.

Convolutional neural networks:

The input image has three input channels:

Convolutional neural networks

Several feature maps :

What do CNNs learn?

First layer:

Krizhevsky et al. 2012

What do CNNs learn?

Hidden layer:

Olah et al. 2017

Overfitting and regularization

Ridge ( L 2 ) regularization: λW l ∥^2 on each layer l. This is also called weight decay (see Lecture 4). Another method is called early stopping :

Remark: for linear regression one can show that early stopping penalizes smaller singular values stronger, as does the ridge penalty.

Overparametrization

Modern neural networks are typically used in the overparametrized regime, i.e. they can perfectly or near-perfectly overfit training data. This can be shown by training them using randomly shuffled labels: they can still achieve zero training loss. At the same time, generalization performance can be high: ‘benign’ overfitting due to implicit regularization. Sometimes one sees overfitting in the test loss but not in the test accuracy: