Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Support Vector Machines (SVM) for Linear & Non-Separable Data, Study notes of Machine Learning

Derby College Machine Learning

The Support Vector Machine (SVM) algorithm for finding maximum margin separating hyperplanes in linearly separable data and dealing with non-separable data using slack variables and kernel functions. The document also touches upon the relationship between SVM and boosting.

What you will learn

How does the Support Vector Machine (SVM) algorithm find the maximum margin separating hyperplane?
What is the maximum margin separating hyperplane in machine learning?
What are slack variables and how are they used in SVM?

Typology: Study notes

2021/2022

Uploaded on 09/27/2022

kianx 🇬🇧

4

(10)

219 documents

1 / 8

This page cannot be seen from the preview

Don't miss anything!

COS 511: Theoretical Machine Learning

Lecturer: Rob Schapire Lecture #14

Scribe: Athindran Ramesh Kumar March 27, 2019

1 Recap: SVM for linearly separable data

In the previous lecture, we developed a method known as the support vector machine for

obtaining the maximum margin separating hyperplane for data that is linearly separable,

i.e., there exists at least one hyperplane that perfectly separates the positive and negative

labeled points. The inspiration for seeking maximum margin classifiers arose from our

observation in boosting that increasing the number of rounds of boosting increased the

margin of our weighted majority classifier and hence decreased the test error. Further, we

note that:

•VC-dim (Linear threshold function with margin δ)≤1

δ2if all training and test exam-

ples have kxk ≤ 1

•VC-dim (Linear threshold function with margin δ)≤R

δ2if all training and test

examples have kxk ≤ R

In other words, the VC-dimension of the set of linear threshold functions with margin δis

bounded above by a quantity independent of the dimension of the space when the training

and test examples have length at most R. The VC-dimension decreases as the margin

increases. This gives us further motivation to explicitly find maximum margin classifiers.

δδ

The above figure shows a linearly separable set of points along with a maximum margin

separating hyperplane with margin δ.

We are given (x1, y1),...,(xm, ym) where xi∈Rnand yi∈ {−1,+1}. The primal

problem for finding the maximum margin hyperplane can be formulated as:

max δ

∀i yi(v·xi)≥δ

kvk2= 1

Partial preview of the text

Download Support Vector Machines (SVM) for Linear & Non-Separable Data and more Study notes Machine Learning in PDF only on Docsity!

COS 511: Theoretical Machine Learning

Lecturer: Rob Schapire Lecture # Scribe: Athindran Ramesh Kumar March 27, 2019

1 Recap: SVM for linearly separable data

In the previous lecture, we developed a method known as the support vector machine for obtaining the maximum margin separating hyperplane for data that is linearly separable, i.e., there exists at least one hyperplane that perfectly separates the positive and negative labeled points. The inspiration for seeking maximum margin classifiers arose from our observation in boosting that increasing the number of rounds of boosting increased the margin of our weighted majority classifier and hence decreased the test error. Further, we note that:

VC-dim (Linear threshold function with margin δ) ≤ (^) δ^12 if all training and test exam- ples have ‖x‖ ≤ 1
VC-dim (Linear threshold function with margin δ) ≤

( R

δ

if all training and test examples have ‖x‖ ≤ R

In other words, the VC-dimension of the set of linear threshold functions with margin δ is bounded above by a quantity independent of the dimension of the space when the training and test examples have length at most R. The VC-dimension decreases as the margin increases. This gives us further motivation to explicitly find maximum margin classifiers.

δ δ

The above figure shows a linearly separable set of points along with a maximum margin separating hyperplane with margin δ. We are given (x 1 , y 1 ),... , (xm, ym) where xi ∈ Rn^ and yi ∈ {− 1 , +1}. The primal problem for finding the maximum margin hyperplane can be formulated as:

max δ ∀i yi(v · xi) ≥ δ ‖v‖ 2 = 1

By introducing a new variable, w = v δ , we transformed the problem as:

min ‖w‖^22 2 ∀i yi(w · xi) ≥ 1

w can be obtained from the training examples (specifically, the support vectors among the training examples) as w =

i αiyixi. The αi’s are obtained by solving the dual problem formulated below:

max

i

αi −

i

j

αiαj yiyj (xi · xj )

s.t ∀i αi ≥ 0

The dual problem can be solved by any non-linear programming method such as gradient descent. For predicting the label of a new example x, we use:

h(x) = sign(w · xi) = sign(

i

αiyix · xi)

Observations

A crucial observation to note is that only inner products between each pair of training examples and each training example and test example is required for predicting the labels of the test examples. This observation will be used later in the class.
Another observation is that w is a linear combination of only the support vectors (the training examples with αi > 0). Hence, the maximum margin separating hyperplane depends only on the support vectors.

The SVM has had a big impact on machine learning. In the SVM, we are able to formulate exactly what we want to optimize and solve the optimization problem fairly easily.

2 SVM for data that is not linearly separable

There are two strategies for dealing with linearly inseparable data. Both the strategies are often combined for practical applications.

2.1 Soft-margin SVM

This strategy can be used in the scenario when the data is almost linearly separable, i.e., there are a few data points that violate the requirement as shown in the figure below.

Now, the equation for the separating hyperplane in 6 dimensions is given by:

v = (a, b, c, d, e, f ) v · ψ(x) = 0 a + bx 1 + cx 2 + dx 1 x 2 + ex^21 + f x^22 = 0

In the mapped space, R^6 , this is the equation of a hyperplane. In the original space, R^2 , this is the equation of an arbitrary conic section. This conic section can separate the set of points in two dimensions as shown below.

Thus, after mapping the space to a higher dimensional space, we have made the data linearly separable. Now, we can use the basic SVM in this high dimensional space. This idea generalizes. In general, if we have n dimensions, we can form all monomial terms up to degree k. This would give rise to a higher dimensional space of dimensionality O(nk). Is this a good idea? Let us discuss further.

2.3 Problems with this idea

Problem 1

As the dimension of the space we map to increases, it would seem that we will need sub- stantially more data, especially if mapping to a very high dimensional space. However, note that we said that VC-dimension of linear threshold functions with margin ( δ is at most R δ

if ‖x‖ ≤ R. This indicates that the VC-dimension does not scale with the number of dimensions and depends only on the margin. However, R could still increase as we increase the number of dimensions. Still, at least the VC-dimension does not explicitly depend on the dimension of the space we are mapping to, so the amount of training data needed will not necessarily increase.

Problem 2

The main problem with this idea is the computational burden incurred by operating in this high-dimensional space. We discussed earlier that we would be operating in a space of dimensionality O(nk). If n = 100 and k = 6, we would be working with a trillion dimensions. This is a huge computational problem and requires gigantic time and memory. This problem is overcome by using the kernel trick.

2.4 Kernel trick

Note that the mapping ψ from R^2 to R^6 is given by:

x = (x 1 , x 2 ) 7 → (1, x 1 , x 2 , x 1 x 2 , x^21 , x^22 ) = ψ(x)

We tweak the mapping by multiplying benign constants for convenience.

x = (x 1 , x 2 ) 7 → (1,

2 x 1 ,

2 x 2 ,

2 x 1 x 2 , x^21 , x^22 ) = ψ(x) z = (z 1 , z 2 ) 7 → (1,

2 z 1 ,

2 z 2 ,

2 z 1 z 2 , z^21 , z^22 ) = ψ(z)

An observation we made regarding SVM’s is that only the inner product of the examples are used in the algorithm. Let us compute the inner product ψ(x) · ψ(z).

ψ(x) · ψ(z) = 1 + 2x 1 z 1 + 2x 2 z 2 + 2(x 1 z 1 x 2 z 2 ) + (x 1 z 1 )^2 + (x 2 z 2 )^2 = (1 + x 1 z 1 + x 2 z 2 )^2 = (1 + x · z)^2

More generally, when we start in Rn^ and add all terms up to degree k, we can write the inner product as:

ψ(x) · ψ(z) = (1 + x · z)k

What the above equation implies is that we do not need to blow up to higher dimensions to compute the inner product in the high-dimensional space but we can compute the inner product in the low-dimensional space itself. This observation is known as the kernel trick. In other words, instead of finding the mapping to the higher dimension ψ(x), ψ(z) and computing the inner product in the higher dimensions, we can compute the inner product ψ(x) · ψ(z) through some operation on x and z. In fact, we can define, design and use the kernel (ψ(x) · ψ(z)) directly without bothering about the higher dimensional space. The map can even be infinite dimensional. In general, a kernel is defined as a real-valued function K(x, z) satisfying certain prop- erties known as the Mercer conditions (namely, for any set of points xi, the matrix M : Mi,j = K(xi, xj ) is symmetric and positive definite). Some common kernels include:

K(x, z) = (1 + x · z)k^ (Polynomial kernel) K(x, z) = exp(−c‖(x − z)‖^22 ) (Radial basis kernel)

The dual problem of the SVM can now be reformulated as:

max

i

αi −

i

j

αiαj yiyj K(xi, xj )

s.t ∀i αi ≥ 0

For predicting the label of a new example x, we use:

h(x) = sign

i

αiyiK(x, xi)

The design of kernels is more of an art and domain knowledge also comes into play. We note that if we use the polynomial kernel, we have to decide the parameter k. As k increases, both the margin δ and the term R in the VC-dimension increase. We can see that the performance will improve up to some k and decrease thereafter. This thus reminds us of the phenomenon of overfitting.

and spits out a hypothesis that generalizes and performs well on the test data with some guarantees. Now, we are going to shift to the paradigm of online learning. In online learning, the algorithm gets one example at a time, makes a prediction and finds out whether the prediction is correct or not. Thus, training and testing happen at the same time. The algorithm is judged by the number of mistakes it makes online. Further, we do away with the assumption that the data is random. In fact, the data can even be adversarial. One might think that making such assumptions can make the problem incredibly hard. However, we will obtain simple algorithms with simple analysis techniques for online learning. These results are obtained without giving up on the performance or compute requirement of the algorithms.

4.1 Prediction from expert advice

One fundamental online learning problem is prediction from expert advice. The learner has to make a prediction at each time step. For making this prediction, the learner gets the suggestions from N experts. After making a prediction, the learner observes whether the prediction was correct or wrong. For example, each morning, we might want to predict whether the stock market will go up or down that day, based on the predictions of experts. Then, in the evening, we find out if our prediction was correct or not. The goal is to make as few mistakes as possible. The configuration for predicting whether the stock market will go up or down based on the advice from 4 experts is shown in the table below.

Experts Learner Outcome 1 2 3 4 (master) Day 1 ↑ ↑ ↓ ↑ ↑ ↑ Day 2 ↓ ↑ ↑ ↓ ↓ ↑ . . . . No. of mistakes 37 12 67 50 18

The problem of prediction from expert advice can be summarized as follows: N = # Experts for t = 1... T

Each expert i predicts ξi ∈ { 0 , 1 }.
Learner predicts ˆy ∈ { 0 , 1 }.
Observe outcome y ∈ { 0 , 1 }
Mistake if ˆy 6 = y

At the very least in this setting, we should hope that the learner does not make too many more mistakes than the best expert. The experts could be humans or simple prediction rules or learning algorithms.

4.2 Halving algorithm and Bounds

Assume that there exists an expert who makes no mistakes. In this setting, we can use the halving algorithm and bound the number of mistakes made by the algorithm. In the halving algorithm, we no longer listen to an expert once it makes a mistake. We make a prediction at each time as the majority vote of the surviving experts (that is, experts who have made no mistakes so far).

yˆ = Majority vote of predictions of surviving experts

Let W be the number of surviving experts. We know that W ≥ 1 as there is one expert who will never make any mistake. Initially W = N. After 1 mistake W ≤ 12 N. This is because the learner makes the decision based on majority vote, so if the learner makes a mistake, then at least half of the experts have made a mistake and will drop out. Similarly, after 2 mistakes W ≤ 14 N. And after m mistakes W ≤ (^21) m N. Therefore,

1 ≤ W ≤

2 m^

N

This implies that:

m ≤ lg(N )

Thus, without making any statistical assumptions on the data, we were able to get an upper bound on the number of mistakes made by the algorithm. However, we made the strong assumption that there exists at least one expert who makes no mistake.

Support Vector Machines (SVM) for Linear & Non-Separable Data, Study notes of Machine Learning

Related documents

Partial preview of the text

Download Support Vector Machines (SVM) for Linear & Non-Separable Data and more Study notes Machine Learning in PDF only on Docsity!

COS 511: Theoretical Machine Learning

1 Recap: SVM for linearly separable data

( R

Observations

2 SVM for data that is not linearly separable

2.1 Soft-margin SVM

2.3 Problems with this idea

2.4 Kernel trick

4.1 Prediction from expert advice

4.2 Halving algorithm and Bounds

1 ≤ W ≤

N