Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Support Vector Machines (SVM) for Linear & Non-Separable Data, Study notes of Machine Learning

The Support Vector Machine (SVM) algorithm for finding maximum margin separating hyperplanes in linearly separable data and dealing with non-separable data using slack variables and kernel functions. The document also touches upon the relationship between SVM and boosting.

What you will learn

  • How does the Support Vector Machine (SVM) algorithm find the maximum margin separating hyperplane?
  • What is the maximum margin separating hyperplane in machine learning?
  • What are slack variables and how are they used in SVM?

Typology: Study notes

2021/2022

Uploaded on 09/27/2022

kianx
kianx 🇬🇧

4

(10)

219 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
COS 511: Theoretical Machine Learning
Lecturer: Rob Schapire Lecture #14
Scribe: Athindran Ramesh Kumar March 27, 2019
1 Recap: SVM for linearly separable data
In the previous lecture, we developed a method known as the support vector machine for
obtaining the maximum margin separating hyperplane for data that is linearly separable,
i.e., there exists at least one hyperplane that perfectly separates the positive and negative
labeled points. The inspiration for seeking maximum margin classifiers arose from our
observation in boosting that increasing the number of rounds of boosting increased the
margin of our weighted majority classifier and hence decreased the test error. Further, we
note that:
VC-dim (Linear threshold function with margin δ)1
δ2if all training and test exam-
ples have kxk 1
VC-dim (Linear threshold function with margin δ)R
δ2if all training and test
examples have kxk R
In other words, the VC-dimension of the set of linear threshold functions with margin δis
bounded above by a quantity independent of the dimension of the space when the training
and test examples have length at most R. The VC-dimension decreases as the margin
increases. This gives us further motivation to explicitly find maximum margin classifiers.
δδ
The above figure shows a linearly separable set of points along with a maximum margin
separating hyperplane with margin δ.
We are given (x1, y1),...,(xm, ym) where xiRnand yi {−1,+1}. The primal
problem for finding the maximum margin hyperplane can be formulated as:
max δ
i yi(v·xi)δ
kvk2= 1
pf3
pf4
pf5
pf8

Partial preview of the text

Download Support Vector Machines (SVM) for Linear & Non-Separable Data and more Study notes Machine Learning in PDF only on Docsity!

COS 511: Theoretical Machine Learning

Lecturer: Rob Schapire Lecture # Scribe: Athindran Ramesh Kumar March 27, 2019

1 Recap: SVM for linearly separable data

In the previous lecture, we developed a method known as the support vector machine for obtaining the maximum margin separating hyperplane for data that is linearly separable, i.e., there exists at least one hyperplane that perfectly separates the positive and negative labeled points. The inspiration for seeking maximum margin classifiers arose from our observation in boosting that increasing the number of rounds of boosting increased the margin of our weighted majority classifier and hence decreased the test error. Further, we note that:

  • VC-dim (Linear threshold function with margin δ) ≤ (^) δ^12 if all training and test exam- ples have ‖x‖ ≤ 1
  • VC-dim (Linear threshold function with margin δ) ≤

( R

δ

if all training and test examples have ‖x‖ ≤ R

In other words, the VC-dimension of the set of linear threshold functions with margin δ is bounded above by a quantity independent of the dimension of the space when the training and test examples have length at most R. The VC-dimension decreases as the margin increases. This gives us further motivation to explicitly find maximum margin classifiers.

δ δ

The above figure shows a linearly separable set of points along with a maximum margin separating hyperplane with margin δ. We are given (x 1 , y 1 ),... , (xm, ym) where xi ∈ Rn^ and yi ∈ {− 1 , +1}. The primal problem for finding the maximum margin hyperplane can be formulated as:

max δ ∀i yi(v · xi) ≥ δ ‖v‖ 2 = 1

By introducing a new variable, w = v δ , we transformed the problem as:

min ‖w‖^22 2 ∀i yi(w · xi) ≥ 1

w can be obtained from the training examples (specifically, the support vectors among the training examples) as w =

i αiyixi. The αi’s are obtained by solving the dual problem formulated below:

max

i

αi −

i

j

αiαj yiyj (xi · xj )

s.t ∀i αi ≥ 0

The dual problem can be solved by any non-linear programming method such as gradient descent. For predicting the label of a new example x, we use:

h(x) = sign(w · xi) = sign(

i

αiyix · xi)

Observations

  • A crucial observation to note is that only inner products between each pair of training examples and each training example and test example is required for predicting the labels of the test examples. This observation will be used later in the class.
  • Another observation is that w is a linear combination of only the support vectors (the training examples with αi > 0). Hence, the maximum margin separating hyperplane depends only on the support vectors.

The SVM has had a big impact on machine learning. In the SVM, we are able to formulate exactly what we want to optimize and solve the optimization problem fairly easily.

2 SVM for data that is not linearly separable

There are two strategies for dealing with linearly inseparable data. Both the strategies are often combined for practical applications.

2.1 Soft-margin SVM

This strategy can be used in the scenario when the data is almost linearly separable, i.e., there are a few data points that violate the requirement as shown in the figure below.

Now, the equation for the separating hyperplane in 6 dimensions is given by:

v = (a, b, c, d, e, f ) v · ψ(x) = 0 a + bx 1 + cx 2 + dx 1 x 2 + ex^21 + f x^22 = 0

In the mapped space, R^6 , this is the equation of a hyperplane. In the original space, R^2 , this is the equation of an arbitrary conic section. This conic section can separate the set of points in two dimensions as shown below.

Thus, after mapping the space to a higher dimensional space, we have made the data linearly separable. Now, we can use the basic SVM in this high dimensional space. This idea generalizes. In general, if we have n dimensions, we can form all monomial terms up to degree k. This would give rise to a higher dimensional space of dimensionality O(nk). Is this a good idea? Let us discuss further.

2.3 Problems with this idea

Problem 1

As the dimension of the space we map to increases, it would seem that we will need sub- stantially more data, especially if mapping to a very high dimensional space. However, note that we said that VC-dimension of linear threshold functions with margin ( δ is at most R δ

if ‖x‖ ≤ R. This indicates that the VC-dimension does not scale with the number of dimensions and depends only on the margin. However, R could still increase as we increase the number of dimensions. Still, at least the VC-dimension does not explicitly depend on the dimension of the space we are mapping to, so the amount of training data needed will not necessarily increase.

Problem 2

The main problem with this idea is the computational burden incurred by operating in this high-dimensional space. We discussed earlier that we would be operating in a space of dimensionality O(nk). If n = 100 and k = 6, we would be working with a trillion dimensions. This is a huge computational problem and requires gigantic time and memory. This problem is overcome by using the kernel trick.

2.4 Kernel trick

Note that the mapping ψ from R^2 to R^6 is given by:

x = (x 1 , x 2 ) 7 → (1, x 1 , x 2 , x 1 x 2 , x^21 , x^22 ) = ψ(x)

We tweak the mapping by multiplying benign constants for convenience.

x = (x 1 , x 2 ) 7 → (1,

2 x 1 ,

2 x 2 ,

2 x 1 x 2 , x^21 , x^22 ) = ψ(x) z = (z 1 , z 2 ) 7 → (1,

2 z 1 ,

2 z 2 ,

2 z 1 z 2 , z^21 , z^22 ) = ψ(z)

An observation we made regarding SVM’s is that only the inner product of the examples are used in the algorithm. Let us compute the inner product ψ(x) · ψ(z).

ψ(x) · ψ(z) = 1 + 2x 1 z 1 + 2x 2 z 2 + 2(x 1 z 1 x 2 z 2 ) + (x 1 z 1 )^2 + (x 2 z 2 )^2 = (1 + x 1 z 1 + x 2 z 2 )^2 = (1 + x · z)^2

More generally, when we start in Rn^ and add all terms up to degree k, we can write the inner product as:

ψ(x) · ψ(z) = (1 + x · z)k

What the above equation implies is that we do not need to blow up to higher dimensions to compute the inner product in the high-dimensional space but we can compute the inner product in the low-dimensional space itself. This observation is known as the kernel trick. In other words, instead of finding the mapping to the higher dimension ψ(x), ψ(z) and computing the inner product in the higher dimensions, we can compute the inner product ψ(x) · ψ(z) through some operation on x and z. In fact, we can define, design and use the kernel (ψ(x) · ψ(z)) directly without bothering about the higher dimensional space. The map can even be infinite dimensional. In general, a kernel is defined as a real-valued function K(x, z) satisfying certain prop- erties known as the Mercer conditions (namely, for any set of points xi, the matrix M : Mi,j = K(xi, xj ) is symmetric and positive definite). Some common kernels include:

K(x, z) = (1 + x · z)k^ (Polynomial kernel) K(x, z) = exp(−c‖(x − z)‖^22 ) (Radial basis kernel)

The dual problem of the SVM can now be reformulated as:

max

i

αi −

i

j

αiαj yiyj K(xi, xj )

s.t ∀i αi ≥ 0

For predicting the label of a new example x, we use:

h(x) = sign

i

αiyiK(x, xi)

The design of kernels is more of an art and domain knowledge also comes into play. We note that if we use the polynomial kernel, we have to decide the parameter k. As k increases, both the margin δ and the term R in the VC-dimension increase. We can see that the performance will improve up to some k and decrease thereafter. This thus reminds us of the phenomenon of overfitting.

and spits out a hypothesis that generalizes and performs well on the test data with some guarantees. Now, we are going to shift to the paradigm of online learning. In online learning, the algorithm gets one example at a time, makes a prediction and finds out whether the prediction is correct or not. Thus, training and testing happen at the same time. The algorithm is judged by the number of mistakes it makes online. Further, we do away with the assumption that the data is random. In fact, the data can even be adversarial. One might think that making such assumptions can make the problem incredibly hard. However, we will obtain simple algorithms with simple analysis techniques for online learning. These results are obtained without giving up on the performance or compute requirement of the algorithms.

4.1 Prediction from expert advice

One fundamental online learning problem is prediction from expert advice. The learner has to make a prediction at each time step. For making this prediction, the learner gets the suggestions from N experts. After making a prediction, the learner observes whether the prediction was correct or wrong. For example, each morning, we might want to predict whether the stock market will go up or down that day, based on the predictions of experts. Then, in the evening, we find out if our prediction was correct or not. The goal is to make as few mistakes as possible. The configuration for predicting whether the stock market will go up or down based on the advice from 4 experts is shown in the table below.

Experts Learner Outcome 1 2 3 4 (master) Day 1 ↑ ↑ ↓ ↑ ↑ ↑ Day 2 ↓ ↑ ↑ ↓ ↓ ↑ . . . . No. of mistakes 37 12 67 50 18

The problem of prediction from expert advice can be summarized as follows: N = # Experts for t = 1... T

  • Each expert i predicts ξi ∈ { 0 , 1 }.
  • Learner predicts ˆy ∈ { 0 , 1 }.
  • Observe outcome y ∈ { 0 , 1 }
  • Mistake if ˆy 6 = y

At the very least in this setting, we should hope that the learner does not make too many more mistakes than the best expert. The experts could be humans or simple prediction rules or learning algorithms.

4.2 Halving algorithm and Bounds

Assume that there exists an expert who makes no mistakes. In this setting, we can use the halving algorithm and bound the number of mistakes made by the algorithm. In the halving algorithm, we no longer listen to an expert once it makes a mistake. We make a prediction at each time as the majority vote of the surviving experts (that is, experts who have made no mistakes so far).

yˆ = Majority vote of predictions of surviving experts

Let W be the number of surviving experts. We know that W ≥ 1 as there is one expert who will never make any mistake. Initially W = N. After 1 mistake W ≤ 12 N. This is because the learner makes the decision based on majority vote, so if the learner makes a mistake, then at least half of the experts have made a mistake and will drop out. Similarly, after 2 mistakes W ≤ 14 N. And after m mistakes W ≤ (^21) m N. Therefore,

1 ≤ W ≤

2 m^

N

This implies that:

m ≤ lg(N )

Thus, without making any statistical assumptions on the data, we were able to get an upper bound on the number of mistakes made by the algorithm. However, we made the strong assumption that there exists at least one expert who makes no mistake.