Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Advanced Natural Language Processing: Support Vector Machines for Classification, Lecture notes of Assembly Language Programming

Perth College Assembly Language Programming

An overview of Support Vector Machines (SVM) as a method for classification in Natural Language Processing (NLP). SVMs are a type of supervised learning algorithm used for binary and multiclass classification tasks. the intuition behind SVMs, the mathematical formulation of the SVM discriminant function, and the optimization problem to find the optimal decision boundary. The document also discusses the kernel trick for handling non-linearly separable data and the relationship between SVMs and logistic regression.

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

juanin 🇬🇧

4.7

(12)

259 documents

1 / 5

This page cannot be seen from the preview

Don't miss anything!

CS769 Spring 2010 Advanced Natural Language Processing

Support Vector Machines

Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu

Many NLP problems can be formulated as classification, e.g., word sense disambiguation, spam filter-

ing, sentiment analysis, document retrieval, speech recognition, etc. There are in general 3 ways to do

classification:

1. Create a generative model p(y), p(x|y), and compute p(y|x) with Bayes rule. Classify according to

p(y|x). For example Naive Bayes.

2. Create a discriminative model p(y|x) directly. Classify according to p(y|x). For example logistic

regression.

3. Forget about probabilities. Create a discriminant function f:X → Y, and classify according to f(x).

Support Vector Machine (SVM) is such an approach.

1 The Linearly Separable Case

We assume binary classification. The intuition of SVM is to put a hyperplane in the middle of the two

classes, so that the distance to the nearest positive or negative example is maximized. Note this essentially

ignores the class distribution p(x|y), and is more similar to logistic regression.

The SVM discriminant function has the form

f(x) = w>x+b, (1)

where wis the parameter vector, and bis the bias or offset scalar. The classification rule is sign(f(x)), and

the linear decision boundary is specified by f(x) = 0. The labels y∈ {−1,1}.

If fseparates the data, the geometric distance between a point xand the decision boundary is

yf (x)

kwk.(2)

To see this, note w>xis not the geometric distance between x’s projection on wand the origin: it must be

normalized by the norm of w.

Given training data {(x, y)1:n}, we want to find a decision boundary w, b such that to maximize the

geometric distance of the closest point, i.e.

max

w,b

n

min

i=1

yi(w>xi+b)

kwk.(3)

Note this is the key difference between SVM and logistic regression: they optimize different objectives.

The above objective is difficult to optimize directly. Here is a trick: notice for any ˆw, ˆ

b, the objective is

the same for κˆw , κˆ

bfor any nonzero scaling factor κ. That is to say, the optimization (3) is actually over

equivalence classes of w, b up to scaling. Therefore, we can reduce the redundancy by requiring the closest

point to the decision boundary to satisfy:

yf (x) = y(w>x+b) = 1,(4)

1

Partial preview of the text

Download Advanced Natural Language Processing: Support Vector Machines for Classification and more Lecture notes Assembly Language Programming in PDF only on Docsity!

CS769 Spring 2010 Advanced Natural Language Processing

Support Vector Machines

Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu

Many NLP problems can be formulated as classification, e.g., word sense disambiguation, spam filter- ing, sentiment analysis, document retrieval, speech recognition, etc. There are in general 3 ways to do classification:

Create a generative model p(y), p(x|y), and compute p(y|x) with Bayes rule. Classify according to p(y|x). For example Naive Bayes.
Create a discriminative model p(y|x) directly. Classify according to p(y|x). For example logistic regression.
Forget about probabilities. Create a discriminant function f : X → Y, and classify according to f (x). Support Vector Machine (SVM) is such an approach.

1 The Linearly Separable Case

We assume binary classification. The intuition of SVM is to put a hyperplane in the middle of the two classes, so that the distance to the nearest positive or negative example is maximized. Note this essentially ignores the class distribution p(x|y), and is more similar to logistic regression. The SVM discriminant function has the form

f (x) = w>x + b, (1)

where w is the parameter vector, and b is the bias or offset scalar. The classification rule is sign(f (x)), and the linear decision boundary is specified by f (x) = 0. The labels y ∈ {− 1 , 1 }. If f separates the data, the geometric distance between a point x and the decision boundary is

yf (x) ‖w‖

To see this, note w>x is not the geometric distance between x’s projection on w and the origin: it must be normalized by the norm of w. Given training data {(x, y)1:n}, we want to find a decision boundary w, b such that to maximize the geometric distance of the closest point, i.e.

max w,b

n min i=

yi(w>xi + b) ‖w‖

Note this is the key difference between SVM and logistic regression: they optimize different objectives. The above objective is difficult to optimize directly. Here is a trick: notice for any ˆw, ˆb, the objective is the same for κ w, κˆ ˆb for any nonzero scaling factor κ. That is to say, the optimization (3) is actually over equivalence classes of w, b up to scaling. Therefore, we can reduce the redundancy by requiring the closest point to the decision boundary to satisfy:

yf (x) = y(w>x + b) = 1, (4)

which implies all points to satisfy

yf (x) = y(w>x + b) ≥ 1. (5)

This converts the unconstrained but complex problem (3) into a constrained but simpler problem

max w,b

1 ‖w‖ (6) s.t. yi(w>xi + b) ≥ 1 i = 1... n. (7)

Maximizing (^) ‖w^1 ‖ is equivalent to minimizing 12 ‖w‖^2 , but the latter will prove convenient later. Our problem now becomes

min w,b

1 2 ‖w‖

s.t. yi(w>xi + b) ≥ 1 i = 1... n. (9)

This is known as a quadratic programming problem, where the objective is a quadratic function of the variable (in this case w), and there are linear inequality constraints. Standard optimization packages can solve such a problem (but often slowly for high dimensional x and large n). However, we will next derive the dual optimization problem. The dual problem has two advantages: 1. It illustrate the reason behind the name “support vector”; 2. It can use the powerful kernel trick. The basic idea is to form the Lagrangian, and maximize the Lagrangian wrt the Lagrange multipliers (called dual variables). To this end, we introduce α1:n ≥ 0, and define the Lagrangian

L(w, b, α) =

‖w‖^2 −

∑^ n

i=

αi

yi(w>xi + b) − 1

Setting ∂L(w, b, α)/∂w = 0 we obtain

w =

∑^ n

i=

αiyixi. (11)

Setting ∂L(w, b, α)/∂b = 0 we obtain

∑^ n

i=

αiyi = 0. (12)

Putting these into the Lagrangian and we get the dual objective as a function of α only, which is to be maximized along with the following constraints,

max α

∑n i,j=1 αiαj^ yiyj^ x

i xj^ +^

∑n i=1 αi^ (13) s.t. αi ≥ 0 i = 1... n (14) ∑n i=1 αiyi^ = 0^ (15)

This is again a constrained quadratic programming problem. We call (9) the primal problem, and (15) the dual problem. They are equivalent, but the primal has D + 1 variables, where D is the number of dimensions of x. In contrast, the dual has n variables, where n is the number of training examples. In general, one should pick the smaller problem to solve. However, as we soon see, the dual form allows so called ‘kernel trick’. If we solve the primal problem, our discriminant function is simply

f (x) = w>x + b. (16)

where C is a weight parameter, which needs to be carefully set (e.g., by cross validation). We can similarly look at the dual problem of (26) by introducing Lagrange multipliers. We arrive at

max α

∑n i,j=1 αiαj^ yiyj^ x

i xj^ +^

∑n i=1 αi^ (27) s.t. 0 ≤ αi ≤ C i = 1... n (28) ∑n i=1 αiyi^ = 0.^ (29)

Note the only difference to the linear separable dual problem (15) is the upper bound C on the α’s. As before, when α = 0 the point is not a support vector and can be ignored. When 0 < α < C, it can be shown using complementarity that ξ = 0, i.e., the point is on the margin. When α = C, the point is inside the margin if ξ ≤ 1, or on the wrong side of the decision boundary if ξ > 1. The discriminant function is again

f (x) =

∑^ n

i=

αiyix> i x + b. (30)

The offset b can be computed on support vectors with 0 < α < C.

3 The Kernel Trick

The dual problem (29) only involves the dot product x> i xj of examples, not the example themselves. So does the discriminant function (30). This allows SVM to be kernelized. Consider the dataset {(x, y)1:3} = {(− 1 , 1), (0, −1), (1, 1)}, where x ∈ R. This is not a linearly separable dataset. However, if we map x to a three dimensional vector

φ(x) = (1,

2 x, x^2 )>, (31)

the dataset becomes linearly separable in the three dimensional space (equivalently, we have a non-linear decision boundary in the original space). The map does not actually increase the intrinsic dimensionality of x: φ(x) lies on a one dimensional manifold in the 3D space. Nonetheless, this suggests a general way to handle linearly non-separable data: map x to some φ(x). This is complimentary to the slack variables, so that we can simply replace all x with φ(x) in (29) and (30). If φ(x) is very high dimensional, representing it and computing the inner product becomes an issue. Here is when the kernel kicks in. Note the dual problem and its solution (29) and (30) involves inner product of feature vectors φ(xi)>φ(xj ) only. Thus it might be possible to use a feature representation φ(x) without explicitly representing it, as long as we can compute the inner product. For example, the inner product of (31) can be computed as k(xi, xj ) = φ(xi)>φ(xj ) = (1 + xixj )^2. (32)

The computational saving is much bigger for such polynomial kernels k(xi, xj ) = (1 + xixj )n^ with larger n, where the explicit feature vector has many more dimensions. For the so-called Radial Basis Function (RBF) kernel

k(xi, xj ) = exp

‖xi − xj ‖^2 2 σ^2

the corresponding feature vector is infinite dimensional. Thus the kernel trick is to replace φ(xi)>φ(xj ) with a kernel function k(xi, xj ) in (29) and (30). What functions are valid kernels that correspond to some feature vector φ(x)? They must be so-called Mercer kernels k : X × X 7 → R, where

k is continuous,
k is symmetric,
k is positive definite, i.e., for any m points x1:m, the m × m Gram matrix K = k(x1:m, x1:m) is positive semi-definite.

4 Odds and Ends

An equivalent formulation to the SVM constrained optimization problem (26) is the unconstrained problem

min w,b

∑n

i=

max(1 − yi(w>xi + b), 0) + λ‖w‖^2. (34)

If we call yi(w>xi + b) the margin of xi, the term max(1 − yi(w>xi + b), 0) wants the margin of any training point to be larger than 1, i.e., having a confident prediction. The term is known as the hinge loss function. Note the above objective is very similar to L2-regularized logistic regress, just with a different loss function (the latter uses negative log likelihood loss). There is no probabilistic interpretation of the margin of a point. There are heuristics to convert margin into probability p(y|x), which sometimes works in practice, but is not justified in theory. There are many ways to extend binary SVM to multiclass classification. A heuristic method is 1-vs-rest. For a K class problem, create K binary classification subproblems: class 1 vs. (2–K), class 2 vs. (1,3–K), and so on. Solve each subproblem with a binary SVM. Classify xi to the class for which it has the largest positive margin. SVM can be extended to regression, by replacing the hinge loss with an -insensitive loss.

Advanced Natural Language Processing: Support Vector Machines for Classification, Lecture notes of Assembly Language Programming

Related documents

Partial preview of the text

Download Advanced Natural Language Processing: Support Vector Machines for Classification and more Lecture notes Assembly Language Programming in PDF only on Docsity!

Support Vector Machines

1 The Linearly Separable Case

3 The Kernel Trick

4 Odds and Ends