



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An overview of Support Vector Machines (SVM) as a method for classification in Natural Language Processing (NLP). SVMs are a type of supervised learning algorithm used for binary and multiclass classification tasks. the intuition behind SVMs, the mathematical formulation of the SVM discriminant function, and the optimization problem to find the optimal decision boundary. The document also discusses the kernel trick for handling non-linearly separable data and the relationship between SVMs and logistic regression.
Typology: Lecture notes
1 / 5
This page cannot be seen from the preview
Don't miss anything!
CS769 Spring 2010 Advanced Natural Language Processing
Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu
Many NLP problems can be formulated as classification, e.g., word sense disambiguation, spam filter- ing, sentiment analysis, document retrieval, speech recognition, etc. There are in general 3 ways to do classification:
We assume binary classification. The intuition of SVM is to put a hyperplane in the middle of the two classes, so that the distance to the nearest positive or negative example is maximized. Note this essentially ignores the class distribution p(x|y), and is more similar to logistic regression. The SVM discriminant function has the form
f (x) = w>x + b, (1)
where w is the parameter vector, and b is the bias or offset scalar. The classification rule is sign(f (x)), and the linear decision boundary is specified by f (x) = 0. The labels y ∈ {− 1 , 1 }. If f separates the data, the geometric distance between a point x and the decision boundary is
yf (x) ‖w‖
To see this, note w>x is not the geometric distance between x’s projection on w and the origin: it must be normalized by the norm of w. Given training data {(x, y)1:n}, we want to find a decision boundary w, b such that to maximize the geometric distance of the closest point, i.e.
max w,b
n min i=
yi(w>xi + b) ‖w‖
Note this is the key difference between SVM and logistic regression: they optimize different objectives. The above objective is difficult to optimize directly. Here is a trick: notice for any ˆw, ˆb, the objective is the same for κ w, κˆ ˆb for any nonzero scaling factor κ. That is to say, the optimization (3) is actually over equivalence classes of w, b up to scaling. Therefore, we can reduce the redundancy by requiring the closest point to the decision boundary to satisfy:
yf (x) = y(w>x + b) = 1, (4)
which implies all points to satisfy
yf (x) = y(w>x + b) ≥ 1. (5)
This converts the unconstrained but complex problem (3) into a constrained but simpler problem
max w,b
1 ‖w‖ (6) s.t. yi(w>xi + b) ≥ 1 i = 1... n. (7)
Maximizing (^) ‖w^1 ‖ is equivalent to minimizing 12 ‖w‖^2 , but the latter will prove convenient later. Our problem now becomes
min w,b
1 2 ‖w‖
s.t. yi(w>xi + b) ≥ 1 i = 1... n. (9)
This is known as a quadratic programming problem, where the objective is a quadratic function of the variable (in this case w), and there are linear inequality constraints. Standard optimization packages can solve such a problem (but often slowly for high dimensional x and large n). However, we will next derive the dual optimization problem. The dual problem has two advantages: 1. It illustrate the reason behind the name “support vector”; 2. It can use the powerful kernel trick. The basic idea is to form the Lagrangian, and maximize the Lagrangian wrt the Lagrange multipliers (called dual variables). To this end, we introduce α1:n ≥ 0, and define the Lagrangian
L(w, b, α) =
‖w‖^2 −
∑^ n
i=
αi
yi(w>xi + b) − 1
Setting ∂L(w, b, α)/∂w = 0 we obtain
w =
∑^ n
i=
αiyixi. (11)
Setting ∂L(w, b, α)/∂b = 0 we obtain
∑^ n
i=
αiyi = 0. (12)
Putting these into the Lagrangian and we get the dual objective as a function of α only, which is to be maximized along with the following constraints,
max α
∑n i,j=1 αiαj^ yiyj^ x
i xj^ +^
∑n i=1 αi^ (13) s.t. αi ≥ 0 i = 1... n (14) ∑n i=1 αiyi^ = 0^ (15)
This is again a constrained quadratic programming problem. We call (9) the primal problem, and (15) the dual problem. They are equivalent, but the primal has D + 1 variables, where D is the number of dimensions of x. In contrast, the dual has n variables, where n is the number of training examples. In general, one should pick the smaller problem to solve. However, as we soon see, the dual form allows so called ‘kernel trick’. If we solve the primal problem, our discriminant function is simply
f (x) = w>x + b. (16)
where C is a weight parameter, which needs to be carefully set (e.g., by cross validation). We can similarly look at the dual problem of (26) by introducing Lagrange multipliers. We arrive at
max α
∑n i,j=1 αiαj^ yiyj^ x
i xj^ +^
∑n i=1 αi^ (27) s.t. 0 ≤ αi ≤ C i = 1... n (28) ∑n i=1 αiyi^ = 0.^ (29)
Note the only difference to the linear separable dual problem (15) is the upper bound C on the α’s. As before, when α = 0 the point is not a support vector and can be ignored. When 0 < α < C, it can be shown using complementarity that ξ = 0, i.e., the point is on the margin. When α = C, the point is inside the margin if ξ ≤ 1, or on the wrong side of the decision boundary if ξ > 1. The discriminant function is again
f (x) =
∑^ n
i=
αiyix> i x + b. (30)
The offset b can be computed on support vectors with 0 < α < C.
The dual problem (29) only involves the dot product x> i xj of examples, not the example themselves. So does the discriminant function (30). This allows SVM to be kernelized. Consider the dataset {(x, y)1:3} = {(− 1 , 1), (0, −1), (1, 1)}, where x ∈ R. This is not a linearly separable dataset. However, if we map x to a three dimensional vector
φ(x) = (1,
2 x, x^2 )>, (31)
the dataset becomes linearly separable in the three dimensional space (equivalently, we have a non-linear decision boundary in the original space). The map does not actually increase the intrinsic dimensionality of x: φ(x) lies on a one dimensional manifold in the 3D space. Nonetheless, this suggests a general way to handle linearly non-separable data: map x to some φ(x). This is complimentary to the slack variables, so that we can simply replace all x with φ(x) in (29) and (30). If φ(x) is very high dimensional, representing it and computing the inner product becomes an issue. Here is when the kernel kicks in. Note the dual problem and its solution (29) and (30) involves inner product of feature vectors φ(xi)>φ(xj ) only. Thus it might be possible to use a feature representation φ(x) without explicitly representing it, as long as we can compute the inner product. For example, the inner product of (31) can be computed as k(xi, xj ) = φ(xi)>φ(xj ) = (1 + xixj )^2. (32)
The computational saving is much bigger for such polynomial kernels k(xi, xj ) = (1 + xixj )n^ with larger n, where the explicit feature vector has many more dimensions. For the so-called Radial Basis Function (RBF) kernel
k(xi, xj ) = exp
‖xi − xj ‖^2 2 σ^2
the corresponding feature vector is infinite dimensional. Thus the kernel trick is to replace φ(xi)>φ(xj ) with a kernel function k(xi, xj ) in (29) and (30). What functions are valid kernels that correspond to some feature vector φ(x)? They must be so-called Mercer kernels k : X × X 7 → R, where
An equivalent formulation to the SVM constrained optimization problem (26) is the unconstrained problem
min w,b
∑n
i=
max(1 − yi(w>xi + b), 0) + λ‖w‖^2. (34)
If we call yi(w>xi + b) the margin of xi, the term max(1 − yi(w>xi + b), 0) wants the margin of any training point to be larger than 1, i.e., having a confident prediction. The term is known as the hinge loss function. Note the above objective is very similar to L2-regularized logistic regress, just with a different loss function (the latter uses negative log likelihood loss). There is no probabilistic interpretation of the margin of a point. There are heuristics to convert margin into probability p(y|x), which sometimes works in practice, but is not justified in theory. There are many ways to extend binary SVM to multiclass classification. A heuristic method is 1-vs-rest. For a K class problem, create K binary classification subproblems: class 1 vs. (2–K), class 2 vs. (1,3–K), and so on. Solve each subproblem with a binary SVM. Classify xi to the class for which it has the largest positive margin. SVM can be extended to regression, by replacing the hinge loss with an -insensitive loss.