Ch 7 — Sparse Kernel Machines
Support vector machines, the maximum-margin principle, soft margins, multi-class extensions, the relevance vector machine.
Kernel methods (Ch 6) are powerful but require evaluating the kernel at every training point at prediction time — expensive when \(N\) is large. Sparse kernel methods produce solutions where most of the kernel coefficients are zero, so prediction depends on only a handful of “important” training points. This chapter covers the two prototypes: the support vector machine (SVM) and the relevance vector machine (RVM).
Prerequisites
- Kernel methods (Ch 6) — dual representations, Mercer kernels.
- Constrained optimization & Lagrange multipliers — SVM training is a quadratic program with linear constraints.
- Logistic regression / probabilistic discriminative models (Ch 4) — context for what SVM/RVM gain and lose.
7.1 Maximum Margin Classifiers
For two linearly-separable classes \(\lbrace t_n \in \pm 1\rbrace\) and a linear classifier \(y(\mathbf{x}) = \mathbf{w}^{\top}\boldsymbol{\phi}(\mathbf{x}) + b\), the margin is the distance from the decision boundary to the closest training point. The maximum-margin solution maximizes that distance.
Among all separating hyperplanes, the maximum-margin one is provably the best in a generalization-bound sense (PAC-Bayes, VC-dimension), and empirically tends to generalize well.
After scaling \((\mathbf{w}, b)\) so that the closest points have \(t_n y(\mathbf{x}_n) = 1\), the problem becomes
\[\min_{\mathbf{w}, b} \;\tfrac{1}{2} \\|\mathbf{w}\\|^{2} \quad \text{subject to} \quad t_n (\mathbf{w}^{\top}\boldsymbol{\phi}_n + b) \geq 1 \;\forall n. \tag{7.1}\]Convex quadratic objective, linear constraints — a quadratic program with a unique global optimum.
The dual form
Introducing Lagrange multipliers \(a_n \geq 0\) for each constraint and eliminating \(\mathbf{w}\) gives the dual
\[\max_{\mathbf{a}} \;\sum_{n=1}^{N} a_n - \tfrac{1}{2} \sum_{n,m} a_n a_m\, t_n t_m\, k(\mathbf{x}_n, \mathbf{x}_m), \quad 0 \leq a_n,\; \sum_n a_n t_n = 0. \tag{7.2}\]Two important consequences:
- The data appear only inside \(k(\mathbf{x}_n, \mathbf{x}_m)\) — kernel trick again.
- KKT conditions force \(a_n = 0\) for any point with \(t_n y(\mathbf{x}_n) > 1\). Only points exactly on the margin (where \(t_n y(\mathbf{x}_n) = 1\)) get \(a_n > 0\). These are the support vectors, and they are the only points that contribute to predictions.
This is the sparsity property: prediction cost scales with the number of support vectors, not the full \(N\).
7.2 Soft Margin SVM
Real data are rarely separable. The soft-margin SVM relaxes the hard constraints by introducing slack variables \(\xi_n \geq 0\):
\[\min_{\mathbf{w}, b, \boldsymbol{\xi}} \; \tfrac{1}{2}\\|\mathbf{w}\\|^{2} + C \sum_{n} \xi_n, \quad t_n y(\mathbf{x}_n) \geq 1 - \xi_n,\; \xi_n \geq 0. \tag{7.3}\]\(C > 0\) is a hyperparameter trading margin width against mis-classification slack. \(C \to \infty\) recovers the hard-margin problem; \(C \to 0\) maximizes the margin while ignoring the data entirely.
The dual is identical to (7.2) except the constraints become \(0 \leq a_n \leq C\) — points that violate the margin saturate at \(a_n = C\). Three categories of training points:
- \(a_n = 0\): on the correct side of the margin, ignored at prediction time.
- \(0 < a_n < C\): exactly on the margin, real support vectors.
- \(a_n = C\): inside the margin or misclassified.
The hinge loss
Eliminating \(\xi_n\) shows the soft-margin SVM is equivalent to minimizing
\[\sum_n [1 - t_n y(\mathbf{x}_n)]_+ + \lambda \\|\mathbf{w}\\|^{2}, \qquad [u]_+ = \max(0, u),\]where \([1 - t y]_+\) is the hinge loss. Compared to logistic regression’s smooth log-loss, hinge is non-smooth and has a flat zero region — that flat region is precisely what gives sparsity (most points sit there with zero gradient, hence zero coefficient).
7.3 Multi-Class SVM
SVMs are intrinsically two-class; extending to \(K\) classes is awkward.
- One-vs-rest: train \(K\) binary classifiers, each separating one class from all others. Requires care with imbalanced classes; outputs aren’t probabilities.
- One-vs-one: train \(K(K-1)/2\) pairwise classifiers; predict by majority vote. More classifiers but each on smaller data.
- Crammer–Singer formulation: a single optimization problem maximizing the margin across all classes simultaneously. Cleaner but harder to optimize.
For most applications: train \(K\) one-vs-rest models, calibrate with Platt scaling if you need probabilities.
7.4 SVM for Regression
A regression analog: the \(\epsilon\)-insensitive loss is zero for residuals smaller than \(\epsilon\) and linear outside. This produces sparse solutions where only points outside the \(\epsilon\)-tube are support vectors. Training is a quadratic program; the dual depends on the data only through kernels.
7.5 The Relevance Vector Machine
The SVM’s flaws: outputs aren’t probabilities; the hyperparameter \(C\) (and any kernel parameters) must be cross-validated; sparsity comes from a non-smooth loss (mathematically inelegant).
The relevance vector machine addresses all three. Place a Gaussian prior on each coefficient \(w_n\), with per-coefficient precision \(\alpha_n\):
\[p(\mathbf{w} \mid \boldsymbol{\alpha}) = \prod_{n=1}^{N} \mathcal{N}(w_n \mid 0, \alpha_n^{-1}).\]Maximizing the marginal likelihood w.r.t. \(\boldsymbol{\alpha}\) — the same Ch 3 evidence procedure — drives many \(\alpha_n \to \infty\). The corresponding \(w_n\) collapse to zero, leaving only a few “relevance vectors” with active weights.
Compared to SVM:
| SVM | RVM | |
|---|---|---|
| Probabilistic outputs | No (not directly) | Yes |
| Hyperparameter tuning | Cross-validate \(C\), kernel | Marginal likelihood |
| Sparsity | Hinge loss | Automatic relevance determination |
| Kernel constraints | Mercer-positive | Any function |
| Training cost | Convex QP, fast | Non-convex evidence max, slower |
RVMs typically give more sparsity than SVMs and yield calibrated predictive distributions. They’re slower to train and the optimization is non-convex, so the SVM remains the more popular choice in practice — but the RVM is the cleaner Bayesian story.
Takeaways
- Maximum margin is the principled answer to “which separating hyperplane?”. The dual depends only on kernel evaluations.
- Support vectors are points exactly on (or violating) the margin; everything else has zero coefficient. Sparsity follows from the hinge loss.
- Soft margin + the hyperparameter \(C\) handle non-separable data; equivalent to L2-regularized hinge-loss minimization.
- RVM trades convexity for Bayesian elegance: probabilistic outputs, automatic hyperparameter selection, and even sparser solutions.
Forward to Ch 8 — Graphical Models.
Figures from Bishop, Pattern Recognition and Machine Learning (Springer, 2006), © Springer / C. M. Bishop. Reproduced for educational use.