Ch 4 — Linear Models for Classification

← Intro to ML

In regression, the target \(t\) is continuous; in classification, it’s a label from \(K\) discrete classes. Bishop organizes classifiers along a spectrum of how directly they go from inputs to labels:

Approach	Models	Pros	Cons
Discriminant function	\(f: \mathbf{x} \to k\) directly	Simple, often fast	No probabilities — no calibrated confidence
Discriminative probabilistic	\(p(\mathcal{C}_k \mid \mathbf{x})\) directly	Calibrated probabilities	Doesn’t model the input distribution
Generative	\(p(\mathbf{x} \mid \mathcal{C}_k)\) and \(p(\mathcal{C}_k)\)	Lets you sample inputs, handle missing data	Hardest to learn; bad in high dimensions

This chapter covers all three for linear decision boundaries.

Prerequisites

Linear algebra: projections, eigenvectors (for Fisher’s LDA).
Probability: Gaussians, the multivariate Gaussian’s log-density (Ch 2).
Calculus: the gradient and Hessian; Newton’s method.

4.1 Discriminant Functions

A linear discriminant for two classes is

\[y(\mathbf{x}) = \mathbf{w}^{\top}\mathbf{x} + w_0,\]

with the decision rule “predict \(\mathcal{C}_1\) if \(y(\mathbf{x}) \geq 0\), else \(\mathcal{C}_2\).” The decision boundary \(y = 0\) is a hyperplane.

For \(K > 2\) classes, the one-vs-rest trick (one binary classifier per class) and the one-vs-one trick (\(K(K-1)/2\) pairwise classifiers) both leave ambiguous regions. The clean fix is to use \(K\) linear discriminants \(y_k(\mathbf{x}) = \mathbf{w}_k^{\top}\mathbf{x} + w_{k0}\) and assign \(\mathbf{x}\) to the class with the largest \(y_k\).

Least squares for classification

If we encode classes as one-hot targets and run least-squares regression on each output, we get a closed-form classifier. It’s mathematically tidy but highly sensitive to outliers — a few extreme points can rotate the decision boundary in a way that misclassifies large clusters. Don’t use it as your real classifier; use it as a baseline.

Fisher’s linear discriminant

A better approach for two classes: project onto a 1-D direction \(\mathbf{w}\) chosen to maximize between-class variance and minimize within-class variance. Defining \(\mathbf{S}_W\) (within-class scatter) and \(\mathbf{S}_B\) (between-class scatter), Fisher’s criterion is

\[J(\mathbf{w}) = \frac{\mathbf{w}^{\top}\mathbf{S}_B \mathbf{w}}{\mathbf{w}^{\top}\mathbf{S}_W \mathbf{w}}, \qquad \mathbf{w}^{\star} \propto \mathbf{S}_W^{-1}(\mathbf{m}_2 - \mathbf{m}_1), \tag{4.1}\]

a generalized-eigenvalue problem with a closed-form solution. For \(K\) classes, the same idea generalizes to projecting onto a \((K-1)\)-dim subspace.

The perceptron

Rosenblatt’s classical algorithm. For \(t_n \in \lbrace -1, +1\rbrace\), define the per-example loss

\[E_P(\mathbf{w}) = - \sum_{n \in \mathcal{M}} t_n\, \mathbf{w}^{\top}\boldsymbol{\phi}(\mathbf{x}_n),\]

summed only over misclassified examples \(\mathcal{M}\). Stochastic gradient descent gives the update \(\mathbf{w} \leftarrow \mathbf{w} + \eta\, t_n \boldsymbol{\phi}(\mathbf{x}_n)\) on each misclassified point. The perceptron convergence theorem says: if the data are linearly separable, the algorithm finds a separating hyperplane in finite steps. Otherwise it doesn’t terminate.

4.2 Probabilistic Generative Models

Model each class-conditional density and the prior, then apply Bayes:

\[p(\mathcal{C}_k \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid \mathcal{C}_k)\, p(\mathcal{C}_k)}{\sum_j p(\mathbf{x} \mid \mathcal{C}_j)\, p(\mathcal{C}_j)}.\]

For two classes the posterior simplifies to

\[p(\mathcal{C}_1 \mid \mathbf{x}) = \sigma(a), \qquad a = \ln \frac{p(\mathbf{x} \mid \mathcal{C}_1) p(\mathcal{C}_1)}{p(\mathbf{x} \mid \mathcal{C}_2) p(\mathcal{C}_2)}, \qquad \sigma(a) = \frac{1}{1 + e^{-a}}.\]

The logistic sigmoid appears naturally — not as a hand-picked nonlinearity but as a consequence of Bayes’ rule. For \(K > 2\), the sigmoid generalizes to the softmax \(p(\mathcal{C}_k \mid \mathbf{x}) = \exp(a_k)/\sum_j \exp(a_j)\).

Gaussian class-conditionals

If each \(p(\mathbf{x} \mid \mathcal{C}_k)\) is Gaussian with a shared covariance \(\boldsymbol{\Sigma}\), the log-ratio is linear in \(\mathbf{x}\) and the decision boundary is a hyperplane (this is Linear Discriminant Analysis, LDA). With class-specific covariances, the log-ratio becomes quadratic — QDA, with curved decision boundaries.

Naive Bayes

If you assume the features of \(\mathbf{x}\) are conditionally independent given the class, \(p(\mathbf{x} \mid \mathcal{C}_k) = \prod_d p(x_d \mid \mathcal{C}_k)\), the model decouples per dimension. Astonishingly often a strong baseline despite the obviously wrong independence assumption.

4.3 Probabilistic Discriminative Models

Skip modeling \(p(\mathbf{x})\) — directly parameterize \(p(\mathcal{C}_k \mid \mathbf{x})\).

Logistic regression

For two classes,

\[p(\mathcal{C}_1 \mid \boldsymbol{\phi}) = \sigma(\mathbf{w}^{\top}\boldsymbol{\phi}), \qquad \boldsymbol{\phi} = \boldsymbol{\phi}(\mathbf{x}).\]

The negative log-likelihood (a.k.a. cross-entropy loss) is

\[E(\mathbf{w}) = - \sum_{n=1}^{N} \lbrace t_n \ln y_n + (1 - t_n) \ln(1 - y_n) \rbrace, \qquad y_n = \sigma(\mathbf{w}^{\top}\boldsymbol{\phi}_n).\]

The gradient has the same form as in linear regression:

\[\nabla E(\mathbf{w}) = \sum_{n=1}^{N} (y_n - t_n)\, \boldsymbol{\phi}_n = \boldsymbol{\Phi}^{\top}(\mathbf{y} - \mathbf{t}). \tag{4.2}\]

No closed form, but the loss is convex, so any local minimum is global. Solve via iteratively reweighted least squares (IRLS) — Newton’s method specialized to logistic regression.

For \(K > 2\), softmax regression generalizes the same way and again has convex log-loss.

4.4 The Laplace Approximation and Bayesian Logistic Regression

Logistic regression’s posterior \(p(\mathbf{w} \mid \mathcal{D})\) is not Gaussian — there’s no conjugate prior. The Laplace approximation fits a Gaussian to the posterior at its mode:

\[q(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{w}_{\mathrm{MAP}}, \mathbf{H}^{-1}),\]

where \(\mathbf{H} = -\nabla\nabla \ln p(\mathbf{w}\mid \mathcal{D})\big

_{\mathbf{w}=\mathbf{w}_{\mathrm{MAP}}}\) is the negative Hessian of the log-posterior. It captures local curvature; the predictive distribution then comes from integrating \(q\).

The Laplace approximation appears throughout PRML — Bayesian logistic regression here, neural network MAP later, and as a baseline against the variational methods of Ch 10.

Takeaways

The logistic sigmoid isn’t arbitrary — it falls out of applying Bayes to two Gaussian class-conditionals with shared covariance.
Generative vs discriminative: generative models are richer but harder to fit; discriminative models trade away the input-distribution model for sharper decisions on labels.
Logistic regression is the workhorse — convex log-loss, IRLS solver, easy to extend with basis functions, a clean Bayesian story via Laplace.
LDA / Fisher’s discriminant / Naive Bayes remain useful as fast baselines and dimensionality-reduction tools.

Forward to Ch 5 — Neural Networks.

Figures from Bishop, Pattern Recognition and Machine Learning (Springer, 2006), © Springer / C. M. Bishop. Reproduced for educational use.