Notes for CMU-18661 Introduction to Machine Learning for Engineers.
Lecture 1: Intro
Link to the course website.
- Homeworks (40%): Both math and programming problems
- Miniexams (15%): Two during the semester
- Midterm Exam (15%): Linear and Logistic Regression, Naive Bayes, SVMs (subject to change)
- Final Exam (25%): Everything else (Nearest Neighbors, Neural Networks, Decision Trees, Boosting, Clustering, PCA, etc.), plus pre-midterm topics (subject to change)
- Gradescope Quizzes (5%): Short multiple-choice question quizzes conducted in random (possibly all) lectures to encourage class attendance and attention. We will take the best 10 quiz scores.
Lecture 2: MLE & MAP
Pending…
Lecture 3-4: Linear Regression
SVD Decomposition:
$$ A = U\Sigma V^{\top} $$
$U\in\mathbb{R}^{m\times m}$ and $V\in\mathbb{R}^{n\times n}$ are orthogonal matrices
$\Sigma\in\mathbb{R}^{m\times n}$ is a diagonal matrix with singular values $\sigma_1\geq\sigma_2\geq\cdots\geq\sigma_r>0$ on the diagonal, where $r$ is the rank of $A$.
The squared singular values $\sigma_i^2$ are the eigenvalues of $A^{\top}A$ and $AA^{\top}$.
$V$ is the right singular vectors of $A$, as well as the eigenvectors of $A^{\top}A$.
$U$ is the left singular vectors of $A$, as well as the eigenvectors of $AA^{\top}$.
Proof:
$$
\begin{aligned}
A^{\top}A&=V\Sigma^{\top} U^{\top} U\Sigma V^{\top}\\
&=V\Sigma^{\top} \Sigma V^{\top}\\
&=V\begin{bmatrix}\sigma_1^2 & & \\ & \ddots & \\ & & \sigma_r^2\end{bmatrix}V^{\top}
\end{aligned}
$$
Similar for others.
Linear Regression:
Setup:
- Input: $\mathbf{x}\in\mathbb{R}^D$
- Output: $y\in\mathbb{R}$
- Model: $f(\mathbf{x})=w_0+\mathbf{w}^{\top}\mathbf{x}=\widetilde{\mathbf{w}}^{\top}\widetilde{\mathbf{x}}$
- Training data: $\mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N$
Residual Sum of Squares (RSS):
$$ \text{RSS}(\mathbf{\widetilde{w}})=\sum_{n=1}^N(y_n-\mathbf{\widetilde{w}}^{\top}\mathbf{\widetilde{x}}_n)^2 $$
For $D=1$:
$$
\begin{aligned}
\frac{\partial RSS}{\partial w_0} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)=0\\
\frac{\partial RSS}{\partial w_1} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)x_n=0\\
\therefore\sum y_n &= N w_0 + w_1 \sum x_n,\quad\sum y_n x_n = w_0 \sum x_n + w_1 \sum x_n^2\\
\therefore w_1^* &= \frac{\sum (x_n-\bar{x})(y_n-\bar{y})}{\sum (x_n-\bar{x})^2},\quad w_0^* = \bar{y} - w_1^* \bar{x}
\end{aligned}
$$
For $D>1$:
$$
\begin{aligned}
RSS(\widetilde{\mathbf{w}}) &= \sum_{n=1}^N (y_n - \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n)^2\\
&=\sum_n(y_n-\widetilde{\mathbf{w}}^{\top}\widetilde{\mathbf{x}}_n)(y_n-\widetilde{\mathbf{x}}_n^{\top}\widetilde{\mathbf{w}})\\
&=\sum_n(- 2 y_n \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n + \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n \widetilde{\mathbf{x}}_n^{\top} \widetilde{\mathbf{w}}) + \text{const}\\
&=\widetilde{\mathbf{w}}^{\top} \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^{\top}y\right) \widetilde{\mathbf{w}} + \text{const}\\
\text{Set}\quad\nabla_{\widetilde{\mathbf{w}}} RSS &= 2 \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^{\top}y\right) = 0\\
\text{Least Mean Squares solution}\\ \\quad\widetilde{\mathbf{w}}^{\text{LMS}} &= \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right)^{-1} \left(\widetilde{\mathbf{X}}^{\top}y\right)
\end{aligned}
$$
Why minimize RSS?
Noisy observation model:
$$ Y=w_0+w_1X+\eta,\quad \eta\sim\mathcal{N}(0,\sigma^2)\quad\rightarrow\quad Y\sim\mathcal{N}(w_0+w_1x,\sigma^2) $$
Conditional Likelihood:
$$ P(y_n|\mathbf{x}_n) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - (w_0 + w_1 x_n))^2}{2\sigma^2}\right) $$
Log-Likelihood of training data:
$$
\begin{aligned}
\log P(\mathcal{D})&=\log\prod_{n=1}^N P(y_n|\mathbf{x}_n)\\
&=\sum_{n=1}^N \log \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - w_0 - w_1 x_n)^2}{2\sigma^2}\right)\\
&=-\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2\\
\end{aligned}
$$
Maximize over $w_0$ and $w_1$: $\max \log P(\mathcal{D})\leftrightarrow \min \text{RSS}$
- This gives a solid footing to our intuition: minimizing $RSS(\widetilde{\mathbf{w}})$ is a sensible thing based on reasonable modeling assumptions.
Maximize over $\sigma^2$:
$$
\begin{aligned}
\frac{\partial \log P(\mathcal{D})}{\partial \sigma^2} &= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2\\
&= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \text{RSS}(\widetilde{\mathbf{w}})\\
\therefore \sigma^{2*} &= \frac{1}{N} \text{RSS}(\widetilde{\mathbf{w}})
\end{aligned}
$$
- Estimating $\sigma^{2*}$ tells us how much noise there is in our predictions. For example, it allows us to place confidence intervals around our predictions.
Gradient Descent Method:
pending…
Lecture 5: Overfitting
Pending…
Lecture 6: Naive Bayes
Naive Bayes:
$$ P(y|\mathbf{x}) = \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})} $$
Spam example:
$$
\begin{aligned}
P(\mathbf{X}=\mathbf{x}, Y=c)&=P(Y=c)P(X=x|Y=c) \\
&=P(Y=c)\prod_{k=1}^K P(\text{word}_k|Y=c)^{x_k} \\
&=\pi_c\prod_{k=1}^K \theta_{c,k}^{x_k},
\end{aligned}
$$
where $\pi_c = P(Y = c)$ is the prior probability of class $c$, $x_k$ is the number of occurrences of the $k$th word, and $ \theta _{c,k} = P(\text{word }_k|Y = c) $ is the weight of the $k$th word for the $c$th class.
Training Data:
$$ \mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N=\{(\{x_{n,k}\}_{k=1}^K,y_n)\}_{n=1}^N $$
Goal:
Learn $\pi_c$ and $\theta_{c,k}$ from the training data $\mathcal{D}$.
Log-Likelihood:
$$
\begin{aligned}
\mathcal{L}(\pi, \theta) &= \sum_{n=1}^N \log P(\mathbf{x}_n, y_n)\\
&= \sum_{n=1}^N \log \left(\pi_{y_n} \prod_{k=1}^K \theta_{y_n,k}^{x_{n,k}}\right)\\
&= \sum_{n=1}^N \log \pi_{y_n} + \sum_{n=1}^N \sum_{k=1}^K x_{n,k} \log \theta_{y_n,k}
\end{aligned}
$$
Optimization:
Note that $\pi_c$ and $\theta_{c,k}$ can be optimized separately.
$$
\begin{aligned}
\pi_c^* &= \frac{N_c}{N}\quad\rightarrow\text{see Lecture 2}\\
\theta_{c,k}^* &= \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c)}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c)}\\
&=\frac{\text{# word k shows up in data for class c}}{\text{# words shows up in data for class c}}\quad\rightarrow\text{see Lecture 2}
\end{aligned}
$$
Classification:
$$
\begin{aligned}
y^* &= \arg\max_c P(y=c|\mathbf{x})\\
&= \arg\max_c P(\mathbf{x}|y=c)P(y=c)\\
&= \arg\max_c \pi_c \prod_{k=1}^K \theta_{c,k}^{x_k}\\
&= \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k}
\end{aligned}
$$
Problem: if $\theta_{c,k}^* = 0$ for some $c$ and $k$, then any test document containing the $k$th word will be classified as not belonging to class $c$.
Solution: Laplace Smoothing
$$ \theta_{c,k}^* = \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c) + \alpha}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c) + K\alpha} $$
Lecture 7: Logistic Regression
Recall for Naive Bayes:
$$ y^* = \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k} $$
For binary classification, the label is based on the sign of
$$ \log\pi_1+\sum_{k=1}^K x_k \log \theta_{1,k} - \log\pi_2 - \sum_{k=1}^K x_k \log \theta_{2,k}, $$
which is a linear function of the input features $\mathbf{x}$.
Logistic Regression:
Setup:
- Input: $\mathbf{x}\in\mathbb{R}^D$
- Output: $y\in\{0,1\}$
- Training data: $\mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N$
- Model: $P(y=1|\mathbf{x})=\sigma(\mathbf{w}^{\top}\mathbf{x})=g(\mathbf{x})$, where $\sigma(z)=\frac{1}{1+e^{-z}}$ is the sigmoid function
Likelihood:
$$
\begin{aligned}
p(y_n|\mathbf{x}_n;\mathbf{w})&=g(\mathbf{x}_n)^{y_n}(1-g(\mathbf{x}_n))^{1-y_n}\\
\mathcal{L}(\mathbf{w})&=\prod_{n=1}^N p(y_n|\mathbf{x}_n;\mathbf{w})\\
\log \mathcal{L}(\mathbf{w})&=\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right)
\end{aligned}
$$
We use the negative log-likelihood, which is also called the cross-entropy loss:
$$ \varepsilon(\mathbf{w}) = -\log \mathcal{L}(\mathbf{w}) = -\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right) $$
Gradient:
$$
\begin{aligned}
\frac{d\sigma(z)}{dz} &= \sigma(z)(1-\sigma(z))\\
\therefore \nabla_{\mathbf{w}} \varepsilon(\mathbf{w}) &= -\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}_n)} \nabla_{\mathbf{w}} g(\mathbf{x}_n) + (1-y_n) \frac{-1}{1-g(\mathbf{x}_n)} \nabla_{\mathbf{w}} g(\mathbf{x}_n)\right)\\
&=-\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}_n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) \nabla_{\mathbf{w}} g(\mathbf{x}_n)\\
&=-\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}_n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) g(\mathbf{x}_n)(1-g(\mathbf{x}_n)) \mathbf{x}_n\\
&=-\sum_{n=1}^N \left(y_n - g(\mathbf{x}_n)\right) \mathbf{x}_n\\
&=\sum_{n=1}^N \left(g(\mathbf{x}_n) - y_n\right) \mathbf{x}_n,
\end{aligned}
$$
where $\left(g(\mathbf{x}_n) - y_n\right)$ is the training error for the $n$th training example.
Training:
pending… (page 31)
Non-linear Decision Boundary:
Solution: High-dimensional feature space (which increases the risk of overfitting, so we need to) add regularization.
Classification Metric:
- True Positive (TP): number of positive examples correctly classified as positive
- False Positive (FP): number of negative examples incorrectly classified as positive
- True Negative (TN): number of negative examples correctly classified as negative
- False Negative (FN): number of positive examples incorrectly classified as negative
- Sensitivity (Recall): $\frac{TP}{TP + FN}$, the proportion of positive examples that are correctly classified as positive
- Specificity: $\frac{TN}{TN + FP}$, the proportion of negative examples that are correctly classified as negative
- Precision: $\frac{TP}{TP + FP}$, the proportion of positive predictions that are correct
Receiver Operating Characteristic (ROC) Curve: plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings.
AUROC: the area under the ROC curve, which measures the overall performance of a binary classifier. A higher AUROC indicates better performance, with a value of $1$ representing a perfect classifier and a value of $0.5$ representing a random classifier.
Lecture 8: Multiclass Logistic Regression
Solution: Softmax of Sigmoid
Lecture 9-10: Support Vector Machines (SVM)
Advantages of SVM
- Maximizes distance of training points from the boundary.
- Only requires a subset of the training points.
- Is less sensitive to outliers.
- Scales better with high-dimensional data.
- Generalizes well to many nonlinear models.
Idea: Maximize the margin between the decision boundary and the training points.
Distance from a point $\mathbf{x}$ to the decision boundary $\mathbf{w}^{\top}\mathbf{x} + w_0 = 0$:
$$ d_{\mathcal{H}}(\mathbf{x}) = \frac{|\mathbf{w}^{\top}\mathbf{x} + w_0|}{\|\mathbf{w}\|_2} $$
To remove the absolute value, we use $y = +1$ to represent positive label and $y = -1$ for negative label, then we have $(\mathbf{w}^{\top}\mathbf{x} + w_0)$ and the label $y$ must have the same sign. So we get
$$
d_{\mathcal{H}}(\mathbf{x}) = \frac{y(\mathbf{w}^{\top}\mathbf{x} + w_0)}{\|\mathbf{w}\|_2}\\
\text{MARGIN}(\mathbf{w}, w_0) = \min_{n} d_{\mathcal{H}}(\mathbf{x}_n)
$$
To solve the SVM, we want to maximize the margin.
We can scale $\mathbf{w}$ and $w_0$ by any positive constant without changing the decision boundary, so we can set $\text{MARGIN}(\mathbf{w}, w_0) = 1$, which gives us the following optimization problem:
$$ \max_{\mathbf{w}, w_0} \text{MARGIN}(\mathbf{w}, w_0) = \max_{\mathbf{w}, w_0} \frac{1}{\|\mathbf{w}\|_2} \quad\text{s.t.}\quad y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1, \forall n $$
Which is further equivalent to:
$$ \min_{\mathbf{w}, w_0} \frac{1}{2}\|\mathbf{w}\|_2^2 \quad\text{s.t.}\quad y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1, \forall n $$
SVM is called a max margin (or large margin) classifier. The constraints are called large margin constraints.
Soft Margin SVM:
Problem: Not fully linear separable data?
Solution: Slack variable
$$ \xi_n \geq 0 \quad\text{s.t.}\quad y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n $$
Which gives us the following optimization problem:
$$
\begin{aligned}
\min_{\mathbf{w}, w_0, \xi} \frac{1}{2}\|\mathbf{w}\|_2^2 + C \sum_{n=1}^N \xi_n \quad\text{s.t.}\quad &y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n\\
&\xi_n \geq 0, \forall n
\end{aligned}
$$
Where $C$ is a hyperparameter that controls the trade-off between maximizing the margin and minimizing the classification error (same idea as the regularization parameter in logistic regression).
Hinge Loss Form:
We want to extract the slack variable $\xi_n$ from the constraints, which gives us:
$$ y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1 - \xi_n \leftrightarrow \xi_n \geq \max(0, 1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0)) $$
By setting $\lambda = \frac{1}{C}$, we can rewrite the optimization problem as:
$$ \min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0)) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2 $$
The first term is called the hinge loss. This can be optimized using gradient descent. There is no penalty for points that are correctly classified and outside the margin, a linear penalty for points that are correctly classified but inside the margin, and a linear penalty for points that are misclassified.
Dual Form:
Optimization problem:
$$ \begin{aligned} \min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0)) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2 \end{aligned} $$
Restrictions (canonical form):
$$
\begin{aligned}
-\xi_n &\leq 0, \forall n\\
1-y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) - \xi_n &\leq 0, \forall n
\end{aligned}
$$
Lagrangian:
$$
\begin{aligned}
\mathcal{L}(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n}) &= C\sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|_2^2 + \sum_{n=1}^N \lambda_n (-\xi_n) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) - \xi_n)\\
\end{aligned}
$$
under the constraints $\alpha_n \geq 0$ and $\lambda_n \geq 0$ for all $n$.
Goal: $\min_{\mathbf{w}, w_0, \xi_n} \max_{\alpha_n \geq 0, \lambda_n \geq 0}\mathcal{L}(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n})$
Idea: If we break the constraints, then the unlimited $\alpha_n$ and $\lambda_n$ will make the Lagrangian go to infinity.
$$
\begin{aligned}
\frac{\partial \mathcal{L}}{\partial \mathbf{w}} &= \mathbf{w} - \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n = 0 &\rightarrow \mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n\\
\frac{\partial \mathcal{L}}{\partial w_0} &= -\sum_{n=1}^N \alpha_n y_n = 0 &\rightarrow \sum_{n=1}^N \alpha_n y_n = 0\\
\frac{\partial \mathcal{L}}{\partial \xi_n} &= C - \lambda_n - \alpha_n = 0 &\rightarrow \lambda_n = C - \alpha_n\rightarrow \alpha_n \leq C
\end{aligned}
$$
Features:
- Independent of the size d of x: SVM scales better for high-dimensional features.
- May seem like a lot of optimization variables when N is large, but many of the $\alpha_n$ become zero. $\alpha_n$ is non-zero only if the nth point is a support vector. SVM only depends on a subset of the training points (support vectors).
$\alpha_n<C$ only when $\xi_n=0$, pending… (page 62)
Learning:
$$ \mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n $$
For $(\mathcal{x}_n,y_n)$ such that $0<\alpha_n<C$, we have
$$
\begin{aligned}
y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) &= 1\\
w_0 &= y_n - \mathbf{w}^{\top}\mathbf{x}_n\\
&= y_n - \sum_{m=1}^N \alpha_m y_m \mathbf{x}_m^{\top} \mathbf{x}_n
\end{aligned}
$$
Kernel SVM:
Similar to the linear regression case, we can use a non-linear basis function $\phi(\mathbf{x})$ to map the input features into a higher-dimensional space. The optimization problem becomes:
$$
\begin{aligned}
\mathcal{L} &= C\sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|_2^2 + \sum_{n=1}^N \lambda_n (-\phi(\xi_n)) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^{\top}\phi(\mathbf{x}_n) + w_0) - \xi_n)\\
&=\frac{1}{2}||\mathbf{w}||_2^2 - \sum \alpha_n y_n \mathbf{w}^{\top} \phi(\mathbf{x}_n) + \sum \alpha_n\quad\text{(Other terms sum up to zero)}\\
&=\sum \alpha_n -\frac12 \sum_{i,j} \alpha_i \alpha_j y_i y_j \phi(\mathbf{x}_i)^{\top} \phi(\mathbf{x}_j)\quad\text{(Substitute $\mathbf{w}=\sum \alpha_n y_n \phi(\mathbf{x}_n)$)}\\
&=\sum \alpha_n -\frac12 \sum_{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)\quad\text{(Define $k(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^{\top} \phi(\mathbf{x}_j)$)}
\end{aligned}
$$
where $k(\mathbf{x}_i, \mathbf{x}_j)$ is called the kernel function.
$k$ is a valid kernel function if it is symmetric and positive semi-definite. Some popular examples:
- Dot product kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^{\top} \mathbf{x}_j$
- Dot product with positive-definite matrix: $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^{\top} A \mathbf{x}_j$
- Polynomial kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^{\top} \mathbf{x}_j + 1)^d,d\in\mathbb{Z}^+$
- Radial basis function (RBF) kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2),\gamma>0$
Note that we do not need to know the explicit form of $\phi(\mathbf{x})$ to compute $k(\mathbf{x}_i, \mathbf{x}_j)$, which allows us to work with very high-dimensional feature spaces without incurring a large computational cost. This is called the kernel trick.
But how do we predict?
$$
\begin{aligned}
y &= \text{sign}(\mathbf{w}^{\top}\phi(\mathbf{x}) + w_0)\\
&=\text{sign}\left(\sum_{n=1}^N \alpha_n y_n
\phi(\mathbf{x}_n)^{\top} \phi(\mathbf{x}) + w_0\right)\\
&=\text{sign}\left(\sum_{n=1}^N \alpha_n y_n k(\mathbf{x}_n, \mathbf{x}) + w_0\right)
\end{aligned}
$$
pending…