18661 Machine Learning ~ Notes

Notes for CMU-18661 Introduction to Machine Learning for Engineers.

Lecture 1: Intro

Link to the course website.

Homeworks (40%): Both math and programming problems
Miniexams (15%): Two during the semester
Midterm Exam (15%): Linear and Logistic Regression, Naive Bayes, SVMs (subject to change)
Final Exam (25%): Everything else (Nearest Neighbors, Neural Networks, Decision Trees, Boosting, Clustering, PCA, etc.), plus pre-midterm topics (subject to change)
Gradescope Quizzes (5%): Short multiple-choice question quizzes conducted in random (possibly all) lectures to encourage class attendance and attention. We will take the best 10 quiz scores.

Lecture 2: MLE & MAP

Pending…

Lecture 3-4: Linear Regression

SVD Decomposition:

$$ A = U\Sigma V^{\top} $$

$U\in\mathbb{R}^{m\times m}$ and $V\in\mathbb{R}^{n\times n}$ are orthogonal matrices

$\Sigma\in\mathbb{R}^{m\times n}$ is a diagonal matrix with singular values $\sigma_1\geq\sigma_2\geq\cdots\geq\sigma_r>0$ on the diagonal, where $r$ is the rank of $A$.

The squared singular values $\sigma_i^2$ are the eigenvalues of $A^{\top}A$ and $AA^{\top}$.

$V$ is the right singular vectors of $A$, as well as the eigenvectors of $A^{\top}A$.

$U$ is the left singular vectors of $A$, as well as the eigenvectors of $AA^{\top}$.

Proof:

$$ \begin{aligned} A^{\top}A&=V\Sigma^{\top} U^{\top} U\Sigma V^{\top}\\
&=V\Sigma^{\top} \Sigma V^{\top}\\
&=V\begin{bmatrix}\sigma_1^2 & & \\ & \ddots & \\ & & \sigma_r^2\end{bmatrix}V^{\top} \end{aligned} $$

Similar for others.

Linear Regression:

Setup:

Input: $\mathbf{x}\in\mathbb{R}^D$
Output: $y\in\mathbb{R}$
Model: $f(\mathbf{x})=w_0+\mathbf{w}^{\top}\mathbf{x}=\widetilde{\mathbf{w}}^{\top}\widetilde{\mathbf{x}}$
Training data: $\mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N$

Residual Sum of Squares (RSS):

$$ \text{RSS}(\mathbf{\widetilde{w}})=\sum_{n=1}^N(y_n-\mathbf{\widetilde{w}}^{\top}\mathbf{\widetilde{x}}_n)^2 $$

For $D=1$:

$$ \begin{aligned} \frac{\partial RSS}{\partial w_0} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)=0\\
\frac{\partial RSS}{\partial w_1} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)x_n=0\\
\therefore\sum y_n &= N w_0 + w_1 \sum x_n,\quad\sum y_n x_n = w_0 \sum x_n + w_1 \sum x_n^2\\
\therefore w_1^* &= \frac{\sum (x_n-\bar{x})(y_n-\bar{y})}{\sum (x_n-\bar{x})^2},\quad w_0^* = \bar{y} - w_1^* \bar{x} \end{aligned} $$

For $D>1$:

$$ \begin{aligned} RSS(\widetilde{\mathbf{w}}) &= \sum_{n=1}^N (y_n - \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n)^2\\
&=\sum_n(y_n-\widetilde{\mathbf{w}}^{\top}\widetilde{\mathbf{x}}_n)(y_n-\widetilde{\mathbf{x}}_n^{\top}\widetilde{\mathbf{w}})\\
&=\sum_n(- 2 y_n \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n + \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n \widetilde{\mathbf{x}}_n^{\top} \widetilde{\mathbf{w}}) + \text{const}\\
&=\widetilde{\mathbf{w}}^{\top} \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^{\top}y\right) \widetilde{\mathbf{w}} + \text{const}\\
\text{Set}\quad\nabla_{\widetilde{\mathbf{w}}} RSS &= 2 \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^{\top}y\right) = 0\\
\text{Least Mean Squares solution}\\ \\quad\widetilde{\mathbf{w}}^{\text{LMS}} &= \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right)^{-1} \left(\widetilde{\mathbf{X}}^{\top}y\right) \end{aligned} $$

Why minimize RSS?

Noisy observation model:

$$ Y=w_0+w_1X+\eta,\quad \eta\sim\mathcal{N}(0,\sigma^2)\quad\rightarrow\quad Y\sim\mathcal{N}(w_0+w_1x,\sigma^2) $$

Conditional Likelihood:

$$ P(y_n|\mathbf{x}_n) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - (w_0 + w_1 x_n))^2}{2\sigma^2}\right) $$

Log-Likelihood of training data:

$$ \begin{aligned} \log P(\mathcal{D})&=\log\prod_{n=1}^N P(y_n|\mathbf{x}_n)\\
&=\sum_{n=1}^N \log \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - w_0 - w_1 x_n)^2}{2\sigma^2}\right)\\
&=-\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2\\
\end{aligned} $$

Maximize over $w_0$ and $w_1$: $\max \log P(\mathcal{D})\leftrightarrow \min \text{RSS}$

This gives a solid footing to our intuition: minimizing $RSS(\widetilde{\mathbf{w}})$ is a sensible thing based on reasonable modeling assumptions.

Maximize over $\sigma^2$:

$$ \begin{aligned} \frac{\partial \log P(\mathcal{D})}{\partial \sigma^2} &= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2\\
&= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \text{RSS}(\widetilde{\mathbf{w}})\\
\therefore \sigma^{2*} &= \frac{1}{N} \text{RSS}(\widetilde{\mathbf{w}}) \end{aligned} $$

Estimating $\sigma^{2*}$ tells us how much noise there is in our predictions. For example, it allows us to place confidence intervals around our predictions.

Gradient Descent Method:

pending…

Lecture 5: Overfitting

Pending…

Lecture 6: Naive Bayes

Naive Bayes:

$$ P(y|\mathbf{x}) = \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})} $$

Spam example:

$$ \begin{aligned} P(\mathbf{X}=\mathbf{x}, Y=c)&=P(Y=c)P(X=x|Y=c) \\
&=P(Y=c)\prod_{k=1}^K P(\text{word}_k|Y=c)^{x_k} \\
&=\pi_c\prod_{k=1}^K \theta_{c,k}^{x_k}, \end{aligned} $$

where $\pi_c = P(Y = c)$ is the prior probability of class $c$, $x_k$ is the number of occurrences of the $k$th word, and $ \theta _{c,k} = P(\text{word }_k|Y = c) $ is the weight of the $k$th word for the $c$th class.

Training Data:

$$ \mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N=\{(\{x_{n,k}\}_{k=1}^K,y_n)\}_{n=1}^N $$

Goal:

Learn $\pi_c$ and $\theta_{c,k}$ from the training data $\mathcal{D}$.

Log-Likelihood:

$$ \begin{aligned} \mathcal{L}(\pi, \theta) &= \sum_{n=1}^N \log P(\mathbf{x}_n, y_n)\\
&= \sum_{n=1}^N \log \left(\pi_{y_n} \prod_{k=1}^K \theta_{y_n,k}^{x_{n,k}}\right)\\
&= \sum_{n=1}^N \log \pi_{y_n} + \sum_{n=1}^N \sum_{k=1}^K x_{n,k} \log \theta_{y_n,k} \end{aligned} $$

Optimization:

Note that $\pi_c$ and $\theta_{c,k}$ can be optimized separately.

$$ \begin{aligned} \pi_c^* &= \frac{N_c}{N}\quad\rightarrow\text{see Lecture 2}\\
\theta_{c,k}^* &= \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c)}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c)}\\
&=\frac{\text{# word k shows up in data for class c}}{\text{# words shows up in data for class c}}\quad\rightarrow\text{see Lecture 2} \end{aligned} $$

Classification:

$$ \begin{aligned} y^* &= \arg\max_c P(y=c|\mathbf{x})\\
&= \arg\max_c P(\mathbf{x}|y=c)P(y=c)\\
&= \arg\max_c \pi_c \prod_{k=1}^K \theta_{c,k}^{x_k}\\
&= \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k} \end{aligned} $$

Problem: if $\theta_{c,k}^* = 0$ for some $c$ and $k$, then any test document containing the $k$th word will be classified as not belonging to class $c$.

Solution: Laplace Smoothing

$$ \theta_{c,k}^* = \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c) + \alpha}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c) + K\alpha} $$

Lecture 7: Logistic Regression

Recall for Naive Bayes:

$$ y^* = \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k} $$

For binary classification, the label is based on the sign of

$$ \log\pi_1+\sum_{k=1}^K x_k \log \theta_{1,k} - \log\pi_2 - \sum_{k=1}^K x_k \log \theta_{2,k}, $$

which is a linear function of the input features $\mathbf{x}$.

Logistic Regression:

Setup:

Input: $\mathbf{x}\in\mathbb{R}^D$
Output: $y\in\{0,1\}$
Training data: $\mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N$
Model: $P(y=1|\mathbf{x})=\sigma(\mathbf{w}^{\top}\mathbf{x})=g(\mathbf{x})$, where $\sigma(z)=\frac{1}{1+e^{-z}}$ is the sigmoid function

Likelihood:

$$ \begin{aligned} p(y_n|\mathbf{x}_n;\mathbf{w})&=g(\mathbf{x}_n)^{y_n}(1-g(\mathbf{x}_n))^{1-y_n}\\
\mathcal{L}(\mathbf{w})&=\prod_{n=1}^N p(y_n|\mathbf{x}_n;\mathbf{w})\\
\log \mathcal{L}(\mathbf{w})&=\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right) \end{aligned} $$

We use the negative log-likelihood, which is also called the cross-entropy loss:

$$ \varepsilon(\mathbf{w}) = -\log \mathcal{L}(\mathbf{w}) = -\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right) $$

Gradient:

$$ \begin{aligned} \frac{d\sigma(z)}{dz} &= \sigma(z)(1-\sigma(z))\\
\therefore \nabla_{\mathbf{w}} \varepsilon(\mathbf{w}) &= -\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}_n)} \nabla_{\mathbf{w}} g(\mathbf{x}_n) + (1-y_n) \frac{-1}{1-g(\mathbf{x}_n)} \nabla_{\mathbf{w}} g(\mathbf{x}_n)\right)\\
&=-\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}_n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) \nabla_{\mathbf{w}} g(\mathbf{x}_n)\\
&=-\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}_n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) g(\mathbf{x}_n)(1-g(\mathbf{x}_n)) \mathbf{x}_n\\
&=-\sum_{n=1}^N \left(y_n - g(\mathbf{x}_n)\right) \mathbf{x}_n\\
&=\sum_{n=1}^N \left(g(\mathbf{x}_n) - y_n\right) \mathbf{x}_n, \end{aligned} $$

where $\left(g(\mathbf{x}_n) - y_n\right)$ is the training error for the $n$th training example.

Training:

pending… (page 31)

Non-linear Decision Boundary:

Solution: High-dimensional feature space (which increases the risk of overfitting, so we need to) add regularization.

Classification Metric:

True Positive (TP): number of positive examples correctly classified as positive
False Positive (FP): number of negative examples incorrectly classified as positive
True Negative (TN): number of negative examples correctly classified as negative
False Negative (FN): number of positive examples incorrectly classified as negative
Sensitivity (Recall): $\frac{TP}{TP + FN}$, the proportion of positive examples that are correctly classified as positive
Specificity: $\frac{TN}{TN + FP}$, the proportion of negative examples that are correctly classified as negative
Precision: $\frac{TP}{TP + FP}$, the proportion of positive predictions that are correct

Receiver Operating Characteristic (ROC) Curve: plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings.

AUROC: the area under the ROC curve, which measures the overall performance of a binary classifier. A higher AUROC indicates better performance, with a value of $1$ representing a perfect classifier and a value of $0.5$ representing a random classifier.

Lecture 8: Multiclass Logistic Regression

Solution: Softmax of Sigmoid

Lecture 9-10: Support Vector Machines (SVM)

Advantages of SVM

Maximizes distance of training points from the boundary.
Only requires a subset of the training points.
Is less sensitive to outliers.
Scales better with high-dimensional data.
Generalizes well to many nonlinear models.

Idea: Maximize the margin between the decision boundary and the training points.

Distance from a point $\mathbf{x}$ to the decision boundary $\mathbf{w}^{\top}\mathbf{x} + w_0 = 0$:

$$ d_{\mathcal{H}}(\mathbf{x}) = \frac{|\mathbf{w}^{\top}\mathbf{x} + w_0|}{\|\mathbf{w}\|_2} $$

To remove the absolute value, we use $y = +1$ to represent positive label and $y = -1$ for negative label, then we have $(\mathbf{w}^{\top}\mathbf{x} + w_0)$ and the label $y$ must have the same sign. So we get

$$ d_{\mathcal{H}}(\mathbf{x}) = \frac{y(\mathbf{w}^{\top}\mathbf{x} + w_0)}{\|\mathbf{w}\|_2}\\
\text{MARGIN}(\mathbf{w}, w_0) = \min_{n} d_{\mathcal{H}}(\mathbf{x}_n) $$

To solve the SVM, we want to maximize the margin.

We can scale $\mathbf{w}$ and $w_0$ by any positive constant without changing the decision boundary, so we can set $\text{MARGIN}(\mathbf{w}, w_0) = 1$, which gives us the following optimization problem:

$$ \max_{\mathbf{w}, w_0} \text{MARGIN}(\mathbf{w}, w_0) = \max_{\mathbf{w}, w_0} \frac{1}{\|\mathbf{w}\|_2} \quad\text{s.t.}\quad y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1, \forall n $$

Which is further equivalent to:

$$ \min_{\mathbf{w}, w_0} \frac{1}{2}\|\mathbf{w}\|_2^2 \quad\text{s.t.}\quad y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1, \forall n $$

SVM is called a max margin (or large margin) classifier. The constraints are called large margin constraints.

Soft Margin SVM:

Problem: Not fully linear separable data?

Solution: Slack variable

$$ \xi_n \geq 0 \quad\text{s.t.}\quad y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n $$

Which gives us the following optimization problem:

$$ \begin{aligned} \min_{\mathbf{w}, w_0, \xi} \frac{1}{2}\|\mathbf{w}\|_2^2 + C \sum_{n=1}^N \xi_n \quad\text{s.t.}\quad &y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n\\
&\xi_n \geq 0, \forall n \end{aligned} $$

Where $C$ is a hyperparameter that controls the trade-off between maximizing the margin and minimizing the classification error (same idea as the regularization parameter in logistic regression).

Hinge Loss Form:

We want to extract the slack variable $\xi_n$ from the constraints, which gives us:

$$ y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1 - \xi_n \leftrightarrow \xi_n \geq \max(0, 1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0)) $$

By setting $\lambda = \frac{1}{C}$, we can rewrite the optimization problem as:

$$ \min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0)) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2 $$

The first term is called the hinge loss. This can be optimized using gradient descent. There is no penalty for points that are correctly classified and outside the margin, a linear penalty for points that are correctly classified but inside the margin, and a linear penalty for points that are misclassified.

Dual Form:

Optimization problem:

$$ \begin{aligned} \min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0)) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2 \end{aligned} $$

Restrictions (canonical form):

$$ \begin{aligned} -\xi_n &\leq 0, \forall n\\
1-y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) - \xi_n &\leq 0, \forall n \end{aligned} $$

Lagrangian:

$$ \begin{aligned} \mathcal{L}(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n}) &= C\sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|_2^2 + \sum_{n=1}^N \lambda_n (-\xi_n) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) - \xi_n)\\
\end{aligned} $$

under the constraints $\alpha_n \geq 0$ and $\lambda_n \geq 0$ for all $n$.

Goal: $\min_{\mathbf{w}, w_0, \xi_n} \max_{\alpha_n \geq 0, \lambda_n \geq 0}\mathcal{L}(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n})$

Idea: If we break the constraints, then the unlimited $\alpha_n$ and $\lambda_n$ will make the Lagrangian go to infinity.

$$ \begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{w}} &= \mathbf{w} - \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n = 0 &\rightarrow \mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n\\
\frac{\partial \mathcal{L}}{\partial w_0} &= -\sum_{n=1}^N \alpha_n y_n = 0 &\rightarrow \sum_{n=1}^N \alpha_n y_n = 0\\
\frac{\partial \mathcal{L}}{\partial \xi_n} &= C - \lambda_n - \alpha_n = 0 &\rightarrow \lambda_n = C - \alpha_n\rightarrow \alpha_n \leq C \end{aligned} $$

Features:

Independent of the size d of x: SVM scales better for high-dimensional features.
May seem like a lot of optimization variables when N is large, but many of the $\alpha_n$ become zero. $\alpha_n$ is non-zero only if the nth point is a support vector. SVM only depends on a subset of the training points (support vectors).

$\alpha_n<C$ only when $\xi_n=0$, pending… (page 62)

Learning:

$$ \mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n $$

For $(\mathcal{x}_n,y_n)$ such that $0<\alpha_n<C$, we have

$$ \begin{aligned} y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) &= 1\\
w_0 &= y_n - \mathbf{w}^{\top}\mathbf{x}_n\\
&= y_n - \sum_{m=1}^N \alpha_m y_m \mathbf{x}_m^{\top} \mathbf{x}_n \end{aligned} $$

Kernel SVM:

Similar to the linear regression case, we can use a non-linear basis function $\phi(\mathbf{x})$ to map the input features into a higher-dimensional space. The optimization problem becomes:

$$ \begin{aligned} \mathcal{L} &= C\sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|_2^2 + \sum_{n=1}^N \lambda_n (-\phi(\xi_n)) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^{\top}\phi(\mathbf{x}_n) + w_0) - \xi_n)\\
&=\frac{1}{2}||\mathbf{w}||_2^2 - \sum \alpha_n y_n \mathbf{w}^{\top} \phi(\mathbf{x}_n) + \sum \alpha_n\quad\text{(Other terms sum up to zero)}\\
&=\sum \alpha_n -\frac12 \sum_{i,j} \alpha_i \alpha_j y_i y_j \phi(\mathbf{x}_i)^{\top} \phi(\mathbf{x}_j)\quad\text{(Substitute $\mathbf{w}=\sum \alpha_n y_n \phi(\mathbf{x}_n)$)}\\
&=\sum \alpha_n -\frac12 \sum_{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)\quad\text{(Define $k(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^{\top} \phi(\mathbf{x}_j)$)} \end{aligned} $$

where $k(\mathbf{x}_i, \mathbf{x}_j)$ is called the kernel function.

$k$ is a valid kernel function if it is symmetric and positive semi-definite. Some popular examples:

Dot product kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^{\top} \mathbf{x}_j$
Dot product with positive-definite matrix: $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^{\top} A \mathbf{x}_j$
Polynomial kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^{\top} \mathbf{x}_j + 1)^d,d\in\mathbb{Z}^+$
Radial basis function (RBF) kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2),\gamma>0$

Note that we do not need to know the explicit form of $\phi(\mathbf{x})$ to compute $k(\mathbf{x}_i, \mathbf{x}_j)$, which allows us to work with very high-dimensional feature spaces without incurring a large computational cost. This is called the kernel trick.

But how do we predict?

$$ \begin{aligned} y &= \text{sign}(\mathbf{w}^{\top}\phi(\mathbf{x}) + w_0)\\
&=\text{sign}\left(\sum_{n=1}^N \alpha_n y_n \phi(\mathbf{x}_n)^{\top} \phi(\mathbf{x}) + w_0\right)\\
&=\text{sign}\left(\sum_{n=1}^N \alpha_n y_n k(\mathbf{x}_n, \mathbf{x}) + w_0\right) \end{aligned} $$

pending…