Lecture 1: Intro

Homeworks (40%): Both math and programming problems
Miniexams (15%): Two during the semester
Midterm Exam (15%): Linear and Logistic Regression, Naive Bayes, SVMs (subject to change)
Final Exam (25%): Everything else (Nearest Neighbors, Neural Networks, Decision Trees, Boosting, Clustering, PCA, etc.), plus pre-midterm topics (subject to change)
Gradescope Quizzes (5%): Short multiple-choice question quizzes conducted in random (possibly all) lectures to encourage class attendance and attention. We will take the best 10 quiz scores.

Lecture 2: MLE & MAP

Pending…

Lecture 3-4: Linear Regression

SVD Decomposition:

$$ A = U\Sigma V^T $$

$U\in\mathbb{R}^{m\times m}$ and $V\in\mathbb{R}^{n\times n}$ are orthogonal matrices

$\Sigma\in\mathbb{R}^{m\times n}$ is a diagonal matrix with singular values $\sigma_1\geq\sigma_2\geq\cdots\geq\sigma_r>0$ on the diagonal, where $r$ is the rank of $A$.

The squared singular values $\sigma_i^2$ are the eigenvalues of $A^TA$ and $AA^T$.

$V$ is the right singular vectors of $A$, as well as the eigenvectors of $A^TA$.

$U$ is the left singular vectors of $A$, as well as the eigenvectors of $AA^T$.

Proof:

$$ \begin{aligned} A^TA&=V\Sigma^T U^T U\Sigma V^T
&=V\Sigma^T \Sigma V^T
&=V\begin{bmatrix}\sigma_1^2 & & \\ & \ddots & \\ & & \sigma_r^2\end{bmatrix}V^T \end{aligned} $$

Similar for others.

Linear Regression:

Setup:

Input: $\mathbf{x}\in\mathbb{R}^D$
Output: $y\in\mathbb{R}$
Model: $f(\mathbf{x})=w_0+\mathbf{w}^T\mathbf{x}=\widetilde{\mathbf{w}}^T\widetilde{\mathbf{x}}$
Training data: $\mathcal{D}={(\mathbf{x}n,y_n)}{n=1}^N$

Residual Sum of Squares (RSS):

$$ \text{RSS}(\mathbf{\widetilde{w}})=\sum_{n=1}^N(y_n-\mathbf{\widetilde{w}}^T\mathbf{\widetilde{x}}_n)^2 $$

For $D=1$:

$$ \begin{aligned} \frac{\partial RSS}{\partial w_0} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)=0
\frac{\partial RSS}{\partial w_1} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)x_n=0
\therefore\sum y_n &= N w_0 + w_1 \sum x_n,\quad\sum y_n x_n = w_0 \sum x_n + w_1 \sum x_n^2
\therefore w_1^* &= \frac{\sum (x_n-\bar{x})(y_n-\bar{y})}{\sum (x_n-\bar{x})^2},\quad w_0^* = \bar{y} - w_1^* \bar{x} \end{aligned} $$

For $D>1$:

$$ \begin{aligned} RSS(\widetilde{\mathbf{w}}) &= \sum_{n=1}^N (y_n - \widetilde{\mathbf{w}}^T \widetilde{\mathbf{x}}_n)^2
&=\sum_n(y_n-\widetilde{\mathbf{w}}^T\widetilde{\mathbf{x}}_n)(y_n-\widetilde{\mathbf{x}}_n^T\widetilde{\mathbf{w}})
&=\sum_n(- 2 y_n \widetilde{\mathbf{w}}^T \widetilde{\mathbf{x}}_n + \widetilde{\mathbf{w}}^T \widetilde{\mathbf{x}}_n \widetilde{\mathbf{x}}_n^T \widetilde{\mathbf{w}}) + \text{const}
&=\widetilde{\mathbf{w}}^T \left(\widetilde{\mathbf{X}}^T\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^Ty\right) \widetilde{\mathbf{w}} + \text{const}\

\text{Set}\quad\nabla_{\widetilde{\mathbf{w}}} RSS &= 2 \left(\widetilde{\mathbf{X}}^T\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^Ty\right) = 0
\text{Least Mean Squares solution}\\quad\widetilde{\mathbf{w}}^{\text{LMS}} &= \left(\widetilde{\mathbf{X}}^T\widetilde{\mathbf{X}}\right)^{-1} \left(\widetilde{\mathbf{X}}^Ty\right) \end{aligned} $$

Why minimize RSS?

Noisy observation model:

$$ Y=w_0+w_1X+\eta,\quad \eta\sim\mathcal{N}(0,\sigma^2)\quad\rightarrow\quad Y\sim\mathcal{N}(w_0+w_1x,\sigma^2) $$

Conditional Likelihood:

$$ P(y_n|\mathbf{x}_n) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - (w_0 + w_1 x_n))^2}{2\sigma^2}\right) $$

Log-Likelihood of training data:

$$ \begin{aligned} \log P(\mathcal{D})&=\log\prod_{n=1}^N P(y_n|\mathbf{x}_n)
&=\sum_{n=1}^N \log \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - w_0 - w_1 x_n)^2}{2\sigma^2}\right)
&=-\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2
\end{aligned} $$

Maximize over $w_0$ and $w_1$: $\max \log P(\mathcal{D})\leftrightarrow \min \text{RSS}$

This gives a solid footing to our intuition: minimizing $RSS(\widetilde{\mathbf{w}})$ is a sensible thing based on reasonable modeling assumptions.

Maximize over $\sigma^2$:

$$ \begin{aligned} \frac{\partial \log P(\mathcal{D})}{\partial \sigma^2} &= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2
&= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \text{RSS}(\widetilde{\mathbf{w}})
\therefore \sigma^{2*} &= \frac{1}{N} \text{RSS}(\widetilde{\mathbf{w}}) \end{aligned} $$

Estimating $\sigma^{2*}$ tells us how much noise there is in our predictions. For example, it allows us to place confidence intervals around our predictions.

Gradient Descent Method:

pending…

Lecture 5: Overfitting

Pending…

Lecture 6: Naive Bayes

Naive Bayes: (spam example)

$$ P(y|\mathbf{x}) = \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})} $$

$$ \begin{aligned} P(\mathbf{X}=\mathbf{x}, Y=c)&=P(Y=c)P(X=x|Y=c)
&=P(Y=c)\prod_{k=1}^K P(\text{word}_k|Y=c)^{x_k}
&=\pi_c\prod_{k=1}^K \theta_{c,k}^{x_k}, \end{aligned} $$

where $\pi_c = P(Y = c)$ is the prior probability of class $c$, $x_k$ is the number of occurrences of the $k$-th word, and $\theta_{c,k} = P(\text{word}_k

Y = c)$ is the weight of the $k$-th word for the $c$-th class.

Training Data:

$$ \mathcal{D}={(\mathbf{x}n,y_n)}{n=1}^N={({x_{n,k}}{k=1}^K,y_n)}{n=1}^N $$

Goal:

Learn $\pi_c$ and $\theta_{c,k}$ from the training data $\mathcal{D}$.

Log-Likelihood:

$$ \begin{aligned} \mathcal{L}(\pi, \theta) &= \sum_{n=1}^N \log P(\mathbf{x}n, y_n)
&= \sum{n=1}^N \log \left(\pi_{y_n} \prod_{k=1}^K \theta_{y_n,k}^{x_{n,k}}\right)
&= \sum_{n=1}^N \log \pi_{y_n} + \sum_{n=1}^N \sum_{k=1}^K x_{n,k} \log \theta_{y_n,k} \end{aligned} $$

Optimization:

Note that $\pi_c$ and $\theta_{c,k}$ can be optimized separately.

$$ \begin{aligned} \pi_c^* &= \frac{N_c}{N}\quad\rightarrow\text{see Lecture 2}
\theta_{c,k}^* &= \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c)}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c)}
&=\frac{\text{# word k shows up in data for class c}}{\text{# words shows up in data for class c}}\quad\rightarrow\text{see Lecture 2} \end{aligned} $$

Classification:

$$ \begin{aligned} y^* &= \arg\max_c P(y=c|\mathbf{x})
&= \arg\max_c P(\mathbf{x}|y=c)P(y=c)
&= \arg\max_c \pi_c \prod_{k=1}^K \theta_{c,k}^{x_k}
&= \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k} \end{aligned} $$

Problem: if $\theta_{c,k}^* = 0$ for some $c$ and $k$, then any test document containing the $k$-th word will be classified as not belonging to class $c$.

Solution: Laplace Smoothing

$$ \theta_{c,k}^* = \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c) + \alpha}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c) + K\alpha} $$

Lecture 7: Logistic Regression

Recall for Naive Bayes:

$$ y^* = \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k} $$

For binary classification, the label is based on the sign of

$$ \log\pi_1+\sum_{k=1}^K x_k \log \theta_{1,k} - \log\pi_2 - \sum_{k=1}^K x_k \log \theta_{2,k}, $$

which is a linear function of the input features $\mathbf{x}$.

Logistic Regression:

Setup:

Input: $\mathbf{x}\in\mathbb{R}^D$
Output: $y\in{0,1}$
Training data: $\mathcal{D}={(\mathbf{x}n,y_n)}{n=1}^N$
Model: $P(y=1 \mathbf{x})=\sigma(\mathbf{w}^T\mathbf{x})=g(\mathbf{x})$, where $\sigma(z)=\frac{1}{1+e^{-z}}$ is the sigmoid function

Likelihood:

$$ \begin{aligned} p(y_n|\mathbf{x}_n;\mathbf{w})&=g(\mathbf{x}_n)^{y_n}(1-g(\mathbf{x}_n))^{1-y_n}
\mathcal{L}(\mathbf{w})&=\prod_{n=1}^N p(y_n|\mathbf{x}_n;\mathbf{w})
\log \mathcal{L}(\mathbf{w})&=\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right) \end{aligned} $$

We use the negative log-likelihood, which is also called the cross-entropy loss:

$$ \varepsilon(\mathbf{w}) = -\log \mathcal{L}(\mathbf{w}) = -\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right) $$

Gradient:

$$ \begin{aligned} \frac{d\sigma(z)}{dz} &= \sigma(z)(1-\sigma(z))
\therefore \nabla_{\mathbf{w}} \varepsilon(\mathbf{w}) &= -\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}n)} \nabla{\mathbf{w}} g(\mathbf{x}n) + (1-y_n) \frac{-1}{1-g(\mathbf{x}_n)} \nabla{\mathbf{w}} g(\mathbf{x}n)\right)
&=-\sum{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) \nabla{\mathbf{w}} g(\mathbf{x}n)
&=-\sum{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) g(\mathbf{x}_n)(1-g(\mathbf{x}_n)) \mathbf{x}_n
&=-\sum{n=1}^N \left(y_n - g(\mathbf{x}n)\right) \mathbf{x}_n
&=\sum{n=1}^N \left(g(\mathbf{x}_n) - y_n\right) \mathbf{x}_n, \end{aligned} $$

where $\left(g(\mathbf{x}_n) - y_n\right)$ is the training error for the $n$-th training example.

Training:

pending… (page 31)

Non-linear Decision Boundary:

Solution: High-dimensional feature space (which increases the risk of overfitting, so we need to) add regularization.

Classification Metric:

True Positive (TP): number of positive examples correctly classified as positive
False Positive (FP): number of negative examples incorrectly classified as positive
True Negative (TN): number of negative examples correctly classified as negative
False Negative (FN): number of positive examples incorrectly classified as negative
Sensitivity (Recall): $\frac{TP}{TP + FN}$, the proportion of positive examples that are correctly classified as positive
Specificity: $\frac{TN}{TN + FP}$, the proportion of negative examples that are correctly classified as negative
Precision: $\frac{TP}{TP + FP}$, the proportion of positive predictions that are correct

Receiver Operating Characteristic (ROC) Curve: plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings.

AUROC: the area under the ROC curve, which measures the overall performance of a binary classifier. A higher AUROC indicates better performance, with a value of $1$ representing a perfect classifier and a value of $0.5$ representing a random classifier.

Lecture 8: Multiclass Logistic Regression

Solution: Softmax of Sigmoid

Lecture 9-10: Support Vector Machines (SVM)

Advantages of SVM

Maximizes distance of training points from the boundary.
Only requires a subset of the training points.
Is less sensitive to outliers.
Scales better with high-dimensional data.
Generalizes well to many nonlinear models.

Idea: Maximize the margin between the decision boundary and the training points.

Distance from a point $\mathbf{x}$ to the decision boundary $\mathbf{w}^T\mathbf{x} + w_0 = 0$:

$$ d_{\mathcal{H}}(\mathbf{x}) = \frac{|\mathbf{w}^T\mathbf{x} + w_0|}{|\mathbf{w}|_2} $$

To remove the absolute value, we use $y = +1$ to represent positive label and $y = -1$ for negative label, then we have $(\mathbf{w}^T\mathbf{x} + w_0)$ and the label $y$ must have the same sign. So we get

$$ d_{\mathcal{H}}(\mathbf{x}) = \frac{y(\mathbf{w}^T\mathbf{x} + w_0)}{|\mathbf{w}|2}
\text{MARGIN}(\mathbf{w}, w_0) = \min{n} d_{\mathcal{H}}(\mathbf{x}_n) $$

To solve the SVM, we want to maximize the margin.

We can scale $\mathbf{w}$ and $w_0$ by any positive constant without changing the decision boundary, so we can set $\text{MARGIN}(\mathbf{w}, w_0) = 1$, which gives us the following optimization problem:

$$ \max_{\mathbf{w}, w_0} \text{MARGIN}(\mathbf{w}, w_0) = \max_{\mathbf{w}, w_0} \frac{1}{|\mathbf{w}|_2} \quad\text{s.t.}\quad y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1, \forall n $$

Which is further equivalent to:

$$ \min_{\mathbf{w}, w_0} \frac{1}{2}|\mathbf{w}|_2^2 \quad\text{s.t.}\quad y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1, \forall n $$

SVM is called a max margin (or large margin) classifier. The constraints are called large margin constraints.

Soft Margin SVM:

Problem: Not fully linear separable data?

Solution: Slack variable

$$ \xi_n \geq 0 \quad\text{s.t.}\quad y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n $$

Which gives us the following optimization problem:

$$ \begin{aligned} \min_{\mathbf{w}, w_0, \xi} \frac{1}{2}|\mathbf{w}|2^2 + C \sum{n=1}^N \xi_n \quad\text{s.t.}\quad &y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n
&\xi_n \geq 0, \forall n \end{aligned} $$

Where $C$ is a hyperparameter that controls the trade-off between maximizing the margin and minimizing the classification error (same idea as the regularization parameter in logistic regression).

Hinge Loss Form:

We want to extract the slack variable $\xi_n$ from the constraints, which gives us:

$$ y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1 - \xi_n \leftrightarrow \xi_n \geq \max(0, 1 - y_n(\mathbf{w}^T\mathbf{x}_n + w_0)) $$

By setting $\lambda = \frac{1}{C}$, we can rewrite the optimization problem as:

$$ \min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^T\mathbf{x}_n + w_0)) + \frac{\lambda}{2}|\mathbf{w}|_2^2 $$

The first term is called the hinge loss. This can be optimized using gradient descent. There is no penalty for points that are correctly classified and outside the margin, a linear penalty for points that are correctly classified but inside the margin, and a linear penalty for points that are misclassified.

Dual Form:

Optimization problem:

$$ \begin{aligned} \min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^T\mathbf{x}_n + w_0)) + \frac{\lambda}{2}|\mathbf{w}|_2^2 \end{aligned} $$

Restrictions (canonical form):

$$ \begin{aligned} -\xi_n &\leq 0, \forall n
1-y_n(\mathbf{w}^T\mathbf{x}_n + w_0) - \xi_n &\leq 0, \forall n \end{aligned} $$

Lagrangian:

$$ \begin{aligned} \mathcal{L}(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n}) &= C\sum_{n=1}^N \xi_n + \frac{1}{2}|\mathbf{w}|2^2 + \sum{n=1}^N \lambda_n (-\xi_n) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^T\mathbf{x}_n + w_0) - \xi_n)
\end{aligned} $$

under the constraints $\alpha_n \geq 0$ and $\lambda_n \geq 0$ for all $n$.

Goal: $\min_{\mathbf{w}, w_0, \xi_n} \max_{\alpha_n \geq 0, \lambda_n \geq 0}\mathcal{L(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n})}$

Idea: If we break the constraints, then the unlimited $\alpha_n$ and $\lambda_n$ will make the Lagrangian go to infinity.

$$ \begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{w}} &= \mathbf{w} - \sum_{n=1}^N \alpha_n y_n \mathbf{x}n = 0 &\rightarrow \mathbf{w} = \sum{n=1}^N \alpha_n y_n \mathbf{x}n
\frac{\partial \mathcal{L}}{\partial w_0} &= -\sum{n=1}^N \alpha_n y_n = 0 &\rightarrow \sum_{n=1}^N \alpha_n y_n = 0
\frac{\partial \mathcal{L}}{\partial \xi_n} &= C - \lambda_n - \alpha_n = 0 &\rightarrow \lambda_n = C - \alpha_n\rightarrow \alpha_n \leq C \end{aligned} $$

Features:

Independent of the size d of x: SVM scales better for high-dimensional features.
May seem like a lot of optimization variables when N is large, but many of the $\alpha_n$ become zero. $\alpha_n$ is non-zero only if the nth point is a support vector. SVM only depends on a subset of the training points (support vectors).

$\alpha_n<C$ only when $\xi_n=0$, pending… (page 62)

Learning:

$$ \mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n $$

For $(\mathcal{x}_n,y_n)$ such that $0<\alpha_n<C$, we have

$$ \begin{aligned} y_n(\mathbf{w}^T\mathbf{x}n + w_0) &= 1
w_0 &= y_n - \mathbf{w}^T\mathbf{x}_n
&= y_n - \sum{m=1}^N \alpha_m y_m \mathbf{x}_m^T \mathbf{x}_n \end{aligned} $$

Kernel SVM:

Similar to the linear regression case, we can use a non-linear basis function $\phi(\mathbf{x})$ to map the input features into a higher-dimensional space. The optimization problem becomes:

$$ \begin{aligned} \mathcal{L} &= C\sum_{n=1}^N \xi_n + \frac{1}{2}|\mathbf{w}|2^2 + \sum{n=1}^N \lambda_n (-\phi(\xi_n)) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^T\phi(\mathbf{x}n) + w_0) - \xi_n)
&=\frac{1}{2}||\mathbf{w}||_2^2 - \sum \alpha_n y_n \mathbf{w}^T \phi(\mathbf{x}_n) + \sum \alpha_n\quad\text{(Other terms sum up to zero)}
&=\sum \alpha_n -\frac12 \sum{i,j} \alpha_i \alpha_j y_i y_j \phi(\mathbf{x}i)^T \phi(\mathbf{x}_j)\quad\text{(Substitute $\mathbf{w}=\sum \alpha_n y_n \phi(\mathbf{x}_n)$)}
&=\sum \alpha_n -\frac12 \sum{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)\quad\text{(Define $k(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$)} \end{aligned} $$

where $k(\mathbf{x}_i, \mathbf{x}_j)$ is called the kernel function.

$k$ is a valid kernel function if it is symmetric and positive semi-definite. Some popular examples:

Dot product kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T \mathbf{x}_j$
Dot product with positive-definite matrix: $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T A \mathbf{x}_j$
Polynomial kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^T \mathbf{x}_j + 1)^d,d\in\mathbb{Z}^+$
Radial basis function (RBF) kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma |\mathbf{x}_i - \mathbf{x}_j|^2),\gamma>0$

Note that we do not need to know the explicit form of $\phi(\mathbf{x})$ to compute $k(\mathbf{x}_i, \mathbf{x}_j)$, which allows us to work with very high-dimensional feature spaces without incurring a large computational cost. This is called the kernel trick.

But how do we predict?

$$ \begin{aligned} y^* &= \text{sign}(\mathbf{w}^T \phi(\mathbf{x}^) + w_0)
&=\text{sign}\left(\sum_{n=1}^N \alpha_n y_n \phi(\mathbf{x}_n)^T \phi(\mathbf{x}^) + w_0\right)
&= \text{sign}\left(\sum_{n=1}^N \alpha_n y_n k(\mathbf{x}_n, \mathbf{x}^*) + w_0\right) \end{aligned} $$

Notes for CMU-18661 Machine Learning

CMU-18661 机器学习课程笔记