Lecture 1: Intro
- Homeworks (40%): Both math and programming problems
- Miniexams (15%): Two during the semester
- Midterm Exam (15%): Linear and Logistic Regression, Naive Bayes, SVMs (subject to change)
- Final Exam (25%): Everything else (Nearest Neighbors, Neural Networks, Decision Trees, Boosting, Clustering, PCA, etc.), plus pre-midterm topics (subject to change)
- Gradescope Quizzes (5%): Short multiple-choice question quizzes conducted in random (possibly all) lectures to encourage class attendance and attention. We will take the best 10 quiz scores.
Lecture 2: MLE & MAP
Pending…
Lecture 3-4: Linear Regression
SVD Decomposition:
$$ A = U\Sigma V^T $$
$U\in\mathbb{R}^{m\times m}$ and $V\in\mathbb{R}^{n\times n}$ are orthogonal matrices
$\Sigma\in\mathbb{R}^{m\times n}$ is a diagonal matrix with singular values $\sigma_1\geq\sigma_2\geq\cdots\geq\sigma_r>0$ on the diagonal, where $r$ is the rank of $A$.
The squared singular values $\sigma_i^2$ are the eigenvalues of $A^TA$ and $AA^T$.
$V$ is the right singular vectors of $A$, as well as the eigenvectors of $A^TA$.
$U$ is the left singular vectors of $A$, as well as the eigenvectors of $AA^T$.
Proof:
$$
\begin{aligned}
A^TA&=V\Sigma^T U^T U\Sigma V^T
&=V\Sigma^T \Sigma V^T
&=V\begin{bmatrix}\sigma_1^2 & & \\ & \ddots & \\ & & \sigma_r^2\end{bmatrix}V^T
\end{aligned}
$$
Similar for others.
Linear Regression:
Setup:
- Input: $\mathbf{x}\in\mathbb{R}^D$
- Output: $y\in\mathbb{R}$
- Model: $f(\mathbf{x})=w_0+\mathbf{w}^T\mathbf{x}=\widetilde{\mathbf{w}}^T\widetilde{\mathbf{x}}$
- Training data: $\mathcal{D}={(\mathbf{x}n,y_n)}{n=1}^N$
Residual Sum of Squares (RSS):
$$ \text{RSS}(\mathbf{\widetilde{w}})=\sum_{n=1}^N(y_n-\mathbf{\widetilde{w}}^T\mathbf{\widetilde{x}}_n)^2 $$
For $D=1$:
$$
\begin{aligned}
\frac{\partial RSS}{\partial w_0} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)=0
\frac{\partial RSS}{\partial w_1} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)x_n=0
\therefore\sum y_n &= N w_0 + w_1 \sum x_n,\quad\sum y_n x_n = w_0 \sum x_n + w_1 \sum x_n^2
\therefore w_1^* &= \frac{\sum (x_n-\bar{x})(y_n-\bar{y})}{\sum (x_n-\bar{x})^2},\quad w_0^* = \bar{y} - w_1^* \bar{x}
\end{aligned}
$$
For $D>1$:
$$
\begin{aligned}
RSS(\widetilde{\mathbf{w}}) &= \sum_{n=1}^N (y_n - \widetilde{\mathbf{w}}^T \widetilde{\mathbf{x}}_n)^2
&=\sum_n(y_n-\widetilde{\mathbf{w}}^T\widetilde{\mathbf{x}}_n)(y_n-\widetilde{\mathbf{x}}_n^T\widetilde{\mathbf{w}})
&=\sum_n(- 2 y_n \widetilde{\mathbf{w}}^T \widetilde{\mathbf{x}}_n + \widetilde{\mathbf{w}}^T \widetilde{\mathbf{x}}_n \widetilde{\mathbf{x}}_n^T \widetilde{\mathbf{w}}) + \text{const}
&=\widetilde{\mathbf{w}}^T \left(\widetilde{\mathbf{X}}^T\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^Ty\right) \widetilde{\mathbf{w}} + \text{const}\
\text{Set}\quad\nabla_{\widetilde{\mathbf{w}}} RSS &= 2 \left(\widetilde{\mathbf{X}}^T\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^Ty\right) = 0
\text{Least Mean Squares solution}\\quad\widetilde{\mathbf{w}}^{\text{LMS}} &= \left(\widetilde{\mathbf{X}}^T\widetilde{\mathbf{X}}\right)^{-1} \left(\widetilde{\mathbf{X}}^Ty\right)
\end{aligned}
$$
Why minimize RSS?
Noisy observation model:
$$ Y=w_0+w_1X+\eta,\quad \eta\sim\mathcal{N}(0,\sigma^2)\quad\rightarrow\quad Y\sim\mathcal{N}(w_0+w_1x,\sigma^2) $$
Conditional Likelihood:
$$ P(y_n|\mathbf{x}_n) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - (w_0 + w_1 x_n))^2}{2\sigma^2}\right) $$
Log-Likelihood of training data:
$$
\begin{aligned}
\log P(\mathcal{D})&=\log\prod_{n=1}^N P(y_n|\mathbf{x}_n)
&=\sum_{n=1}^N \log \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - w_0 - w_1 x_n)^2}{2\sigma^2}\right)
&=-\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2
\end{aligned}
$$
Maximize over $w_0$ and $w_1$: $\max \log P(\mathcal{D})\leftrightarrow \min \text{RSS}$
- This gives a solid footing to our intuition: minimizing $RSS(\widetilde{\mathbf{w}})$ is a sensible thing based on reasonable modeling assumptions.
Maximize over $\sigma^2$:
$$
\begin{aligned}
\frac{\partial \log P(\mathcal{D})}{\partial \sigma^2} &= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2
&= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \text{RSS}(\widetilde{\mathbf{w}})
\therefore \sigma^{2*} &= \frac{1}{N} \text{RSS}(\widetilde{\mathbf{w}})
\end{aligned}
$$
- Estimating $\sigma^{2*}$ tells us how much noise there is in our predictions. For example, it allows us to place confidence intervals around our predictions.
Gradient Descent Method:
pending…
Lecture 5: Overfitting
Pending…
Lecture 6: Naive Bayes
Naive Bayes: (spam example)
$$ P(y|\mathbf{x}) = \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})} $$
$$
\begin{aligned}
P(\mathbf{X}=\mathbf{x}, Y=c)&=P(Y=c)P(X=x|Y=c)
&=P(Y=c)\prod_{k=1}^K P(\text{word}_k|Y=c)^{x_k}
&=\pi_c\prod_{k=1}^K \theta_{c,k}^{x_k},
\end{aligned}
$$
| where $\pi_c = P(Y = c)$ is the prior probability of class $c$, $x_k$ is the number of occurrences of the $k$-th word, and $\theta_{c,k} = P(\text{word}_k | Y = c)$ is the weight of the $k$-th word for the $c$-th class. |
Training Data:
$$ \mathcal{D}={(\mathbf{x}n,y_n)}{n=1}^N={({x_{n,k}}{k=1}^K,y_n)}{n=1}^N $$
Goal:
Learn $\pi_c$ and $\theta_{c,k}$ from the training data $\mathcal{D}$.
Log-Likelihood:
$$
\begin{aligned}
\mathcal{L}(\pi, \theta) &= \sum_{n=1}^N \log P(\mathbf{x}n, y_n)
&= \sum{n=1}^N \log \left(\pi_{y_n} \prod_{k=1}^K \theta_{y_n,k}^{x_{n,k}}\right)
&= \sum_{n=1}^N \log \pi_{y_n} + \sum_{n=1}^N \sum_{k=1}^K x_{n,k} \log \theta_{y_n,k}
\end{aligned}
$$
Optimization:
Note that $\pi_c$ and $\theta_{c,k}$ can be optimized separately.
$$
\begin{aligned}
\pi_c^* &= \frac{N_c}{N}\quad\rightarrow\text{see Lecture 2}
\theta_{c,k}^* &= \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c)}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c)}
&=\frac{\text{# word k shows up in data for class c}}{\text{# words shows up in data for class c}}\quad\rightarrow\text{see Lecture 2}
\end{aligned}
$$
Classification:
$$
\begin{aligned}
y^* &= \arg\max_c P(y=c|\mathbf{x})
&= \arg\max_c P(\mathbf{x}|y=c)P(y=c)
&= \arg\max_c \pi_c \prod_{k=1}^K \theta_{c,k}^{x_k}
&= \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k}
\end{aligned}
$$
Problem: if $\theta_{c,k}^* = 0$ for some $c$ and $k$, then any test document containing the $k$-th word will be classified as not belonging to class $c$.
Solution: Laplace Smoothing
$$ \theta_{c,k}^* = \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c) + \alpha}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c) + K\alpha} $$
Lecture 7: Logistic Regression
Recall for Naive Bayes:
$$ y^* = \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k} $$
For binary classification, the label is based on the sign of
$$ \log\pi_1+\sum_{k=1}^K x_k \log \theta_{1,k} - \log\pi_2 - \sum_{k=1}^K x_k \log \theta_{2,k}, $$
which is a linear function of the input features $\mathbf{x}$.
Logistic Regression:
Setup:
- Input: $\mathbf{x}\in\mathbb{R}^D$
- Output: $y\in{0,1}$
- Training data: $\mathcal{D}={(\mathbf{x}n,y_n)}{n=1}^N$
-
Model: $P(y=1 \mathbf{x})=\sigma(\mathbf{w}^T\mathbf{x})=g(\mathbf{x})$, where $\sigma(z)=\frac{1}{1+e^{-z}}$ is the sigmoid function
Likelihood:
$$
\begin{aligned}
p(y_n|\mathbf{x}_n;\mathbf{w})&=g(\mathbf{x}_n)^{y_n}(1-g(\mathbf{x}_n))^{1-y_n}
\mathcal{L}(\mathbf{w})&=\prod_{n=1}^N p(y_n|\mathbf{x}_n;\mathbf{w})
\log \mathcal{L}(\mathbf{w})&=\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right)
\end{aligned}
$$
We use the negative log-likelihood, which is also called the cross-entropy loss:
$$ \varepsilon(\mathbf{w}) = -\log \mathcal{L}(\mathbf{w}) = -\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right) $$
Gradient:
$$
\begin{aligned}
\frac{d\sigma(z)}{dz} &= \sigma(z)(1-\sigma(z))
\therefore \nabla_{\mathbf{w}} \varepsilon(\mathbf{w}) &= -\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}n)} \nabla{\mathbf{w}} g(\mathbf{x}n) + (1-y_n) \frac{-1}{1-g(\mathbf{x}_n)} \nabla{\mathbf{w}} g(\mathbf{x}n)\right)
&=-\sum{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) \nabla{\mathbf{w}} g(\mathbf{x}n)
&=-\sum{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) g(\mathbf{x}_n)(1-g(\mathbf{x}_n)) \mathbf{x}_n
&=-\sum{n=1}^N \left(y_n - g(\mathbf{x}n)\right) \mathbf{x}_n
&=\sum{n=1}^N \left(g(\mathbf{x}_n) - y_n\right) \mathbf{x}_n,
\end{aligned}
$$
where $\left(g(\mathbf{x}_n) - y_n\right)$ is the training error for the $n$-th training example.
Training:
pending… (page 31)
Non-linear Decision Boundary:
Solution: High-dimensional feature space (which increases the risk of overfitting, so we need to) add regularization.
Classification Metric:
- True Positive (TP): number of positive examples correctly classified as positive
- False Positive (FP): number of negative examples incorrectly classified as positive
- True Negative (TN): number of negative examples correctly classified as negative
- False Negative (FN): number of positive examples incorrectly classified as negative
- Sensitivity (Recall): $\frac{TP}{TP + FN}$, the proportion of positive examples that are correctly classified as positive
- Specificity: $\frac{TN}{TN + FP}$, the proportion of negative examples that are correctly classified as negative
- Precision: $\frac{TP}{TP + FP}$, the proportion of positive predictions that are correct
Receiver Operating Characteristic (ROC) Curve: plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings.
AUROC: the area under the ROC curve, which measures the overall performance of a binary classifier. A higher AUROC indicates better performance, with a value of $1$ representing a perfect classifier and a value of $0.5$ representing a random classifier.
Lecture 8: Multiclass Logistic Regression
Solution: Softmax of Sigmoid
Lecture 9-10: Support Vector Machines (SVM)
Advantages of SVM
- Maximizes distance of training points from the boundary.
- Only requires a subset of the training points.
- Is less sensitive to outliers.
- Scales better with high-dimensional data.
- Generalizes well to many nonlinear models.
Idea: Maximize the margin between the decision boundary and the training points.
Distance from a point $\mathbf{x}$ to the decision boundary $\mathbf{w}^T\mathbf{x} + w_0 = 0$:
$$ d_{\mathcal{H}}(\mathbf{x}) = \frac{|\mathbf{w}^T\mathbf{x} + w_0|}{|\mathbf{w}|_2} $$
To remove the absolute value, we use $y = +1$ to represent positive label and $y = -1$ for negative label, then we have $(\mathbf{w}^T\mathbf{x} + w_0)$ and the label $y$ must have the same sign. So we get
$$
d_{\mathcal{H}}(\mathbf{x}) = \frac{y(\mathbf{w}^T\mathbf{x} + w_0)}{|\mathbf{w}|2}
\text{MARGIN}(\mathbf{w}, w_0) = \min{n} d_{\mathcal{H}}(\mathbf{x}_n)
$$
To solve the SVM, we want to maximize the margin.
We can scale $\mathbf{w}$ and $w_0$ by any positive constant without changing the decision boundary, so we can set $\text{MARGIN}(\mathbf{w}, w_0) = 1$, which gives us the following optimization problem:
$$ \max_{\mathbf{w}, w_0} \text{MARGIN}(\mathbf{w}, w_0) = \max_{\mathbf{w}, w_0} \frac{1}{|\mathbf{w}|_2} \quad\text{s.t.}\quad y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1, \forall n $$
Which is further equivalent to:
$$ \min_{\mathbf{w}, w_0} \frac{1}{2}|\mathbf{w}|_2^2 \quad\text{s.t.}\quad y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1, \forall n $$
SVM is called a max margin (or large margin) classifier. The constraints are called large margin constraints.
Soft Margin SVM:
Problem: Not fully linear separable data?
Solution: Slack variable
$$ \xi_n \geq 0 \quad\text{s.t.}\quad y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n $$
Which gives us the following optimization problem:
$$
\begin{aligned}
\min_{\mathbf{w}, w_0, \xi} \frac{1}{2}|\mathbf{w}|2^2 + C \sum{n=1}^N \xi_n \quad\text{s.t.}\quad &y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n
&\xi_n \geq 0, \forall n
\end{aligned}
$$
Where $C$ is a hyperparameter that controls the trade-off between maximizing the margin and minimizing the classification error (same idea as the regularization parameter in logistic regression).
Hinge Loss Form:
We want to extract the slack variable $\xi_n$ from the constraints, which gives us:
$$ y_n(\mathbf{w}^T\mathbf{x}_n + w_0) \geq 1 - \xi_n \leftrightarrow \xi_n \geq \max(0, 1 - y_n(\mathbf{w}^T\mathbf{x}_n + w_0)) $$
By setting $\lambda = \frac{1}{C}$, we can rewrite the optimization problem as:
$$ \min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^T\mathbf{x}_n + w_0)) + \frac{\lambda}{2}|\mathbf{w}|_2^2 $$
The first term is called the hinge loss. This can be optimized using gradient descent. There is no penalty for points that are correctly classified and outside the margin, a linear penalty for points that are correctly classified but inside the margin, and a linear penalty for points that are misclassified.
Dual Form:
Optimization problem:
$$ \begin{aligned} \min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^T\mathbf{x}_n + w_0)) + \frac{\lambda}{2}|\mathbf{w}|_2^2 \end{aligned} $$
Restrictions (canonical form):
$$
\begin{aligned}
-\xi_n &\leq 0, \forall n
1-y_n(\mathbf{w}^T\mathbf{x}_n + w_0) - \xi_n &\leq 0, \forall n
\end{aligned}
$$
Lagrangian:
$$
\begin{aligned}
\mathcal{L}(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n}) &= C\sum_{n=1}^N \xi_n + \frac{1}{2}|\mathbf{w}|2^2 + \sum{n=1}^N \lambda_n (-\xi_n) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^T\mathbf{x}_n + w_0) - \xi_n)
\end{aligned}
$$
under the constraints $\alpha_n \geq 0$ and $\lambda_n \geq 0$ for all $n$.
Goal: $\min_{\mathbf{w}, w_0, \xi_n} \max_{\alpha_n \geq 0, \lambda_n \geq 0}\mathcal{L(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n})}$
Idea: If we break the constraints, then the unlimited $\alpha_n$ and $\lambda_n$ will make the Lagrangian go to infinity.
$$
\begin{aligned}
\frac{\partial \mathcal{L}}{\partial \mathbf{w}} &= \mathbf{w} - \sum_{n=1}^N \alpha_n y_n \mathbf{x}n = 0 &\rightarrow \mathbf{w} = \sum{n=1}^N \alpha_n y_n \mathbf{x}n
\frac{\partial \mathcal{L}}{\partial w_0} &= -\sum{n=1}^N \alpha_n y_n = 0 &\rightarrow \sum_{n=1}^N \alpha_n y_n = 0
\frac{\partial \mathcal{L}}{\partial \xi_n} &= C - \lambda_n - \alpha_n = 0 &\rightarrow \lambda_n = C - \alpha_n\rightarrow \alpha_n \leq C
\end{aligned}
$$
Features:
- Independent of the size d of x: SVM scales better for high-dimensional features.
- May seem like a lot of optimization variables when N is large, but many of the $\alpha_n$ become zero. $\alpha_n$ is non-zero only if the nth point is a support vector. SVM only depends on a subset of the training points (support vectors).
$\alpha_n<C$ only when $\xi_n=0$, pending… (page 62)
Learning:
$$ \mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n $$
For $(\mathcal{x}_n,y_n)$ such that $0<\alpha_n<C$, we have
$$
\begin{aligned}
y_n(\mathbf{w}^T\mathbf{x}n + w_0) &= 1
w_0 &= y_n - \mathbf{w}^T\mathbf{x}_n
&= y_n - \sum{m=1}^N \alpha_m y_m \mathbf{x}_m^T \mathbf{x}_n
\end{aligned}
$$
Kernel SVM:
Similar to the linear regression case, we can use a non-linear basis function $\phi(\mathbf{x})$ to map the input features into a higher-dimensional space. The optimization problem becomes:
$$
\begin{aligned}
\mathcal{L} &= C\sum_{n=1}^N \xi_n + \frac{1}{2}|\mathbf{w}|2^2 + \sum{n=1}^N \lambda_n (-\phi(\xi_n)) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^T\phi(\mathbf{x}n) + w_0) - \xi_n)
&=\frac{1}{2}||\mathbf{w}||_2^2 - \sum \alpha_n y_n \mathbf{w}^T \phi(\mathbf{x}_n) + \sum \alpha_n\quad\text{(Other terms sum up to zero)}
&=\sum \alpha_n -\frac12 \sum{i,j} \alpha_i \alpha_j y_i y_j \phi(\mathbf{x}i)^T \phi(\mathbf{x}_j)\quad\text{(Substitute $\mathbf{w}=\sum \alpha_n y_n \phi(\mathbf{x}_n)$)}
&=\sum \alpha_n -\frac12 \sum{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)\quad\text{(Define $k(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$)}
\end{aligned}
$$
where $k(\mathbf{x}_i, \mathbf{x}_j)$ is called the kernel function.
$k$ is a valid kernel function if it is symmetric and positive semi-definite. Some popular examples:
- Dot product kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T \mathbf{x}_j$
- Dot product with positive-definite matrix: $k(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T A \mathbf{x}_j$
- Polynomial kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^T \mathbf{x}_j + 1)^d,d\in\mathbb{Z}^+$
- Radial basis function (RBF) kernel: $k(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma |\mathbf{x}_i - \mathbf{x}_j|^2),\gamma>0$
Note that we do not need to know the explicit form of $\phi(\mathbf{x})$ to compute $k(\mathbf{x}_i, \mathbf{x}_j)$, which allows us to work with very high-dimensional feature spaces without incurring a large computational cost. This is called the kernel trick.
But how do we predict?
$$
\begin{aligned}
y^* &= \text{sign}(\mathbf{w}^T \phi(\mathbf{x}^) + w_0)
&=\text{sign}\left(\sum_{n=1}^N \alpha_n y_n \phi(\mathbf{x}_n)^T \phi(\mathbf{x}^) + w_0\right)
&= \text{sign}\left(\sum_{n=1}^N \alpha_n y_n k(\mathbf{x}_n, \mathbf{x}^*) + w_0\right)
\end{aligned}
$$