林空鹿饮溪的博客 | DeerInForest's Blog

CMU 18661 Machine Learning | Notes

Notes for CMU-18661 Introduction to Machine Learning for Engineers.

Lecture 1: Intro

Link to the course website.

  • Homeworks (40%): Both math and programming problems
  • Miniexams (15%): Two during the semester
  • Midterm Exam (15%): Linear and Logistic Regression, Naive Bayes, SVMs (subject to change)
  • Final Exam (25%): Everything else (Nearest Neighbors, Neural Networks, Decision Trees, Boosting, Clustering, PCA, etc.), plus pre-midterm topics (subject to change)
  • Gradescope Quizzes (5%): Short multiple-choice question quizzes conducted in random (possibly all) lectures to encourage class attendance and attention. We will take the best 10 quiz scores.

Lecture 2: MLE & MAP

Pending…

Lecture 3-4: Linear Regression

SVD Decomposition:

A=UΣVA = U\Sigma V^{\top}

URm×mU\in\mathbb{R}^{m\times m} and VRn×nV\in\mathbb{R}^{n\times n} are orthogonal matrices

ΣRm×n\Sigma\in\mathbb{R}^{m\times n} is a diagonal matrix with singular values σ1σ2σr>0\sigma_1\geq\sigma_2\geq\cdots\geq\sigma_r>0 on the diagonal, where rr is the rank of AA.

The squared singular values σi2\sigma_i^2 are the eigenvalues of AAA^{\top}A and AAAA^{\top}.

VV is the right singular vectors of AA, as well as the eigenvectors of AAA^{\top}A.

UU is the left singular vectors of AA, as well as the eigenvectors of AAAA^{\top}.

Proof:

AA=VΣUUΣV=VΣΣV=V[σ12σr2]V\begin{aligned} A^{\top}A&=V\Sigma^{\top} U^{\top} U\Sigma V^{\top}\\ &=V\Sigma^{\top} \Sigma V^{\top}\\ &=V\begin{bmatrix}\sigma_1^2 & & \\ & \ddots & \\ & & \sigma_r^2\end{bmatrix}V^{\top} \end{aligned}

Similar for others.

Linear Regression:

Setup:

  • Input: xRD\mathbf{x}\in\mathbb{R}^D
  • Output: yRy\in\mathbb{R}
  • Model: f(x)=w0+wx=w~x~f(\mathbf{x})=w_0+\mathbf{w}^{\top}\mathbf{x}=\widetilde{\mathbf{w}}^{\top}\widetilde{\mathbf{x}}
  • Training data: D={(xn,yn)}n=1N\mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N

Residual Sum of Squares (RSS):

RSS(w~)=n=1N(ynw~x~n)2\text{RSS}(\mathbf{\widetilde{w}})=\sum_{n=1}^N(y_n-\mathbf{\widetilde{w}}^{\top}\mathbf{\widetilde{x}}_n)^2

For D=1D=1:

RSSw0=2n=1N(yn(w0+w1xn))=0RSSw1=2n=1N(yn(w0+w1xn))xn=0yn=Nw0+w1xn,ynxn=w0xn+w1xn2w1=(xnxˉ)(ynyˉ)(xnxˉ)2,w0=yˉw1xˉ\begin{aligned} \frac{\partial RSS}{\partial w_0} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)=0\\ \frac{\partial RSS}{\partial w_1} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)x_n=0\\ \therefore\sum y_n &= N w_0 + w_1 \sum x_n,\quad\sum y_n x_n = w_0 \sum x_n + w_1 \sum x_n^2\\ \therefore w_1^* &= \frac{\sum (x_n-\bar{x})(y_n-\bar{y})}{\sum (x_n-\bar{x})^2},\quad w_0^* = \bar{y} - w_1^* \bar{x} \end{aligned}

For D>1D>1:

RSS(w~)=n=1N(ynw~x~n)2=n(ynw~x~n)(ynx~nw~)=n(2ynw~x~n+w~x~nx~nw~)+const=w~(X~X~)w~2(X~y)w~+constSetw~RSS=2(X~X~)w~2(X~y)=0Least Mean Squares solutionw~LMS=(X~X~)1(X~y)\begin{aligned} RSS(\widetilde{\mathbf{w}}) &= \sum_{n=1}^N (y_n - \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n)^2\\ &=\sum_n(y_n-\widetilde{\mathbf{w}}^{\top}\widetilde{\mathbf{x}}_n)(y_n-\widetilde{\mathbf{x}}_n^{\top}\widetilde{\mathbf{w}})\\ &=\sum_n(- 2 y_n \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n + \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n \widetilde{\mathbf{x}}_n^{\top} \widetilde{\mathbf{w}}) + \text{const}\\ &=\widetilde{\mathbf{w}}^{\top} \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^{\top}y\right) \widetilde{\mathbf{w}} + \text{const}\\ \text{Set}\quad\nabla_{\widetilde{\mathbf{w}}} RSS &= 2 \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^{\top}y\right) = 0\\ \text{Least Mean Squares solution}\quad\widetilde{\mathbf{w}}^{\text{LMS}} &= \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right)^{-1} \left(\widetilde{\mathbf{X}}^{\top}y\right) \end{aligned}

Why minimize RSS?

Noisy observation model:

Y=w0+w1X+η,ηN(0,σ2)YN(w0+w1x,σ2)Y=w_0+w_1X+\eta,\quad \eta\sim\mathcal{N}(0,\sigma^2)\quad\rightarrow\quad Y\sim\mathcal{N}(w_0+w_1x,\sigma^2)

Conditional Likelihood:

P(ynxn)=12πσexp((yn(w0+w1xn))22σ2)P(y_n|\mathbf{x}_n) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - (w_0 + w_1 x_n))^2}{2\sigma^2}\right)

Log-Likelihood of training data:

logP(D)=logn=1NP(ynxn)=n=1Nlog12πσexp((ynw0w1xn)22σ2)=N2log(2πσ2)12σ2n=1N(ynw0w1xn)2\begin{aligned} \log P(\mathcal{D})&=\log\prod_{n=1}^N P(y_n|\mathbf{x}_n)\\ &=\sum_{n=1}^N \log \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - w_0 - w_1 x_n)^2}{2\sigma^2}\right)\\ &=-\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2\\ \end{aligned}

Maximize over w0w_0 and w1w_1: maxlogP(D)minRSS\max \log P(\mathcal{D})\leftrightarrow \min \text{RSS}

  • This gives a solid footing to our intuition: minimizing RSS(w~)RSS(\widetilde{\mathbf{w}}) is a sensible thing based on reasonable modeling assumptions.

Maximize over σ2\sigma^2:

logP(D)σ2=N2σ2+12σ4n=1N(ynw0w1xn)2=N2σ2+12σ4RSS(w~)σ2=1NRSS(w~)\begin{aligned} \frac{\partial \log P(\mathcal{D})}{\partial \sigma^2} &= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2\\ &= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \text{RSS}(\widetilde{\mathbf{w}})\\ \therefore \sigma^{2*} &= \frac{1}{N} \text{RSS}(\widetilde{\mathbf{w}}) \end{aligned}
  • Estimating σ2\sigma^{2*} tells us how much noise there is in our predictions. For example, it allows us to place confidence intervals around our predictions.

Gradient Descent Method:

pending…

Lecture 5: Overfitting

Pending…

Lecture 6: Naive Bayes

Naive Bayes:

P(yx)=P(xy)P(y)P(x)P(y|\mathbf{x}) = \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})}

Spam example:

P(X=x,Y=c)=P(Y=c)P(X=xY=c)=P(Y=c)k=1KP(wordkY=c)xk=πck=1Kθc,kxk,\begin{aligned} P(\mathbf{X}=\mathbf{x}, Y=c)&=P(Y=c)P(X=x|Y=c) \\ &=P(Y=c)\prod_{k=1}^K P(\text{word}_k|Y=c)^{x_k} \\ &=\pi_c\prod_{k=1}^K \theta_{c,k}^{x_k}, \end{aligned}

where πc=P(Y=c)\pi_c = P(Y = c) is the prior probability of class cc, xkx_k is the number of occurrences of the kkth word, and θc,k=P(word kY=c)\theta _{c,k} = P(\text{word }_k|Y = c) is the weight of the kkth word for the ccth class.

Training Data:

D={(xn,yn)}n=1N={({xn,k}k=1K,yn)}n=1N\mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N=\{(\{x_{n,k}\}_{k=1}^K,y_n)\}_{n=1}^N

Goal:

Learn πc\pi_c and θc,k\theta_{c,k} from the training data D\mathcal{D}.

Log-Likelihood:

L(π,θ)=n=1NlogP(xn,yn)=n=1Nlog(πynk=1Kθyn,kxn,k)=n=1Nlogπyn+n=1Nk=1Kxn,klogθyn,k\begin{aligned} \mathcal{L}(\pi, \theta) &= \sum_{n=1}^N \log P(\mathbf{x}_n, y_n)\\ &= \sum_{n=1}^N \log \left(\pi_{y_n} \prod_{k=1}^K \theta_{y_n,k}^{x_{n,k}}\right)\\ &= \sum_{n=1}^N \log \pi_{y_n} + \sum_{n=1}^N \sum_{k=1}^K x_{n,k} \log \theta_{y_n,k} \end{aligned}

Optimization:

Note that πc\pi_c and θc,k\theta_{c,k} can be optimized separately.

πc=NcNsee Lecture 2θc,k=n=1Nxn,kI(yn=c)n=1Nk=1Kxn,kI(yn=c)=# word k shows up in data for class c# words shows up in data for class csee Lecture 2\begin{aligned} \pi_c^* &= \frac{N_c}{N}\quad\rightarrow\text{see Lecture 2}\\ \theta_{c,k}^* &= \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c)}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c)}\\ &=\frac{\text{\# word k shows up in data for class c}}{\text{\# words shows up in data for class c}}\quad\rightarrow\text{see Lecture 2} \end{aligned}

Classification:

y=argmaxcP(y=cx)=argmaxcP(xy=c)P(y=c)=argmaxcπck=1Kθc,kxk=argmaxclogπc+k=1Kxklogθc,k\begin{aligned} y^* &= \arg\max_c P(y=c|\mathbf{x})\\ &= \arg\max_c P(\mathbf{x}|y=c)P(y=c)\\ &= \arg\max_c \pi_c \prod_{k=1}^K \theta_{c,k}^{x_k}\\ &= \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k} \end{aligned}

Problem: if θc,k=0\theta_{c,k}^* = 0 for some cc and kk, then any test document containing the kkth word will be classified as not belonging to class cc.

Solution: Laplace Smoothing

θc,k=n=1Nxn,kI(yn=c)+αn=1Nk=1Kxn,kI(yn=c)+Kα\theta_{c,k}^* = \frac{\sum_{n=1}^N x_{n,k} \mathbb{I}(y_n = c) + \alpha}{\sum_{n=1}^N \sum_{k=1}^K x_{n,k} \mathbb{I}(y_n = c) + K\alpha}

Lecture 7: Logistic Regression

Recall for Naive Bayes:

y=argmaxclogπc+k=1Kxklogθc,ky^* = \arg\max_c \log \pi_c + \sum_{k=1}^K x_k \log \theta_{c,k}

For binary classification, the label is based on the sign of

logπ1+k=1Kxklogθ1,klogπ2k=1Kxklogθ2,k,\log\pi_1+\sum_{k=1}^K x_k \log \theta_{1,k} - \log\pi_2 - \sum_{k=1}^K x_k \log \theta_{2,k},

which is a linear function of the input features x\mathbf{x}.

Logistic Regression:

Setup:

  • Input: xRD\mathbf{x}\in\mathbb{R}^D
  • Output: y{0,1}y\in\{0,1\}
  • Training data: D={(xn,yn)}n=1N\mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N
  • Model: P(y=1x)=σ(wx)=g(x)P(y=1|\mathbf{x})=\sigma(\mathbf{w}^{\top}\mathbf{x})=g(\mathbf{x}), where σ(z)=11+ez\sigma(z)=\frac{1}{1+e^{-z}} is the sigmoid function

Likelihood:

p(ynxn;w)=g(xn)yn(1g(xn))1ynL(w)=n=1Np(ynxn;w)logL(w)=n=1N(ynlogg(xn)+(1yn)log(1g(xn)))\begin{aligned} p(y_n|\mathbf{x}_n;\mathbf{w})&=g(\mathbf{x}_n)^{y_n}(1-g(\mathbf{x}_n))^{1-y_n}\\ \mathcal{L}(\mathbf{w})&=\prod_{n=1}^N p(y_n|\mathbf{x}_n;\mathbf{w})\\ \log \mathcal{L}(\mathbf{w})&=\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right) \end{aligned}

We use the negative log-likelihood, which is also called the cross-entropy loss:

ε(w)=logL(w)=n=1N(ynlogg(xn)+(1yn)log(1g(xn)))\varepsilon(\mathbf{w}) = -\log \mathcal{L}(\mathbf{w}) = -\sum_{n=1}^N \left(y_n \log g(\mathbf{x}_n) + (1-y_n) \log (1-g(\mathbf{x}_n))\right)

Gradient:

dσ(z)dz=σ(z)(1σ(z))wε(w)=n=1N(yn1g(xn)wg(xn)+(1yn)11g(xn)wg(xn))=n=1N(yn1g(xn)(1yn)11g(xn))wg(xn)=n=1N(yn1g(xn)(1yn)11g(xn))g(xn)(1g(xn))xn=n=1N(yng(xn))xn=n=1N(g(xn)yn)xn,\begin{aligned} \frac{d\sigma(z)}{dz} &= \sigma(z)(1-\sigma(z))\\ \therefore \nabla_{\mathbf{w}} \varepsilon(\mathbf{w}) &= -\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}_n)} \nabla_{\mathbf{w}} g(\mathbf{x}_n) + (1-y_n) \frac{-1}{1-g(\mathbf{x}_n)} \nabla_{\mathbf{w}} g(\mathbf{x}_n)\right)\\ &=-\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}_n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) \nabla_{\mathbf{w}} g(\mathbf{x}_n)\\ &=-\sum_{n=1}^N \left(y_n \frac{1}{g(\mathbf{x}_n)} - (1-y_n) \frac{1}{1-g(\mathbf{x}_n)}\right) g(\mathbf{x}_n)(1-g(\mathbf{x}_n)) \mathbf{x}_n\\ &=-\sum_{n=1}^N \left(y_n - g(\mathbf{x}_n)\right) \mathbf{x}_n\\ &=\sum_{n=1}^N \left(g(\mathbf{x}_n) - y_n\right) \mathbf{x}_n, \end{aligned}

where (g(xn)yn)\left(g(\mathbf{x}_n) - y_n\right) is the training error for the nnth training example.

Training:

pending… (page 31)

Non-linear Decision Boundary:

Solution: High-dimensional feature space (which increases the risk of overfitting, so we need to) add regularization.

Classification Metric:

  • True Positive (TP): number of positive examples correctly classified as positive
  • False Positive (FP): number of negative examples incorrectly classified as positive
  • True Negative (TN): number of negative examples correctly classified as negative
  • False Negative (FN): number of positive examples incorrectly classified as negative
  • Sensitivity (Recall): TPTP+FN\frac{TP}{TP + FN}, the proportion of positive examples that are correctly classified as positive
  • Specificity: TNTN+FP\frac{TN}{TN + FP}, the proportion of negative examples that are correctly classified as negative
  • Precision: TPTP+FP\frac{TP}{TP + FP}, the proportion of positive predictions that are correct

Receiver Operating Characteristic (ROC) Curve: plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings.

AUROC: the area under the ROC curve, which measures the overall performance of a binary classifier. A higher AUROC indicates better performance, with a value of 11 representing a perfect classifier and a value of 0.50.5 representing a random classifier.

Lecture 8: Multiclass Logistic Regression

Solution: Softmax of Sigmoid

Lecture 9-10: Support Vector Machines (SVM)

Advantages of SVM

  1. Maximizes distance of training points from the boundary.
  2. Only requires a subset of the training points.
  3. Is less sensitive to outliers.
  4. Scales better with high-dimensional data.
  5. Generalizes well to many nonlinear models.

Idea: Maximize the margin between the decision boundary and the training points.

Distance from a point x\mathbf{x} to the decision boundary wx+w0=0\mathbf{w}^{\top}\mathbf{x} + w_0 = 0:

dH(x)=wx+w0w2d_{\mathcal{H}}(\mathbf{x}) = \frac{|\mathbf{w}^{\top}\mathbf{x} + w_0|}{\|\mathbf{w}\|_2}

To remove the absolute value, we use y=+1y = +1 to represent positive label and y=1y = -1 for negative label, then we have (wx+w0)(\mathbf{w}^{\top}\mathbf{x} + w_0) and the label yy must have the same sign. So we get

dH(x)=y(wx+w0)w2MARGIN(w,w0)=minndH(xn)\begin{aligned} d_{\mathcal{H}}(\mathbf{x}) = \frac{y(\mathbf{w}^{\top}\mathbf{x} + w_0)}{\|\mathbf{w}\|_2}\\ \text{MARGIN}(\mathbf{w}, w_0) = \min_{n} d_{\mathcal{H}}(\mathbf{x}_n) \end{aligned}

To solve the SVM, we want to maximize the margin.

We can scale w\mathbf{w} and w0w_0 by any positive constant without changing the decision boundary, so we can set MARGIN(w,w0)=1\text{MARGIN}(\mathbf{w}, w_0) = 1, which gives us the following optimization problem:

maxw,w0MARGIN(w,w0)=maxw,w01w2s.t.yn(wxn+w0)1,n\max_{\mathbf{w}, w_0} \text{MARGIN}(\mathbf{w}, w_0) = \max_{\mathbf{w}, w_0} \frac{1}{\|\mathbf{w}\|_2} \quad\text{s.t.}\quad y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1, \forall n

Which is further equivalent to:

minw,w012w22s.t.yn(wxn+w0)1,n\min_{\mathbf{w}, w_0} \frac{1}{2}\|\mathbf{w}\|_2^2 \quad\text{s.t.}\quad y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1, \forall n

SVM is called a max margin (or large margin) classifier. The constraints are called large margin constraints.

Soft Margin SVM:

Problem: Not fully linear separable data?

Solution: Slack variable

ξn0s.t.yn(wxn+w0)1ξn,n\xi_n \geq 0 \quad\text{s.t.}\quad y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n

Which gives us the following optimization problem:

minw,w0,ξ12w22+Cn=1Nξns.t.yn(wxn+w0)1ξn,nξn0,n\begin{aligned} \min_{\mathbf{w}, w_0, \xi} \frac{1}{2}\|\mathbf{w}\|_2^2 + C \sum_{n=1}^N \xi_n \quad\text{s.t.}\quad &y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1 - \xi_n, \forall n\\ &\xi_n \geq 0, \forall n \end{aligned}

Where CC is a hyperparameter that controls the trade-off between maximizing the margin and minimizing the classification error (same idea as the regularization parameter in logistic regression).

Hinge Loss Form:

We want to extract the slack variable ξn\xi_n from the constraints, which gives us:

yn(wxn+w0)1ξnξnmax(0,1yn(wxn+w0))y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) \geq 1 - \xi_n \leftrightarrow \xi_n \geq \max(0, 1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0))

By setting λ=1C\lambda = \frac{1}{C}, we can rewrite the optimization problem as:

minw,w0n=1Nmax(0,1yn(wxn+w0))+λ2w22\min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0)) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2

The first term is called the hinge loss. This can be optimized using gradient descent. There is no penalty for points that are correctly classified and outside the margin, a linear penalty for points that are correctly classified but inside the margin, and a linear penalty for points that are misclassified.

Dual Form:

Optimization problem:

minw,w0n=1Nmax(0,1yn(wxn+w0))+λ2w22\begin{aligned} \min_{\mathbf{w}, w_0} \sum_{n=1}^N \max(0, 1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0)) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2 \end{aligned}

Restrictions (canonical form):

ξn0,n1yn(wxn+w0)ξn0,n\begin{aligned} -\xi_n &\leq 0, \forall n\\ 1-y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) - \xi_n &\leq 0, \forall n \end{aligned}

Lagrangian:

L(w,w0,ξn,αn,λn)=Cn=1Nξn+12w22+n=1Nλn(ξn)+n=1Nαn(1yn(wxn+w0)ξn)\begin{aligned} \mathcal{L}(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n}) &= C\sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|_2^2 + \sum_{n=1}^N \lambda_n (-\xi_n) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) - \xi_n)\\ \end{aligned}

under the constraints αn0\alpha_n \geq 0 and λn0\lambda_n \geq 0 for all nn.

Goal: minw,w0,ξnmaxαn0,λn0L(w,w0,ξn,αn,λn)\min_{\mathbf{w}, w_0, \xi_n} \max_{\alpha_n \geq 0, \lambda_n \geq 0}\mathcal{L}(\mathbf{w}, w_0, {\xi_n}, {\alpha_n}, {\lambda_n})

Idea: If we break the constraints, then the unlimited αn\alpha_n and λn\lambda_n will make the Lagrangian go to infinity.

Lw=wn=1Nαnynxn=0w=n=1NαnynxnLw0=n=1Nαnyn=0n=1Nαnyn=0Lξn=Cλnαn=0λn=CαnαnC\begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{w}} &= \mathbf{w} - \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n = 0 &\rightarrow \mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n\\ \frac{\partial \mathcal{L}}{\partial w_0} &= -\sum_{n=1}^N \alpha_n y_n = 0 &\rightarrow \sum_{n=1}^N \alpha_n y_n = 0\\ \frac{\partial \mathcal{L}}{\partial \xi_n} &= C - \lambda_n - \alpha_n = 0 &\rightarrow \lambda_n = C - \alpha_n\rightarrow \alpha_n \leq C \end{aligned}

Features:

  • Independent of the size d of x: SVM scales better for high-dimensional features.
  • May seem like a lot of optimization variables when N is large, but many of the αn\alpha_n become zero. αn\alpha_n is non-zero only if the nth point is a support vector. SVM only depends on a subset of the training points (support vectors).

αn<C\alpha_n<C only when ξn=0\xi_n=0, pending… (page 62)

Learning:

w=n=1Nαnynxn\mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n

For (xn,yn)(\mathcal{x}_n,y_n) such that 0<αn<C0<\alpha_n<C, we have

yn(wxn+w0)=1w0=ynwxn=ynm=1Nαmymxmxn\begin{aligned} y_n(\mathbf{w}^{\top}\mathbf{x}_n + w_0) &= 1\\ w_0 &= y_n - \mathbf{w}^{\top}\mathbf{x}_n\\ &= y_n - \sum_{m=1}^N \alpha_m y_m \mathbf{x}_m^{\top} \mathbf{x}_n \end{aligned}

Kernel SVM:

Similar to the linear regression case, we can use a non-linear basis function ϕ(x)\phi(\mathbf{x}) to map the input features into a higher-dimensional space. The optimization problem becomes:

L=Cn=1Nξn+12w22+n=1Nλn(ϕ(ξn))+n=1Nαn(1yn(wϕ(xn)+w0)ξn)=12w22αnynwϕ(xn)+αn(Other terms sum up to zero)=αn12i,jαiαjyiyjϕ(xi)ϕ(xj)(Substitute w=αnynϕ(xn))=αn12i,jαiαjyiyjk(xi,xj)(Define k(xi,xj)=ϕ(xi)ϕ(xj))\begin{aligned} \mathcal{L} &= C\sum_{n=1}^N \xi_n + \frac{1}{2}\|\mathbf{w}\|_2^2 + \sum_{n=1}^N \lambda_n (-\phi(\xi_n)) + \sum_{n=1}^N \alpha_n (1 - y_n(\mathbf{w}^{\top}\phi(\mathbf{x}_n) + w_0) - \xi_n)\\ &=\frac{1}{2}||\mathbf{w}||_2^2 - \sum \alpha_n y_n \mathbf{w}^{\top} \phi(\mathbf{x}_n) + \sum \alpha_n\quad\text{(Other terms sum up to zero)}\\ &=\sum \alpha_n -\frac12 \sum_{i,j} \alpha_i \alpha_j y_i y_j \phi(\mathbf{x}_i)^{\top} \phi(\mathbf{x}_j)\quad\text{(Substitute $\mathbf{w}=\sum \alpha_n y_n \phi(\mathbf{x}_n)$)}\\ &=\sum \alpha_n -\frac12 \sum_{i,j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)\quad\text{(Define $k(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^{\top} \phi(\mathbf{x}_j)$)} \end{aligned}

where k(xi,xj)k(\mathbf{x}_i, \mathbf{x}_j) is called the kernel function.

kk is a valid kernel function if it is symmetric and positive semi-definite. Some popular examples:

  • Dot product kernel: k(xi,xj)=xixjk(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^{\top} \mathbf{x}_j
  • Dot product with positive-definite matrix: k(xi,xj)=xiAxjk(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^{\top} A \mathbf{x}_j
  • Polynomial kernel: k(xi,xj)=(xixj+1)d,dZ+k(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^{\top} \mathbf{x}_j + 1)^d,d\in\mathbb{Z}^+
  • Radial basis function (RBF) kernel: k(xi,xj)=exp(γxixj2),γ>0k(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2),\gamma>0

Note that we do not need to know the explicit form of ϕ(x)\phi(\mathbf{x}) to compute k(xi,xj)k(\mathbf{x}_i, \mathbf{x}_j), which allows us to work with very high-dimensional feature spaces without incurring a large computational cost. This is called the kernel trick.

But how do we predict?

y=sign(wϕ(x)+w0)=sign(n=1Nαnynϕ(xn)ϕ(x)+w0)=sign(n=1Nαnynk(xn,x)+w0)\begin{aligned} y &= \text{sign}(\mathbf{w}^{\top}\phi(\mathbf{x}) + w_0)\\ &=\text{sign}\left(\sum_{n=1}^N \alpha_n y_n \phi(\mathbf{x}_n)^{\top} \phi(\mathbf{x}) + w_0\right)\\ &=\text{sign}\left(\sum_{n=1}^N \alpha_n y_n k(\mathbf{x}_n, \mathbf{x}) + w_0\right) \end{aligned}

Lecture 11: Nearest Neighbors

Two crucial choices for NN

  • Choosing KK , i.e., the number of nearest neighbors (default is 11)
  • Choosing the right distance measure: Euclidean distance, L1L_1 or LpL_p distance, feature scaling, …
  • In practice, these are hyper parameters that need to be tuned by a validation dataset or cross validation.
  • When KK increases, the decision boundary becomes smoother and less susceptible to outliers (i.e. lower variance).

pending…

Lecture 12: Decision Trees

Learning a Tree Model, things to learn:

  • The structure of the tree.
  • The split rule at each node.
    • Which feature to consider?
    • What is the threshold θi\theta_i ?
    • What about categorical features?
  • The predicted values for the leaves.

First split: use Information Gain

If a random variable YY takes KK different values, a1,a2,,aKa_1, a_2, \ldots, a_K, then its entropy is

H(Y)=k=1Kp(ak)logp(ak)H(Y) = -\sum_{k=1}^K p(a_k) \log p(a_k)

where p(ak)p(a_k) is the probability of YY taking the value aka_k.

Conditional entropy of YY given XX:

H(YX)=xp(x)H(YX=x)H(Y|X) = \sum_{x} p(x) H(Y|X=x)

Information Gain:

IG(X;Y)=H(Y)H(YX)\text{IG}(X; Y) = H(Y) - H(Y|X)

Measures the reduction in entropy (i.e., the reduction of uncertainty in YY) when we also consider XX.

For continuous features, Information Gain:

IG(X;Y)=H(Y)H(Ywhether Xθ)\text{IG}(X; Y) = H(Y) - H(Y|\text{whether }X \leq \theta)

Strategies to avoid overfitting:

  • Stop growing when data split is not statistically significant.
  • Acquire more training data.
  • Remove irrelevant attributes (manual process: not always possible).
  • Grow full tree, then post-prune.
  • Use tree ensembles (random forest, boosting, to be covered next lecture).

Lecture 13: Midterm Exam

No content.

Lecture 14: Ensemble Methods

Motivation:

  • Shallow trees have low variance (don’t overfit) but high bias (underfit).
  • Can we combine several shallow trees to reduce bias (without sacrificing variance)?
  • Deep trees have low bias, but high variance.
  • Can we combine several deep trees to reduce variance (without sacrificing bias)?

Bagging (random forests): train each tree on a random subset of the training data

  • Limitation of Bagging: If one or more features are very informative, they will be selected first by every tree in the bag, which increases the correlation between the trees (thus harms our goal of creating “random variation” in the training process to reduce variance).
  • Key Idea behind Random Forests: Reduces correlation between trees.

Boosting: train each tree on a re-weighted version of the training data, where the weights are updated based on the performance of the previous tree.

  • Initialize weights w1(n)=1Nw_1(n) = \frac{1}{N} for every training sample nn
  • For t=1t = 1 to TT (number of rounds, hyperparameter):
  1. Train a weak classifier ht(x)h_t(x) using weights wt(n)w_t(n), by minimizing

ϵt=nwt(n)1[ynht(xn)]\epsilon_t = \sum_{n} w_t(n) 1[y_n \neq h_t(x_n)]

(the weighted classification error) 2. Compute contribution for this classifier: βt=12log1ϵtϵt\beta_t = \frac{1}{2} \log \frac{1 - \epsilon_t}{\epsilon_t} (smaller ϵt\epsilon_t, larger error, larger βt\beta_t) 3. Update weights on each training sample nn

wt+1(n)wt(n)eβtynht(xn)w_{t+1}(n) \propto w_t(n) e^{-\beta_t y_n h_t(x_n)}

(wt(n)w_t(n) decreased if yn=ht(xn)y_n = h_t(x_n); increased if ynht(xn)y_n \neq h_t(x_n)) and normalize them such that nwt+1(n)=1\sum_{n} w_{t+1}(n) = 1.

  • Output the final classifier:

H(x)=sign(t=1Tβtht(x))H(x) = \text{sign}\left(\sum_{t=1}^T \beta_t h_t(x)\right)

Why this works? pending…

Lecture 15-17: Neural Networks

Lecture 18: PyTorch

No content.

Lecture 19: Distributed Synchronous SGD