February 17, 2026
CMU 18661 Machine Learning | Notes Notes for CMU-18661 Introduction to Machine Learning for Engineers.
Lecture 1: Intro
Link to the course website.
Homeworks (40%): Both math and programming problems
Miniexams (15%): Two during the semester
Midterm Exam (15%): Linear and Logistic Regression, Naive Bayes, SVMs (subject to change)
Final Exam (25%): Everything else (Nearest Neighbors, Neural Networks, Decision Trees, Boosting, Clustering, PCA, etc.), plus pre-midterm topics (subject to change)
Gradescope Quizzes (5%): Short multiple-choice question quizzes conducted in random (possibly all) lectures to encourage class attendance and attention. We will take the best 10 quiz scores.
Lecture 2: MLE & MAP
Pending…
Lecture 3-4: Linear Regression
SVD Decomposition:
A = U Σ V ⊤ A = U\Sigma V^{\top} A = U Σ V ⊤
U ∈ R m × m U\in\mathbb{R}^{m\times m} U ∈ R m × m and V ∈ R n × n V\in\mathbb{R}^{n\times n} V ∈ R n × n are orthogonal matrices
Σ ∈ R m × n \Sigma\in\mathbb{R}^{m\times n} Σ ∈ R m × n is a diagonal matrix with singular values σ 1 ≥ σ 2 ≥ ⋯ ≥ σ r > 0 \sigma_1\geq\sigma_2\geq\cdots\geq\sigma_r>0 σ 1 ≥ σ 2 ≥ ⋯ ≥ σ r > 0 on the diagonal, where r r r is the rank of A A A .
The squared singular values σ i 2 \sigma_i^2 σ i 2 are the eigenvalues of A ⊤ A A^{\top}A A ⊤ A and A A ⊤ AA^{\top} A A ⊤ .
V V V is the right singular vectors of A A A , as well as the eigenvectors of A ⊤ A A^{\top}A A ⊤ A .
U U U is the left singular vectors of A A A , as well as the eigenvectors of A A ⊤ AA^{\top} A A ⊤ .
Proof:
A ⊤ A = V Σ ⊤ U ⊤ U Σ V ⊤ = V Σ ⊤ Σ V ⊤ = V [ σ 1 2 ⋱ σ r 2 ] V ⊤ \begin{aligned}
A^{\top}A&=V\Sigma^{\top} U^{\top} U\Sigma V^{\top}\\
&=V\Sigma^{\top} \Sigma V^{\top}\\
&=V\begin{bmatrix}\sigma_1^2 & & \\
& \ddots & \\
& & \sigma_r^2\end{bmatrix}V^{\top}
\end{aligned} A ⊤ A = V Σ ⊤ U ⊤ U Σ V ⊤ = V Σ ⊤ Σ V ⊤ = V σ 1 2 ⋱ σ r 2 V ⊤
Similar for others.
Linear Regression:
Setup:
Input: x ∈ R D \mathbf{x}\in\mathbb{R}^D x ∈ R D
Output: y ∈ R y\in\mathbb{R} y ∈ R
Model: f ( x ) = w 0 + w ⊤ x = w ~ ⊤ x ~ f(\mathbf{x})=w_0+\mathbf{w}^{\top}\mathbf{x}=\widetilde{\mathbf{w}}^{\top}\widetilde{\mathbf{x}} f ( x ) = w 0 + w ⊤ x = w ⊤ x
Training data: D = { ( x n , y n ) } n = 1 N \mathcal{D}=\{(\mathbf{x}_n,y_n)\}_{n=1}^N D = {( x n , y n ) } n = 1 N
Residual Sum of Squares (RSS):
RSS ( w ~ ) = ∑ n = 1 N ( y n − w ~ ⊤ x ~ n ) 2 \text{RSS}(\mathbf{\widetilde{w}})=\sum_{n=1}^N(y_n-\mathbf{\widetilde{w}}^{\top}\mathbf{\widetilde{x}}_n)^2 RSS ( w ) = n = 1 ∑ N ( y n − w ⊤ x n ) 2
For D = 1 D=1 D = 1 :
∂ R S S ∂ w 0 = − 2 ∑ n = 1 N ( y n − ( w 0 + w 1 x n ) ) = 0 ∂ R S S ∂ w 1 = − 2 ∑ n = 1 N ( y n − ( w 0 + w 1 x n ) ) x n = 0 ∴ ∑ y n = N w 0 + w 1 ∑ x n , ∑ y n x n = w 0 ∑ x n + w 1 ∑ x n 2 ∴ w 1 ∗ = ∑ ( x n − x ˉ ) ( y n − y ˉ ) ∑ ( x n − x ˉ ) 2 , w 0 ∗ = y ˉ − w 1 ∗ x ˉ \begin{aligned}
\frac{\partial RSS}{\partial w_0} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)=0\\
\frac{\partial RSS}{\partial w_1} &= -2\sum_{n=1}^N\left(y_n-(w_0+w_1x_n)\right)x_n=0\\
\therefore\sum y_n &= N w_0 + w_1 \sum x_n,\quad\sum y_n x_n = w_0 \sum x_n + w_1 \sum x_n^2\\
\therefore w_1^* &= \frac{\sum (x_n-\bar{x})(y_n-\bar{y})}{\sum (x_n-\bar{x})^2},\quad w_0^* = \bar{y} - w_1^* \bar{x}
\end{aligned} ∂ w 0 ∂ R S S ∂ w 1 ∂ R S S ∴ ∑ y n ∴ w 1 ∗ = − 2 n = 1 ∑ N ( y n − ( w 0 + w 1 x n ) ) = 0 = − 2 n = 1 ∑ N ( y n − ( w 0 + w 1 x n ) ) x n = 0 = N w 0 + w 1 ∑ x n , ∑ y n x n = w 0 ∑ x n + w 1 ∑ x n 2 = ∑ ( x n − x ˉ ) 2 ∑ ( x n − x ˉ ) ( y n − y ˉ ) , w 0 ∗ = y ˉ − w 1 ∗ x ˉ
For D > 1 D>1 D > 1 :
R S S ( w ~ ) = ∑ n = 1 N ( y n − w ~ ⊤ x ~ n ) 2 = ∑ n ( y n − w ~ ⊤ x ~ n ) ( y n − x ~ n ⊤ w ~ ) = ∑ n ( − 2 y n w ~ ⊤ x ~ n + w ~ ⊤ x ~ n x ~ n ⊤ w ~ ) + const = w ~ ⊤ ( X ~ ⊤ X ~ ) w ~ − 2 ( X ~ ⊤ y ) w ~ + const Set ∇ w ~ R S S = 2 ( X ~ ⊤ X ~ ) w ~ − 2 ( X ~ ⊤ y ) = 0 Least Mean Squares solution w ~ LMS = ( X ~ ⊤ X ~ ) − 1 ( X ~ ⊤ y ) \begin{aligned}
RSS(\widetilde{\mathbf{w}}) &= \sum_{n=1}^N (y_n - \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n)^2\\
&=\sum_n(y_n-\widetilde{\mathbf{w}}^{\top}\widetilde{\mathbf{x}}_n)(y_n-\widetilde{\mathbf{x}}_n^{\top}\widetilde{\mathbf{w}})\\
&=\sum_n(- 2 y_n \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n + \widetilde{\mathbf{w}}^{\top} \widetilde{\mathbf{x}}_n \widetilde{\mathbf{x}}_n^{\top} \widetilde{\mathbf{w}}) + \text{const}\\
&=\widetilde{\mathbf{w}}^{\top} \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^{\top}y\right) \widetilde{\mathbf{w}} + \text{const}\\
\text{Set}\quad\nabla_{\widetilde{\mathbf{w}}} RSS &= 2 \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right) \widetilde{\mathbf{w}} - 2 \left(\widetilde{\mathbf{X}}^{\top}y\right) = 0\\
\text{Least Mean Squares solution}\quad\widetilde{\mathbf{w}}^{\text{LMS}} &= \left(\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}\right)^{-1} \left(\widetilde{\mathbf{X}}^{\top}y\right)
\end{aligned} R S S ( w ) Set ∇ w R S S Least Mean Squares solution w LMS = n = 1 ∑ N ( y n − w ⊤ x n ) 2 = n ∑ ( y n − w ⊤ x n ) ( y n − x n ⊤ w ) = n ∑ ( − 2 y n w ⊤ x n + w ⊤ x n x n ⊤ w ) + const = w ⊤ ( X ⊤ X ) w − 2 ( X ⊤ y ) w + const = 2 ( X ⊤ X ) w − 2 ( X ⊤ y ) = 0 = ( X ⊤ X ) − 1 ( X ⊤ y )
Noisy observation model:
Y = w 0 + w 1 X + η , η ∼ N ( 0 , σ 2 ) → Y ∼ N ( w 0 + w 1 x , σ 2 ) Y=w_0+w_1X+\eta,\quad \eta\sim\mathcal{N}(0,\sigma^2)\quad\rightarrow\quad Y\sim\mathcal{N}(w_0+w_1x,\sigma^2) Y = w 0 + w 1 X + η , η ∼ N ( 0 , σ 2 ) → Y ∼ N ( w 0 + w 1 x , σ 2 )
Conditional Likelihood:
P ( y n ∣ x n ) = 1 2 π σ exp ( − ( y n − ( w 0 + w 1 x n ) ) 2 2 σ 2 ) P(y_n|\mathbf{x}_n) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - (w_0 + w_1 x_n))^2}{2\sigma^2}\right) P ( y n ∣ x n ) = 2 π σ 1 exp ( − 2 σ 2 ( y n − ( w 0 + w 1 x n ) ) 2 )
Log-Likelihood of training data:
log P ( D ) = log ∏ n = 1 N P ( y n ∣ x n ) = ∑ n = 1 N log 1 2 π σ exp ( − ( y n − w 0 − w 1 x n ) 2 2 σ 2 ) = − N 2 log ( 2 π σ 2 ) − 1 2 σ 2 ∑ n = 1 N ( y n − w 0 − w 1 x n ) 2 \begin{aligned}
\log P(\mathcal{D})&=\log\prod_{n=1}^N P(y_n|\mathbf{x}_n)\\
&=\sum_{n=1}^N \log \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y_n - w_0 - w_1 x_n)^2}{2\sigma^2}\right)\\
&=-\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2\\
\end{aligned} log P ( D ) = log n = 1 ∏ N P ( y n ∣ x n ) = n = 1 ∑ N log 2 π σ 1 exp ( − 2 σ 2 ( y n − w 0 − w 1 x n ) 2 ) = − 2 N log ( 2 π σ 2 ) − 2 σ 2 1 n = 1 ∑ N ( y n − w 0 − w 1 x n ) 2
Maximize over w 0 w_0 w 0 and w 1 w_1 w 1 : max log P ( D ) ↔ min RSS \max \log P(\mathcal{D})\leftrightarrow \min \text{RSS} max log P ( D ) ↔ min RSS
This gives a solid footing to our intuition: minimizing R S S ( w ~ ) RSS(\widetilde{\mathbf{w}}) R S S ( w ) is a sensible thing based on reasonable modeling assumptions.
Maximize over σ 2 \sigma^2 σ 2 :
∂ log P ( D ) ∂ σ 2 = − N 2 σ 2 + 1 2 σ 4 ∑ n = 1 N ( y n − w 0 − w 1 x n ) 2 = − N 2 σ 2 + 1 2 σ 4 RSS ( w ~ ) ∴ σ 2 ∗ = 1 N RSS ( w ~ ) \begin{aligned}
\frac{\partial \log P(\mathcal{D})}{\partial \sigma^2} &= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{n=1}^N (y_n - w_0 - w_1 x_n)^2\\
&= -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \text{RSS}(\widetilde{\mathbf{w}})\\
\therefore \sigma^{2*} &= \frac{1}{N} \text{RSS}(\widetilde{\mathbf{w}})
\end{aligned} ∂ σ 2 ∂ log P ( D ) ∴ σ 2 ∗ = − 2 σ 2 N + 2 σ 4 1 n = 1 ∑ N ( y n − w 0 − w 1 x n ) 2 = − 2 σ 2 N + 2 σ 4 1 RSS ( w ) = N 1 RSS ( w )
Estimating σ 2 ∗ \sigma^{2*} σ 2 ∗ tells us how much noise there is in our predictions. For example, it allows us to place confidence intervals around our predictions.
Gradient Descent Method:
pending…
Lecture 5: Overfitting
Pending…
Lecture 6: Naive Bayes
Naive Bayes:
P ( y ∣ x ) = P ( x ∣ y ) P ( y ) P ( x ) P(y|\mathbf{x}) = \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})} P ( y ∣ x ) = P ( x ) P ( x ∣ y ) P ( y )
Spam example:
P ( X = x , Y = c ) = P ( Y = c ) P ( X = x ∣ Y = c ) = P ( Y = c ) ∏ k = 1 K P ( word k ∣ Y = c ) x k = π c ∏ k = 1 K θ c , k x k , \begin{aligned}
P(\mathbf{X}=\mathbf{x}, Y=c)&=P(Y=c)P(X=x|Y=c) \\
&=P(Y=c)\prod_{k=1}^K P(\text{word}_k|Y=c)^{x_k} \\
&=\pi_c\prod_{k=1}^K \theta_{c,k}^{x_k},
\end{aligned} P ( X = x , Y = c ) = P ( Y = c ) P ( X = x