斯坦福CS229(吴恩达授)学习笔记(3)
CS229-notes1-part3
- 说明
- 正文
-
- Problem Set #1: Supervised learning
-
- 1. Newton's method for computing least squares
- 5. Exponential family and the geometric distribution
说明
此笔记 是cs229-notes1讲义中的第二部分学习内容,与B站上的“04 牛顿方法”视频对应,主要是对讲义中一些推理的补充以及一些重点内容的记录,另外还会附加该部分相对应的习题解答和算法的C++实现。
课程相关视频、讲义等资料可参照《斯坦福CS229(吴恩达授)学习笔记(1)》 获取。
正文
Problem Set #1: Supervised learning
1. Newton’s method for computing least squares
原文题目如下:

解答:
(a)
假定:
X:m\times n,\theta:n\times 1, \vec y:m\times 1
\begin{aligned} J(\theta)=&\frac{1}{2}\sum^m_{i=1}(\theta^Tx^{(i)}-y^{(i)})^2\\ =&\frac{1}{2}(X\theta-\vec y)^T(X\theta-\vec y)\\ =&\frac{1}{2}tr[(X\theta-\vec y)^T(X\theta-\vec y)]\\ =&\frac{1}{2}tr[\theta^TX^TX\theta-(\vec y^TX\theta)^T-\vec y^TX\theta+\vec y^T \vec y] \end{aligned}
所以
\begin{aligned} \nabla_\theta J(\theta)=&\frac{1}{2}\nabla_\theta tr[\theta^TX^TX\theta-(\vec y^TX\theta)^T-\vec y^TX\theta+\vec y^T \vec y]\\ =&\frac{1}{2}[\nabla_\theta tr\theta^TX^TX\theta-2\nabla_\theta tr\vec y^TX\theta]\\ =&\frac{1}{2}[\nabla_\theta tr\theta^T_cX^TX\theta+\nabla_\theta tr\theta^TX^TX\theta_c]-(\vec y^TX)^T\\ =&X^TX\theta-X^T\vec y \end{aligned}
所以Hessian为
\begin{aligned} H=\nabla_\theta\nabla_\theta J(\theta)=&\nabla_\theta( X^TX\theta-X^T\vec y)=X^TX \end{aligned}
(b)
根据Newton-Raphson算法
\begin{aligned} \theta :=&\theta-H^{-1}\nabla_\theta J(\theta)\\ :=&\theta-(X^TX)^{-1}(X^TX\theta-X^T\vec y)\\ :=&\theta-(X^TX)^{-1}X^TX\theta+(X^TX)^{-1}X^T\vec y\\ :=&\theta-\theta+(X^TX)^{-1}X^T\vec y\\ :=&(X^TX)^{-1}X^T\vec y \end{aligned}
5. Exponential family and the geometric distribution
原文题目如下:

解答:
(a)
几何分布:在n次伯努利实验中,实验y次才得到第一次成功的概率。也就是前y-1次皆失败,第y次成功的概率。
\begin{aligned} p(y;\phi)=&(1-\phi)^{y-1}\phi\\ =&exp(log((1-\phi)^{y-1}\phi))\\ =&exp((y-1)log(1-\phi)+log\phi)\\ =&exp(ylog(1-\phi)-log(1-\phi)+log\phi)\\ =&exp(ylog(1-\phi)+log\frac{\phi}{1-\phi}) \end{aligned}
根据GLM的格式p(y;\phi)=b(y)exp(\eta^TT(y)-a(\eta)),其中\eta叫做natural parameter(或者canonical parameter), y叫做response variable
所以
\begin{aligned} b(y)=&1\\ \eta=&log(1-\phi)\\ T(y)=&y\\ a(\eta)=&-log\frac{\phi}{1-\phi}=-log\frac{1-e^\eta}{e^\eta}=\eta \end{aligned}
(b)
根据题意,y服从几何分布,所以y就叫做geometric response variable
根据canonical response function 定义
\begin{aligned} g(\eta)=&E[T(y);\eta]\\ =&E[y;\eta]\\ =&\frac{1}{\phi}\\ =&\frac{1}{1-e^\eta} \end{aligned}
(c)
根据题意,即当x给定时,y服从几何分布。那么如何构造训练集的几何分布GLM模型呢(讲义中前面用到的logistic regression就是一种GLM模型)?先看讲义中的几段话。

简而言之,给定x,我们的目标是预测y的期望值h(x),即h(x)=E[y|x]。而自然参数\eta和x线性相关:\eta=\theta^Tx。据(b),几何分布的期望为\frac{1}{1-e^{\eta}},所以
h(x)=\frac{1}{1-e^{\eta}}=\frac{1}{1-e^{\theta^Tx}}
因为要求的是随机梯度上升算法,每次只考虑一个样本点,所以
\begin{aligned} \ell_i(\theta)=logp(y^{(i)}|\phi)=logp(y^{(i)}|\eta)=&logp(y^{(i)}|x^{(i)},\theta)\\ =&log(exp(y^{(i)}log(1-\phi)+log\frac{\phi}{1-\phi}))\\ =&log(exp(y^{(i)}log(1-(1-e^\eta))+log\frac{1-e^\eta}{1-(1-e^\eta)}))\\ =&log(exp(y^{(i)}\eta-log\frac{e^\eta}{1-e^\eta}))\\ =&log(exp(y^{(i)}\theta^Tx^{(i)}-log\frac{e^{\theta^Tx^{(i)}}}{1-e^{\theta^Tx^{(i)}}}))\\ =&y\theta^Tx^{(i)}-log\frac{e^{\theta^Tx^{(i)}}}{1-e^{\theta^Tx^{(i)}}})\\ =&y\theta^Tx^{(i)}+log(e^{-\theta^Tx^{(i)}}-1) \end{aligned}
根据随机梯度上升算法\theta_j:=\theta_j+\alpha \frac{\partial\ell_i(\theta)}{\partial \theta_j},求出
\begin{aligned} \frac{\partial\ell_i(\theta)}{\partial \theta_j}=&x_j^{(i)}y^{(i)}+\frac{e^{-\theta^Tx^{(i)}}}{e^{-\theta^Tx^{(i)}}-1}(-x_j^{(i)})=(y^{(i)}-\frac{1}{1-e^{\theta^Tx^{(i)}}})x_j^{(i)} \end{aligned}
故\theta_j:=\theta_j+\alpha(y^{(i)}-\frac{1}{1-e^{\theta^Tx^{(i)}}})x_j^{(i)}
