【Machine Learning】Optimization
本笔记基于清华大学《机器学习》的课程讲义梯度下降相关部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。
Smoothness assumption
Upper Bound for \nabla^2f(x): \left\|\nabla^2f(w)\right \|\le L
f(w')\le f(w)+\langle \nabla f(w), w'-w\rangle+\frac{L}{2}\left\|w'-w\right \|^2
equivalent to \left\|\nabla f(w)-\nabla f(w')\right \|\le L\left\|w-w'\right \|
Proof:
* $\Rightarrow$: $\left\|\nabla f(w)-\nabla f(w')\right \|\le \left\|\nabla^2 f(x)\right\|\left\|w-w'\right \|\le L\left\|w-w'\right \|$
* $\Leftarrow$
2-side: \left|f(w')-f(w)-\langle \nabla f(w), w'-w\rangle\right|\le \frac{L}{2}\left\|w'-w\right \|^2
contains both upper/lower bound
When w'=w-\eta\nabla f(w), \eta=\frac{1}{L} to make sure
f(w')-f(w)=-\frac{1}{2\eta}\left\|\nabla f(x)\right \|^2<0
Convex Function
Lower Bound for \nabla^2f(x)
f(w')\ge f(w)+\nabla f(w)^T(w'-w)
* $\lambda\min\nabla^2 f(w)\ge 0$
Strong convex function: f(w')\ge f(w)+\nabla f(w)^T(w'-w)+\frac{\mu}{2}\left\|w-w'\right\|^2
* $\lambda\min \nabla^2f(w)\ge \mu \ge 0$
Convergence Analysis
f is L-smooth
The sequence is w_0,w_1,...,w_t and the optimized point is w^*,
w_i=w_{i-1}-\eta \nabla f(w_{i-1})
\eta<\frac{2}{L}
Gradient Descent make progress
f(w')-f(w)\le -\eta\left(1-\frac{L\eta}{2}\right)\left\|\nabla f(w)\right \|^2
- Convex function: 1/t convergence rate
f(w_t)-f(w^*)\le \frac{\left\|w_0-w^*\right \|^2}{2t\eta}
Stochastic Gradient Descent(SGD)
f is L-smooth and convex
w_{t+1}=w_t-\eta G_t,E[G_t]=\nabla f(w_t)
E f(\overline{w_t})-f(w^*)\le \frac{\left\|w_0-w^*\right \|^2}{2t\eta}+\eta \sigma^2
- Convergence rate 1/\sqrt{t}
SVRG
Proof: To read.
Mirror Descent
After k iterations, f(\bar{x})-f(u)\le \frac{1}{k}\sum_{t=0}^{k-1}\langle \nabla f(x_t),x_t-u\rangle (also calls regret)becomes smaller.
f is \rho-Lipschitz, that is |\nabla f(x)|\le \rho. After T=O(\frac{\rho^2}{\epsilon^2}), f(\bar{x})-f(x^*)\le \epsilon. 1/\sqrt{t} convergence rate.
Linear Coupling : 1/t^2 convergence rate. t\ge \Omega\left(\frac{1}{\epsilon}\right)^{1/2}
