Advertisement

L20 梯度下降、随机梯度下降和小批量梯度下降

阅读量:

airfoil4755 下载
链接:https://pan.baidu.com/s/1YEtNjJ0_G9eeH6A6vHXhnA
提取码:dwjq

梯度下降

(Boyd & Vandenberghe, 一书中)

复制代码
    %matplotlib inline
    import numpy as np
    import torch
    import time
    from torch import nn, optim
    import math
    import sys
    sys.path.append('/home/kesci/input')
    import d2lzh1981 as d2l
    
    
      
      
      
      
      
      
      
      
      
    
    代码解释

一维梯度下降

证明:沿梯度反方向移动自变量可以减小函数值

泰勒展开:

f(x+\epsilon)=f(x)+\epsilon f^{\prime}(x)+\mathcal{O}\left(\epsilon^{2}\right)

代入沿梯度方向的移动量 \eta f^{\prime}(x)

f\left(x-\eta f^{\prime}(x)\right)=f(x)-\eta f^{\prime 2}(x)+\mathcal{O}\left(\eta^{2} f^{\prime 2}(x)\right)

f\left(x-\eta f^{\prime}(x)\right) \lesssim f(x)

x \leftarrow x-\eta f^{\prime}(x)

e.g.

f(x) = x^2

复制代码
    def f(x):
    return x**2  # Objective function
    
    def gradf(x):
    return 2 * x  # Its derivative
    
    def gd(eta):
    x = 10
    results = [x]
    for i in range(10):
        x -= eta * gradf(x)
        results.append(x)
    print('epoch 10, x:', x)
    return results
    
    res = gd(0.2)
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    epoch 10, x: 0.06046617599999997
    
    
      
    
    代码解释
复制代码
    def show_trace(res):
    n = max(abs(min(res)), abs(max(res)))
    f_line = np.arange(-n, n, 0.01)
    d2l.set_figsize((3.5, 2.5))
    d2l.plt.plot(f_line, [f(x) for x in f_line],'-')
    d2l.plt.plot(res, [f(x) for x in res],'-o')
    d2l.plt.xlabel('x')
    d2l.plt.ylabel('f(x)')
    
    
    show_trace(res)
    
    
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释

学习率

复制代码
    show_trace(gd(0.05))
    
    
      
    
    代码解释
复制代码
    epoch 10, x: 3.4867844009999995
    
    
      
    
    代码解释
复制代码
    show_trace(gd(1.1))
    
    
      
    
    代码解释
复制代码
    epoch 10, x: 61.917364224000096
    
    
      
    
    代码解释

局部极小值

e.g.

f(x) = x\cos cx

复制代码
    c = 0.15 * np.pi
    
    def f(x):
    return x * np.cos(c * x)
    
    def gradf(x):
    return np.cos(c * x) - c * x * np.sin(c * x)
    
    show_trace(gd(2))
    
    
      
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    epoch 10, x: -1.528165927635083
    
    
      
    
    代码解释

多维梯度下降

梯度向量\nabla f(\mathbf{x})在点\mathbf{x}处由各个偏导数组成的向量组成

f(\\x + \\varepsilon) = f(\\x) + \\varepsilon^T \\nabla f(\\x) + \\mathcal{O}(||\\varepsilon||^2)

\mathbf{x} \leftarrow \mathbf{x}-\eta \nabla f(\mathbf{x})

复制代码
    def train_2d(trainer, steps=20):
    x1, x2 = -5, -2
    results = [(x1, x2)]
    for i in range(steps):
        x1, x2 = trainer(x1, x2)
        results.append((x1, x2))
    print('epoch %d, x1 %f, x2 %f' % (i + 1, x1, x2))
    return results
    
    def show_trace_2d(f, results): 
    d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')
    x1, x2 = np.meshgrid(np.arange(-5.5, 1.0, 0.1), np.arange(-3.0, 1.0, 0.1))
    d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')
    d2l.plt.xlabel('x1')
    d2l.plt.ylabel('x2')
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释

f(x) = x_1^2 + 2x_2^2

复制代码
    eta = 0.1
    
    def f_2d(x1, x2):  # 目标函数
    return x1 ** 2 + 2 * x2 *
    
    def gd_2d(x1, x2):
    return (x1 - eta * 2 * x1, x2 - eta * 4 * x2)
    
    show_trace_2d(f_2d, train_2d(gd_2d))
    
    
      
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    epoch 20, x1 -0.057646, x2 -0.000073
    
    
      
    
    代码解释

自适应方法

牛顿法

x + \epsilon 处泰勒展开:

f(\mathbf{x}+\epsilon)可表示为f(\mathbf{x})加上\epsilon沿梯度方向的一阶近似项以及\epsilon的平方项乘以Hessian矩阵后的二次型再加上高阶无穷小项\mathcal{O}\left(\|\epsilon\|^{3}\right).

在极小值点处满足: \nabla f(\mathbf{x})=0时, 我们期望在扰动后的位置\mathbf{x}+\epsilon仍能满足该条件, 因此对上述方程两边关于\epsilon求导数, 忽略高阶无穷小项的影响, 得到以下结果:

The gradient of function f at point \mathbf{x} plus the Hessian matrix of f multiplied by vector \boldsymbol{\epsilon} equals zero, which implies that the solution for vector \boldsymbol{\epsilon} is equal to negative the inverse of the Hessian matrix of f multiplied by the gradient of f evaluated at point \mathbf{x}

复制代码
    c = 0.5
    
    def f(x):
    return np.cosh(c * x)  # Objective
    
    def gradf(x):
    return c * np.sinh(c * x)  # Derivative
    
    def hessf(x):
    return c**2 * np.cosh(c * x)  # Hessian
    
    # Hide learning rate for now
    def newton(eta=1):
    x = 10
    results = [x]
    for i in range(10):
        x -= eta * gradf(x) / hessf(x)
        results.append(x)
    print('epoch 10, x:', x)
    return results
    
    show_trace(newton())
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    epoch 10, x: 0.0
    
    
      
    
    代码解释
复制代码
    c = 0.15 * np.pi
    
    def f(x):
    return x * np.cos(c * x)
    
    def gradf(x):
    return np.cos(c * x) - c * x * np.sin(c * x)
    
    def hessf(x):
    return - 2 * c * np.sin(c * x) - x * c**2 * np.cos(c * x)
    
    show_trace(newton())
    
    
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    epoch 10, x: 26.83413291324767
    
    
      
    
    代码解释
复制代码
    show_trace(newton(0.5))
    
    
      
    
    代码解释
复制代码
    epoch 10, x: 7.269860168684531
    
    
      
    
    代码解释

收敛性分析

只考虑在函数为凸函数, 且最小值点上 f''(x^*) > 0 时的收敛速度:

x_k 为第 k 次迭代后 x 的值, e_{k}:=x_{k}-x^{*} 表示 x_k 到最小值点 x^{*} 的距离,由 f'(x^{*}) = 0:

其中 0 = f'(x_ k − e_ k ) = f'( x_ k ) − e_ k  f''( x_ k ) + \frac{1}{2 } e_ k ^ 2  f'''( ξ _ k ) ,其中 ξ _ k ∈ [ x _ k − e _ k , x _ k ].

两边除以 f''(x_k), 有:

e_{k}-f'(x_k)/f''(x_k)= (1/2)e_k^2 f'''(\xi_k)/f''(x_k)

代入更新方程 x_{k+1} = x_{k} - f^{\prime}\left(x_{k}\right) / f^{\prime \prime}\left(x_{k}\right), 得到:

x sub k minus x star minus left(f prime at x sub k right) divided by left(f double prime at x sub k right) equals one-half times e sub k squared times left(f triple prime at xi sub k right) divided by left(f double prime at x sub k right)

误差项e_{k+1}等于说它可表示为\frac{1}{2} e_k^2f'''(ξ_k)/f''(x_k)的比率。

假设满足以下条件:当\frac{1}{2} f^{\prime \prime \prime}(\xi_k) / f^{\prime \prime}(x_k) \leq c时,则存在一个常数c使得以下不等式成立。

e_{k+1} \leq c e_{k}^{2}

预处理 (Heissan阵辅助梯度下降)

变量\mathbf{x}被赋值为其自身减去步长因子乘以Hessian矩阵主对角线元素倒数与梯度向量的乘积

梯度下降与线性搜索(共轭梯度法)

随机梯度下降

随机梯度下降参数更新

基于包含n个样本对的训练数据集

f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} f_{i}(\mathbf{x})

其梯度为:

\nabla f(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x})

使用该梯度的一次更新的时间复杂度为 \mathcal{O}(n)

随机梯度下降更新公式 \mathcal{O}(1):

\mathbf{x} \leftarrow \mathbf{x}-\eta \nabla f_{i}(\mathbf{x})

且有:

\mathbb{E}_{i} \nabla f_{i}(\mathbf{x})=\frac{1}{n} \sum_{i=1}^{n} \nabla f_{i}(\mathbf{x})=\nabla f(\mathbf{x})

e.g.

f(x_1, x_2) = x_1^2 + 2 x_2^2

复制代码
    def f(x1, x2):
    return x1 ** 2 + 2 * x2 ** 2  # Objective
    
    def gradf(x1, x2):
    return (2 * x1, 4 * x2)  # Gradient
    
    def sgd(x1, x2):  # Simulate noisy gradient
    global lr  # Learning rate scheduler
    (g1, g2) = gradf(x1, x2)  # Compute gradient
    (g1, g2) = (g1 + np.random.normal(0.1), g2 + np.random.normal(0.1))
    eta_t = eta * lr()  # Learning rate at time t
    return (x1 - eta_t * g1, x2 - eta_t * g2)  # Update variables
    
    eta = 0.1
    lr = (lambda: 1)  # Constant learning rate
    show_trace_2d(f, train_2d(sgd, steps=50))
    
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    epoch 50, x1 -0.027566, x2 0.137605
    
    
      
    
    代码解释

动态学习率

\begin{array}{ll}{\eta(t)=\eta_{i} \text { if } t_{i} \leq t \leq t_{i+1}} & {\text { piecewise constant function over the interval }[t_i, t_{i+1}]} \\ {\eta(t)=\eta_{0} \cdot e^{-\lambda t}} & {\text { exponential decay model with parameter }\lambda} \\ {\eta(t)=\eta_{0} \cdot(\beta t+1)^{-\alpha}} & {\text { polynomial decay law characterized by parameters }\beta\text{ and }\alpha}\end{array}

复制代码
    def exponential():
    global ctr
    ctr += 1
    return math.exp(-0.1 * ctr)
    
    ctr = 1
    lr = exponential  # Set up learning rate
    show_trace_2d(f, train_2d(sgd, steps=1000))
    
    
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    epoch 1000, x1 -0.677947, x2 -0.089379
    
    
      
    
    代码解释
复制代码
    def polynomial():
    global ctr
    ctr += 1
    return (1 + 0.1 * ctr)**(-0.5)
    
    ctr = 1
    lr = polynomial  # Set up learning rate
    show_trace_2d(f, train_2d(sgd, steps=50))
    
    
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    epoch 50, x1 -0.095244, x2 -0.041674
    
    
      
    
    代码解释

小批量随机梯度下降

读取数据

读取数据

复制代码
    def get_data_ch7():  # 本函数已保存在d2lzh_pytorch包中方便以后使用
    data = np.genfromtxt('/home/kesci/input/airfoil4755/airfoil_self_noise.dat', delimiter='\t')
    data = (data - data.mean(axis=0)) / data.std(axis=0) # 标准化
    return torch.tensor(data[:1500, :-1], dtype=torch.float32), \
           torch.tensor(data[:1500, -1], dtype=torch.float32) # 前1500个样本(每个样本5个特征)
    
    features, labels = get_data_ch7()
    features.shape
    
    
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    torch.Size([1500, 5])
    
    
      
    
    代码解释
复制代码
    import pandas as pd
    df = pd.read_csv('/home/kesci/input/airfoil4755/airfoil_self_noise.dat', delimiter='\t', header=None)
    df.head(10)
    
    
      
      
      
    
    代码解释
0 1 2 3 4 5
0 800 0.0 0.3048 71.3 0.002663 126.201
1 1000 0.0 0.3048 71.3 0.002663 125.201
2 1250 0.0 0.3048 71.3 0.002663 125.951
3 1600 0.0 0.3048 71.3 0.002663 127.591
4 2000 0.0 0.3048 71.3 0.002663 127.461
5 2500 0.0 0.3048 71.3 0.002663 125.571
6 3150 0.0 0.3048 71.3 0.002663 125.201
7 4000 0.0 0.3048 71.3 0.002663 123.061
8 5000 0.0 0.3048 71.3 0.002663 121.301
9 6300 0.0 0.3048 71.3 0.002663 119.541

从零开始实现

复制代码
    def sgd(params, states, hyperparams):
    for p in params:
        p.data -= hyperparams['lr'] * p.grad.data
    
    
      
      
      
    
    代码解释
复制代码
    # 本函数已保存在d2lzh_pytorch包中方便以后使用
    def train_ch7(optimizer_fn, states, hyperparams, features, labels,
              batch_size=10, num_epochs=2):
    # 初始化模型
    net, loss = d2l.linreg, d2l.squared_loss
    
    w = torch.nn.Parameter(torch.tensor(np.random.normal(0, 0.01, size=(features.shape[1], 1)), dtype=torch.float32),
                           requires_grad=True)
    b = torch.nn.Parameter(torch.zeros(1, dtype=torch.float32), requires_grad=True)
    
    def eval_loss():
        return loss(net(features, w, b), labels).mean().item()
    
    ls = [eval_loss()]
    data_iter = torch.utils.data.DataLoader(
        torch.utils.data.TensorDataset(features, labels), batch_size, shuffle=True)
    
    for _ in range(num_epochs):
        start = time.time()
        for batch_i, (X, y) in enumerate(data_iter):
            l = loss(net(X, w, b), y).mean()  # 使用平均损失
            
            # 梯度清零
            if w.grad is not None:
                w.grad.data.zero_()
                b.grad.data.zero_()
                
            l.backward()
            optimizer_fn([w, b], states, hyperparams)  # 迭代模型参数
            if (batch_i + 1) * batch_size % 100 == 0:
                ls.append(eval_loss())  # 每100个样本记录下当前训练误差
    # 打印结果和作图
    print('loss: %f, %f sec per epoch' % (ls[-1], time.time() - start))
    d2l.set_figsize()
    d2l.plt.plot(np.linspace(0, num_epochs, len(ls)), ls)
    d2l.plt.xlabel('epoch')
    d2l.plt.ylabel('loss')
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    def train_sgd(lr, batch_size, num_epochs=2):
    train_ch7(sgd, None, {'lr': lr}, features, labels, batch_size, num_epochs)
    
    
      
      
    
    代码解释

对比

复制代码
    train_sgd(1, 1500, 6)
    
    
      
    
    代码解释
复制代码
    loss: 0.244373, 0.009881 sec per epoch
    
    
      
    
    代码解释
复制代码
    train_sgd(0.005, 1)
    
    
      
    
    代码解释
复制代码
    loss: 0.245968, 0.463836 sec per epoch
    
    
      
    
    代码解释
复制代码
    train_sgd(0.05, 10)
    
    
      
    
    代码解释
复制代码
    loss: 0.243900, 0.065017 sec per epoch
    
    
      
    
    代码解释

简洁实现

复制代码
    # 本函数与原书不同的是这里第一个参数优化器函数而不是优化器的名字
    # 例如: optimizer_fn=torch.optim.SGD, optimizer_hyperparams={"lr": 0.05}
    def train_pytorch_ch7(optimizer_fn, optimizer_hyperparams, features, labels,
                    batch_size=10, num_epochs=2):
    # 初始化模型
    net = nn.Sequential(
        nn.Linear(features.shape[-1], 1)
    )
    loss = nn.MSELoss()
    optimizer = optimizer_fn(net.parameters(), **optimizer_hyperparams)
    
    def eval_loss():
        return loss(net(features).view(-1), labels).item() / 2
    
    ls = [eval_loss()]
    data_iter = torch.utils.data.DataLoader(
        torch.utils.data.TensorDataset(features, labels), batch_size, shuffle=True)
    
    for _ in range(num_epochs):
        start = time.time()
        for batch_i, (X, y) in enumerate(data_iter):
            # 除以2是为了和train_ch7保持一致, 因为squared_loss中除了2
            l = loss(net(X).view(-1), y) / 2 
            
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            if (batch_i + 1) * batch_size % 100 == 0:
                ls.append(eval_loss())
    # 打印结果和作图
    print('loss: %f, %f sec per epoch' % (ls[-1], time.time() - start))
    d2l.set_figsize()
    d2l.plt.plot(np.linspace(0, num_epochs, len(ls)), ls)
    d2l.plt.xlabel('epoch')
    d2l.plt.ylabel('loss')
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解释
复制代码
    train_pytorch_ch7(optim.SGD, {"lr": 0.05}, features, labels, 10)
    
    
      
    
    代码解释
复制代码
    loss: 0.243770, 0.047664 sec per epoch
    
    
      
    
    代码解释

全部评论 (0)

还没有任何评论哟~