Advertisement

Adam(Adaptive Moment Estimation)

阅读量:

Adam(Adaptive Moment Estimation)

Adam(Adaptive Moment Estimation)是一种自适应学习率优化算法,结合了动量法和RMSProp的优点。它不仅考虑了梯度的一阶矩(动量),还考虑了梯度的二阶矩(RMSProp),通过自适应调整学习率,使得参数更新更加稳定和高效。

Adam优化算法的原理

Adam优化算法通过以下步骤来更新参数:

计算梯度的动量估计(Exponential Moving Average of Gradient, 一阶矩估计)
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
其中:
-m_t是梯度的动量估计。
-g_t是当前梯度。
-\beta_1是动量超参数,通常取值为0.9。

计算梯度平方的动量估计(Exponential Moving Average of Squared Gradient, 二阶矩估计)
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
其中:
-v_t是梯度平方的动量估计。
-\beta_2是RMSProp超参数,通常取值为0.999。

偏差校正
由于动量估计和梯度平方的动量估计在初始时刻可能偏向于零,Adam引入了偏差校正:
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

更新参数
\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
其中:
-\theta_t是第t次迭代的参数。
-\alpha是学习率。
-\epsilon是一个小常数,用于防止除零错误,通常取10^{-8}

具体数据示例

假设我们有一个简单的线性回归问题,训练数据集如下:

x y
1 2
2 3
3 4
4 5

我们要拟合的线性模型为h(\theta) = \theta_0 + \theta_1 x

步骤1:初始化参数

假设\theta_0 = 0\theta_1 = 0,学习率\alpha = 0.01\beta_1 = 0.9\beta_2 = 0.999\epsilon = 10^{-8},并且初始化动量项和二阶矩估计m_0 = 0v_0 = 0

步骤2:计算梯度

损失函数J(\theta)为均方误差(MSE):
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h(\theta) - y_i)^2
其中,m是训练样本的数量。

对于第一个样本(x_1, y_1) = (1, 2),模型预测值为:
h(\theta) = \theta_0 + \theta_1 x_1 = 0

计算损失函数对参数的梯度:
\frac{\partial J}{\partial \theta_0} = h(\theta) - y_1 = 0 - 2 = -2
\frac{\partial J}{\partial \theta_1} = (h(\theta) - y_1) x_1 = -2 \cdot 1 = -2

步骤3:更新动量项和二阶矩估计

更新动量项:
m_{t,0} = \beta_1 m_{t-1,0} + (1 - \beta_1) g_{t,0} = 0.9 \times 0 + 0.1 \times (-2) = -0.2
m_{t,1} = \beta_1 m_{t-1,1} + (1 - \beta_1) g_{t,1} = 0.9 \times 0 + 0.1 \times (-2) = -0.2

更新二阶矩估计:
v_{t,0} = \beta_2 v_{t-1,0} + (1 - \beta_2) g_{t,0}^2 = 0.999 \times 0 + 0.001 \times (-2)^2 = 0.004
v_{t,1} = \beta_2 v_{t-1,1} + (1 - \beta_2) g_{t,1}^2 = 0.999 \times 0 + 0.001 \times (-2)^2 = 0.004

偏差校正:
\hat{m}_{t,0} = \frac{m_{t,0}}{1 - \beta_1^t} = \frac{-0.2}{1 - 0.9^1} = \frac{-0.2}{0.1} = -2
\hat{m}_{t,1} = \frac{m_{t,1}}{1 - \beta_1^t} = \frac{-0.2}{1 - 0.9^1} = \frac{-0.2}{0.1} = -2
\hat{v}_{t,0} = \frac{v_{t,0}}{1 - \beta_2^t} = \frac{0.004}{1 - 0.999^1} = \frac{0.004}{0.001} = 4
\hat{v}_{t,1} = \frac{v_{t,1}}{1 - \beta_2^t} = \frac{0.004}{1 - 0.999^1} = \frac{0.004}{0.001} = 4

更新参数:
\theta_{t,0} = \theta_{t-1,0} - \frac{\alpha}{\sqrt{\hat{v}_{t,0}} + \epsilon} \hat{m}_{t,0} = 0 - \frac{0.01}{\sqrt{4} + 10^{-8}} \times (-2) = 0.01
\theta_{t,1} = \theta_{t-1,1} - \frac{\alpha}{\sqrt{\hat{v}_{t,1}} + \epsilon} \hat{m}_{t,1} = 0 - \frac{0.01}{\sqrt{4} + 10^{-8}} \times (-2) = 0.01

第二次迭代

假设下一次随机选择的样本是(x_2, y_2) = (2, 3)

计算新的预测值:
h(\theta) = \theta_0 + \theta_1 x_2 = 0.01 + 0.01 \times 2 = 0.03

计算新的梯度:
\frac{\partial J}{\partial \theta_0} = h(\theta) - y_2 = 0.03 - 3 = -2.97
\frac{\partial J}{\partial \theta_1} = (h(\theta) - y_2) x_2 = -2.97 \times 2 = -5.94

更新动量项:
m_{t,0} = \beta_1 m_{t-1,0} + (1 - \beta_1) g_{t,0} = 0.9 \times (-0.2) + 0.1 \times (-2.97) = -0.467
m_{t,1} = \beta_1 m_{t-1,1} + (1 - \beta_1) g_{t,1} = 0.9 \times (-0.2) + 0.1 \ times (-5.94) = -0.872

更新二阶矩估计:
v_{t,0} = \beta_2 v_{t-1,0} + (1 - \beta_2) g_{t,0}^2 = 0.999 \times 0.004 + 0.001 \times (-2.97)^2 = 0.012
v_{t,1} = \beta_2 v_{t-1,1} + (1 - \beta_2) g_{t,1}^2 = 0.999 \times 0.004 + 0.001 \times (-5.94)^2 = 0.04

偏差校正:
\hat{m}_{t,0} = \frac{m_{t,0}}{1 - \beta_1^t} = \frac{-0.467}{1 - 0.9^2} = -2.47
\hat{m}_{t,1} = \frac{m_{t,1}}{1 - \beta_1^t} = \frac{-0.872}{1 - 0.9^2} = -4.591
\hat{v}_{t,0} = \frac{v_{t,0}}{1 - \beta_2^t} = \frac{0.012}{1 - 0.999^2} = 6
\hat{v}_{t,1} = \frac{v_{t,1}}{1 - \beta_2^t} = \frac{0.04}{1 - 0.999^2} = 20

更新参数:
\theta_{t,0} = \theta_{t-1,0} - \frac{\alpha}{\sqrt{\hat{v}_{t,0}} + \epsilon} \hat{m}_{t,0} = 0.01 - \frac{0.01}{\sqrt{6} + 10^{-8}} \times (-2.47) \approx 0.02
\theta_{t,1} = \theta_{t-1,1} - \frac{\alpha}{\sqrt{\hat{v}_{t,1}} + \epsilon} \hat{m}_{t,1} = 0.01 - \frac{0.01}{\sqrt{20} + 10^{-8}} \times (-4.591) \approx 0.02

第三次迭代

假设下一次随机选择的样本是(x_3, y_3) = (3, 4)

计算新的预测值:
h(\theta) = \theta_0 + \theta_1 x_3 = 0.02 + 0.02 \times 3 = 0.08

计算新的梯度:
\frac{\partial J}{\partial \theta_0} = h(\theta) - y_3 = 0.08 - 4 = -3.92
\frac{\partial J}{\partial \theta_1} = (h(\theta) - y_3) x_3 = -3.92 \times 3 = -11.76

更新动量项:
m_{t,0} = \beta_1 m_{t-1,0} + (1 - \beta_1) g_{t,0} = 0.9 \times (-0.467) + 0.1 \times (-3.92) = -0.812
m_{t,1} = \beta_1 m_{t-1,1} + (1 - \beta_1) g_{t,1} = 0.9 \times (-0.872) + 0.1 \times (-11.76) = -1.906

更新二阶矩估计:
v_{t,0} = \beta_2 v_{t-1,0} + (1 - \beta_2) g_{t,0}^2 = 0.999 \times 0.012 + 0.001 \times (-3.92)^2 = 0.027
v_{t,1} = \beta_2 v_{t-1,1} + (1 - \beta_2) g_{t,1}^2 = 0.999 \times 0.04 + 0.001 \times (-11.76)^2 = 0.079

偏差校正:
\hat{m}_{t,0} = \frac{m_{t,0}}{1 - \beta_1^t} = \frac{-0.812}{1 - 0.9^3} = -2.89
\hat{m}_{t,1} = \frac{m_{t,1}}{1 - \beta_1^t} = \frac{-1.906}{1 - 0.9^3} = -6.79
\hat{v}_{t,0} = \frac{v_{t,0}}{1 - \beta_2^t} = \frac{0.027}{1 - 0.999^3} = 9
\hat{v}_{t,1} = \frac{v_{t,1}}{1 - \beta_2^t} = \frac{0.079}{1 - 0.999^3} = 26

更新参数:
\theta_{t,0} = \theta_{t-1,0} - \frac{\alpha}{\sqrt{\hat{v}_{t,0}} + \epsilon} \hat{m}_{t,0} = 0.02 - \frac{0.01}{\sqrt{9} + 10^{-8}} \times (-2.89) \approx 0.03
\theta_{t,1} = \theta_{t-1,1} - \frac{\alpha}{\sqrt{\hat{v}_{t,1}} + \epsilon} \hat{m}_{t,1} = 0.02 - \frac{0.01}{\sqrt{26} + 10^{-8}} \times (-6.79) \approx 0.03

总结

Adam优化算法结合了动量法和RMSProp的优点,通过考虑梯度的一阶矩和二阶矩来自适应调整学习率,使得参数更新更加稳定和高效。通过具体数据的示例,可以清楚地看到Adam如何在每次迭代中逐步计算动量和二阶矩估计,并通过偏差校正和参数更新来加速模型的收敛。

全部评论 (0)

还没有任何评论哟~