Advertisement

斯坦福cs231n课程记录——assignment2 FullyConnectedNets

阅读量:

目录

  • 作业目的
  • 网络层实现
  • 优化方法实现
  • 作业问题记录
  • 参考文献

一、作业目的

之前完成了一篇关于Two-layer neural network的作业,并将其整合为一个完整的程序框架。然而,在这一过程中发现其损失函数与反向传播算法均在一个单一函数体内完成编写工作,并未采取功能分离的方式进行实现。这种开发模式不利于复杂网络体系的构建与扩展。因此本练习的目的在于通过功能分离的方式对各组件进行独立设计与实现,并在此过程中探索如何更好地构建复杂网络架构

二、网络层实现

1.affine_layer(layers.py)

1.1 affine_forward

Inputs:
- x: A numpy ndarray holding input data with shapes (N, d_1,…,d_k)
- w: A numpy ndarray of weights with shapes (D,M)
- b: A numpy ndarray of biases with shapes (M,)

Returns a tuple of:
- out: output, of shape (N, M)
- cache: (x, w, b)

复制代码
 def affine_forward(x, w, b):

    
     out = None
    
     out = np.dot(x.reshape((x.shape[0], -1)), w) + b
    
     cache = (x, w, b)
    
     return out, cache

前向传播过程相对简洁,并且可以表示为out = x * w + b。其中,在x的维度为(N,D),w的维度为(D,M),因此输出out的维度为(N,M)。需要注意的是,在与偏置b相加时采用了广播机制以实现计算的一致性。

1.2 affine_backward

Inputs:
- dout为上游梯度
- cache包含以下内容:
- x为输入数据
- w代表权重矩阵
- b为偏差向量

表示为一个元组中

复制代码
 def affine_backward(dout, cache):

    
     x, w, b = cache
    
     dx, dw, db = None, None, None
    
     dw = np.dot(x.reshape((x.shape[0], -1)).T, dout)
    
     db = dout.sum(axis=0)
    
     dx = np.dot(dout, w.T)
    
     dx = dx.reshape(x.shape)
    
     return dx, dw, db

反向传播主要注意维度的问题。

dw = x.T * dout (D,M)=(D,N)*(N,M)
db = dout 的列向量之和 (M,) =(N,M)[0]
dx = dout * w.T (N,D) = (N,M) *(M,D)

2.ReLU layer(layers.py)

2.1 relu_forward

- x: 输入项为来自任何链接的形状

Returns a tuple of:
- out: Output, of the same shape as x
- cache: x

复制代码
 def relu_forward(x):

    
     out = x * (x > 0)
    
     cache = x
    
     return out, cache

(x > 0 ) 是一个布尔判断,输出大于0的x。

2.2 relu_backward

- dout: A derivative from downstream computations. Any shape.
- cache: Stores the input x matching the shape of the downstream derivative dout.

Returns:
- dx: Gradient with respect to x

复制代码
 dx, x = None, cache

    
     dx = dout * (x > 0)
    
     return dx

同样,大于0的数才会得到反向传播的值。

3.Loss layers: Softmax and SVM(layers.py)

复制代码
 def svm_loss(x, y):

    
     """
    
     Computes the loss and gradient using for multiclass SVM classification.
    
   5.     Inputs:
    
     - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
    
       class for the ith input.
    
     - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
    
       0 <= y[i] < C
    
   11.     Returns a tuple of:
    
     - loss: Scalar giving the loss
    
     - dx: Gradient of the loss with respect to x
    
     """
    
     N = x.shape[0]
    
     correct_class_scores = x[np.arange(N), y]
    
     margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)
    
     margins[np.arange(N), y] = 0
    
     loss = np.sum(margins) / N
    
     num_pos = np.sum(margins > 0, axis=1)
    
     dx = np.zeros_like(x)
    
     dx[margins > 0] = 1
    
     dx[np.arange(N), y] -= num_pos
    
     dx /= N
    
     return loss, dx
复制代码
  
    
 def softmax_loss(x, y):
    
     """
    
     Computes the loss and gradient for softmax classification.
    
   6.     Inputs:
    
     - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
    
       class for the ith input.
    
     - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
    
       0 <= y[i] < C
    
   12.     Returns a tuple of:
    
     - loss: Scalar giving the loss
    
     - dx: Gradient of the loss with respect to x
    
     """
    
     shifted_logits = x - np.max(x, axis=1, keepdims=True)
    
     Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True)
    
     log_probs = shifted_logits - np.log(Z)
    
     probs = np.exp(log_probs)
    
     N = x.shape[0]
    
     loss = -np.sum(log_probs[np.arange(N), y]) / N
    
     dx = probs.copy()
    
     dx[np.arange(N), y] -= 1
    
     dx /= N
    
     return loss, dx

4.Two-layer network(fc_net.py)

上个作业:Two-Layer Neural Network

5. Solver

During the preceding assignment, the logic for training models had been integrated into the models themselves. The assignments now employ a more modular approach, and we have separated the logic for training models into its own class.

一些作图方面的语句:

复制代码
 plt.subplot(2, 1, 1)

    
 plt.title('Training loss')
    
 plt.plot(solver.loss_history, 'o')
    
 plt.xlabel('Iteration')
    
  
    
 plt.subplot(2, 1, 2)
    
 plt.title('Accuracy')
    
 plt.plot(solver.train_acc_history, '-o', label='train')
    
 plt.plot(solver.val_acc_history, '-o', label='val')
    
 plt.plot([0.5] * len(solver.val_acc_history), 'k--')
    
 plt.xlabel('Epoch')
    
 plt.legend(loc='lower right')
    
 plt.gcf().set_size_inches(15, 12)    
    
 plt.show()

布局调整
在绘制子图时可能会遇到以下问题:

坐标轴遮挡:可以通过调用tight_layout()方法并配置相关的参数来解决。具体来说:

  • pad参数用于控制整体轮廓之间的间距
  • w_pad参数用于控制子图之间的水平间距
  • h_pad参数用于控制子图之间的垂直间距

图像密集显示问题:可以通过调用set_size_inches()方法并指定合适的英寸值来设置图像的实际尺寸(单位为英寸)。

复制代码
>       1. plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

>  
>       2. fig = plt.gcf()
>  
>       3. fig.set_size_inches(10, 8)
>  
>  
>

6.Multilayer network

Design a fully-connected network with a variable number of intermediate layers.

复制代码
 class FullyConnectedNet(object):

    
     """
    
     A fully-connected neural network with an arbitrary number of hidden layers,
    
     ReLU nonlinearities, and a softmax loss function. This will also implement
    
     dropout and batch/layer normalization as options. For a network with L layers,
    
     the architecture will be
    
   8.     {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
    
   10.     where batch/layer normalization and dropout are optional, and the {...} block is
    
     repeated L - 1 times.
    
   13.     Similar to the TwoLayerNet above, learnable parameters are stored in the
    
     self.params dictionary and will be learned using the Solver class.
    
     """
    
  
    
     def __init__(self, hidden_dims, input_dim=3 * 32 * 32, num_classes=10,
    
              dropout=1, normalization=None, reg=0.0,
    
              weight_scale=1e-2, dtype=np.float32, seed=None):
    
     """
    
     Initialize a new FullyConnectedNet.
    
   23.         Inputs:
    
     - hidden_dims: A list of integers giving the size of each hidden layer.
    
     - input_dim: An integer giving the size of the input.
    
     - num_classes: An integer giving the number of classes to classify.
    
     - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
    
       the network should not use dropout at all.
    
     - normalization: What type of normalization the network should use. Valid values
    
       are "batchnorm", "layernorm", or None for no normalization (the default).
    
     - reg: Scalar giving L2 regularization strength.
    
     - weight_scale: Scalar giving the standard deviation for random
    
       initialization of the weights.
    
     - dtype: A numpy datatype object; all computations will be performed using
    
       this datatype. float32 is faster but less accurate, so you should use
    
       float64 for numeric gradient checking.
    
     - seed: If not None, then pass this random seed to the dropout layers. This
    
       will make the dropout layers deteriminstic so we can gradient check the
    
       model.
    
     """
    
     self.normalization = normalization
    
     self.use_dropout = dropout != 1
    
     self.reg = reg
    
     self.num_layers = 1 + len(hidden_dims)
    
     self.dtype = dtype
    
     self.params = {}
    
  
    
     ############################################################################
    
     # TODO: Initialize the parameters of the network, storing all values in    #
    
     # the self.params dictionary. Store weights and biases for the first layer #
    
     # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
    
     # initialized from a normal distribution centered at 0 with standard       #
    
     # deviation equal to weight_scale. Biases should be initialized to zero.   #
    
     #                                                                          #
    
     # When using batch normalization, store scale and shift parameters for the #
    
     # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
    
     # beta2, etc. Scale parameters should be initialized to ones and shift     #
    
     # parameters should be initialized to zeros.                               #
    
     ############################################################################
    
     for i in range(self.num_layers - 1):
    
         if i == 0:
    
             self.params['W%s' % (i + 1)] = np.random.normal(0, weight_scale, size=(input_dim, hidden_dims[i]))
    
             self.params['b%s' % (i + 1)] = np.zeros(shape=(hidden_dims[i]))
    
             if self.normalization is not None:
    
                 self.params['gamma%s' % (i + 1)] = np.ones(shape=(hidden_dims[i]))
    
                 self.params['beta%s' % (i + 1)] = np.zeros(shape=(hidden_dims[i]))
    
         else:
    
             self.params['W%s' % (i + 1)] = np.random.normal(0, weight_scale,
    
                                                             size=(hidden_dims[i - 1], hidden_dims[i]))
    
             self.params['b%s' % (i + 1)] = np.zeros(shape=(hidden_dims[i]))
    
             if self.normalization is not None:
    
                 self.params['gamma%s' % (i + 1)] = np.ones(shape=(hidden_dims[i]))
    
                 self.params['beta%s' % (i + 1)] = np.zeros(shape=(hidden_dims[i]))
    
     self.params['W%s' % self.num_layers] = np.random.normal(0, weight_scale, size=(hidden_dims[-1], num_classes))
    
     self.params['b%s' % self.num_layers] = np.zeros(shape=(num_classes))
    
     ############################################################################
    
     #                             END OF YOUR CODE                             #
    
     ############################################################################
    
     # When using dropout we need to pass a dropout_param dictionary to each
    
     # dropout layer so that the layer knows the dropout probability and the mode
    
     # (train / test). You can pass the same dropout_param to each dropout layer.
    
     self.dropout_param = {}
    
     if self.use_dropout:
    
         self.dropout_param = {'mode': 'train', 'p': dropout}
    
         if seed is not None:
    
             self.dropout_param['seed'] = seed
    
  
    
     # With batch normalization we need to keep track of running means and
    
     # variances, so we need to pass a special bn_param object to each batch
    
     # normalization layer. You should pass self.bn_params[0] to the forward pass
    
     # of the first batch normalization layer, self.bn_params[1] to the forward
    
     # pass of the second batch normalization layer, etc.
    
     self.bn_params = []
    
     if self.normalization == 'batchnorm':
    
         self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
    
     if self.normalization == 'layernorm':
    
         self.bn_params = [{} for i in range(self.num_layers - 1)]
    
  
    
     # Cast all parameters to the correct datatype
    
     for k, v in self.params.items():
    
         self.params[k] = v.astype(dtype)
    
  
    
     def loss(self, X, y=None):
    
     """
    
     Compute loss and gradient for the fully-connected net.
    
   107.         Input / output: Same as TwoLayerNet above.
    
     """
    
     X = X.astype(self.dtype)
    
     mode = 'test' if y is None else 'train'
    
  
    
     # Set train/test mode for batchnorm params and dropout param since they
    
     # behave differently during training and testing.
    
     if self.use_dropout:
    
         self.dropout_param['mode'] = mode
    
     if self.normalization == 'batchnorm':
    
         for bn_param in self.bn_params:
    
             bn_param['mode'] = mode
    
     scores = X
    
     ############################################################################
    
     # TODO: Implement the forward pass for the fully-connected net, computing  #
    
     # the class scores for X and storing them in the scores variable.          #
    
     #                                                                          #
    
     # When using dropout, you'll need to pass self.dropout_param to each       #
    
     # dropout forward pass.                                                    #
    
     #                                                                          #
    
     # When using batch normalization, you'll need to pass self.bn_params[0] to #
    
     # the forward pass for the first batch normalization layer, pass           #
    
     # self.bn_params[1] to the forward pass for the second batch normalization #
    
     # layer, etc.                                                              #
    
     ############################################################################
    
     caches = list()
    
     for i in range(self.num_layers - 1):
    
         cache = list()
    
         scores, fc_cache = affine_forward(scores, self.params['W%s' % (i + 1)], self.params['b%s' % (i + 1)])
    
         cache.append(fc_cache)
    
         if self.normalization == 'batchnorm':
    
             scores, bn_cache = batchnorm_forward(scores, self.params['gamma%s' % (i + 1)],
    
                                                self.params['beta%s' % (i + 1)],
    
                                                self.bn_params[i])
    
             cache.append(bn_cache)
    
         elif self.normalization == 'layernorm':
    
             scores, ln_cache = layernorm_forward(scores, self.params['gamma%s' % (i + 1)],
    
                                                  self.params['beta%s' % (i + 1)],
    
                                                  self.bn_params[i])
    
             cache.append(ln_cache)
    
         scores, relu_cache = relu_forward(scores)
    
         cache.append(relu_cache)
    
         if self.use_dropout:
    
             scores, dropout_cache = dropout_forward(scores, self.dropout_param)
    
             cache.append(dropout_cache)
    
         caches.append(cache)
    
     scores, fc_cache = affine_forward(scores, self.params['W%s' % self.num_layers],
    
                                       self.params['b%s' % self.num_layers])
    
     caches.append(fc_cache)
    
     ############################################################################
    
     #                             END OF YOUR CODE                             #
    
     ############################################################################
    
  
    
     # If test mode return early
    
     if mode == 'test':
    
         return scores
    
  
    
     loss, grads = 0.0, {}
    
     ############################################################################
    
     # TODO: Implement the backward pass for the fully-connected net. Store the #
    
     # loss in the loss variable and gradients in the grads dictionary. Compute #
    
     # data loss using softmax, and make sure that grads[k] holds the gradients #
    
     # for self.params[k]. Don't forget to add L2 regularization!               #
    
     #                                                                          #
    
     # When using batch/layer normalization, you don't need to regularize the scale   #
    
     # and shift parameters.                                                    #
    
     #                                                                          #
    
     # NOTE: To ensure that your implementation matches ours and you pass the   #
    
     # automated tests, make sure that your L2 regularization includes a factor #
    
     # of 0.5 to simplify the expression for the gradient.                      #
    
     ############################################################################
    
     loss, dx = softmax_loss(scores, y)
    
     # 加上正则项
    
     for i in range(1, self.num_layers + 1):
    
         loss += 0.5 * self.reg * np.sum(self.params['W%s' % i] ** 2)
    
  
    
     for i in range(self.num_layers, 0, -1):
    
         if i == self.num_layers:
    
             dx, dw, db = affine_backward(dx, caches[i - 1])
    
             grads['W%s' % i] = dw + self.reg * self.params['W%s' % i]
    
             grads['b%s' % i] = db
    
         else:
    
             if self.use_dropout:
    
                 dx = dropout_backward(dx, caches[i - 1][-1])
    
             if self.normalization is not None:
    
                 dx = relu_backward(dx, caches[i - 1][2])
    
                 if self.normalization == 'batchnorm':
    
                     dx, dgamma, dbeta = batchnorm_backward_alt(dx, caches[i - 1][1])
    
                 elif self.normalization == 'layernorm':
    
                     dx, dgamma, dbeta = layernorm_backward(dx, caches[i - 1][1])
    
                 else:
    
                     raise ValueError("No such normalization")
    
                 grads['gamma%s' % i] = dgamma
    
                 grads['beta%s' % i] = dbeta
    
             else:
    
                 dx = relu_backward(dx, caches[i - 1][1])
    
             dx, dw, db = affine_backward(dx, caches[i - 1][0])
    
             grads['W%s' % i] = dw + self.reg * self.params['W%s' % i]
    
             grads['b%s' % i] = db
    
  
    
     ############################################################################
    
     #                             END OF YOUR CODE                             #
    
     ############################################################################
    
  
    
     return loss, grads

三、优化方法实现

优化方法总结

四、作业问题记录

Inline Question 1:
我们只要求实现ReLU函数,请问在神经网络中有哪些其他不同的激活函数可以选择?每个激活函数都有其各自的优缺点。特别地,在反向传播过程中出现零(或接近零)梯度流动是一个常见问题,请问以下哪种激活函数会面临这一问题?如果从一维情况考虑这些函数的表现,请问什么样的输入会导致这种行为?

Sigmoid
ReLU
Leaky ReLU

Answer

Inline Question 2:
Nothings stands out regarding the comparative difficulty of training a three-layer network versus a five-layer network. Notably, based on my experience, which network proved more sensitive to initialization scales, and why do you think that might be?

Inline Question 2:
Nothing stands out regarding the comparative difficulty of training a three-layer network versus a five-layer network. Notably, based on my experience, which network proved more sensitive to initialization scales, and why do you think that might be?

Inline Question 2:
Nothings stands out regarding the comparative difficulty of training a three-layer network versus a five-layer network. Notably, based on my experience, which network proved more sensitive to initialization scales, and why do you think that might be?

尽管3层和5层神经网络都能在训练集上达到100%的准确率,
但就验证集而言,
3层神经网络能够展现出更好的泛化能力。
特别地,
在权重初始化scale方面表现得更为敏感,
这一现象主要源于以下两个原因:
一方面是因为5层网络本身更为复杂,
在其设计中使用的参数数量远超前一层;
另一方面是由于其损失函数的形式更加复杂,
从而使得优化过程中更容易陷入局部极小值。

问题 inline 3:The AdaGrad and Adam methods are per-parameter optimization methods that employ the following update rule.

cache += dw**2
w += - learning_rate * dw / (np.sqrt(cache) + eps)
Upon observing his network during training using AdaGrad, John noticed that the update steps were becoming increasingly minuscule, and as a result, his network's learning process appeared to be stagnating. Based on your understanding of how AdaGrad operates, can you explain why such diminishing updates might occur? Would Adam exhibit similar behavior?

可能源于初始阶段dw值过大,在更新过程中从而使得cache内存占用极大

rac{learning rate}{qrt{cache }+ arepsilon }

虽然规模较小但其收敛速度较慢 Adam 也不会遇到类似的问题 因为 Adam 在执行更新操作时会采用当前迭代值而不是等待后续迭代的结果

m_{t} = rac{m}{1 - eta _{1}^ t}

可以看做动量梯度,t比较小时,mt会更大,就不会遇到这个问题。

参考文献

[1] 蒂姆en蒂尔曼和格雷夫·希顿。“在rmsprop中计算梯度平均值:在Lecture 6.5-rmsprop中计算梯度平均值。”Coursera:神经网络机器学习4(2012年)。

[2] Diederik Kingma and Jimmy Ba, "Adam: A Method for Stochastic Optimization", ICLR 2015.

[3] https://juejin.im/post/5ad41530f265da2386705937 Matplolib Tips

[4] https://cs231n.github.io/neural-networks-3/#update Parameter updates

全部评论 (0)

还没有任何评论哟~