第四篇Deep Learning for Computer Vision: A Comprehensive R
作者:禅与计算机程序设计艺术
1.简介
Deep learning has garnered significant attention in the domains of computer vision and natural language processing due to its capability to extract intricate features from raw data, thereby enabling machine learning algorithms to execute high-level tasks such as object detection and image classification with remarkable success. This review article aims to explore the foundational concepts of deep learning models applied to computer vision, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. We will delve into the applications of deep learning models across various computer vision tasks, such as object recognition tasks, action recognition tasks, and scene understanding tasks. Finally, we will outline a comprehensive roadmap for future research and development in this domain, offering insights into current challenges, emerging trends, and future directions.
2.相关术语
Before engaging in discussions about the core elements of deep learning models in the domain of computer vision, it is advisable to establish a clear understanding of some fundamental terms and concepts. Below are detailed definitions and explanations of these key terminologies:
- Convolutional Neural Network (CNN): This is a deep learning model constructed using convolution operations. It employs filters or kernel matrices to extract relevant patterns from input images. Commonly, CNNs consist of multiple feature channels generated through various filters, which are merged across different levels to produce comprehensive image representations. The convolution operation enables the network to capture spatial relationships between adjacent pixels and aid in identifying features such as edges, corners, and textures.
 - Recurrent Neural Network (RNN): RNNs are designed for processing sequential data in natural language tasks. They consist of multiple hidden units that sequentially handle one element at a time. At each step, the output from the previous state and input vector is fed into the next hidden unit, along with additional information provided by feedback connections.
 - Transformer: The Transformer architecture, introduced in 2017, has achieved state-of-the-art performance in text translation and various NLP tasks. It utilizes self-attention mechanisms to capture long-range dependencies in text and produces accurate word representations without explicitly modeling their ordering. The self-attention mechanism enables the model to focus on relevant parts of the input sequence, allowing it to learn complex patterns and dependencies effectively.
 
3.卷积神经网络(Convolutional Neural Networks,CNN)
3.1 模型结构
3.1.1 LeNet-5
The original LeNet-5 [1] model, initially proposed for recognizing handwritten digits, featured a limited number of hidden layers, necessitating extensive training data to attain satisfactory accuracy. In comparison, modern architectures such as AlexNet, VGG, and ResNet have demonstrated superior performance and faster convergence. Despite these advancements, even the most refined convolutional neural network (CNN) models today still struggle to achieve state-of-the-art performance on datasets like MNIST, which are relatively small in scale. To address this challenge, researchers have investigated and implemented techniques such as improved weight initialization protocols, enhanced dropout regularization strategies, and the application of batch normalization methods. These innovations have collectively enhanced the accuracy of numerous models while simultaneously mitigating overfitting and improving the efficiency and stability of the training process.
3.1.2 AlexNet
AlexNet [2] serves as one of the earliest notable CNN models designed primarily for the ImageNet classification task. It contained two convolutional layers and then three fully connected layers. Its architecture diagram depicts the operational mechanism of the model.
The principal advantage of AlexNet over previous models was its adoption of smaller input dimensions (AlexNet utilized a 227 × 227 pixel input, contrasting with the traditional 28 × 28 pixel dimensions of earlier models). Furthermore, the inclusion of local response normalization contributed to preventing the vanishing gradients issue.
Another significant enhancement in AlexNet was the implementation of the dropout technique, which mitigated overfitting and assisted the model in becoming less dependent on specific examples. Batch Normalization [3] was subsequently incorporated to accelerate convergence and achieve greater stability during training.
Despite these advancements, AlexNet did not achieve performance nearing state-of-the-art levels on the ImageNet dataset until the subsequent AlexNet V2 model was introduced.
3.1.3 VGG
VGG [4] is originally known as the Visual Geometry Group. It was renownedly developed by Oxford University in 2014. One distinctive feature of VGG models is their simplified architecture, containing only three convolutional layers followed by max pooling layers, as opposed to the five or seven layers found in other models, which facilitates easier training and reduces computational costs compared to larger models.
VGG has multiple versions, each tailored for specific applications. VGG16 is frequently employed for color images, while VGG19 is typically utilized for grayscale images. For instance, an illustration of VGG16 follows.
Observing this, the VGG16 architecture incorporates eight distinct convolutional layers, each paired with a stride alternating between 1 and 2, resulting in 16 filters per layer. Furthermore, each convolutional layer is equipped with a bias term to prevent the weights from collapsing to zero, while the network skips activation functions and pooling operations beyond the final max pooling layer.
3.1.4 GoogLeNet
GoogleNet [5],亦即Inception v1,已在图像分类任务中展现出良好的效果。它整合了多种创新性思路,包括同一网络内部的并行路径、深度可分离卷积以及辅助分类器。Inception模块的核心理念在于,它将输入图像划分为若干子区域,并在每个区域分别应用不同的卷积滤波器,随后将这些区域的输出进行整合。通过堆叠多个模块,GoogleNet在ILSVRC挑战赛数据集上的性能超越了以往的先进方法。
Here is the architecture of GoogLeNet:
GoogLeNet is composed of four distinct module types: standard convolution block, Inception Module, reduction module, and global average pool. The standard convolution block is characterized by a sequence of convolutional layers with identical filter numbers and strides, each followed by a non-linearity function such as ReLU. The Inception Module involves partitioning the input into separate subregions and applying independently trained convolutional filters to each subregion. Following this, the concatenation of these outputs serves as the input for the subsequent layers. The reduction module is designed to decrease the network's dimensionality by halving both the number of filters and the height and width dimensions. Finally, the global average pooling layer processes the output across the spatial dimensions, yielding a fixed-length representation irrespective of the input size.
3.1.5 ResNet
ResNet [6] 被认为是图像分类领域具有里程碑意义的卷积神经网络架构。为了解决传统卷积神经网络中梯度消失的问题,Residual blocks 被提出。与传统方法不同,ResNet仅学习输入信号的残差映射,并将其与前一层的输出相加以生成新的输出。这有助于降低模型复杂度并防止网络过于深层。通过精心设计的跳跃连接结构,ResNet有望在与深度相近的网络相比,达到或超越其性能。
Here is the basic architecture of ResNet:
As depicted in the figure, ResNet is constructed by stacking multiple residual blocks, with shortcut connections facilitating the transmission of identity mappings between them. The architecture of convolutional layers in ResNet mirrors that of conventional CNNs, comprising several convolutional layers with identical numbers of filters, strides, and padding. The final output is obtained by summing the input signal with the transformed output from the last convolutional layer. Each convolutional layer typically employs a rectified linear unit (ReLU) as its activation function, except for the final layer, which directly outputs probability values.
3.2 数据预处理
Data augmentation represents a crucial element in the development of effective deep learning models for computer vision tasks. It incorporates diverse preprocessing techniques such as rotation, scaling, flipping, cropping, and contrast adjustment to significantly expand the dataset's diversity. The primary objective of data augmentation is to enhance the model's generalization capability by augmenting the dataset with a variety of variations when trained on limited labeled data. Various data augmentation techniques include random shifting, rotating, flipping, zooming, shearing, adjusting brightness, and modifying contrast.
For instance, in the case of a binary classification task, we might perform random transformations on approximately half of the positive samples, leaving the remaining samples unchanged. Alternatively, if we are dealing with a multi-class classification task involving ten distinct classes, we might randomly perform transformations on any sample belonging to any of those classes. In the case of bounding box prediction problems, we might randomly perform perturbation on the existing bounding boxes to generate new ones.
Common practices of data augmentation involve employing predefined transformation parameters for each iteration, incorporating horizontal and vertical flips, and randomly selecting various transform combinations to increase the diversity of the dataset.
Additionally, certain data augmentation techniques can also lead to slightly altered label annotations. For instance, in face detection tasks, we might introduce a small random shift to both the x and y coordinates of each bounding box representing a face annotation, resulting in minor adjustments to the actual face's position. Consequently, it is crucial to carefully assess the effectiveness of data augmentation techniques in relation to their impact on downstream tasks, particularly when dealing with highly imbalanced datasets or scenarios characterized by strong inter-annotator agreements.
4.计算机视觉任务及应用
In this section, we will delve into methods for approaching common computer vision tasks and identify the appropriate deep learning models for their resolution. Specifically, we will explore the following topics:
- 物体检测与分类
- R-CNN、SSD、YOLO
 - Faster R-CNN、Mask R-CNN
 - 动作识别
 - LSTM、GRU、ConvLSTM
 - TSN、时空段网络
 - I3D、膨胀式三维卷积网络
 - 场景理解
- 图卷积网络、小波神经网络、U-Net
 
 
 
4.1 对象检测与分类
4.1.1 RCNN, SSD, YOLO
Object detection represents a fundamental task in computer vision, aimed at identifying and categorizing objects within images. Conventional approaches primarily involve searching for target objects within extensive databases or manually establishing anchor points around them, processes that are laborious and often result in low recall rates. To address these limitations, numerous researchers have developed object detection methods grounded in deep learning frameworks. Among these, several prominent variants based on deep learning techniques include region-based convolutional neural networks (R-CNN), single-shot detectors (SSD), and the highly regarded YOLO [7]. In this presentation, we will delve into two notable approaches.
R-CNN
R-CNN [8] marks the initial phase of the object detection revolution. This method employs a two-stage framework, utilizing the selective search algorithm to generate candidate regions, followed by a convolutional neural network for classification tasks. One of R-CNN's key contributions is its systematic approach to region proposal generation, which effectively enhances both the precision and recall rates. However, the method's limitation lies in its relatively slow inference speed, primarily due to its iterative computational process.
Here is the overall architecture of R-CNN:
SSD
SSD [9] significantly enhances the drawbacks of R-CNN by incorporating a single convolutional neural network (convnet) that generates predictions across multiple scales. While R-CNN detects objects individually, SSD focuses on identifying objects at various sizes, thereby increasing efficiency and scalability. Unlike prior approaches, SSD directly predicts bounding boxes, class labels, and confidence scores for each detected object, thus removing the necessity for post-processing steps such as non-maximum suppression.
Here is the overall architecture of SSD:
YOLO
YOLO [10] is a high-speed and robust object detector designed for real-time performance on CPU. It leverages a single neural network optimized to process the entire image for class probability prediction and bounding box estimation. Unlike conventional detectors, YOLO eliminates the need for pre-defined anchor points and grid cells. Central to YOLO's success is its innovative use of a regression loss function, which prompts the network to predict relative offsets from anchors instead of absolute coordinates. Through this approach, YOLO achieves the prediction of highly precise and accurate bounding boxes.
Here is the overall architecture of YOLO:
4.1.2 Faster RCNN, Mask RCNN
Both Faster RCNN [11] and Mask RCNN [12] represent significant advancements in object detection technology. By replacing the selective search mechanism of the original R-CNN with a more efficient RoI Align algorithm, Faster RCNN achieves notable performance improvement. Furthermore, it optimizes computational processes by sharing feature maps generated by the backbone network, thereby significantly enhancing speed. To address the challenge of occlusion caused by human bodies, Faster RCNN introduces an innovative module known as the mask head, which accurately estimates object segmentation masks. Mask RCNN, on the other hand, builds upon these improvements by incorporating additional mechanisms to enhance detection accuracy and robustness.
Here is the overall architecture of Faster RCNN:
While the above architecture demonstrates the effectiveness of the Faster RCNN detector system with shared feature channels, here is the analogous architecture of Mask RCNN.
The two architectures exhibit a distinct feature in their design, which lies in the incorporation of a third branch referred to as the mask head, specifically aimed at estimating segmentation masks. This mask head processes the output from the shared feature map, generating predicted foreground probability maps for each proposal as well as a background probability map encompassing the entire image. During the inference process, these masks undergo refinement through the application of pixel-wise cross entropy loss, comparing the predicted probability maps against the ground truth segmentation masks to enhance accuracy.
4.2 动作识别
4.2.1 LSTM, GRU, ConvLSTM
Action recognition represents a challenging task in computer vision, aimed at detecting and monitoring actions performed by humans in videos. Numerous researchers have sought to devise diverse models to address this challenge. Some of the prominent approaches include the Long Short-Term Memory (LSTM) model [13], the Gated Recurrent Unit (GRU) [14], and the Convolutional LSTM (ConvLSTM) [15]. Here, we will provide a brief overview of these approaches.
LSTM
LSTM [13] is an example of a recurrent neural network (RNN) that effectively captures temporal dependencies within sequential data. It is composed of memory cells designed to store information and a gate mechanism responsible for controlling the flow of information over time. LSTM networks are capable of capturing long-term dependencies and can process input sequences of varying lengths. Below, we present a detailed overview of the LSTM cell architecture.
GRU
GRU [14] represents the development of LSTM, achieving comparable or superior accuracy under equivalent computational resources. While GRUs substitute the sigmoid gates and update equations in LSTM with tanh and ReLU activations, respectively. Similarly, they eliminate the need to track long-term memory states, retaining only short-term states, resulting in simpler implementation and faster convergence.
Here is the overall architecture of a GRU cell:
ConvLSTM
该模型(ConvLSTM [15])在LSTM的基础上增加了卷积层,从而实现了视频序列的处理。该方法通过将视频的每一帧转化为时空特征向量,并将这些向量输入到LSTM单元中,实现了对视频序列的建模。通过有效地将视频帧处理能力引入到标准LSTM架构中,ConvLSTM模块显著提升了模型的处理效率。
Here is the overall architecture of a ConvLSTM cell:
4.2.2 TSN, Temporal Segment Network
时间序列分析(TSA)是一种研究数据随时间演变规律的方法。在生物学、金融学、医学和医疗科学等领域,TSA扮演着至关重要的角色。在动作识别领域,识别发生在连续运动序列中的动作行为是一个典型的任务。为了捕捉这一模式,Temporal Segment Networks(TSNs)[16]被引入。
TSN models are designed to learn discriminative features from multiple segments of an input sequence to obtain robust temporal embeddings. Each segment is represented by a concise sequence of learned feature representations, which capture the inherent motion patterns. The features are then aggregated into a video-level embedding vector by taking the average of all segments' features.
Here is the overall architecture of TSN:
In summary, TSN models are used to fuse several video segments to extract distinctive features for motion recognition tasks.
4.2.3 I3D, Inflated 3D ConvNet
Action recognition in videos primarily depends on spatial and temporal features. When dealing with complex motion scenes featuring multiple moving objects, the difficulty level significantly increases. To address this limitation, the Intermediate Video (I3D) method [17] has been successfully developed.
I3D models aim to utilize the spatiotemporal contexts of a video clip by breaking it down into separate frames. The model leverages i3d convolutional neural networks to encode each frame into a collection of spatiotemporal features that capture both appearance and motion cues. These features are then aggregated into a video-level embedding vector by employing a combination of averaging and max-pooling operators.
Here is the overall architecture of I3D:
Similar to TSN, I3D models extract both spatial and temporal cues in the action recognition task. While TSN processes a single frame from the input video clip, I3D models process multiple frame sequences.
4.3 场景理解
4.3.1 Graph Convolutional Net, Wavelet Neural Net, U-Net
Scenes understanding is the process of extracting meaningful and informative visual features from a given input image. Graph Convolutional Networks (GCNs) [18] and wavelet neural nets (WNNs) [19] are two promising approaches to address the task of semantic segmentation in RGB images. U-Net [20] is another variant of GCN that provides competitive accuracy with fewer parameters.
Graph Convolutional Net
Graph Convolutional Networks (GCNs) [18] utilize pixel arrangements to calculate feature descriptors that capture detailed geometric information of an image. The concept of GCN originates from the observation that vertices in a graph stand for visual entities such as pixels, and edges represent the interactions between these vertices. GCNs acquire weights that indicate the significance of each edge or node based on their closeness in the graph, allowing them to extract contextual and structural relationships between pixels.
Here is the overall architecture of a simple GCN:
该图卷积网络通过计算像素邻居的加权和来生成每个像素的描述符。邻居特征的聚合通过点积运算器来定义,其方式取决于其属于空间域还是谱域。在空间域中,每个像素有k个最近邻居;在谱域中,每个像素对应协方差矩阵的一组k个特征值和特征向量。
Wavelet Neural Net
Wavelet Neural Networks (WNNs) [19] extend traditional Graph Convolutional Networks (GCNs) by incorporating wavelets to model image patches. Wavelets offer a hierarchical decomposition of images, allowing them to effectively capture both low- and high-frequency components within the image. The wavelet coefficients derived from these decompositions are then processed by a shallow neural network, whose outputs serve as the foundation for extracting semantic information from images.
Here is the overall architecture of a simple WNN:
U-Net
U-Net [20] 是另一种 GCN 模型,通过收缩路径和扩展路径实现图像分辨率的上采样和下采样。其中,编码器和解码器单元分别采用了 3x3 卷积和跳接连接。U-Net 允许实现对象在图像中的精确定位和分割。
Here is the overall architecture of a simple U-Net:
Conclusion To summarize, this article examined six prominent deep learning models for computer vision, namely LeNet-5, AlexNet, VGG, GoogLeNet, ResNet, and DenseNet. It compared the distinctions between CNNs and RNNs, provided an explanation of what constitutes an object detection model, and highlighted the most recent advancements in object detection. Moving on, we explored several advanced models for action recognition, including Long Short-Term Memory (LSTM) networks, Gated Recurrent Units (GRU) models, Convolutional LSTM networks, Temporal Segment Networks (TSN), Implicit Integration Networks (I3D), and finally introduced semantic segmentation models utilizing Graph Convolutional Networks (GCNs) and Wavelet Neural Networks (WNNs).
