Advertisement

Deep Cascaded Bi-Network for Face Hallucination--阅读笔记

阅读量:

We introduce an innovative system for synthesizing faces in unconstrained poses at very low resolution (with a face size as small as 5pxIOD 1 ).

该文章提供了一种适用于无约束姿态及极其低分辨率的人脸增强技术方案;其中将"人脸大小"具体定义为双眼间距像素数量(最低水平仅限于5个像素)。

we optimize two complementary tasks, including face reconstruction and dense correspondence field estimation, within a single integrated system.

除了轮流提升效率外,在具体实现过程中还需要关注以下几个关键环节:首先是常规的面部超分技术(即普通的face hallucination),其次是高分辨率区域检测技术(即高密度区域估计)。这两个环节在相互配合下能够有效提升图像重建的质量。

Furthermore, we introduce a novel gated deep bi-network comprising two function-specific branch modules designed to extract and reconstruct varying degrees of texture information.

此外,在本研究中我们开发出一种门限bi-网络(threshold bi networks),该网络包含两个功能性专门化的分支(functionality-specialized branches),用于从多种纹理细节中重建原始图像信息。

In this study, we explore and expand upon the concept of prior in relation to pixel-wise dense face correspondence fields.

这里的"prior"指的是:面部结构特征及面部空间配置两种先验信息;而本文则是对上述提到的先验信息进行拓展应用,在像素级层面构建了高密度的人脸对应关系(或许应该将其翻译为密集响应区域?)。

In this context, a dense correspondence field is essential for characterizing the spatial arrangement of its pixel-wise (without reliance on facial landmark detection) and point-to-point mapping (excluding face segmentation technique) properties.

密度响应区域用于描述像素级的空间结构及对应的属性。

6、

图1(a)展示了原始高分辨率图像。(b)显示了一个低分辨率输入图像大小为5pxIOD。(c)是基于双三次插值的放大结果。(d)概述了所提出的面部 hallucination 框架的概述图。实线箭头表示通过稀疏对应场实现的 hallucination步骤。虚线箭头表示估计过程。

此处展示了原始高分辨率图像(a)、低分辨率图像(b),其间距离设定为5个像素;通过Bicubic插值方法生成了高分辨率图像(c),此外还有本文所提出的框架(d)。实线箭头的作用是利用空间线索如密度响应区域来进行face hallucinates;虚线箭头则用于重新估计密度响应区域。

Consequently, we encounter a chicken-and-egg dilemma - face hallucination is more effectively guided by the arrangement of facial features, whereas the other aspect demands a high-resolution image. This challenge has predominantly been overlooked or circumvented in prior studies.

面对先有鸡还是先有蛋的问题,先前的研究工作都忽略了这一问题。

The two assignments currently at hand - high-level face correspondence estimation and low-level face hallucination - are interdependent, and they can be alternately optimized through mutual guidance.

本文所设计的框架包含两个关键环节:高精度的人脸区域定位以及基于细节增强技术的高-detailed facial detail inpainting.

Within a cascade iteration framework, as face resolution enhances, the dense correspondence field undergoes gradual improvement. Simultaneously, image resolution is dynamically enhanced through adaptive upscaling, leveraging the refined dense correspondence field.

当两个任务交替执行时,在人脸分辨率提升的过程中,密度响应区域逐步提升;同时,在这一过程推动下,图像分辨率也随之自适应地提高。

为了更好地恢复不同层次的纹理细节在人脸上的表现效果 我们在人脸生成过程中的每个递进阶段 提出了一种新型的带有门控机制的双层深度网络架构

为了从人脸中更有效地恢复不同层次的纹理细节的研究中, 本文引入了门限二元网络架构来实现对人脸 hallucination过程的支持.

相较于此前的研究,在本研究中所提出的网络体系包含两个功能细分化的分支,并且这两个分支均采用了端到端的训练方式。第一个分支被指认为公共分支,在这种情况下该分支能够保守地恢复仅在低分辨率输入中可见的纹理细节特征。第二个分支则被指称为高频分支,在这种情况下该分支能够基于当前级联估计出的面形变前进行超分辨率重构。归功于先前估计出的高频率先验信息的帮助作用,在这种情况下该高频分支具备恢复并综合合成过于低分辨率输入图像中隐藏细节特征的能力

这里阐述了本文所构建的网络体系结构:共享分支(Common Branch)旨在通过低分辨率图像重建纹理细节;而高频率分支(High-Frequency Branch)则专注于整合估计的人脸响应区域的空间先验信息。基于这种空间先验分析,在高频率分支中能够有效提取并补充低分辨率图像中隐含的纹理细节。

A pixel-level gated network is trained to combine the outputs of the two branches.

一种像素级的门限网络将两个分支进行整合。

Despite the high-frequency branch generating facial features covered by sunglasses, the gate network automatically prefers the outputs from the common branch during fusion.

图2展示了该双网结构效果示例图。(a)输入进行了双三次插值处理。(b)仅启用公共分支的情况。(c)高频特征提取部分的结果展示。(d)同时启用两网时的结果分析。(e)最终所得高分辨率图像源。请通过放大电子版文档进行查看。

(a)这是通过Bicubic插值方法获得的结果。
(b)这是来自common branch的输出结果。
(c)这是来自高频分支的输出结果。
(d)它是将两个分支整合在一起得到的结果。
观察发现将高频分支的输出进行融合后形成了被眼镜遮挡 eyes的效果。
值得注意的是,在合并过程中门限网络通常会更倾向于common branch的结果。

The face image is represented by the matrix I. The pixel coordinates (x, y) are denoted by x ∈ ℝ² within the matrix I.

The high-density face correspondence field establishes a pixel-level correspondence relationship between M, which is a subset of ℝ² representing the 2D face area in the mean template, and the corresponding facial region in image I. The dense correspondence field expresses this relationship using a warping function [38], where each coordinate z ∈ M is mapped to its corresponding position x = W(z) ∈ ℝ² within image I. See Fig. 3(a,b) for a clear illustration.

Fig. 3 illustrates two subfigures: (a,b) depict a mean face template M and a facial image I, where a grid represents a dense correspondence field W(z). This warping function defines how points from z are mapped to x. Subfigures (c,d) show high-frequency priors before and after warping, specifically for an exemplar image in (b). Notably, both priors retain C channels, each containing a single 'contour line'. For visualization clarity, we reduce each channel's dimensionality by applying max operations across all but one contour line, omitting specific indices k as they do not affect overall understanding. This illustration is best viewed electronically.

文中所说的密度级人脸响应区域即为从中性人脸模板到目标图像I建立单像素级别的对应关系:x=W(z)

Following [39], the warping residual W(z) − z is modeled as a linear combination of the dense facial deformation bases, which effectively captures the intricate patterns of facial expressions.

where

denotes the deformation coefficients and B(z) =

deformation基由N个元素组成。根据文献[40]中的AAM框架选择这N个基元,在其中4个基元对应相似变换部分而剩下的则用于处理非刚性变形问题。值得注意的是这些基元是在训练阶段预先定义好的并且是所有样本共用的。由此可见变形系数p对于每个样本来说都是关键因素。当p=0时变形系数矩阵对应的面网格将完全等同于均值脸模版。

详细说明了如何推导出W(z),其中p被定义为一个关键参数。基于AAMs算法生成的一种变形基矩阵B已经被预先计算并被所有后续图片共享。因此,在这种设定下使得密度分布完全由变形参数p决定。特别地,在p=0的情况下,则对应于平均人脸模型的情形。

Our framework is built from K sequential steps, as illustrated in Fig. 1(d). Each step refines the prediction through a process.

where k ranges from 1 to K. Herein, Equation (2) signifies the dense field update step, whereas Equation (3) corresponds to the spatially guided face hallucination step within each cascade. '↑' symbolizes the upscaling procedure (a scaling factor of twice using bicubic interpolation as implemented in our study.)

该段具体说明了变形参数、密度人脸响应区域以及第k次上采样所得高分辨率图片I的更新方法。

This framework initiates its construction with two fundamental initial conditions: I₀ and p₀. Here, I₀ represents the input low-resolution facial image, while p₀ serves as a zero vector that holds the deformation coefficients for a mean face template. Through iterative processing, this framework generates a hallucinated facial image denoted as I_K.

这表明整个系统流程从初始状态I(0)和p(0)=∅开始运行。其中I(0)代表低分辨率的人脸图像输入,而p(0)=∅是一个零向量。最终输出结果为第k步的状态I(k)。

该模型由函数fk(密集场估计)和gk(基于空间提示的面部重构)组成。变形基Bk被预先定义为每个级联,并在整个训练与测试过程中保持不变。

f函数被应用于密度区域估计这一任务中,g函数则被用来生成基于空间线索的人脸hallucination,而变量B则在每一个级联阶段都被预先设置了这一操作,并且在整个训练与测试过程中均保持不变

We develop a gated bi-network for each stage of the cascade. For each k-th stage, we accept input images I_k-1 and the current estimated dense correspondence field W_k(z) to aim at predicting their image residual G = I_k- I_k-1.

针对每一个级联模块,在第k次迭代中基于图像I(k-1)和当前估计密度响应区域Wk(z),我们直接生成预测图像残差。

We integrate the two branches using a gated network. Specifically, we represent the outputs from the common branch (A) and high-frequency branch (B) as GA and GB, respectively, and then integrate them through a gated mechanism.

where G is defined as our predicted image residual I_k - \uparrow I_{k-1} (i.e., representing the difference between current frame I_k and its motion-compensated prediction \uparrow I_{k-1}), and G_\lambda represents a pixel-wise soft gate map responsible for blending the outputs from processes G_A and G_B. Element-wise multiplication is symbolically represented by \otimes.

common branch被简称为GA;high-freq branch同样被简称为GB;将这两个branch结合起来,在门限网络中形成G;同时,G同样是I(k)与I(k−1)之间的残差;为了调节两个分支GA与GB之间的结合程度,Gλ则被标记为像素级软门限映射。

如图4所示,本研究提出了一种分层有门控机制的双网络架构(gated bi-network)。该架构包含三个卷积型子网络分别用于预测GA、GB以及Gλ值。其中共同分支型子网络(如图4中蓝色部分)仅使用插值后的低分辨率图像“Ik−1”来预测GA;而高频分支型子网络(如图4中红色部分)则同时接收"Ik−1"以及被扭曲后的高频先验信息EWk(根据估计的密集对应场扭曲)。所有输入数据与GA和GB一起输入到门控子网络(如图4中绿色部分)以预测Gλ,并最终获得高分辨率输出G

As shown in Fig. 4, the architecture of the proposed deep bi-network (for the k-th cascade) comprises three key components: a shared branch, a high-frequency submodule, and a gating mechanism.

从图中可以看到GA、GB、Gλ三个网络的输入输出是什么。

//2017/5/2

22、人脸图像增强方面

High-frequency priors are defined as the manifestation of locations characterized by high-frequency details.

就是定义高频先验为:高频细节位置的标记。

These high-frequency prior maps are created in this work to require the enforcement of spatial constraints during image synthesis.

产生这个高频先验图的目的就是为hallucination做空间指引。

The prior maps are obtained from the mean face template domain.

先验图从平均人脸模板获得。

for every training image, we calculate the residual map between the original image I^ and its bicubic interpolation, then map this residual map into the mean face's template domain.

在每一个训练图片的基础上,随后计算原始图片与经bicubic插值处理后的图片之间的误差图,并将其重塑为具有平均人脸形状的模板。

Across all training image pairs, we compute the magnitude of the warped residual maps and use it to form a preliminary high-frequency map.

基于全部训练图像集合...计算均值级别的变形后残差图像,并将其作为初步高频特征图进行后续处理

To mitigate noise and establish a semantically meaningful prior, we cluster the preliminary high-frequency map into C continuous contours (10 units in our implementation). These sets of contour representations are termed our high-frequency priors and denoted as E_k(z): M_k \rightarrow \mathbb{R}^C. Each such representation is denoted by E_k(z) for all z \in M_k. The illustration of this prior can be seen in Figure 3(c).

为了利用语义信息对抗噪声的影响,在图像处理中对初始高频图像进行聚类分割以生成一个具有C个通道的空间结构;其中每个通道对应一个独立的空间区域;该空间结构即为高频图像处理的基础模型,并将其命名为Ek(z)

To train the common branch, we employ a loss function across all training samples.

上面式子是common branch的损失函数。

The high-frequency branch consists of two inputs: ↑I_{k-1} and a warped version of a high-frequency prior E^{W}_k, as illustrated in Fig. 3(d). These two components are combined in the channel dimension to form an (1 + C)-channel composite input. We employ a specific loss function to evaluate performance across all training samples.

where

represents the c-th channel of the warped high-frequency prior maps. When compared to the common branch, we additionally incorporate prior knowledge as input and apply penalties exclusively within the high-frequency domain.

此公式为high-pre branch的损失函数,并非对所有通道的所有部分施加惩罚。

Learning to predict the gate map G λ is supervised by the final loss

上面这式子是门限图的损失(最终损失)。

Specifically, we acquire two sets of N deformation bases simultaneously: B_k(z) ∈ ℝ^{2×N} for the dense field and S_k(l) ∈ ℝ^{2×N} for the landmarks. Here, l represents the landmark index. Notably, these bases are uniquely associated with identical deformation coefficients p_k ∈ ℝ^N.

where

denotes the coordinates of the l-th landmark, and

denotes its mean location.

同时获得两个变形基(用变形系数P)

To estimate the deformation coefficients p_k at each stage k, we apply an effective cascaded regression technique [23]. A Gauss-Newton-based steepest descent approach is employed at every training step to learn a regression matrix R_k that maps observed appearances to updated deformation coefficients.

where φ represents a shape-based feature [27,2], which combines the local appearance emanating from all L landmarks, and φ̄ denotes its average value across all training samples.

为了在每个级联k中预测变形系数p(k),我们采用了一种高效的递归方法进行估计。这一过程就是在每一轮迭代过程中学习高斯-牛顿最陡下降回归矩阵R(x),将观察到的外部数据映射到变形系数的更新上。其中φ作为形状相关特征,在连接所有的L个人脸标记和人脸局部外观方面发挥着重要作用;而φ̄则代表所有训练样本的平均特征。

By employing the method described in reference [23], we are able to determine the Jacobian matrix Jk, which is subsequently used to construct the project-out Hessian for obtaining Rk in a steepest descent regression framework.

. We refer readers to [23] for more details.

如何求解高斯-牛顿最速梯度下降矩阵。

27、

全部评论 (0)

还没有任何评论哟~