【论文】Semi-Supervised Classification with Graph Convolutional Networks
Semi-Supervised Classification with Graph Convolutional Networks
-
1 Introduction
-
2 Fast Approximate Convolutions on Graphs
-
- 2.1 Spectral Graph Convolutions
- 2.2 Layer-Wise Linear Model
-
3 Semi-Supervised Node Classification
-
- 3.1 Example
We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.
We present
- a scalable approach
- for semi-supervised learning
- on graph-structured data
- based on an efficient variant of convolutional neural networks
- operate directly on graphs.
问题:
- 本文是如何做到scalable的?
- 本文是如何做到efficient的?
- 为什么强调directly on graph?难道还有indirectly on graph吗?
We motivate the choice of our convolutional architecture
- via a localized first-order approximation
- of spectral graph convolutions.
问题:
4. 什么叫做localized first-order approximation?
Our model
- scales linearly in the number of graph edges
- and learns hidden layer representations that encode
- both local graph structure
- and features of nodes.
问题:
5. 什么叫做 scales linearly in the number of graph edges?
6. 如何完成对 local graph structure和features of nodes的编码?
In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.
我们提出了一种基于图结构数据的半监督学习的可扩展方法,该方法基于卷积神经网络的有效变体,该变体直接在图上运行。 我们通过频谱图卷积的局部一阶逼近来激发卷积结构的选择。 我们的模型在图边缘的数量上线性缩放,并学习对局部图结构和节点特征进行编码的隐藏层表示。 在引文网络和知识图数据集上进行的大量实验中,我们证明了我们的方法比相关方法有明显的优势。
1 Introduction
We consider the problem of classifying nodes (such as documents) in a graph (such as a citation network), where labels are only available for a small subset of nodes. This problem can be framed as graph-based semi-supervised learning, where label information is smoothed over the graph via some form of explicit graph-based regularization (Zhu et al., 2003; Zhou et al., 2004; Belkin et al., 2006; Weston et al., 2012), e.g. by using a graph Laplacian regularization term in the loss function:
考虑一个图上的节点分类问题(只知道部分节点的标签)——即基于图的半监督学习问题。目标函数一般需要加上一个正则项来平滑处理——比如 图拉普拉斯 正则项:

最小化相连节点的目标标签的距离(基于相邻结点相似度更高的假设)。
(关于图像拉普拉斯算子:<>)
(关于图拉普拉斯算子(比较通俗易懂):https://zhuanlan.zhihu.com/p/50742283)
思考:为什么要加上这个正则项呢?因为这是一个半监督学习问题——我们我们只有一部分节点 的标签,但是却需要对所有节点 进行预测。如果只有L_0作为目标函数,那么我们就只能预测带标签的节点,而对于没有标签的节点则完全没有参与到运算当中,我们缺少梯度信息,无法对没有标签的节点进行预测。而加上了正则项之后,没有标签的节点的预测值也加入了目标函数(这是通过一个有意义的先验实现的,相邻节点的标签有相似性)。而本文则提出了另一种方法,使得不需要加上这个正则项,也可以对没有标签的节点进行预测。
问题:
7. 本文使用了怎样的方法,使得只使用部分节点的标签,却能对所有节点进行预测?
In this work, we encode the graph structure directly using a neural network model f(X,A) and train on a supervised target L_0 for all nodes with labels, thereby avoiding explicit graph-based regularization in the loss function. Conditioning f(·) on the adjacency matrix of the graph will allow the model to distribute gradient information from the supervised loss L_0 and will enable it to learn representations of nodes both with and without labels
本文直接使用神经网络f(X, A)对节点分类,使用带有标签的节点构建目标函数L_0,因此不需要拉普拉斯正则项。这样训练出来的模型不仅能对带标签的节点输出预测标签,也能对不带标签的节点进行预测(依然是半监督模型),因为在运算过程中,通过邻接矩阵A能够把梯度信息分发到整个图上 (包括不带标签的节点)。
文章说是通过邻接矩阵A能够把梯度信息分发到整个图上,这到底是怎么一回事呢?
Our contributions are two-fold.
- Firstly, we introduce a simple and well-behaved layer-wise propagation rule for neural network models which operate directly on graphs and show how it can be motivated from a first-order approximation of spectral graph convolutions (Hammond et al., 2011).
- Secondly, we demonstrate how this form of a graph-based neural network model can be used for fast and scalable semi-supervised classification of nodes in a graph.
- Experiments on a number of datasets demonstrate that our model compares favorably both in classification accuracy and efficiency (measured in wall-clock time) against state-of-the-art methods for semi-supervised learning.
2 Fast Approximate Convolutions on Graphs

本文提出了如上形式的GCN模型,有很多层组成,每一层之间通过如上公式进行传播。
这个模型可以从两个角度来解释,第一个角度是可以视为传统的基于频谱的GCN方法的一个近似:the form of this propagation rule can be motivated via a first-order approximation of localized spectral filters on graphs (2.1节),另一个角度在附录里。
2.1 Spectral Graph Convolutions

图上的卷积操作可以用如上公式表示。x\in R^N是一个输入信号,g_\theta\in R^N是过滤器,*表示卷积操作,得到的(g_\theta *x)\in R^N即表示输出信号。简单的理解卷积操作,就是先把输入信号通过傅里叶变换 U^T 变换到傅里叶空间,通过过滤器处理,再通过逆傅里叶变换 U 回原空间。公式中的 U^T 和 U 是对图拉普拉斯矩阵 L 分解得到的,L=U\Lambda U^T,因此 U 也就是 L 的特征向量矩阵,\Lambda 是其特征值对角矩阵。
作者的出发点就在于优化这个卷积操作的复杂度。首先分解图拉普拉斯矩阵 L 对于大规模图是计算量很大的。其次傅里叶变换需要进行(N, N)和(N, 1)的矩阵乘法,复杂度为O(N^2)。那么如何近似优化呢?
To circumvent this problem, it was suggested in Hammond et al. (2011) that g_θ(Λ) can be well-approximated by a truncated expansion in terms of Chebyshev polynomials T_k (x) up to K^{th} order:
作者说 g_θ 可以视为 \Lambda 的一个函数,这怎么理解呢?我觉得 g_θ 本身是可以自由选择的,只要是N维向量都可以,因此它可以是任何一个N维向量的函数,自然也可以是 \Lambda 的函数。之所以特地说是 \Lambda 的函数,是因为可以用Chebyshev多项式来近似,并且下面进一步可以把公式中的 \Lambda 转换成 L (从而消除复杂的傅里叶运算)。
把等式4带入等式3中化简, 有:
T_k(\Lambda)=T_k(U^TLU)=U^TT_k(L)U
因此可以化简得到:

可以看到傅里叶变换U全部被抵消了!复杂度只有O(|\mathcal{E}|),和边的数量呈线性关系(比 O(N^2) 快很多了)。也就回答了问题5:scales linearly in the number of graph edges
2.2 Layer-Wise Linear Model
A neural network model based on graph convolutions can therefore be built by stacking multiple convolutional layers of the form of Eq. 5, each layer followed by a point-wise non-linearity. Now, imagine we limited the layer-wise convolution operation to K = 1 (see Eq. 5), i.e. a function that is linear w.r.t. L and therefore a linear function on the graph Laplacian spectrum.
在上一节,我们已经通过Chebyshev多项式定义了一个近似的卷积操作,那么我们怎么将使用这个卷积操作来构建复杂的神经网络呢?有两种方法,一种是设置K的值,K越大,能够囊括的邻居阶数就越高,同时非线性阶数也越高;另一种是采用很小的K作为一个卷积层(比如K=1),然后堆叠多个卷积层(上一层的输出作为下一层的输入)。作者采用的是第二种方法:
We intuitively expect that such a model can alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions, such as social networks, citation networks, knowledge graphs and many other real-world graph datasets. Additionally, for a fixed computational budget, this layer-wise linear formulation allows us to build deeper models, a practice that is known to improve modeling capacity on a number of domains (He et al., 2016).
使用多层卷积网络的好处:
- 能够避免对邻居结构的过拟合,更适用于真实世界各种复杂的图
- 使用多个线性层组合,计算成本更低,能构建更深的网络
在确定了使用多个线性卷积层进行叠加之后,作者又进一步近似:
近似1,假设 \lambda_{max}=2 是一个固定值,因为这个值只起到一个放缩作用,在目标函数的监督下,神经网络可以自动在可训练参数\theta中修正这个放缩值:

近似2,只使用一个参数 \theta 代替 \theta_0 和 \theta_1 ,一是减少了参数防止过拟合,二是减少了一半的矩阵乘法运算:

近似3,因为中间的矩阵的特征值范围是[0, 2],在经过很多层运算之后,可能会出现梯度消失或者梯度爆炸的问题,所以需要再进行归一化:

最后得到如下的简化模型:

3 Semi-Supervised Node Classification
Having introduced a simple, yet flexible model f(X,A) for efficient information propagation on graphs, we can return to the problem of semi-supervised node classification. As outlined in the introduction, we can relax certain assumptions typically made in graph-based semi-supervised learning by conditioning our model f(X,A) both on the data X and on the adjacency matrix A of the underlying graph structure. We expect this setting to be especially powerful in scenarios where the adjacency matrix contains information not present in the data X, such as citation links between documents in a citation network or relations in a knowledge graph. The overall model, a multi-layer GCN for semi-supervised learning, is schematically depicted in Figure 1
3.1 Example

以两层GCN为例,X\in R^{N\times C}为输入,Z\in R^{N\times 1}为输出,\hat A\in R^{N\times N} 为预处理后的邻接矩阵,W^{(0)}\in R^{C\times H} 为输入到隐藏层的权重矩阵,W^{(1)}\in R^{H\times F} 为隐藏层到输出的权重矩阵。

使用交叉熵作为目标函数(只使用带标签的节点)。
使用批次梯度下降训练参数W,邻接矩阵可以使用稀疏矩阵存储,空间复杂度O(E)。前馈传播复杂度(等式9)为O(|\mathcal{E}|CHF) ,C H F 都是很小的常数,因此复杂度和边数呈线性关系。

