GNN-CS224W: 1-2 Introduction; Traditional Methods for machine learning in Graphs
network相比其他的数据有很多特点让它难以处理:
arbitrary size and complex topological structure
Applications of Graph ML
task level
Node-level
Predict a property of a node
Example: Categorize online users / items
Alpha Fold
Objective: Compute the 3D structure of a protein using only its amino acid sequence information.
Key idea: "Spatial graph" is used to model the three-dimensional structure of proteins by representing them as spatial graphs. Nodes represent amino acids within a protein sequence, and edges connect these nodes based on their spatial proximity and interactions in three-dimensional space.
模型:基于给定Amino acids的位置以及它们之间的edge proximity relationships among them的基础上推导出新的Amino acids new positions(新出现的氨基酸位置)或者 Amino acids positional changes(氨基酸位置的变化),进而推断出amino acids positions and protein final structure(蛋白质各氨基酸的位置及其整体结构)。
Edge-level
Determine if there are unconnected edges within a pair of nodes. Example: Knowledge graph construction.
Recommender system
Node entity includes user accounts and item entities (digital goods and physical products). Edge dynamics represent the engagement interactions between users and items. The system's objective is to offer tailored item suggestions that align with individual user preferences.
Drug side effects–Biomedical graph link prediction
有些人可能服用多种药物,并非为了测试所有可能的药物组合之间的相互作用影响结果。
由于药物种类过多而难以测试所有药物组合及其副作用的影响结果,则应用机器学习技术来预测。
任务:通过给定的一对药物预测潜在的副作用
节点:药物与蛋白质
边:药物与蛋白质之间的相互作用、蛋白质之间的相互作用、药物之间的相互作用
节点之间的边中有一些是已知的,并利用模型预测缺失的边
Community (subgraph) level
Identify whether nodes belong to a group. Example: Social circle detection can be seen as identifying tightly-knit subgroups within social networks. Equivalent to cluster analysis, it aims to detect groups of interconnected individuals within larger networks.
Traffic prediction
将道路网络表示为图中的一种结构形式。其中,每个节点代表道路路段;而边则表示连接这些路段的道路连接。具体来说,在这个图中完成的任务是确定从一节点至另一节点的路径规划。
Graph-level
Categorize different graphs
Example: Molecule property prediction
Drug discover
Antibiotics represent small molecular graphs.其中Nodes代表Atom. Edges代表Chemical bonds. Task is to identify whether a molecule can be considered as a drug.
大量的由原子组成的分子数量庞大,在逐一进行药物测试方面存在巨大挑战,因此有必要利用机器学习来预测那些可能有价值的分子。
Graph generation – Drug discovery
还是将原子视为节点单元以及化学键链作为连接关系
Graph evolution – Physical simulation
Physical simulation can be modeled as a graph. Nodes represent particles, and edges denote the interactions among particles. The task is to simulate the future states of these particles.
Choice of graph representation, how do we define a graph?
the choice of what the nodes are and what the links are are very important.
相同的数据集提供者可以选择多种节点和边类型作为node和link的具体选择,并且这种选择对于模型性能至关重要。
Object集合:包含节点nodes和顶点vertices N
Directed and Undirected Graph
Undirected Graph
node之间的关系是 symmetrical(对称的), reciprocal(相互的)
Examples: Collaborations, Friendship on Facebook
Directed Graph
每个链接都具有方向性、来源和目的地。
例如:电话呼叫、在Twitter上转发信息。
Node degrees
Node degree refers to the count of connected edges attached to each node, specifically including both in-degree and out-degree. For instance, if k_i=4, it means node i has four connected edges. The average degree is calculated as the sum of all nodes' degrees divided by the total number of nodes, resulting in \frac{1}{N}\sum\limits^{N}_{i=1}k_i=\frac{2E}{N}.
directed graph node has In degree and Out degree
Bipartite graph

如图所示,满足如下特性:
- All nodes can be separated into two distinct groups, denoted as U and V.
- Each link links a node from U to a corresponding node in V.
- U and V are independent sets.
Examples:
§ 学者-correspond-to-论文(這些學者 曾 publication)
§ 演員-correspond-to-电影 (這些演員 曾在 电影中 出現过)
Folded/Projected Bipartite graph

如上所示,通过折叠操作,Bipartite graph能够将一边与另一侧连接.假设上图展示的是作者与论文之间的关系网络,左边圆圈代表作者,右边方框代表论文.在未进行折叠操作的情况下, bipartite graph表明,编号为1、2、3的作者共同撰写了论文A;编号为2和5的作者共同撰写了论文B.同样地, 对另一侧进行折叠操作也可以揭示类似的联系.
Representing graphs
Adjacent matrix


Many real-life networks are sparse, which leads to the adjacency matrix also being sparse. Take common social networks as an example, the global population stands at 7 billion people. However, an individual can only form a limited number of connections with others, leaving vast portions of the network unconnected.
Edge list

将graph表示为一个二维的matrix
缺点是不能做任何graph manipulation和analysis of the graph
Adjacency list

优点:
- 支持对图的修改和分析
- 相较于邻接矩阵而言, 该方法节省了大量存储空间。
使用二维矩阵来进行存储, 由于边的数量极少, 导致列的数量相对较少。
Node, edge and graph attributes
Possible options:
Weight(例如沟通频率)权重可以直接体现为邻接矩阵中的元素
- 排名(核心朋友、次要核心朋友等)
- 类型(个人关系类型:包括朋友、亲属关系类型以及同事关系类型)
- 符号:正向关系vs负面关系、信任vs不信任
- 依赖于图结构的属性:共同好友数量
more types of graphs
节点具有指向自身的连接,在adjacency矩阵中表现为对角线位置上的元素均为1
两个节点之间不止一条边。在某些情况下,当两条边上所承载的含义相同时,则可将两条边的数量视为权重;但在另一些情况下,则不可将两条边的数量视为权重。
连通图和非连通图
Undirected
连通图和非连通图的差别可以提现在adjacent matrix,如下图

Directed
Strongly connected directed graph (强连通图)不仅存在从任意两个节点之间都可以相互到达的情况,并且这种关系是相互的(例如说A到B以及B到A的道路)
A weakly connected directed graph becomes connected when ignoring edges' orientations.
Strongly connected components (SCCs) (强连通分量)
Traditional Methods for machine learning in Graphs
Traditional ML pipeline
Design aspects for nodes, links, and graphs. The node's attributes: such as protein characteristics. For instance, the position of a node within the network and its surrounding structural features.
特征可以被分为2种:
1. network structure
2. nodes/links/graphs properties
Obtain features for all training data
Node-level tasks and features
Objective: Analyze the topology and location of a node within the network, commonly characterized by four key attributes.
Node degree
该方法计算邻居数量时忽略了其重要性
Node centrality
Node centrality(中心度)用于衡量一个node在graph中是否为中心及其程度如何
有如下几种方式
Engienvector(特征向量) centrality
A node v is defined as a central node whose neighboring nodes u\in N(v) are all key in the network structure.
node v的centrality的计算方式:
c_v=\frac{1}{\lambda}\sum\limits_{u\in N(v)}c_u
其中\lambda为一个正数常量,并被称作eigenvalue(特征值)。该方程组可以用矩阵形式表示为:
\lambda c = Ac
其中A为adjacent matrix, c称为为eigenvector
问题
- 由 Perron-Frobenius 定理可知,矩阵的最大特征值 λ_max 始终为正且唯一(Perron-Frobenius 定理)。最大特征向量 c_max 被用于衡量网络中心性。
- 每个网络中的顶点都有可能影响到其相邻顶点的行为或属性,从而影响整个网络的行为或属性分布模式。
如果没有预先设定某些初始顶点具有特定的行为特征,则所有其他顶点都需要通过依赖其相邻顶点逐步计算得出其行为特征。
这种情况下,位于外围边缘区域的孤立顶点将无法获得外部信息以确定自身的行为特征来源。
Betweenness centrality
Intuition: A node is considered central if it appears in a significant number of the shortest paths among other nodes.
c_v = \sum_{s \neq v \neq t} \frac{\#(\text{the shortest path(s) connecting }s \text{ and }t \text{ through node }v)}{\#(\text{the shortest path(s) connecting }s \text{ and }t)}
"#"表示the number of
Closeness centrality
Intuition: According to the concept, a node is considered significant if its minimal shortest paths to all other nodes are relatively short.
c_v等于1除以从所有u不等于v的最短路径长度之和,在u和v之间。
and many others…
Clustering coefficient
Intuition: Reflects the connectedness of neighboring nodes. Essentially, the extent to which the neighbors of a node are connected.
e_v= \frac{\#(edges\ among\ neighboring\ nodes)}{C_{k_v}^2}
其中分子代表了node v的所有邻居节点之间实际存在的边的数量;符号k_v定义为node v的度数;而分母则计算了从node v的所有邻居节点中选取两个节点形成边的可能性数量。
Another view

Graphlets
The number of triangles can be generalized by enumerating pre-specified subgraphs, such as graphlets.
什么是graphlet
如下图为包括2个到5个node的graphlets
isomorphic(同构)

需要注意的是,在每个graphlet中有一些节点被标记了数字这些被标记数字的node表示我们需要获取其特征这些标记的目的在于明确该node在图中的具体位置同时还需要区分不同情况具体来说当一个graphlet具有三个节点时会有两种不同的结构这是因为对于某些特定的位置排列我们只需要考虑其中一种情况而无需重复计算另外一种情况因此对于具有两个不同排列方向的情况我们将其视为同一类而具有不同排列方向的情况则需要分别计算因此对于有两个不同排列方向的情况我们将其视为两种不同的case
Graphlet Degree Vector (GDV)
如图所示,在GDV中,每个节点都对应着一个向量。这个向量中的每一个维度都对应于一种graphlet,并且数值表示该graphlet出现的次数。

Considering graphlets containing between 2 and 5 nodes, we obtain: A vector comprising 73 coordinates serves as an identifier for each node, encapsulating the topological features characterizing its local structure. It captures the interconnectivities extending up to four hops away and also highlights that five nodes require four edges to form their structure.
作用
The purpose of the graphlet degree vector is to represent an assessment of a node's local network structure.
Comparing vectors of two nodes demonstrates a greater detail metric for evaluating the topological structure proximity than node degrees or clustering coefficients.
Node features 分类总结
Importance-based features
capture the importance of a node is in a graph
- Node degree
- Different node centrality measures
Valuable in foreseeing key players within a network structure, as demonstrated by cases such as identifying opinion leaders on an online platform.
Structure-based features
Capture topological properties of local neighborhood around a node
- Node degree
- Clustering coefficient
- Graphlet count vector
Can be used to predict the specific function of nodes within a graph: For instance, it can be used to predict the activity of proteins within a protein-protein interaction network.
Link-level prediction tasks and features
Task: predict new links based on existing links.
方式:未预先存在的所有链接都被排序,并据此预测出前k个节点对。
Key: design features for a pair of nodes.
只用node feature的话将会缺少link信息
Two formulations of the link prediction task
Links missing at random:
Remove a random set of links and then aim to predict them
更适合static network,例如protein-protein network
Links over time
基于一个动态发展的网络结构(如社会网络或引用网络)
Consider the network G[t_0,t'_0], which represents the edges up to time t'_0. The task is to provide a sorted list L of links (excluding edges from G[t_0,t'_0]) that are predicted to appear in the extended period up to time t'_1.
Evaluación criterios: El sistema adquiere las aristas nuevas que surgen durante el intervalo de prueba [t₁, t₁'] y las compara con las aristas previamente predichas.
只适用于dynamic network,例如protein-protein network
此处未明确阐述t_0,t'_1之间的关系这一问题,请问该图是否代表时间段t_0到t'_1内的时间序列及其对应的graph?或者仅仅是一个单一时间点的graph?这为何无法将预测连接线包含在该时间段内生成的时间序列及其对应的graph中?
Methodology 常见步骤
Each node pair (x, y), calculate the score value c(x, y). As an illustration, this score can be calculated as the count...
Arrange pairs (x, y) based on their decreasing scores, c(x, y). Forecast the top n pairs to be introduced as new connections. Investigate whether these predicted links are present in G[t_1, t_1'].
Link-level features
Objective: introduce two nodes, characterize their interconnection, and subsequently predict or determine whether a connection exists between these nodes.
Distance-based feature

Local neighborhood overlap feature
Captures number of neighboring nodes shared between two nodes v_1 and v_2:

在Adamic-Adar指数中, 当一个节点(k_u)的度数显著高时, \log{k_u}明显增大, 因此\frac{1}{\log{k_u}}相应减小
这表明在一个节点对共享邻居的影响程度上, 度数较高的邻居具有较弱的影响, 而度数较低的邻居则具有较强的影响
比如在social network中,并非拥有相同明星作为粉丝的人必然具备某种社会关系;相反地,在某些情况下两者均为小众明星的追随者反而更能体现出潜在的社会联系。
缺点:Metric will always be zero when two nodes lack any shared neighbors.However, despite this, the two nodes might still have a chance to connect in the future.As shown in the following figure for example

此处的local意为仅局限于v_1和v_2的邻居分析, 而网络中其他节点未被纳入分析
Global neighborhood overlap feature
相比Local neighborhood overlap feature,考虑了 the entire graph
基于全局节点邻域重叠的一种度量方法;Katz指标;用于量化任意一对节点之间所有路径的数量。
employ the power of the graph adjacency matrix to calculate the number of paths with a specific length between two specific nodes.
powers of the graph adjacency matrix


在矩阵A^{(k)}中,幂k代表了路径的长度。当元素A_{uv}^{(k)} \neq 0时,则表明节点u与节点v之间具有长度为k的路径,并且该路径的数量等于对应的元素值。
现在继续看Katz index
the Katz index between v_1 and v_2 is determined by a summation over all paths, with each path contributing a weighted value based on its length. The formula for this calculation is given by the infinite series:
S_{v_1 v_2} = \sum\limits_{l=1}^{\infty} \beta^l (A^l)_{v_1 v_2}
其中0<\beta<1被称为discount factor。该式子表明,在v_1和v_2之间的最短路径具有更高的权重值,在较长距离路径上的权重则会降低相应的数值系数倍数关系。当S_{v_1 v_2}值越大时,则表示两个节点之间的联系更为紧密。
Katz index matrix is computed using an analytical solution:
S=\sum\limits_{i=1}^{\infty} \beta^i A^i=(I-\beta A)^{-1}-I
该计算基于矩阵序列的几何级数进行推导,并由下式体现:
(I-\beta A)^{-1} = \sum_{i=0}^\infty \beta^i A^i
Graph-level features and graph kernals
Kernel methods
widely-used for traditional ML for graph-level prediction
Idea: 而不是从特征向量出发设计核函数 普通方法首先从图中提取特征向量作为模型输入 核方法同样从图中提取特征向量进行计算 然而 在计算两个图之间的相似性时 并不需要进一步建模
Kernel K(G, G')\in R measures similarity between data
The kernel matrix \textbf{K}=(K(G, G'))_{G, G'} is necessarily positive semidefinite (i.e., it has positive eigenvalues).
There exists a feature representation \phi(\cdot) such that K(G, G')=\phi(G)^T\phi(G')
After defining the kernel, off-the-shelf ML models, such as kernel SVMs, are capable of making predictions.
Graph-level features: Graph kernals
Objective: Develop feature descriptors that encode the structural properties of a complete graph in order to compute similarity.
Graph kernal: Key Idea
这里是Graph kernal 方法的一个简单例子
特征1:Bag-of-Words (BoW) for a graph
类似于基于词袋模型的文本表示方法,在图神经网络领域中,则通过一个特征向量来编码graph结构信息。该特征向量的长度等于所有possible node的数量,在这种编码方案下,每个元素都标识对应节点在特定graph中是否存在
这种方法能包含的信息容量非常有限, 会使得大量具有不同结构的图被表示为相同的形式, 例如下图

其中\phi表示graph的特征向量
特征2:Bag of node degrees
使用node degrees生成特征向量,包含的信息比特征1更多,如下图所示:

这里的特征1和特征2属于较为基础的Graph kernel方法,下面两个属于更为高阶的Graph kernel方法
Graphlet Kernel
Key idea: Count the number of different graphlets in a graph.
Here, the definition of graphlets differs slightly from that of node-level features.
Node(s) in the graphlets are not required to be connected, which permits the presence of isolated nodes. These graphlets lack a root structure.




其中Sum(f_G)即代表特征向量所有元素之和,在上述示例中f_G=(1,3,6,0)^T时计算结果为10
Limitations
Counting graphlets is expensive!
Weisfeiler-Lehman Kernel
more computing efficient than Graphlet Kernel
Concept: exploit the neighborhood structure to recurrently enhance node vocabulary
实现方法:Color refinement
Color refinement
算法描述:

如图所示,在初始状态下以1标记color节点。首先将所有邻居加入到该节点中;接着利用hash函数对节点及其邻居的数据特征进行映射处理,并生成新的color值。这即完成一次aggregate操作,在此过程中每个节点的颜色字段整合了自身及直接邻居的数据特征


按照下图所示的步骤进行第二次aggregate操作,在此过程中对于当前状态中的每一个node而言,在此之前该节点已经整合了与之直接相连的所有邻居节点的信息(即一步邻居)。经过该操作后,每个节点不仅整合了与其直接相连的所有邻居节点的信息(即一步邻居),还整合了通过两次连接到达的所有间接邻居节点的信息(即两步邻居)。




理解
可以有效的集合大量的各种距离的neighbor信息
在第一步操作中收集了与节点距离为1的所有相关节点信息,在第二步操作中收集了与
节点距离为2的所有相关连接信息;经过K次操作后
,每个
节点整合并存储了与其相距
K
的所有相关连接信息
time complexity
-
在每个步骤中,时间复杂度与边的数量成正比。
-
颜色种类数最多等于节点总数。
-
颜色是可以复用的(即在不同步骤中可以重复使用),但必须满足每一步使用的颜色在全局上是独立的(即不会与其他步骤中的颜色冲突)。因此,在这种情况下最多只需要使用与节点数量相当的颜色种类。
-
统计颜色的时间复杂度为线性时间相对于节点数目。(此步骤)在构建特征向量的过程中统计每一种颜色的数量,并且仅需节点数目次计算
In total, time complexity is linear in #(edges).
Other kernels
(beyond the scope of this lecture)
- Random-walk kernel
- Shortest-path graph kernel
And many more…
