GNN-CS224W: 1-2 Introduction； Traditional Methods for machine learning in Graphs

阅读量：

network相比其他的数据有很多特点让它难以处理：

arbitrary size and complex topological structure

Applications of Graph ML

task level

Node-level

Predict a property of a node
Example: Categorize online users / items

Alpha Fold

Objective: Compute the 3D structure of a protein using only its amino acid sequence information.

Key idea: "Spatial graph" is used to model the three-dimensional structure of proteins by representing them as spatial graphs. Nodes represent amino acids within a protein sequence, and edges connect these nodes based on their spatial proximity and interactions in three-dimensional space.

模型：基于给定Amino acids的位置以及它们之间的edge proximity relationships among them的基础上推导出新的Amino acids new positions（新出现的氨基酸位置）或者 Amino acids positional changes（氨基酸位置的变化），进而推断出amino acids positions and protein final structure（蛋白质各氨基酸的位置及其整体结构）。

Edge-level

Determine if there are unconnected edges within a pair of nodes. Example: Knowledge graph construction.

Recommender system

Node entity includes user accounts and item entities (digital goods and physical products). Edge dynamics represent the engagement interactions between users and items. The system's objective is to offer tailored item suggestions that align with individual user preferences.

Drug side effects–Biomedical graph link prediction

有些人可能服用多种药物，并非为了测试所有可能的药物组合之间的相互作用影响结果。
由于药物种类过多而难以测试所有药物组合及其副作用的影响结果，则应用机器学习技术来预测。

任务：通过给定的一对药物预测潜在的副作用
节点：药物与蛋白质
边：药物与蛋白质之间的相互作用、蛋白质之间的相互作用、药物之间的相互作用
节点之间的边中有一些是已知的，并利用模型预测缺失的边

Community (subgraph) level

Identify whether nodes belong to a group. Example: Social circle detection can be seen as identifying tightly-knit subgroups within social networks. Equivalent to cluster analysis, it aims to detect groups of interconnected individuals within larger networks.

Traffic prediction

将道路网络表示为图中的一种结构形式。其中，每个节点代表道路路段；而边则表示连接这些路段的道路连接。具体来说，在这个图中完成的任务是确定从一节点至另一节点的路径规划。

Graph-level

Categorize different graphs
Example: Molecule property prediction

Drug discover

Antibiotics represent small molecular graphs.其中Nodes代表Atom. Edges代表Chemical bonds. Task is to identify whether a molecule can be considered as a drug.

大量的由原子组成的分子数量庞大，在逐一进行药物测试方面存在巨大挑战，因此有必要利用机器学习来预测那些可能有价值的分子。

Graph generation – Drug discovery

还是将原子视为节点单元以及化学键链作为连接关系

Graph evolution – Physical simulation

Physical simulation can be modeled as a graph. Nodes represent particles, and edges denote the interactions among particles. The task is to simulate the future states of these particles.

Choice of graph representation, how do we define a graph?

the choice of what the nodes are and what the links are are very important.

相同的数据集提供者可以选择多种节点和边类型作为node和link的具体选择，并且这种选择对于模型性能至关重要。

Object集合：包含节点nodes和顶点vertices N

Directed and Undirected Graph

Undirected Graph

node之间的关系是 symmetrical(对称的), reciprocal(相互的)

Examples: Collaborations, Friendship on Facebook

Directed Graph

每个链接都具有方向性、来源和目的地。
例如：电话呼叫、在Twitter上转发信息。

Node degrees

Node degree refers to the count of connected edges attached to each node, specifically including both in-degree and out-degree. For instance, if $k_i=4$ , it means node i has four connected edges. The average degree is calculated as the sum of all nodes' degrees divided by the total number of nodes, resulting in $\frac{1}{N}\sum\limits^{N}_{i=1}k_i=\frac{2E}{N}$ .

directed graph node has In degree and Out degree

Bipartite graph

如图所示，满足如下特性：

All nodes can be separated into two distinct groups, denoted as U and V.
Each link links a node from U to a corresponding node in V.
U and V are independent sets.

Examples:
§ 学者-correspond-to-论文（這些學者曾 publication）
§ 演員-correspond-to-电影（這些演員曾在电影中出現过）

Folded/Projected Bipartite graph

如上所示,通过折叠操作,Bipartite graph能够将一边与另一侧连接.假设上图展示的是作者与论文之间的关系网络,左边圆圈代表作者,右边方框代表论文.在未进行折叠操作的情况下, bipartite graph表明,编号为1、2、3的作者共同撰写了论文A;编号为2和5的作者共同撰写了论文B.同样地, 对另一侧进行折叠操作也可以揭示类似的联系.

Representing graphs

Adjacent matrix

Many real-life networks are sparse, which leads to the adjacency matrix also being sparse. Take common social networks as an example, the global population stands at 7 billion people. However, an individual can only form a limited number of connections with others, leaving vast portions of the network unconnected.

Edge list

将graph表示为一个二维的matrix

缺点是不能做任何graph manipulation和analysis of the graph

Adjacency list

优点：

支持对图的修改和分析
相较于邻接矩阵而言, 该方法节省了大量存储空间。
使用二维矩阵来进行存储, 由于边的数量极少, 导致列的数量相对较少。

Node, edge and graph attributes

Possible options:

Weight（例如沟通频率）权重可以直接体现为邻接矩阵中的元素

排名（核心朋友、次要核心朋友等）
类型（个人关系类型：包括朋友、亲属关系类型以及同事关系类型）
符号：正向关系vs负面关系、信任vs不信任
依赖于图结构的属性：共同好友数量

more types of graphs

节点具有指向自身的连接，在adjacency矩阵中表现为对角线位置上的元素均为1

两个节点之间不止一条边。在某些情况下，当两条边上所承载的含义相同时，则可将两条边的数量视为权重；但在另一些情况下，则不可将两条边的数量视为权重。

连通图和非连通图

Undirected

连通图和非连通图的差别可以提现在adjacent matrix，如下图

Directed

Strongly connected directed graph (强连通图)不仅存在从任意两个节点之间都可以相互到达的情况，并且这种关系是相互的（例如说A到B以及B到A的道路）

A weakly connected directed graph becomes connected when ignoring edges' orientations.

Strongly connected components (SCCs) (强连通分量)

Traditional Methods for machine learning in Graphs

Traditional ML pipeline

Design aspects for nodes, links, and graphs. The node's attributes: such as protein characteristics. For instance, the position of a node within the network and its surrounding structural features.

特征可以被分为2种：

复制代码

1. network structure
2. nodes/links/graphs properties

Obtain features for all training data

Node-level tasks and features

Objective: Analyze the topology and location of a node within the network, commonly characterized by four key attributes.

Node degree

该方法计算邻居数量时忽略了其重要性

Node centrality

Node centrality（中心度）用于衡量一个node在graph中是否为中心及其程度如何

有如下几种方式

Engienvector(特征向量) centrality

A node $v$ is defined as a central node whose neighboring nodes $u\in N(v)$ are all key in the network structure.

node $v$ 的centrality的计算方式：

$c_v=\frac{1}{\lambda}\sum\limits_{u\in N(v)}c_u$

其中 $\lambda$ 为一个正数常量，并被称作eigenvalue（特征值）。该方程组可以用矩阵形式表示为：

$\lambda c = Ac$

其中A为adjacent matrix, c称为为eigenvector

问题

由 Perron-Frobenius 定理可知,矩阵的最大特征值 λ_max 始终为正且唯一（Perron-Frobenius 定理）。最大特征向量 c_max 被用于衡量网络中心性。
每个网络中的顶点都有可能影响到其相邻顶点的行为或属性,从而影响整个网络的行为或属性分布模式。
如果没有预先设定某些初始顶点具有特定的行为特征,则所有其他顶点都需要通过依赖其相邻顶点逐步计算得出其行为特征。
这种情况下,位于外围边缘区域的孤立顶点将无法获得外部信息以确定自身的行为特征来源。

Betweenness centrality

Intuition: A node is considered central if it appears in a significant number of the shortest paths among other nodes.
$c_v = \sum_{s \neq v \neq t} \frac{\#(\text{the shortest path(s) connecting }s \text{ and }t \text{ through node }v)}{\#(\text{the shortest path(s) connecting }s \text{ and }t)}$

"#"表示the number of

Closeness centrality

Intuition: According to the concept, a node is considered significant if its minimal shortest paths to all other nodes are relatively short.

$c_v$ 等于1除以从所有u不等于v的最短路径长度之和，在u和v之间。

and many others…

Clustering coefficient

Intuition: Reflects the connectedness of neighboring nodes. Essentially, the extent to which the neighbors of a node are connected.

$e_v= \frac{\#(edges\ among\ neighboring\ nodes)}{C_{k_v}^2}$

其中分子代表了node v的所有邻居节点之间实际存在的边的数量；符号 $k_v$ 定义为node v的度数；而分母则计算了从node v的所有邻居节点中选取两个节点形成边的可能性数量。

Another view

Graphlets

The number of triangles can be generalized by enumerating pre-specified subgraphs, such as graphlets.

什么是graphlet

如下图为包括2个到5个node的graphlets
isomorphic(同构)

需要注意的是，在每个graphlet中有一些节点被标记了数字这些被标记数字的node表示我们需要获取其特征这些标记的目的在于明确该node在图中的具体位置同时还需要区分不同情况具体来说当一个graphlet具有三个节点时会有两种不同的结构这是因为对于某些特定的位置排列我们只需要考虑其中一种情况而无需重复计算另外一种情况因此对于具有两个不同排列方向的情况我们将其视为同一类而具有不同排列方向的情况则需要分别计算因此对于有两个不同排列方向的情况我们将其视为两种不同的case

Graphlet Degree Vector (GDV)

如图所示，在GDV中，每个节点都对应着一个向量。这个向量中的每一个维度都对应于一种graphlet，并且数值表示该graphlet出现的次数。

Considering graphlets containing between 2 and 5 nodes, we obtain: A vector comprising 73 coordinates serves as an identifier for each node, encapsulating the topological features characterizing its local structure. It captures the interconnectivities extending up to four hops away and also highlights that five nodes require four edges to form their structure.

作用

The purpose of the graphlet degree vector is to represent an assessment of a node's local network structure.

Comparing vectors of two nodes demonstrates a greater detail metric for evaluating the topological structure proximity than node degrees or clustering coefficients.

Node features 分类总结

Importance-based features

capture the importance of a node is in a graph

Node degree
Different node centrality measures

Valuable in foreseeing key players within a network structure, as demonstrated by cases such as identifying opinion leaders on an online platform.

Structure-based features

Capture topological properties of local neighborhood around a node

Node degree
Clustering coefficient
Graphlet count vector

Can be used to predict the specific function of nodes within a graph: For instance, it can be used to predict the activity of proteins within a protein-protein interaction network.

Link-level prediction tasks and features

Task: predict new links based on existing links.

方式：未预先存在的所有链接都被排序，并据此预测出前k个节点对。

Key: design features for a pair of nodes.

只用node feature的话将会缺少link信息

Two formulations of the link prediction task

Links missing at random:

Remove a random set of links and then aim to predict them

更适合static network，例如protein-protein network

Links over time

基于一个动态发展的网络结构（如社会网络或引用网络）

Consider the network $G[t_0,t'_0]$ , which represents the edges up to time $t'_0$ . The task is to provide a sorted list L of links (excluding edges from $G[t_0,t'_0]$ ) that are predicted to appear in the extended period up to time $t'_1$ .

Evaluación criterios: El sistema adquiere las aristas nuevas que surgen durante el intervalo de prueba [t₁, t₁'] y las compara con las aristas previamente predichas.

只适用于dynamic network，例如protein-protein network

此处未明确阐述 $t_0,t'_1$ 之间的关系这一问题，请问该图是否代表时间段 $t_0$ 到 $t'_1$ 内的时间序列及其对应的graph？或者仅仅是一个单一时间点的graph？这为何无法将预测连接线包含在该时间段内生成的时间序列及其对应的graph中？

Methodology 常见步骤

Each node pair (x, y), calculate the score value c(x, y). As an illustration, this score can be calculated as the count...

Arrange pairs $(x, y)$ based on their decreasing scores, $c(x, y)$ . Forecast the top $n$ pairs to be introduced as new connections. Investigate whether these predicted links are present in $G[t_1, t_1']$ .

Link-level features

Objective: introduce two nodes, characterize their interconnection, and subsequently predict or determine whether a connection exists between these nodes.

Distance-based feature

Local neighborhood overlap feature

Captures number of neighboring nodes shared between two nodes $v_1$ and $v_2$ :

在Adamic-Adar指数中, 当一个节点（ $k_u$ ）的度数显著高时, $\log{k_u}$ 明显增大, 因此 $\frac{1}{\log{k_u}}$ 相应减小
这表明在一个节点对共享邻居的影响程度上, 度数较高的邻居具有较弱的影响, 而度数较低的邻居则具有较强的影响

比如在social network中，并非拥有相同明星作为粉丝的人必然具备某种社会关系；相反地，在某些情况下两者均为小众明星的追随者反而更能体现出潜在的社会联系。

缺点：Metric will always be zero when two nodes lack any shared neighbors.However, despite this, the two nodes might still have a chance to connect in the future.As shown in the following figure for example

此处的local意为仅局限于 $v_1$ 和 $v_2$ 的邻居分析, 而网络中其他节点未被纳入分析

Global neighborhood overlap feature

相比Local neighborhood overlap feature，考虑了 the entire graph

基于全局节点邻域重叠的一种度量方法；Katz指标；用于量化任意一对节点之间所有路径的数量。

employ the power of the graph adjacency matrix to calculate the number of paths with a specific length between two specific nodes.

powers of the graph adjacency matrix

在矩阵 $A^{(k)}$ 中，幂 $k$ 代表了路径的长度。当元素 $A_{uv}^{(k)} \neq 0$ 时，则表明节点u与节点v之间具有长度为 $k$ 的路径，并且该路径的数量等于对应的元素值。

现在继续看Katz index

the Katz index between $v_1$ and $v_2$ is determined by a summation over all paths, with each path contributing a weighted value based on its length. The formula for this calculation is given by the infinite series:

S_{v_1 v_2} = \sum\limits_{l=1}^{\infty} \beta^l (A^l)_{v_1 v_2}

其中 $0<\beta<1$ 被称为discount factor。该式子表明，在 $v_1$ 和 $v_2$ 之间的最短路径具有更高的权重值，在较长距离路径上的权重则会降低相应的数值系数倍数关系。当 $S_{v_1 v_2}$ 值越大时，则表示两个节点之间的联系更为紧密。

Katz index matrix is computed using an analytical solution:
$S=\sum\limits_{i=1}^{\infty} \beta^i A^i=(I-\beta A)^{-1}-I$

该计算基于矩阵序列的几何级数进行推导，并由下式体现：

(I-\beta A)^{-1} = \sum_{i=0}^\infty \beta^i A^i

Graph-level features and graph kernals

Kernel methods

widely-used for traditional ML for graph-level prediction

Idea: 而不是从特征向量出发设计核函数普通方法首先从图中提取特征向量作为模型输入核方法同样从图中提取特征向量进行计算然而在计算两个图之间的相似性时并不需要进一步建模

Kernel $K(G, G')\in R$ measures similarity between data

The kernel matrix $\textbf{K}=(K(G, G'))_{G, G'}$ is necessarily positive semidefinite (i.e., it has positive eigenvalues).

There exists a feature representation $\phi(\cdot)$ such that $K(G, G')=\phi(G)^T\phi(G')$

After defining the kernel, off-the-shelf ML models, such as kernel SVMs, are capable of making predictions.

Graph-level features: Graph kernals

Objective: Develop feature descriptors that encode the structural properties of a complete graph in order to compute similarity.

Graph kernal: Key Idea

这里是Graph kernal 方法的一个简单例子

特征1：Bag-of-Words (BoW) for a graph

类似于基于词袋模型的文本表示方法，在图神经网络领域中，则通过一个特征向量来编码graph结构信息。该特征向量的长度等于所有possible node的数量，在这种编码方案下，每个元素都标识对应节点在特定graph中是否存在

这种方法能包含的信息容量非常有限, 会使得大量具有不同结构的图被表示为相同的形式, 例如下图

其中 $\phi$ 表示graph的特征向量

特征2：Bag of node degrees

使用node degrees生成特征向量，包含的信息比特征1更多，如下图所示：

这里的特征1和特征2属于较为基础的Graph kernel方法,下面两个属于更为高阶的Graph kernel方法

Graphlet Kernel

Key idea: Count the number of different graphlets in a graph.

Here, the definition of graphlets differs slightly from that of node-level features.

Node(s) in the graphlets are not required to be connected, which permits the presence of isolated nodes. These graphlets lack a root structure.

其中 $Sum(f_G)$ 即代表特征向量所有元素之和，在上述示例中 $f_G=(1,3,6,0)^T$ 时计算结果为10

Limitations

Counting graphlets is expensive!

Weisfeiler-Lehman Kernel

more computing efficient than Graphlet Kernel

Concept: exploit the neighborhood structure to recurrently enhance node vocabulary

实现方法：Color refinement

算法描述：

如图所示，在初始状态下以1标记color节点。首先将所有邻居加入到该节点中；接着利用hash函数对节点及其邻居的数据特征进行映射处理，并生成新的color值。这即完成一次aggregate操作，在此过程中每个节点的颜色字段整合了自身及直接邻居的数据特征

按照下图所示的步骤进行第二次aggregate操作，在此过程中对于当前状态中的每一个node而言，在此之前该节点已经整合了与之直接相连的所有邻居节点的信息（即一步邻居）。经过该操作后，每个节点不仅整合了与其直接相连的所有邻居节点的信息（即一步邻居），还整合了通过两次连接到达的所有间接邻居节点的信息（即两步邻居）。

理解

可以有效的集合大量的各种距离的neighbor信息

在第一步操作中收集了与节点距离为1的所有相关节点信息，在第二步操作中收集了与
节点距离为2的所有相关连接信息；经过K次操作后
，每个
节点整合并存储了与其相距
K
的所有相关连接信息

time complexity

在每个步骤中，时间复杂度与边的数量成正比。
颜色种类数最多等于节点总数。
颜色是可以复用的（即在不同步骤中可以重复使用），但必须满足每一步使用的颜色在全局上是独立的（即不会与其他步骤中的颜色冲突）。因此，在这种情况下最多只需要使用与节点数量相当的颜色种类。
统计颜色的时间复杂度为线性时间相对于节点数目。(此步骤)在构建特征向量的过程中统计每一种颜色的数量，并且仅需节点数目次计算

In total, time complexity is linear in #(edges).

Other kernels

(beyond the scope of this lecture)

Random-walk kernel
Shortest-path graph kernel
And many more…

全部评论 (0)

还没有任何评论哟~

GNN-CS224W: 1-2 Introduction； Traditional Methods for machine learning in Graphs

network相比其他的数据有很多特点让它难以处理： arbitrarysizeandcomplextopologicalstructure ApplicationsofGraphML tasklev...

An Introduction to Kernel Methods for Machine Learning:

作者：禅与计算机程序设计艺术 1.简介 Inthisarticle,wewillgiveabriefintroductiontokernelmethodsinmachinelearning,inclu...

CS224W: Machine Learning with Graphs - 03 Node Embeddings

NodeEmbeddings 1\.GraphRepresnetationLearning Graphrepresnetationlearningalleviatestheneedtodofeatur...

Introduction to Machine Learning in Production

文章目录一、机器学习生命周期和部署概述 1\.概述 2.机器学习的生命周期 3\.机器学习模型部署的挑战 4\.项目部署模式 5\.项目监控二、建模 1.模型训练的挑战 2\.误差分析 3\.性能...

图神经网络（CS224w）学习笔记1 Introduction：Machine Learning for Graphs

文章目录前言一、为什么要选择图进行机器学习？（起源） 1.现代深度学习工具的不足二、图神经网络定义 1.有监督机器学习全流程图 2.图表示机器学习三、图机器学习的应用四、图表示的选择 1.图...

Introduction to Machine Learning for Absolute Beginners

作者：禅与计算机程序设计艺术 1.简介机器学习（英文MachineLearning）是一门跨学科的科学研究领域，涉及概率论、统计学、计算理论、信息论等多个领域。它旨在让计算机系统能够“学习”从数据中...

ECE598HZ: Advanced Topics in Machine Learning and Formal Methods

ECE598HZ:AdvancedTopicsinMachineLearningandFormalMethods Fall2024 Homework1 DueSep2311:59pmCT Typese...

图神经网络（CS224w）学习笔记2Traditional Methods for ML on Graphs

文章目录前言一、传统基于节点特征方法： 1.1、节点的度Nodedegree 1.2、节点中心性Nodecentrality 1.2.1、特征向量中心Eigenvectorcentrality: ...

Supervised machine learning methods

目录 1\.Overview 2\.Process 2.1Overallprocess 2.2Howthemodelslearns 3\.Models 3.1LinearRegression 3.1....

Chapter01.introduction（Machine Learning）

Chapter01.introduction（MachineLearning）文章目录 Chapter01.introduction（MachineLearning） 1Welcome 1.1Mac...

是否确定退出登录?

GNN-CS224W: 1-2 Introduction； Traditional Methods for machine learning in Graphs

Applications of Graph ML

task level

Node-level

Alpha Fold

Edge-level

Recommender system

Drug side effects–Biomedical graph link prediction

Community (subgraph) level

Traffic prediction

Graph-level

Drug discover

Graph generation – Drug discovery

Graph evolution – Physical simulation

Choice of graph representation, how do we define a graph?

Directed and Undirected Graph

Undirected Graph

Directed Graph

Node degrees

Bipartite graph

Folded/Projected Bipartite graph

Representing graphs

Adjacent matrix

Edge list

Adjacency list

Node, edge and graph attributes

more types of graphs

连通图和非连通图

Undirected

Directed

Strongly connected components (SCCs) (强连通分量)

Traditional Methods for machine learning in Graphs

Traditional ML pipeline

Node-level tasks and features

Node degree

Node centrality

Engienvector(特征向量) centrality

问题

Betweenness centrality

Closeness centrality

and many others…

Clustering coefficient

Graphlets

什么是graphlet

Graphlet Degree Vector (GDV)

作用

Node features 分类总结

Importance-based features

Structure-based features

Link-level prediction tasks and features

Two formulations of the link prediction task

Links missing at random:

Links over time

Methodology 常见步骤

Link-level features

Distance-based feature

Local neighborhood overlap feature

Global neighborhood overlap feature

powers of the graph adjacency matrix

Graph-level features and graph kernals

Kernel methods

Graph-level features: Graph kernals

Graph kernal: Key Idea

Graphlet Kernel

Limitations

Weisfeiler-Lehman Kernel

Color refinement

理解

time complexity

Other kernels

全部评论 (0)

相关文章推荐

GNN-CS224W: 1-2 Introduction； Traditional Methods for machine learning in Graphs

An Introduction to Kernel Methods for Machine Learning:

CS224W: Machine Learning with Graphs - 03 Node Embeddings

Introduction to Machine Learning in Production

图神经网络（CS224w）学习笔记1 Introduction：Machine Learning for Graphs

Introduction to Machine Learning for Absolute Beginners

ECE598HZ: Advanced Topics in Machine Learning and Formal Methods