CS224W: Machine Learning with Graphs - 03 Node Embeddings
Node Embeddings
1. Graph Represnetation Learning
Graph represnetation learning alleviates the need to do feature engineering every single time (automatically learn the features)
Goal : efficient task-independent feature learning for machine learning with graphs
Why embedding?
- Similarity of embeddings between nodes indicates their similarity in the netwrok
- Encode network information
- Potantially used for many downstream predictions (node classification, link prediction, graph prediction, anomalous node detection, clustering…)
2. Node Embeddings: Encoder and Decoder
Goal : encode nodes so that similarity in the embedding space approximates similarity in the graph
a) Encoder ENC maps from nodes to embeddings (a low-dimensional vector)
b) Define a node similarity function (i.e., a measure of similarity in the original network)
c) Decoder DEC maps from embeddings to the similarity score
d) Optimize the parameters of the encoder so that similarity(u, v)\approx z_v^Tz_u
1). “Shallow” Encoding
Simplest encoding approach: encoder is just an embedding-lookup so each node is assigned a unique embedding vector
\text{ENV}(v)=z_v=Z \cdot v
where Z is matrix and each column is a node embedding and v is an indicator vector with all zeroes excepy a one in column indicating node v
Methods : DeepWalk, node2vec
3. Random Walk Approaches for Node Embeddings
- Vector z_u is the embedding of node u
- Probability P(v|z_u) is the (predicted) probability of visiting node v on random walks starting from node u
- Random walk : given a graph and a starting point, we select one of its neighbors at random and move to this neighbor; then we select a neighbor of this point at random and move to it, etc. The (random) sequence of points visited this way is a random walk on the graph.
1). Random-walk Embeddings
z_u^Tz_v \approx \text{probability that \textit{u} and \textit{v} co-occur on a random walk over the graph}
- Estimate probability P_R(v|u) of visiting node v on a random walk starting from node u using the random walk strategy R
- Optimize embeddings to encode these random walk statistics
Why random walks?
- Expressivity : flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information (If a random walk starting from node u visits v with high probability, u and v are similar)
- Efficiency : do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks
2). Unsupervised Feature Learning
Intuition : find embedding of nodes in d-dimensional space that preserves similarity
Idea : learn node embedding such that nearby nodes are close together in the network
N_R(u): neighborhood of u obtained by the strategy R
Goal: learn a mapping f: u \to \mathbb{R}_d: f(u) = z_u
Log-likelihood objective:
\underset{f}{max}\sum_{u\in V} log P(N_R(u)|z_u)
Given node u, we want to learn feature represnetations that are predictive of the nodes in its random walk neighborhood N_R(u)
3). Random-walk Optimization
a) Run short fixed-length random walks starting from each node u in the graph using the random walk strategy R
b) For each node u, collect N_R(u), the multiset of nodes visited on random walks starting from u
c) Optimize embeddings
\underset{f}{max}\sum_{u\in V} \log P(N_R(u)|z_u)
Parameterize P(v|z_u) using softmax
P(v|z_u)=\frac{\exp(z_u^Tz_v)}{\sum_{n\in V}\exp(z_u^Tz_n)}
Let
L=\sum_{u\in V}\sum_{v\in N_R(u)}-\log P(v|z_u)=\sum_{u\in V}\sum_{v\in N_R(u)}-\log(\frac{\exp(z_u^Tz_v)}{\sum_{n\in V}\exp(z_u^Tz_n)})
Optimizing random walk embeddings = Finding embeddings z_u that minimizes L
Time complexity : O(|V^2|)
Solution: Negative sampling
\log(\frac{\exp(z_u^Tz_v)}{\sum_{n\in V}\exp(z_u^Tz_n)}) \approx \log (\sigma(z_u^Tz_v)) - \sum_{i=1}^k \log(z_u^Tz_{n_i}), n_i\sim P_V
where \sigma(\cdot) is sigmoid function and n_i follows random distribution over nodes
Instead of normalizing w.r.t. all nodes, just normalize against k random “negative samples” n_i
Sample k negative nodes each with probability proportional to its degree (k=5\sim 20 in practice)
To minimize L, we can use gradient descent (GD) or stochastic gradient descent (SGD)
How to randomly walk
Simplest idea: just run fixed-length, unbiased random walks starting from each node
4). Overview of Node2vec
- Goal: embed nodes with similar network neighborhoods close in the feature space
- Key observation: flexible notion of network neighborhood N_R(u) of node u leads to rich node embeddings
- Develop biased 2^{nd} order random walk R to trade off between local (BFS) and global (DFS) views of the network
Interpolating BFS and DFS
Two parameters:
- Return parameter p: return back to the previous node
- In-out parameter q (ratio of BFS vs DFS): moving outwards (DFS) or inwards (BFS)
Biased Random Walks
Idea:remember where the walk came from
- BFS-like walk: low value of p
- DFS-like walk: low value of q
Node2vec Algorithm
a) compute random walk probabilities
b) simulate r random walks of length l start from each node u
c) optimize the node2vec objective using SGD
Time Complexity: linear
Steps are individually parallelizable
5). Other Random Walk Ideas
- Different kinds of biased random walks: based on node attributes/learned weights
- Alternative optimization schemes: directly optimize based on 1-hop and 2-hop random walk probabilities
- Network preprocessing techniques: run random walks on modified versions of the original network
4. Embedding Entire Graphs
Goal: embed a subgraph or an entire gtaph G
1). Approach 1
Run a standard graph embedding technique on the (sub)graph G and then just sum the node embeddings in the (sub)graph G
z_G=\sum_{v\in G}z_v
2). Approach 2
Introduce a “virtual node” to represent the (sub)graph and run a standart graph embedding technique
3). Approach 3: Anonymous Walk Embeddings
- States in anonymous walks correspond to the index of the first time we visited the node in a random walk (agnostic to the identity of the nodes visited), simulate anonymous walks w_i of l steps and record their counts then represent the graph as a probability distribution over these walks
- Number of anonymous walks grows exponentially
- Sampling anonymous walks
Generate independently a set of m random walks
Represent the graph as a probability distribution over these walks with error of more than \epsilon with probability less than \delta
m=[\frac{2}{\epsilon^2}(\log(2^{\eta}-2)-\log(\delta))]
where \eta is the total number of anonymous walks of length l
5. How to Use Embeddings
- Clustering/community detection: cluster points z_i
- Node classification: predict label of node i based on z_i
- Link prediction: predict edge (i. j) based on (z_i, z_j) (where we can concatenate, average, product or take a difference between the embeddings)
- Graph classification: graph embedding z_G via aggregating node embeddings or anonymous random walks. Predict label based on graph embedding z_G
