Differential geometry and algebraic topology are not encountered very frequently in mainstream machine learning. In this series of posts, I show how tools from these fields can be used to reinterpret Graph Neural Networks and address some of their common plights in a principled way.
“Symmetry, as wide or as narrow as you may define its meaning, is one idea by which man through the ages has tried to comprehend and create order, beauty, and perfection.”
This somewhat poetic description by Hermann Weyl [1] underlines the cornerstone role of symmetry in science. Felix Klein’s 1872 “Erlangen Programme” [2] characterised geometries through symmetry groups. Not only was this a breakthrough in mathematics, unifying the “zoo of geometries,” but also led to the development of modern physical theories that can be entirely derived from the first principles of symmetry [3]. Similar principles have emerged in machine learning under the umbrella of Geometric Deep Learning, a general blueprint for deriving the majority of popular neural network architectures through group invariance and equivariance [4].
The Geometric Deep Learning Blueprint can be applied to different domains, such as grids, meshes, or graphs [5]. Yet, while the former two have clear continuous analogy objects (grids can be considered as a discretisation of Euclidean or more generally homogeneous spaces such as the sphere, and meshes are a common discretisation of 2-dimensional manifolds), there is no immediate continuous analogy for graphs [6]. This inequity is somewhat disturbing and motivates us to take a closer look at continuous models for learning on graphs.
Graph Neural Diffusion. Graph Neural Networks (GNNs) learn by performing some form of message passing on the graph, whereby features are passed from node to node across the edges. Such a mechanism is related to diffusion processes on graphs that can be expressed in the form of a partial differential equation (PDE) called “diffusion equation”. In a recent paper [7], we showed that the discretisation of such PDEs with nonlinear learnable diffusivity functions (referred to as “Graph Neural Diffusion” or very modestly, GRAND) generalises a broad class of GNN architectures such as Graph Attention Networks (GAT) [8].
The PDE mindset offers multiple advantages, such as the possibility to exploit efficient numerical solvers (e.g. implicit, multistep, adaptive, and multigrid schemes) with guaranteed stability and convergence properties. Some of these solvers do not have an immediate analogy among the zoo of popular GNN architectures, potentially promising new interesting Graph Neural Network designs. Such architectures might be at least slightly more interepretable than the typical ones, owing to the fact that the diffusion PDE we consider can be seen as the gradient flow [9] of some associated energy.
At the same time, while the GRAND model offers continuous time in the place of layers in traditional GNNs, the spatial part of the equation is still discrete and relies on the input graph. Importantly, in this diffusion model, the domain (graph) is fixed, and some property defined on it (features) evolves.
A different concept commonly used in differential geometry is that of geometric flows, evolving the properties of the domain itself [10]. This idea was adopted by my PhD advisor Ron Kimmel and his co-authors in the 1990s in the field of image processing [11]. They modelled images as manifolds embedded in a joint positional and colour space and evolved them by a PDE minimising the harmonic energy of the embedding [12]. Such a PDE, named Beltrami flow, has the form of isotropic non-euclidean diffusion and produces edge-preserving image denoising.
We adapted this paradigm to graphs in the “Beltrami Neural Diffusion” (BLEND) framework [13]. The nodes of the graph are now characterised by positional and feature coordinates, both of which are evolved and both of which determine the diffusivity properties. In this mindset, the graph itself passes into an auxiliary role: it can be generated from the positional coordinates (e.g. as a k-nearest neighbour graph) and rewired throughout the evolution. The following Figure illustrates this simultaneous evolution process:
Significant attention has been devoted in recent works to the expressive power of Graph Neural Networks. Message-passing GNNs are equivalent to the Weisfeiler-Lehman graph isomorphism test [14–16], a classical method attempting to determine whether two graphs are structurally equivalent (“isomorphic”) by means of iterative colour refinement. This test is a necessary but insufficient condition: in fact, some non-isomorphic graphs might be deemed equivalent by Weisfeler-Lehman. The following Figure illustrates what message passing GNNs “see”: the two highlighted nodes appear indistinguishable, though the graphs clearly have a different structure:
Positional Encoding. A common remedy to this problem is to “colour” the nodes by assigning to them some additional features representing the role or “position” of the node in the graph. Popularised in Transformers [17] (which are a special case of attentional GNNs operating on a complete graph [4]), positional encoding methods have become a common way of increasing the expressive power of Graph Neural Networks.
Perhaps the most straightforward approach is to endow each node with a random feature [18]; however, while being more expressive, such an approach would provide poor generalisation (since it is impossible to reproduce random features across two graphs). The eigenvectors of the graph Laplacian operator [19] provide a neighbourhood-preserving embedding of the graph and have been successfully used as positional encoding. Finally, we showed in our paper with Giorgos Bouritsas and Fabrizio Frasca [20] that graph substructure counting can be used as a form of positional or “structural” encoding that can be made provably more powerful than the basic Weisfeiler-Lehman test.
However, with a variety of choices for positional encoding, there is no clear recipe as to how to choose one, and no clear answer to the question of which method works better in which cases. I believe that geometric flows such as BLEND can be interpreted in the light of this question: by evolving the positional coordinates of the graph through non-euclidean diffusion, the positional encoding is adapted for the downstream task. The answer is, therefore, “it depends”: the best positional encoding is a function of the data and task at hand.
Higher-order Message Passing. An alternative take on expressivity is to stop thinking about graphs in terms of nodes and edges. Graphs are examples of objects called cell complexes, one of the main objects of study in the field of algebraic topology. In this terminology, nodes are 0-cells and edges are 1-cells. One does not have to stop there: we can construct 2-cells(faces) like shown in the following Figure, which make the two graphs from our previous example perfectly distinguishable:
In two recent papers coauthored with Cristian Bodnar and Fabrizio Frasca [21–22], we show that it is possible to construct a “lifting transformation” that augments the graph with such higher-order cells, on which one can perform a more complex form of hierarchical message passing. This scheme can be made provably more expressive than the Weisfeiler-Lehman test and has shown promising results in computational chemistry where many molecules exhibit structures better modeled as cell complexes rather than graphs.
Another common plight of GNNs is the “oversquashing” phenomenon, or the failure of message passing to propagate information efficiently due to certain structural characteristics of the input graph (“bottleneck”) [23]. Oversquashing typically occurs in graphs with exponential volume growth such as small-world networks [24] and in problems dependent on long-range information. Put differently, the input graph on which GNNs operate is not always necessarily friendly for message passing.
Over-squashing, Bottlenecks, and Graph Rewiring. Empirically, it was observed that decoupling the input graph from the computational graph and allowing passing messages on a different graph helps alleviate the problem; such techniques are generally referred to as “graph rewiring”.
It is fair to say that many popular GNN architectures implement some form of graph rewiring, which can take the form of neighbourhood sampling (originally proposed in GraphSAGE to cope with scalability [25]) or multi-hop filters [26]. Topological message passing discussed above can also be seen as a form of rewiring, whereby information flow between distant nodes can be “shortcut” through higher-order cells. Alon and Yahav [23] showed that even as simplistic an approach as using a fully connected graph may help improve over-squashing in graph ML problems. Klicpera and coauthors enthusiastically proclaimed that “diffusion improves graph learning”, proposing a universal preprocessing step for GNNs (named “DIGL”) consisting in denoising the connectivity of the graph by means of a diffusion process [27]. Overall, despite the significant empirical study, the over-squashing phenomenon has been elusive and insufficiently understood.
In a recent paper [28], we show that bottlenecks resulting in over-squashing can be attributed to the local geometric properties of the graph. Specifically, by defining a graph analogy of Ricci curvature, we can show that negatively-curved edges are the culprits. This interpretation leads to a graph rewiring procedure akin to “backward Ricci flow” that surgically removes problematic edges and produces a graph that is more message passing-friendly while being structurally similar to the input one.
These examples show that differential geometry and algebraic topology bring a new perspective to important and challenging problems in graph machine learning. In the following posts in this series, I will show in further detail how tools from these fields can be used in order to address the aforementioned problems of Graph Neural Networks. Part II will discuss how algebraic topology can improve the expressive power of GNNs. Part III will deal with geometric diffusion PDEs. Part IV will show how the over-squashing phenomena can be related to graph curvature, and offer a geometric approach to graph rewiring inspired by the Ricci flow.
Source: Michael Bronstein. Professor @imperialcollege, Head of Graph ML Research @Twitter, ML Lead @ProjectCETI. Researcher, teacher, entrepreneur