Graph Language Models
The Graph Language Model (GLM) integrates language models and graph neural networks, enhancing understanding of graph concepts, outperforming existing models in relation classification, and effectively processing text and structured graph data.
Read original articleThe paper "Graph Language Models" by Moritz Plenz and Anette Frank presents a new type of language model that integrates the capabilities of traditional language models (LMs) and graph neural networks (GNNs). While LMs are widely used in natural language processing (NLP), their interaction with structured knowledge graphs (KGs) remains an area of active research. Current methods either linearize graphs for embedding with LMs, which leads to a loss of structural information, or utilize GNNs, which struggle to represent text features effectively. The proposed Graph Language Model (GLM) addresses these limitations by initializing its parameters from a pretrained LM, enhancing its understanding of graph concepts and relationships. The architecture of the GLM is designed to incorporate graph biases, facilitating effective knowledge distribution. This allows the GLM to process both graphs and text, as well as combinations of the two. Empirical evaluations demonstrate that GLM embeddings outperform both LM and GNN baselines in relation classification tasks, showcasing their versatility in supervised and zero-shot settings. The findings suggest that GLMs could significantly advance the integration of structured knowledge into NLP applications.
- The Graph Language Model (GLM) combines strengths of language models and graph neural networks.
- GLM parameters are initialized from pretrained language models to improve understanding of graph concepts.
- The architecture of GLM incorporates graph biases for better knowledge distribution.
- Empirical evaluations show GLM embeddings outperform existing LM and GNN baselines.
- GLMs can process both text and structured graph data effectively.
Related
Basically, language models are already graph neural networks. To understand why, go back to Word2Vec/GLoVe: word embedding distance represent co-occurrence frequency of words in a sentence.
Note how this is the same as a graph embedding problem: words are nodes, and the edge weight is co-occurence frequency. You embed the graph nodes. In fact, this is stated in formal math in the GLoVe paper.
The LLM architecture is basically doing the same thing, except the graph is conditional occurence based on the previous contextual words.
This setup makes for a graph with a truly astronomic number of nodes (word|context) and edges. This huge graph exists only in the land of abstract math, but it also shows why LLMs require so many parameters to perform well.
In any case, 4 years on, I'm still pretty lukewarm on the current gen of graph neural network architectures.
Case in point: the OP paper is pretty much the classic ML paper mill setup of "take some existing algorithm, add some stuff over it, spend a ton hyperparameter searching on your algo and show it beats some 2 year old baseline".
It is like the difference between concrete and abstract syntax. LLMs frequently generate code that won't compile, since they predict tokens not AST nodes. They are underconstrained for the task.
How to address? You can train a single model to handle both, as the authors did, or you can manually enforce constraints while decoding.