Working paper coming soon!

Abstract:

In the recent past, deep neural networks have revolutionized natural language processing and have since set the new state of the art in most language modeling tests. Transformer architectures, such as BERT or GPT-x have been particularly successful by generating flexible, context-aware representation of text inputs for downstream tasks. However, much less is known about how researchers can use these models to analyze existing text. This is a question of great importance, because much information available to applied researchers is contained within written language or spoken text. Making use of these models with their ability to capture sophisticated linguistic relations is thus imminently desirable, nevertheless much uncertainty about how they operate remains, calling their use for inferential text analysis into question. We propose a novel method that combines transformer models with network analysis to form a self-referential representation of language use in a corpus of interest. This approach avoids many issues related to understanding the internal workings of the deep neural network. Hence, it produces linguistic relations strongly consistent with the underlying model as well as mathematically well-defined operations on them. In an analysis of a random sample of news publication from 1990-2018, we find that ties in our network track the semantics of discourse over time, while higher order structures allow us to identify clusters of semantic and syntactic relations. This new approach offers several advantages over the use of contextual word embeddings, and gives researchers a new tool to make sense of language use while reducing the amount of discretionary choices of representation and distance measures. We discuss how this method can also complement and inform analyses of the behavior of deep learning models.