Reportage can be boiled down to an abstract representation of entities (person, place, organization, ideology, it’s quite broad) and relationships (could be an action, could be a dependence, it’s very vague, honestly). As such, the purely linguistic generation of misinformation is a combinatorial exercise; it’s an incorrect (maybe not falsifiable) representation of entities and the relationships between them. And what makes one piece of misinformation more plausible than another is that the combination or sequence of entites & relationships in one is closer in some abstract (networked) space to a pre-existing combination, i.e. fact. We (Aabir & I) argue that this piece of news feels plausible because something similar has happened, and you wouldn’t be surprised if it had actually happened.
This brings us to a host of fascinating epistemic questions, on mental models and our expectations of reality and even how the scientific enterprise works. In fact, Prof. James Evans has some work which explores how much of scientific discovery is the result of combinatorial tinkering. He and others call it the exploration-exploitation trade-off; local combinatorial tinkering is exploiting a set of known relationships, while exploration is the ostensibly global search of potential, unrelated, unknown relationships.
Being the wannabe ML nerds that we are, we try to boil down this into a feasible ML problem. The enity-relationship structure we refer to is called a knowledge graph representation, and there has been a ton of work in trying to build graph neural network models that try to model this structure for a given KG base. But all of them center around a limited space of tuples, like subject-verb-object triples or entity links from the Wikipedia Knowledge Base. We want to extend it to a live knowledge corpus like News or all of Wikipedia, but we quickly realized that was, well, too much. That there are full teams at Stanford and other places trying to do just the creation of the knowledge base. Semantic Role Extraction & Labelling is still an unsolved task in NLU.
So, we settled for proving out our hypothesis with a relatively simple question: Given a subject and an object, do pre-trained GLOVe word embeddings do a better job of predicting the relationship or does a learned KGE model?
Aabir and I went back-and-forth a lot on whether our evaluation criteria was fair or not. But, for a quarter-long course, I think we at least proved something out. We plan to continue working on this. Even if the epistemic questions will distract me more than the model itself.