tl;dr: Use codebooks, move away from auto-generated unsupervised topics.
This idea is completely due to Benjamin Guinaudeau’s great tweet.
What is the new German coalition agreement about?
— Benjamin Guinaudeau (@Ben_Guinaudeau) November 24, 2021
I trained a text-classifier, which classifies paragraphs following the #CAP coding scheme. pic.twitter.com/IeducxSAww
TIL that the Comparative Agendas Project has a dataset and codebook section and it tickled my NLP & PoliSci brain.
It provides very granular breakdowns of political topics of interest. At the highest level, it is stuff like Defense, Health Issues, Crime, Religion, but there are several sub-levels to each of these. One of their datasets is mapping a random sample of NYT article descriptions to these high-level labels.
The way I think I could use this: fine-tune a pre-trained BERT-like model to map these descriptions to topics, and use this to generate labels for paragraphs of text on Reddit or .win data. I want to see how communities shift from talking about crime under certain conditions. I want to see if I can use this conceptual classifier to understand distrbutions of topics and text in discussions. Instead of relying on building an unsupervised model and then infer topic labels, this would be a much more robust classifier and it would have immediate purchase with the PoliSci community.
Off to the races now as I try to make a huggingface module using this.