Learning from Error

An old-school blog by Adarsh Mathew

Pretrained Comparative Agendas

Last Modified at — Nov 24, 2021
tl;dr: Use codebooks, move away from auto-generated unsupervised topics.

This idea is completely due to Benjamin Guinaudeau’s great tweet.

TIL that the Comparative Agendas Project has a dataset and codebook section and it tickled my NLP & PoliSci brain.

It provides very granular breakdowns of political topics of interest. At the highest level, it is stuff like Defense, Health Issues, Crime, Religion, but there are several sub-levels to each of these. One of their datasets is mapping a random sample of NYT article descriptions to these high-level labels.

The way I think I could use this: fine-tune a pre-trained BERT-like model to map these descriptions to topics, and use this to generate labels for paragraphs of text on Reddit or .win data. I want to see how communities shift from talking about crime under certain conditions. I want to see if I can use this conceptual classifier to understand distrbutions of topics and text in discussions. Instead of relying on building an unsupervised model and then infer topic labels, this would be a much more robust classifier and it would have immediate purchase with the PoliSci community.

Off to the races now as I try to make a huggingface module using this.

comments powered by Disqus