Learning from Error

An old-school blog by Adarsh Mathew

Visualizing Reddit circa 2017

Last Modified at — Mar 6, 2021

While working with the Reddit data dumps for Ideolect, Jeremiah and I have often used the metaphor of the Reddit Universe: groups of subreddits form their own solar systems, with some long-range connections between others. Even more so when we refer to Pokemon Reddit or Football Reddit. Extending this metaphor, I speculated that there could exist a framing where multiple small subreddits ‘revolve’ around a larger one that anchors and influences them.

Underpinning both these notions is the idea of subreddit ‘similarity’, either in topics or user overlap. At my end, I’ve been referring to user overlap as structural similarity. It can be visualized as a graph of subreddits, where the edge between two is the similarity between them. Now is this a directed or undirected graph? Can I detect neighborhoods of similarity? That depends on the measure of similarity itself.

One straightforward way to measure similarity is using the Jaccard coefficient of users shared between subreddits. It’s simple and intuitive, but it lacks directionality: an undirected link between a large and small subreddit doesn’t tell you which one is the centre and which the periphery. So an alternate measure that preserves directionality is proportion of users contributed: if subreddit B has 100 users, of which 15 are shared with subreddit A, its ‘contribution’ (for the lack of a better term) is $\frac{15}{100}$. Conversely, if subreddit A has 1000 users, the same 15 users are shared, and A’s contribution to B is $\frac{15}{1000}$. The difference in magnitude drives home the idea of a parent/larger subreddit and a smaller one.

I tried visualizing these graphs (Jaccard similarity) as networks, but you end up getting the dreaded hairball result. Even after adding thresholds for similarity, the visual never really improved. And this set me onto the path of translating this graph into a neat 2-D visualization, ideally with node embeddings. I first tried UMAP, which is supposed to be a big improvement over t-SNE, and I liked it initially. But it had a major issue: there was no clear demarcation between clusters. You can see that in the visualization below – the big blue blob in the middle which is the majority of subreddits I study. To be fair, the embeddings generated by UMAP could be good and the blob problem might be one with my clustering algorithm, DBSCAN. There is some weak anecdotal evidence to support this: topically similar subreddits do cluster together.

In other iterations of this, I used Spectral Clustering, and that worked really well in generating (mostly) meaningful clusters. And that should be expected, since the algorithm conceptualizes similarity as a graph, following it up with a bisection problem, which perfectly describes how my data was generated. That said, the visualization was less than instructive, since plotting the first two dimensions of the spectral embedding did not result in neat boundaries between clusters. That bit needs some more work. I’ll either reduce the number of final dimensions to 2, or feed the spectral embedding into a dimension reduction algorithm to bring the number down to 2, which should better preserve the influence of the other dimensions. Until then, I have this UMAP-DBSCAN visual.

tl;dr: I visualized subreddit similarity as an adjacency matrix of the user overlap between any two of the top ~5000 subreddits in 2017. The dimensions in the plot were derived using UMAP, the ‘clusters’ using DBSCAN.

The numeric pairs when you hover over a point:

  1. Subreddit size, or the number of unique commenters on that subreddit for the year.

  2. A scaling factor for the radius of the circle, given by: $\frac{log_{10}{size}}{100}$

my plot
comments powered by Disqus