Quantifying Affective Polarization on Social Media

A couple of years ago, I worked with Marilena Hohmann and Karel Devriendt on a method to estimate ideological polarization on social media: the tendency of people to have more extreme opinions and to avoid contact with people holding a different opinion. Studying ideological polarization is interesting, but it misses a crucial piece of the puzzle: what happens when differing opinions – which may or may not be trying to avoid each other – collide? Are people actually having a debate and an exchange of ideas, or are they escalating to name-calling and generally toxic behavior?

Answering that question requires a method to estimate affective polarization, rather than merely ideological polarization. Once Marilena and I were done working with the latter, we rolled up our sleeves to work on the former. The result is the paper “Estimating affective polarization on a social network“, which appeared a few days ago on PLoS One.

The objective appears simple: to try and quantify what people with differing opinions do when they interact. Unpacking this objective requires some care, though. One could think that this is a simple correlation test: if the more people disagree the more they use toxic language, then affective polarization is high.

Such an approach, however, ignores that people might hate each other so much that they refuse to communicate altogether, or they are forcibly separated. An example is r/the_donald. For a time, it was one of the most active subreddits on Reddit, creating a strong polarized environment. At some point, the Reddit admins decided to ban the subreddit altogether, which resulted in an exodus of users. In the data, one would see a decrease in affective polarization, because there was less toxicity. In reality, discourse had become so toxic it had to stop, which we argue is a sign of a growing, nor decreasing, affective polarization.

The two components of affective polarization: above, the higher the correlation between disagreement and toxicity, the higher affective polarization. Below, the more separated the camps are, the higher affective polarization is.

So we still need to track the network of interactions, just like we did for ideological polarization, because ideology and affect are intertwined. Marilena and I spent a lot of blood and tears trying to be smart about finding a solution, but in the end – as is often the case – the simple route was the best one. We decided to add the affective component to the ideological polarization measure we already had. The older measure captures the social separation, while the correlation between disagreement and toxicity captures the affective component.

Once we had such a measure, we made a case study, analyzing the evolution of the social discourse on former Twitter (RIP) on COVID-19. We used data from February to July 2020, filtered using a set of keywords used in the early pandemic debate. Initially, the results were a bit confusing. While we did find a modest rise in the affective polarization levels, it seemed that affective polarization was mostly a flat line.

The overall level of affective polarization (y axis) over time (x axis).

This went a bit counter our expectation, but analyzing the social separation and the affective components separately told an insightful story (thanks reviewer #1 for prodding us in this direction, we owe you big time 🙂 ).

Above: the affective component. Below: the social segregation component.

The clear pattern was that, in the first couple of weeks, there was low social segregation but a high affective component. After this initial shock, social segregation skyrocketed and by week 9 it plateaued, while the affective component went down.

This is consistent with a narrative of a new topic coming to scene. As the topic is new, no one knows where they stand exactly, so everyone tends to interact with everyone (low social segregation). However, feelings run high, both because of the emergency itself and – possibly – because of previous conflicts between the users, which lead to renewed toxicity. As people get used to the new scenario and the clear factions emerge and stabilize, social segregation suddenly kicks in, and the factions stop talking, which also reduces the chances of using toxic language against the opposing side.

I think this exemplifies beautifully why the measure is useful. If we didn’t have a network measure, we would conclude that affective polarization was low after the first few weeks of the pandemic, because there was no correlation between disagreement and toxicity. Instead, affective polarization was still growing, and we failed to see the correlation because polarization was so high people weren’t event talking to each other any more.

There’s more work to do, of course, because we only tested a tiny scenario. Marlena and I are working on the final piece of our polarization trilogy, where all these great tools we built are finally put to use. Stay tuned!

Pearson Correlations for Networks

We all know that correlation doesn’t imply causation:

And yet, we calculate correlations all the time. Because knowing when two things correlate is still pretty darn useful. Even if there is no causation link at all. For instance, it’d be great to know whether reading makes you love reading more. Part of the answer could start by correlating the number of books you read with the number of books you want to read.

The very important questions the Pearson correlation coefficient allows you to ask: will consuming cheese bring upon you the doom of dying by suffocating in your bedsheets? source: https://www.tylervigen.com/spurious-correlations

As a network scientist, you might think that you could calculate correlations of variables attached to the nodes of your network. Unfortunately, you cannot do this, because normal correlation measures assume that nodes do not influence each other — the measures are basically assuming the network doesn’t exist. Well, you couldn’t, until I decided to make a correlation coefficient that works on networks. I describe it in the paper “Pearson Correlations on Complex Networks,” which appeared in the Journal of Complex Networks earlier this week.

The formula you normally use to calculate the correlation between two variables is the Pearson correlation coefficient. What I realized is that this formula is the special case of a more general formula that can be applied to networks.

In Pearson, you compare two vectors, which are just two sequences of numbers. One could be the all the numbers of books that the people in our sample have read, and the other one is all of their ages. In the example, you might expect that older people had more time to read more books. To do so, you check each entry in the two vectors in order: every time you consider a new person, if their age is higher than the average person’s, then also the number of books they read should be higher.

If you are in a network, each entry of these vectors is the value of a node. In our book-reading case, you might have a social network: for each person you know who their friends are. Now you shouldn’t look at each person in isolation, because the numbers of books and the ages of people also correlate in different parts of the network — this is known as homophily. Some older people might be pressured into reading more books by their book-addicted older friends. Thus, leaving out the network might cause us to miss something: that a person’s age tells us not just about the number of books they have read, but it also allows us to predict the number of books their friends have read.

This is the type of networks you are forced to work with when you use the Pearson correlation. That’s just silly, isn’t it?

To put it simply, the classical Pearson correlation coefficient assumes that there is a very special network behind the data: a network in which each node is isolated and only connects to itself — see the image above. When we slightly modify the math behind its formula, it can take into account how close two nodes are in the network — for instance, by calculating their shortest path length.

You can interpret the results from this network correlation coefficient the same way you do with the Pearson one. The maximum value of +1 means that there is a perfect positive relation: for every extra year of age you read a certain amount of new books. The minimum of -1 means that there is a perfect negative relationship: a weird world where the oldest people have not read much. The midpoint of 0 means that the two variables have no relation at all.

Is the network correlation coefficient useful? Two answers. First: how dare you, asking me if the stuff I do has any practical application. The nerve of some people. Second: Yes! To begin with, in the paper I build a bunch of artificial cases in which I show how the Pearson coefficient would miss correlations that are actually present in a network. But you’re not here for synthetic data: you’re a data science connoisseur, you want the real deal, actual real world data. Above you can see a line chart, showing the vanilla Pearson (in blue) and the network-flavored (in red) correlations for a social network of book readers as they evolve over time.

The data comes from Anobii, a social network for bibliophiles. The plot is a correlation between number of books read and number of books in the wishlist of a user. These two variables are positively correlated: the more you have read, the more you want to read. However, the Pearson correlation coefficient greatly underestimates the true correlation, at 0.25, while the network correlation is beyond 0.6. This is because bookworms like each other and connect with each other, thus the number of books you have read also correlates with the wishlist size of your friends.

This other beauty of a plot, instead, shows the correlation between the age of a user and the number of tags they used to tag books. What is interesting here is that, for Pearson, there practically isn’t a correlation: the coefficient is almost zero and not statistically significant. Instead, when we consider the network, there is a strong and significant negative correlation at around -0.11. Older users are less inclined to tag the books they read — it’s just a fad kids do these days –, and they are even less inclined if their older friends do not tag either. If you were to hypothesize a link between age and tag activity and all you had was lousy Pearson, you’d miss this relationship. Luckily, you know good ol’ Michele.

If this makes you want to mess around with network correlations, you can do it because all the code I wrote for the paper is open and free to use. Don’t forget to like and subscrib… I mean, cite my paper if you are fed up with the Pearson correlation coefficient and you find it useful to estimate network correlations properly.