20 September 2019 ~ 0 Comments

Who will Cluster the Cluster Makers?

If you follow this blog, you know that I periodically talk about community discovery. The problem seems so deceptively simple: finding groups of nodes densely connected in a network. If it is so simple, why have I been talking about it since 2012? The reason is that it isn’t so simple, and people have tried to organize the literature explaining how the thousands of different algorithms work, how they perform, what definition of “community” they use. After all that work, I’m still left with one question. Which two algorithms return the same gosh darned communities in a network? That’s what we’re going to discover today.

What I want to do is to take as many community discovery algorithms as I can, test them on a set of networks, and compare their results to estimate how similar they are. This gives me a similarity matrix of algorithms, which I can transform into a network by keeping only similarities that are statistically significant. Once I have the network, I can discovery groups of algorithms returning a coherent set of results, because they’re all significantly related to each other. How do I find these groups? Well, by ehm…. er… performing… community… discovery? on the… network of… community discovery… algorithms… This is exactly the outline of my paper Discovering Communities of Community Discovery, which I presented at ASONAM last month.

“As many community discovery algorithms as I can” turned out to be 73, all implemented in different languages, taking input in different ways, and providing different output formats. I ran them on more than 1,500 networks (real world ones from ICON and synthetic benchmarks). It was a… difficult month for me. I compared their partitions by estimating their mutual information: given that I know the results of algorithm A, how much can I infer about the results of algorithm B? For each network where two algorithms result in a lot of mutual information, I increase their similarity count by one. Once I have all similarity counts, I can extract the backbone of this matrix, controlling for the fact that some algorithms tend to be more peculiar than others, while others tend to be more mainstream.

And this is the result (click to enlarge):

I love this network, because it has well defined groups and they all make sense. There’s the group of modularity maximization algorithms (in green), there’s the ones based on percolation / random walks (in blue), and the ones using neighbor similarity (in purple) as the guiding principle.

Then there’s a lump of algorithms that allow communities to share nodes (in red). The only thing these algorithms have in common is that they allow communities to share nodes, which is not a good enough common characteristics. The ways they find communities are as diverse as the ones you find in the rest of the network. But that’s the beauty of my approach: I can select a subset of nodes — say all overlapping community discovery algorithms — and re-apply the test of statistical significance with a more stringent threshold. This allows me to zoom in and see if there are meaningful structures inside the community. Lo and behold:

Here you can see meaningful groups of overlapping algorithms. There’s the ones achieving overlap by clustering edges instead of nodes (in blue), and the ones applying the percolation / random walk strategy (in green), but allowing for node sharing.

Why is this work significant? First, because it proves that there really are different — and valid — definitions of what communities are in complex networks. If there weren’t, this network would be more homogeneous, without distinct groups.

Second, as I mentioned, some of the networks I tested the algorithms on are standard benchmarks: LFR networks. These benchmarks grow a network with a planted community structure: the real latent structure the algorithm is supposed to find. Yet, this “ground truth” is well embedded in one of the clusters: the percolation/random walk community (in blue). LFR benchmarks follow that specific definition and not others. If you are developing a new community discovery algorithm which has a different community definition, you should not use the LFR benchmark to test it. Moreover, if you are developing a percolation/random walk algorithm and you’re correctly testing it on an LFR benchmark, you cannot test it against algorithms that are not part of the blue community. Otherwise the test would be unfair, because those algorithms are looking for something else: of course they’ll perform poorly on LFR benchmarks!

You can get the full list of algorithms that I tested, with proper references, from the official page of the project. From there, you can also download the network and use it for your purposes. This is necessarily an eternal work in progress: there are more than 73 community discovery algorithms out there. But I am but a man[citation needed] and I cannot spend my entire time scouting for implementations on the web. I got to put bread (or, preferably, pasta) on the table as well. Thus, if you think I really should have included your algorithm in this structure, you can mail me and send me a working implementation of it, and I’ll gladly run it on my benchmarks.

What’s next? I’d be delighted to inaugurate the field of meta-research. So join me, as I develop new projects such as:

  • Predicting links of link predictors: which link prediction algorithms will become more similar in the future?
  • Spreading epidemics of epidemic spreading: which researchers will cave in to peer pressure and publish a paper studying the diffusion of some phenomenon?
  • Modeling the growth of growth models: how has the Barabasi and Albert model evolved over time? Which features were added? What about Watts & Strogatz’s model?

(In case you were wondering, I’m joking. In network science, sometimes that might be hard to tell.)

(Or am I?)

(It’s settled: the best among these papers will receive the hereby instituted Escher prize, awarded by Douglas Hofstadter himself)

Continue Reading

20 March 2014 ~ 0 Comments

When Dimensions Collide

The literature about community discovery, which deals with the problem of finding related groups of nodes in a network, is vast, interesting and full of potential practical applications. However, if I would have to give one critique of it, it would be about its self referential character. Most community discovery papers I read in computer science and physics journals are mainly about finding communities. Not much time is spent thinking about what to do with them, or what they mean. My first post in this blog was about a community discovery algorithm. Recently, an extended version of that paper has been accepted in a computer science journal. Since that first post, I (mainly) added some crucial modifications and features to the algorithm. I don’t want to talk about those here: they are boring. I also didn’t bring up this paper to boast about it. Okay, maybe a little. I did it because the paper touches upon the issue I am talking about here: it tries to do something with communities, it tries to explain something about them. Namely, it asks: why do communities overlap?

First of all: communities do overlap. When trying to detect them, many researchers realized that hard partitions, where each node can belong to one and only one community, are not always a good idea. Most of them found this a problem. Others were actually very happy: the problem gets harder! Nice! (Researchers are weird). Blinded by their enthusiasm, they started developing algorithms to deal with this overlap. Not many asked the question I am trying to answer here: why do communities overlap? As a result, some of these algorithms detect this overlap, but using approaches that do not really mean anything in real life, it’s just a mathematical trick. Others, instead, build the algorithm around a core hypothesis.

This hypothesis is nothing unheard of. Communities overlap because people have complex lives. Some of your college mates also attend your yoga class. And you know your significant other’s colleagues, which puts you in their community. All these communities have you as common member, and probably some more people too. The beauty of this is that it is not only intuitive: it works well in finding communities in real world social networks. So well that it is the assumption of my approach and of many others outstanding algorithms (this and this are the first two that pop into my mind, but there are probably many more). Another beautiful thing about it is that it is almost obvious, and so it is probably true. But here we hit a wall.

The fact that it is simple, reasonable and it works well in practice proves nothing about its property of being true. There are things that are not simple nor reasonable, but nevertheless true (hello quantum physics!). And there is practical knowledge that does not quite correspond to how things work (in my opinion, most computer science is a patch and nobody really knows why it works). Unless we test it, we cannot say that this nice practical principle actually corresponds to something happening in reality. So how do we go on and prove it? In the paper I proposed a first step.

This brings me back to another old love of mine. Multidimensional networks. They are networks in which we put multiple relations in a cage together in mating season and see what happens (research is fun). The idea behind the paper is that multidimensional networks give us the perfect tool to test the hypothesis. In monodimensional networks you have no clue why two people are connecting besides the obvious “they know each other”. In a multidimensional network, you know why they know each other, it’s information embedded in the type of the relation. So, the hypothesis is that different types of relations are the cause of the community overlap, and with multidimensional networks we can look at how communities distribute over relations. First, let us take a look at what two overlapping communities look like in a multidimensional network.

We collected a multidimensional social network putting together relationships between users in Facebook, Twitter and Foursquare. We then used DEMON to extract overlapping communities from each dimension. We then took two communities with extensive overlap in the Facebook dimension (picture below).

We then looked at the very same set of nodes, but now in the Foursquare network. In the picture below, we kept the edges, and the node positioning, of the Facebook network to make the comparison easier, but keep in mind that the edges in the Foursquare dimension are different, and they are the ones that decide to what community the nodes belong.

Very interesting. The communities look a lot alike, although the shared (and non shared) nodes are slightly different. Now node 7369 is shared (it wasn’t in Facebook) while node 8062 isn’t (whereas it was before). Let’s put another nail on the coffin and see the communities these nodes belong in Twitter (same disclaimer applies):

Surprise surprise, in Twitter there is actually only one community, which brings together the majority of the nodes of the two communities. So here’s where our overlap comes from: common affiliations in different dimensions! Now, I’m going to deal with that voice in your head that is screaming “Anecdotal! Anecdotal!”. (You don’t hear it? Did I already mention that researchers are weird? In any case “Anecdotal” refers to a type of evidence that bears no value in scientifically proving a point if not backed by more solid proofs). Put in a more general way: the more two communities overlap in some dimension, the more likely it is we can find a dimension in which these communities are actually a single community. This involves boring details you can find in the paper which ultimately generate this plot:

Does this plot prove our theory without leaving out any reasonable doubt? Maybe, or not really. There are still things to check. But science is made by tiny steps forward. And this is certainly one.

Continue Reading

16 January 2014 ~ 2 Comments

The Eternal Struggle between Partitions and Overlapping Coverages

New year, old topic. I could make a lot of resolutions for this new year, but for sure to stop talking about community discovery is not among them. At least this time I tried to turn it up a notch in the epicness of the title. My aim is to give some substance to one of the many typical filler phrases in science writing. The culprit sentence in this case is “different application scenarios demand different approaches”. Bear with me for a metaphoric example.

When presenting a new toaster, it is difficult to prove that it toasts everything better under any point of view, under any circumstances. It usually does most toasts okay, and for one kind of toasts it really shines. Or its toasts really suck, but it can toast underwater. That’s fine. We are all grown up here, we don’t believe in the fairy tales of the silver bullets any more. At this point, our toaster salesman is forced to say it. “Different application scenarios demand different approaches”. In some cases this is a shameful fig leaf, but in many others it is simply true. Problem is: nobody really checks.

domo_toaster

I decided to check. At least one of them. Teaming up with Diego Pennacchioli and Dino Pedreschi, I put the spotlight on one of the strongest dichotomies in community discovery.  As you may remember, community discovery algorithms can force every node to belong to just one community, or allow them to be in many of them. The former approach is called “graph partitioning”, whilst the latter aims to find an “overlapping coverage”. Are these two strategies yielding interesting, yet completely different, results? This question has been dissected in the paper: “Overlap Versus Partition: Marketing Classification and Customer Profiling in Complex Networks of Products“, that will be presented in one workshop of the 2014 edition of the International Conference of Data Engineering.  Let me refresh your mind about overlaps and partitions.

Above you have the nec plus ultra scenario for a partitioning algorithm. If a partitioning algorithm sees the graph on the left, it would just die of happiness. In the graph, in fact, it appears very clearly that each node belongs to a very specific community. And it can’t belong to any other. If we assume that our algorithm works on edge strength (e.g. the inverse of the edge betweenness), then what the algorithm really sees is the graph on the right. It then proceeds to group together the nodes for which the edge strength is maximal, et voilà.

Here we have an example that’s a bit more complex. The picture has too many overlapping parts, so let me describe the connection pattern. In the graph on the left there are several groups of 6 nodes, each node connected to all other members of the group. In practice, each diagonal is completely connected to the two neighbouring diagonals. Clearly, here there is no way we can put each node in a disjoint group. Why put together nodes 0,1,2 with 3,4,5 and not with 9,10,11? But at that point, why 9,10,11 should be in a community with them and not with 6,7,8? The correct approach is just to allow every completely connected group to be a community, thus letting nodes to be part of more than a community. Some overlapping algorithms see the graph as it has been depicted on the right, with an edge colour per densely connected group.

Time to test which one of these approaches is The Right One! For our data quest we focused on supermarket transactions. We created a network of products that you can buy in supermarkets. To be connected, two products have to be bought together by the same customers in a significant number of times. What does that mean? By pure intuition, bread and water aren’t going to be connected: both of them are bought very frequently, but they have little to do with each other, thus they are expected to be in the same shopping cart by chance. Eggs and flour are too very popular, but probably more than chance, since there are a lot of things you can do with them together. Therefore they are connected. Other specific pairs of products, say bacon flavoured lipstick and liquorice shoelaces, may ended up in the same, quite weird, shopping cart. But we don’t connect them, as their volume of sales is too low (or at least I hope so).

Here are some of the facts we found. First. The overlapping approach* tends to return relatively more communities with a larger amount of nodes than the partition approach**. In absolute terms that’s obvious, since the same node is counted more than once, but here the key term is “relatively”. See the plot above on the right, where we graph the probability (y axis) of finding a community with a given number of nodes (x axis). Second. The overlapping approach returns more “messy” communities. Our messiness measure checks how many different product categories are grouped together on average in the same community. Again, larger communities are expected to be messier, but the messiness measure that we used controls for community size. See the plot on the right, again the probability (y axis) of finding a community with a given entropy (x axis, “entropy” is the fancy scientific term for “messiness”). Third. The partition approach returned denser communities, whose link strength (the number of people buying the products together) is higher.

What is the meaning of all this? In our opinion, the two algorithms are aiming to do something completely different. The partition approach is aiming to create a new marketing classification. It more or less coincides with the established one (thus lower messiness), most customers buy those products together (high link strength) and there are very few giant categories (most communities are small). The overlapping approach, instead, wants to do customer profiling. A customer rarely buys all products of a marketing category (thus increasing its messiness), it has specific needs (that not many people have, thus lowering edge weight) and she usually needs a bunch of stuff (thus larger communities, on average).

Who’s right? That’s the catch: both. The fact that two results are incompatible, in this case, does not mean that one is right and one is wrong. They are just different applications. Which was exactly what I wanted to prove, in this narrow and very specific, probably unsurprising, scenario. Now you should feel better: I gave you a small proof that the hours you spend to choose the perfect toaster for you are really worth your time!


* As overlapping approach, we used the Hierarchical Link Clustering.

** As partitioning approach, we used Infomap.

 

Continue Reading

14 November 2013 ~ 2 Comments

What is a “Community”?

The four of you who follow this blog regularly will know that I have a thing for something called “community discovery“. That’s because no matter how you call it, it always sounds damn cool. “Discovering Communities” or “Detecting the functional modules” or “Uncovering node clusters”. These are all names given to the task of finding groups of nodes in a network that are very similar to each other. And they make you feel like some kind of wizard. Adding to that, there are countless applications in epidemiology, sociology, immunology, marketing.

Far from being original, I share this passion with at least a thousand researchers. Being as smart as they are, they quickly realized that there are many ways in which you can group nodes based on their similarity. On the one hand, this is good news, as we basically have an algorithm for any possible community you want to find in your network. On the other hand, this made a lot of people freak out, as too many algorithms and too different solutions are usually a big red flag in computer science. A flag that says: “You have no idea what you are doing!” (although a computer scientist would put it in the cold and rational “Your problem is not formally defined”: it means the same).

Yes, my signature "Community Discovery Picture" strikes again!

Yes, my signature “Community Discovery Picture” strikes again!

I personally think that the plus side is more predominant than the minus side, and you can get rid of the latter with a bit of work. Work that I have done with Dino Pedreschi and Fosca Giannotti in our paper “A classification for Community Discovery Methods in Complex Networks“. The trick is very simple. It just consists in noticing what’s wrong with the starting point. “Finding groups of nodes in a network that are very similar to each other”. Exactly what is “similar“? It is an umbrella term that can be interpreted in many different ways. After all, we already do this outside of network science. People can be very similar because they look alike. Or because they like the same things. So why can’t we just have different definitions of communities, based on how we intend similarity?

Well, because at the beginning of community discovery we thought that the problem was well defined. The first definition of community was something like: “A community is a group of nodes that are densely connected, and they have few edges connecting them to nodes outside the community”. Which is fine. In some cases. In others, we discovered that it doesn’t really make sense. For example, we discovered that many social networks have a pervasive overlap. It means that nodes are densely connected with many different groups, disproving the definition: now, the area outside the community could be just as dense as the community itself! And this is just one example: you take a hundred community discovery algorithms in literature and you’ll get a hundred different community results on the same network.

Overlap in the infamous Zachary Karate Club network.

Overlap in the infamous Zachary Karate Club network, you can even win a prize if you mention it!

So now researchers in the community discovery… well… community were divided in three factions. We had those who thought that the problem was ill defined, thus everything done so far was just a royal mess. Then there were those who still thought that the problem was well defined, because their definition of community was the only one standing on solid ground and everybody else was just running around like a headless chicken. And then there were people like me and Sune Lehmann (whom I thank for the useful discussions). Our point was that there were many different definitions of communities, and the incompatible results are just the output of incompatible definitions of community.

This is the main take-away message of the paper. We then moved on and tried to actually spot and categorize all different community definitions (for 90s kids: think of a Pokédex for algorithms). Some choices were easy, some others weren’t. I personally think that more than an established classification, this is just a conversation starter. Also because the boundaries between community definitions are at least as fuzzy as the boundaries between the communities themselves. Algorithms in one category may also satisfy conditions imposed by another category. And to me that’s fine: I don’t really like to put things in separate boxes, I just want to have an insight about them.

I put tags, not classes.

I put tags, not classes.

So here you go, the classification we made includes the following “community types” (names are slightly changed from the paper, but it should be obvious which is which):

  • Common Features: in this definition, each node has a number of attributes. If we are in a social network and the nodes are people, these attributes may well be the social connections, the movies you like, the songs you listen to. Communities are groups of nodes with similar attributes.
  • Internal Density: the classical starting point of community discovery. Here we are interested in just maximizing the number of edges inside the communities.
  • External Sparsity: a subtle variant of the Internal Density class. The focus of this definition is on considering communities as islands of nodes, not necessarily densely connected.
  • Action Communities: this is a very dynamic definition of communities. Nodes are not just static entities, but they perform actions. Again, in a social network you not only like a particular artist: you listen to her songs. If your listening happens with the same, or similar, dynamics of other people, then you might as well form a community with them.
  • Proximal Nodes: here we want the edges inside the communities to make it easy for a node to be connected to all other nodes in the community. Or: to get to any other node in the community I have to follow just a few edges.
  • Fixed Structure: this is a very demanding community definition. It says that the algorithm knows what a community looks like and it just has to find that structure in the network.
  • Link Communities: one of my favorites, because it revolutionizes the idea of community. Here we think that we need to group the edges, not the nodes. In a social network, we know different people for different reasons: family, work, free time, … The reason why you know somebody is the community. And you belong to many of them: to all the communities your edges belong to.
  • Others: in any decent classification there must be a miscellaneous category! Some algorithms do not really follow a particular definition, whether because they just add features to other community discovery algorithms or because they let the user define their communities and then try to find them.

And now just a shortlist of readily available community discovery algorithms you can find on the Web:

That’s it! I hope I created a couple of new community discovery aficionados!

Continue Reading