Archive | Community Discovery

16 January 2014 ~ 2 Comments

The Eternal Struggle between Partitions and Overlapping Coverages

New year, old topic. I could make a lot of resolutions for this new year, but for sure to stop talking about community discovery is not among them. At least this time I tried to turn it up a notch in the epicness of the title. My aim is to give some substance to one of the many typical filler phrases in science writing. The culprit sentence in this case is “different application scenarios demand different approaches”. Bear with me for a metaphoric example.

When presenting a new toaster, it is difficult to prove that it toasts everything better under any point of view, under any circumstances. It usually does most toasts okay, and for one kind of toasts it really shines. Or its toasts really suck, but it can toast underwater. That’s fine. We are all grown up here, we don’t believe in the fairy tales of the silver bullets any more. At this point, our toaster salesman is forced to say it. “Different application scenarios demand different approaches”. In some cases this is a shameful fig leaf, but in many others it is simply true. Problem is: nobody really checks.

domo_toaster

I decided to check. At least one of them. Teaming up with Diego Pennacchioli and Dino Pedreschi, I put the spotlight on one of the strongest dichotomies in community discovery.  As you may remember, community discovery algorithms can force every node to belong to just one community, or allow them to be in many of them. The former approach is called “graph partitioning”, whilst the latter aims to find an “overlapping coverage”. Are these two strategies yielding interesting, yet completely different, results? This question has been dissected in the paper: “Overlap Versus Partition: Marketing Classification and Customer Profiling in Complex Networks of Products“, that will be presented in one workshop of the 2014 edition of the International Conference of Data Engineering.  Let me refresh your mind about overlaps and partitions.

Above you have the nec plus ultra scenario for a partitioning algorithm. If a partitioning algorithm sees the graph on the left, it would just die of happiness. In the graph, in fact, it appears very clearly that each node belongs to a very specific community. And it can’t belong to any other. If we assume that our algorithm works on edge strength (e.g. the inverse of the edge betweenness), then what the algorithm really sees is the graph on the right. It then proceeds to group together the nodes for which the edge strength is maximal, et voilà.

Here we have an example that’s a bit more complex. The picture has too many overlapping parts, so let me describe the connection pattern. In the graph on the left there are several groups of 6 nodes, each node connected to all other members of the group. In practice, each diagonal is completely connected to the two neighbouring diagonals. Clearly, here there is no way we can put each node in a disjoint group. Why put together nodes 0,1,2 with 3,4,5 and not with 9,10,11? But at that point, why 9,10,11 should be in a community with them and not with 6,7,8? The correct approach is just to allow every completely connected group to be a community, thus letting nodes to be part of more than a community. Some overlapping algorithms see the graph as it has been depicted on the right, with an edge colour per densely connected group.

Time to test which one of these approaches is The Right One! For our data quest we focused on supermarket transactions. We created a network of products that you can buy in supermarkets. To be connected, two products have to be bought together by the same customers in a significant number of times. What does that mean? By pure intuition, bread and water aren’t going to be connected: both of them are bought very frequently, but they have little to do with each other, thus they are expected to be in the same shopping cart by chance. Eggs and flour are too very popular, but probably more than chance, since there are a lot of things you can do with them together. Therefore they are connected. Other specific pairs of products, say bacon flavoured lipstick and liquorice shoelaces, may ended up in the same, quite weird, shopping cart. But we don’t connect them, as their volume of sales is too low (or at least I hope so).

Here are some of the facts we found. First. The overlapping approach* tends to return relatively more communities with a larger amount of nodes than the partition approach**. In absolute terms that’s obvious, since the same node is counted more than once, but here the key term is “relatively”. See the plot above on the right, where we graph the probability (y axis) of finding a community with a given number of nodes (x axis). Second. The overlapping approach returns more “messy” communities. Our messiness measure checks how many different product categories are grouped together on average in the same community. Again, larger communities are expected to be messier, but the messiness measure that we used controls for community size. See the plot on the right, again the probability (y axis) of finding a community with a given entropy (x axis, “entropy” is the fancy scientific term for “messiness”). Third. The partition approach returned denser communities, whose link strength (the number of people buying the products together) is higher.

What is the meaning of all this? In our opinion, the two algorithms are aiming to do something completely different. The partition approach is aiming to create a new marketing classification. It more or less coincides with the established one (thus lower messiness), most customers buy those products together (high link strength) and there are very few giant categories (most communities are small). The overlapping approach, instead, wants to do customer profiling. A customer rarely buys all products of a marketing category (thus increasing its messiness), it has specific needs (that not many people have, thus lowering edge weight) and she usually needs a bunch of stuff (thus larger communities, on average).

Who’s right? That’s the catch: both. The fact that two results are incompatible, in this case, does not mean that one is right and one is wrong. They are just different applications. Which was exactly what I wanted to prove, in this narrow and very specific, probably unsurprising, scenario. Now you should feel better: I gave you a small proof that the hours you spend to choose the perfect toaster for you are really worth your time!


* As overlapping approach, we used the Hierarchical Link Clustering.

** As partitioning approach, we used Infomap.

 

Continue Reading

14 November 2013 ~ 2 Comments

What is a “Community”?

The four of you who follow this blog regularly will know that I have a thing for something called “community discovery“. That’s because no matter how you call it, it always sounds damn cool. “Discovering Communities” or “Detecting the functional modules” or “Uncovering node clusters”. These are all names given to the task of finding groups of nodes in a network that are very similar to each other. And they make you feel like some kind of wizard. Adding to that, there are countless applications in epidemiology, sociology, immunology, marketing.

Far from being original, I share this passion with at least a thousand researchers. Being as smart as they are, they quickly realized that there are many ways in which you can group nodes based on their similarity. On the one hand, this is good news, as we basically have an algorithm for any possible community you want to find in your network. On the other hand, this made a lot of people freak out, as too many algorithms and too different solutions are usually a big red flag in computer science. A flag that says: “You have no idea what you are doing!” (although a computer scientist would put it in the cold and rational “Your problem is not formally defined”: it means the same).

Yes, my signature "Community Discovery Picture" strikes again!

Yes, my signature “Community Discovery Picture” strikes again!

I personally think that the plus side is more predominant than the minus side, and you can get rid of the latter with a bit of work. Work that I have done with Dino Pedreschi and Fosca Giannotti in our paper “A classification for Community Discovery Methods in Complex Networks“. The trick is very simple. It just consists in noticing what’s wrong with the starting point. “Finding groups of nodes in a network that are very similar to each other”. Exactly what is “similar“? It is an umbrella term that can be interpreted in many different ways. After all, we already do this outside of network science. People can be very similar because they look alike. Or because they like the same things. So why can’t we just have different definitions of communities, based on how we intend similarity?

Well, because at the beginning of community discovery we thought that the problem was well defined. The first definition of community was something like: “A community is a group of nodes that are densely connected, and they have few edges connecting them to nodes outside the community”. Which is fine. In some cases. In others, we discovered that it doesn’t really make sense. For example, we discovered that many social networks have a pervasive overlap. It means that nodes are densely connected with many different groups, disproving the definition: now, the area outside the community could be just as dense as the community itself! And this is just one example: you take a hundred community discovery algorithms in literature and you’ll get a hundred different community results on the same network.

Overlap in the infamous Zachary Karate Club network.

Overlap in the infamous Zachary Karate Club network, you can even win a prize if you mention it!

So now researchers in the community discovery… well… community were divided in three factions. We had those who thought that the problem was ill defined, thus everything done so far was just a royal mess. Then there were those who still thought that the problem was well defined, because their definition of community was the only one standing on solid ground and everybody else was just running around like a headless chicken. And then there were people like me and Sune Lehmann (whom I thank for the useful discussions). Our point was that there were many different definitions of communities, and the incompatible results are just the output of incompatible definitions of community.

This is the main take-away message of the paper. We then moved on and tried to actually spot and categorize all different community definitions (for 90s kids: think of a Pokédex for algorithms). Some choices were easy, some others weren’t. I personally think that more than an established classification, this is just a conversation starter. Also because the boundaries between community definitions are at least as fuzzy as the boundaries between the communities themselves. Algorithms in one category may also satisfy conditions imposed by another category. And to me that’s fine: I don’t really like to put things in separate boxes, I just want to have an insight about them.

I put tags, not classes.

I put tags, not classes.

So here you go, the classification we made includes the following “community types” (names are slightly changed from the paper, but it should be obvious which is which):

  • Common Features: in this definition, each node has a number of attributes. If we are in a social network and the nodes are people, these attributes may well be the social connections, the movies you like, the songs you listen to. Communities are groups of nodes with similar attributes.
  • Internal Density: the classical starting point of community discovery. Here we are interested in just maximizing the number of edges inside the communities.
  • External Sparsity: a subtle variant of the Internal Density class. The focus of this definition is on considering communities as islands of nodes, not necessarily densely connected.
  • Action Communities: this is a very dynamic definition of communities. Nodes are not just static entities, but they perform actions. Again, in a social network you not only like a particular artist: you listen to her songs. If your listening happens with the same, or similar, dynamics of other people, then you might as well form a community with them.
  • Proximal Nodes: here we want the edges inside the communities to make it easy for a node to be connected to all other nodes in the community. Or: to get to any other node in the community I have to follow just a few edges.
  • Fixed Structure: this is a very demanding community definition. It says that the algorithm knows what a community looks like and it just has to find that structure in the network.
  • Link Communities: one of my favorites, because it revolutionizes the idea of community. Here we think that we need to group the edges, not the nodes. In a social network, we know different people for different reasons: family, work, free time, … The reason why you know somebody is the community. And you belong to many of them: to all the communities your edges belong to.
  • Others: in any decent classification there must be a miscellaneous category! Some algorithms do not really follow a particular definition, whether because they just add features to other community discovery algorithms or because they let the user define their communities and then try to find them.

And now just a shortlist of readily available community discovery algorithms you can find on the Web:

That’s it! I hope I created a couple of new community discovery aficionados!

Continue Reading

04 January 2013 ~ 2 Comments

Data-Driven Borders

What defines the human division of territory? Think about it: cities are placed in particular areas for a number of good reasons: communication routes, natural resources, migration flows. But once cities are located in a given spot, who decides where one city ends and another begins? Likewise, who decides on the borders of a region or a nation and how? This decision, more often than not, is quite random.

Sometimes administrative borders are defined by natural barriers like mountains and rivers. This makes practical sense, although it is not always clear why the border should be that particular mountain or that particular river. In fact, the main criterion is usually historical: it’s because some dynasty of dudes conquered that area and then got lazy and didn’t go on (this may be the official version: unofficially, maybe, it’s because they found somebody who kicked their asses all day long, just like the complicated relationship of the Romans with the Parthians).

Of course, the borders of states or regions are sometimes re-arranged to better fit practical administrative purposes. In any case, these are nothing else than sub-optimal adjustments of a far-from-optimal process. Network analysis can be useful in this context, because it can provide an objective way to divide the territory according to a particular theory (and it can provide pretty pictures too).

The theory here is very simple: two territories are related if a lot of people travel regularly from one to the other. If people constantly travel back and forth between two territories, then it probably makes sense to combine these territories into one administrative unit. So, how do we determine which territories should be merged, and which shouldn’t be? This problem is easily solvable in network theory, because it contains a network in its very basic definition: two areas are strongly connected if many people travel from one to the other. What we aim for is a grouping of territories. This looks really familiar to the eyes of some readers of this website: grouping nodes in a network. Yes! Community discovery!

I am not claiming to be the first one to see the problem this way. There is a number of people who already worked on it: the two most important that I can think of are Brockmann et al. and Ratti et al. However, I am reporting this because I also have a paper on the topic. And, of course, I think it’s better than the alternatives, for a number of reasons that I won’t report because it’s boring for non nerd people. But then again, I am a narcissist, so I can’t resist giving you the short list:

  • The previous works are based on not so perfect data: Brockmann et al. work with the banknotes trajectories recorded by the “Where’s George?” website (an awesome idea, take a look at it), while Ratti et al. use cellphone mobility data. Both are not exact representations about how people move and contain critical error terms. In our work, we use GPS trajectories with very high frequency and precision: we are studying the real thing.
  • The previous works use outdated methods for community discovery which cannot detect small communities: we use a more up-to-date method that is considered the state-of-the-art of community discovery. For example, in Brockmann et al. the entire west part of the United States is apparently one single area, grouping California and Montana and creating a region of 60-something million people.
  • We actually create a framework that establishes the correct methodology to approach the problem in general, instead of just studying one particular case.

But enough blabbering! I promised pretty pictures and I’ll give you pretty pictures. The general shared methodology is the following (in the pictures, the example of  mobility in Tuscany, Italy):

1) We divide the territory in cells (either a regular grid or very fine grained census cells);

2) We connect the nodes according to how many cars went from one cell to the other;

3) We forget about geography and we obtain a complex network (here, the node layout has nothing to do with their location on the map);

4) We apply community discovery, grouping set of nodes (territories) that are visited by the same people;

5) We put the nodes back in their geographical positions, obtaining the borders we were yearning for.

Funnily enough, Italy is undergoing a re-organization process of its regions and provinces. The results in Tuscany are very similar to the insights of our work (not perfectly similar, as the current process is just a merge of the existing provinces and not a real re-design):

On the left the new provinces (colors) on top of the old ones (lines), on the right our clusters (click for a larger resolution).

The match suggests that our data-driven borders follow the general intuition about what the borders should look like. However, they are not just a blind merge of the existing provinces, such as the one made by the policy-makers, making them more connected with reality. Hurrah for network analysis!

Continue Reading

17 August 2012 ~ 2 Comments

Democratic Community Discovery

When thinking about our social life, we instinctively recall the reason why we know the people we know. There is my sister, there’s the friend with whom I shared all my experiences – from stealing snacks at the primary school to the bachelor thesis we wrote together – and finally there’s that obscure guy who I don’t really know but he added me on Facebook and I don’t want to reduce my friend counter. A long time ago a very nice app called Nexus allowed Facebook users to visualize this concept. Nexus then died, it was replaced by Social Graph, which died too. Now you can use Touch Graph, but it is not nearly as good and, besides this, it’s a different story from the one I want to tell.

Nexus’ visualizations looked more or less like this:

The grey dots are my friends on Facebook and they are connected if they are friends on Facebook too. It’s clear that people you know for the same reason are densely connected to each other, because they are very likely to know each other too. The structure is pretty spectacular, I have to admit. It also seems to make a lot of sense and these different dense areas (“communities”) do not look too hard to be extracted automatically by an algorithm.

It’s on the shoulders of this gigantic positivist naïvety that countless authors decided to write such algorithm. Soon enough, a new branch of complex network analysis was born, called Community Discovery. The aim of  community discovery is to group nodes that are densely connected to each other. The groups are then called “communities” and nodes are said to be “members” of a community. Community discovery has a long, long history made by initial successes, coups de théâtre, drama and romance, but it is a complicated story and not in the scope of this post. Basically, many authors realized that they were dealing with a larger mess than previously thought. The main reason is that most, if not all, real world networks do not look like the above example. At. All. They look like this:

An ugly and meaningless hairball. Most people scratched their heads and decided to stick with their definition of community. This is the wrong way because it implies that we just give up in analyzing the second case and we consider the two examples as completely different phenomena. Guess what: they are not. The second network, too, is a sample of the Facebook friendship network and it contains the first. The sole difference between the two is the scale: the first only considers the immediate neighborhood of a node (me), the second just samples 15,000 users and the connections between them.

It’s on the basis of these considerations that I wrote a paper on community discovery, together with my co-authors at the KDDLab in Italy. We developed an algorithm that exploits the order at the local level around a node to find a sense of the connective mess at the global scale. It is called DEMON: Democratic Estimate of the Modular Organization of a Network.

The details of the implementation are included in the paper accepted at the SIGKDD 2012 conference on Data Mining. It was recently presented there by my co-author Giulio Rossetti who also happens to have written its implementation, freely available (and Open Source!). In this post, I’ll just give you a quick idea about how DEMON works.

Let’s consider a simple, messy, network:

It looks more like the messy hairball, doesn’t it? Let’s now select a node:

And then all the other nodes that are directly connected to it:

Now, let us ignore all the other nodes of the network, included the first selected. We just create a network only using the green nodes and the connections between them. What does this network look like? Like this:

Surprise! We’ve fallen back into the first, neatly divided, example. Now what we need to do is to just apply a very simple, old-school, community discovery algorithm to this sample and we have an idea about the communities surrounding the yellow node. We apply this operation to all the nodes of the network and then we merge together the communities that share an extensive portion of their members. I won’t bother you with the details and the proofs about how awesome this method is and how it outperforms the current state-of-the-art community discovery algorithms because everything is in the paper and I don’t like to brag (twice).

Continue Reading