20 March 2014 ~ 0 Comments

When Dimensions Collide

The literature about community discovery, which deals with the problem of finding related groups of nodes in a network, is vast, interesting and full of potential practical applications. However, if I would have to give one critique of it, it would be about its self referential character. Most community discovery papers I read in computer science and physics journals are mainly about finding communities. Not much time is spent thinking about what to do with them, or what they mean. My first post in this blog was about a community discovery algorithm. Recently, an extended version of that paper has been accepted in a computer science journal. Since that first post, I (mainly) added some crucial modifications and features to the algorithm. I don’t want to talk about those here: they are boring. I also didn’t bring up this paper to boast about it. Okay, maybe a little. I did it because the paper touches upon the issue I am talking about here: it tries to do something with communities, it tries to explain something about them. Namely, it asks: why do communities overlap?

First of all: communities do overlap. When trying to detect them, many researchers realized that hard partitions, where each node can belong to one and only one community, are not always a good idea. Most of them found this a problem. Others were actually very happy: the problem gets harder! Nice! (Researchers are weird). Blinded by their enthusiasm, they started developing algorithms to deal with this overlap. Not many asked the question I am trying to answer here: why do communities overlap? As a result, some of these algorithms detect this overlap, but using approaches that do not really mean anything in real life, it’s just a mathematical trick. Others, instead, build the algorithm around a core hypothesis.

This hypothesis is nothing unheard of. Communities overlap because people have complex lives. Some of your college mates also attend your yoga class. And you know your significant other’s colleagues, which puts you in their community. All these communities have you as common member, and probably some more people too. The beauty of this is that it is not only intuitive: it works well in finding communities in real world social networks. So well that it is the assumption of my approach and of many others outstanding algorithms (this and this are the first two that pop into my mind, but there are probably many more). Another beautiful thing about it is that it is almost obvious, and so it is probably true. But here we hit a wall.

The fact that it is simple, reasonable and it works well in practice proves nothing about its property of being true. There are things that are not simple nor reasonable, but nevertheless true (hello quantum physics!). And there is practical knowledge that does not quite correspond to how things work (in my opinion, most computer science is a patch and nobody really knows why it works). Unless we test it, we cannot say that this nice practical principle actually corresponds to something happening in reality. So how do we go on and prove it? In the paper I proposed a first step.

This brings me back to another old love of mine. Multidimensional networks. They are networks in which we put multiple relations in a cage together in mating season and see what happens (research is fun). The idea behind the paper is that multidimensional networks give us the perfect tool to test the hypothesis. In monodimensional networks you have no clue why two people are connecting besides the obvious “they know each other”. In a multidimensional network, you know why they know each other, it’s information embedded in the type of the relation. So, the hypothesis is that different types of relations are the cause of the community overlap, and with multidimensional networks we can look at how communities distribute over relations. First, let us take a look at what two overlapping communities look like in a multidimensional network.

We collected a multidimensional social network putting together relationships between users in Facebook, Twitter and Foursquare. We then used DEMON to extract overlapping communities from each dimension. We then took two communities with extensive overlap in the Facebook dimension (picture below).

We then looked at the very same set of nodes, but now in the Foursquare network. In the picture below, we kept the edges, and the node positioning, of the Facebook network to make the comparison easier, but keep in mind that the edges in the Foursquare dimension are different, and they are the ones that decide to what community the nodes belong.

Very interesting. The communities look a lot alike, although the shared (and non shared) nodes are slightly different. Now node 7369 is shared (it wasn’t in Facebook) while node 8062 isn’t (whereas it was before). Let’s put another nail on the coffin and see the communities these nodes belong in Twitter (same disclaimer applies):

Surprise surprise, in Twitter there is actually only one community, which brings together the majority of the nodes of the two communities. So here’s where our overlap comes from: common affiliations in different dimensions! Now, I’m going to deal with that voice in your head that is screaming “Anecdotal! Anecdotal!”. (You don’t hear it? Did I already mention that researchers are weird? In any case “Anecdotal” refers to a type of evidence that bears no value in scientifically proving a point if not backed by more solid proofs). Put in a more general way: the more two communities overlap in some dimension, the more likely it is we can find a dimension in which these communities are actually a single community. This involves boring details you can find in the paper which ultimately generate this plot:

Does this plot prove our theory without leaving out any reasonable doubt? Maybe, or not really. There are still things to check. But science is made by tiny steps forward. And this is certainly one.

Continue Reading

16 January 2014 ~ 2 Comments

The Eternal Struggle between Partitions and Overlapping Coverages

New year, old topic. I could make a lot of resolutions for this new year, but for sure to stop talking about community discovery is not among them. At least this time I tried to turn it up a notch in the epicness of the title. My aim is to give some substance to one of the many typical filler phrases in science writing. The culprit sentence in this case is “different application scenarios demand different approaches”. Bear with me for a metaphoric example.

When presenting a new toaster, it is difficult to prove that it toasts everything better under any point of view, under any circumstances. It usually does most toasts okay, and for one kind of toasts it really shines. Or its toasts really suck, but it can toast underwater. That’s fine. We are all grown up here, we don’t believe in the fairy tales of the silver bullets any more. At this point, our toaster salesman is forced to say it. “Different application scenarios demand different approaches”. In some cases this is a shameful fig leaf, but in many others it is simply true. Problem is: nobody really checks.

domo_toaster

I decided to check. At least one of them. Teaming up with Diego Pennacchioli and Dino Pedreschi, I put the spotlight on one of the strongest dichotomies in community discovery.  As you may remember, community discovery algorithms can force every node to belong to just one community, or allow them to be in many of them. The former approach is called “graph partitioning”, whilst the latter aims to find an “overlapping coverage”. Are these two strategies yielding interesting, yet completely different, results? This question has been dissected in the paper: “Overlap Versus Partition: Marketing Classification and Customer Profiling in Complex Networks of Products“, that will be presented in one workshop of the 2014 edition of the International Conference of Data Engineering.  Let me refresh your mind about overlaps and partitions.

Above you have the nec plus ultra scenario for a partitioning algorithm. If a partitioning algorithm sees the graph on the left, it would just die of happiness. In the graph, in fact, it appears very clearly that each node belongs to a very specific community. And it can’t belong to any other. If we assume that our algorithm works on edge strength (e.g. the inverse of the edge betweenness), then what the algorithm really sees is the graph on the right. It then proceeds to group together the nodes for which the edge strength is maximal, et voilà.