11 December 2018 ~ 0 Comments

How to Sample Networks Using Social Media APIs

If you were a social network analyst in the 1920s, 90% of your work would be to go around bugging people so that they could tell you whom they were friends with — we were, and still are, no fun at parties. Today, instead, we live in the land of plenty: 400 million new Instagram stories per day, 330 million monthly active users on Twitter, a gazillion Facebook profiles! What is there not to love? Well, to start, the fact that you do not have a download button, for very good and real reasons. That means that now 90% of your work is trying to figure out how to interface with these online media to sample the best possible representation of their social networks. So today I’m talking about how to sample a network via a social media API.

Let’s define our terms here. “Sampling a network” means to extract a part of it whose characteristics are as similar as possible to the entire structure. “API” is short for “Application Programming Interface.” It is the program in the server which interacts with the one you wrote to collect data. If you want to know the connections of user X, you ask the API and it will tell you. Most of the time. After a bit of nagging.

A good sample would look like the original network. A good sample like they wanted :’)

There are many approaches to sample networks, and many people have studied them to understand which one works best. But none of these studies actually made an experiment simulating their relationship with the actual API systems they have to work on. The comparisons made so far assume you can know all the connections of a user in one go, and that you can move to the other users as soon as you’re done exploring the current one. Sadly, the real world doesn’t remotely work that way. Thus we need to know how different API systems will influence different sampling methods. With Luca Rossi I wrote a paper about that, “Benchmarking API Costs of Network Sampling Strategies“, which I’ll present this month at the International Conference on Big Data.

An API system will put itself in the way of your noble sampling quest in three ways: (i) by returning only a maximum of n connections per request (i.e. by paginating the results), (ii) by making you wait a certain amount of time between requests, and (iii) by taking some time to respond (i.e. by having latency). The reason why considering the API hurdles is important is that they have a non-trivial relationship with your sampling method.

To illustrate my point consider two API systems. The first system, A1, gives you 100 connections per request, but imposes you to wait two seconds between requests. The second system, A2, gives you only 10 connections per request, but allows you a request per second. A2 is a better system to get all users with fewer than 10 connections — because you are done with only one request and you get one user per second –, and A1 is a better system in all other cases — because you make far fewer requests, for instance only one for a node with 50 connections, while in A2 you’d need five requests.

It seems trivial that A1 is a better system than A2, because it gives you 50 connections per second instead of 10 (we’re ignoring latency here). However, that’s not necessarily the case. Real world networks are far from equal: there are a few superstars with millions of followers, while your average user only has a handful (but I’m catching up with you, Katy!). This means that there are way way way way way way way way more users with 10 or fewer connections than there are with more than 10. In the case represented by the figure, sampling the full network via A2 will actually take half as much time as via A1, even if theoretically we thought we were going to be five times slower.

How many users (y-axis) have this many connections (x-axis). The blue area is where A2 works best — one user per second — while the purple area is where A1 works best. But there are 492.5k users in the blue area (the Michele Coscias), and only 7.5k in the purple (the Katy Perrys)!

With Luca, I created a benchmarking system — which you can freely use — that allows you to simulate network sampling by letting you define:

So now we get to the point of the post where I tell you which sampling method is the very best and you should always use it and it will solve world peace and stuff. And that method is…

…none of them. Unfortunately we realized that, in the world of network sampling, there is no free lunch. The space of possible characteristics of interest, API systems, networks on which you work, and budget constraints is so vast that each sampling method is the best — or worst — in some combinations of these factors. We ran a bazillion tests, but none of them summarizes the results better than these two plots.

On the left you see how badly we get the degree distribution wrong (y-axis, lower is better) at different budget levels (x-axis, from left to right we increase the amount of time we spend sampling the network). If we don’t have much time, the best methods are a variant of Random Walks (MHRW) or Snowball sampling, while the worst method is DFS. But surprise surprise, if we have tons of time, DFS is the best method, and MHRW and Snowball are the worst. By a long margin. No free lunch. On the right we have another instance of the same problem: here we want to know how well we identify central nodes in the network (y-axis, higher is better). The difference at increasing budget levels is ridiculous: the rankings you get when you have a lot of time to sample are practically the opposite of the one you get when you’re in a hurry!

This means that you really need to be careful when you extract networks from social media. You cannot barge in and grab whatever you can, however you can. You need to know which characteristics of the network are important to you. You need to know what the underlying network might look like. You need to know how much time you have to sample the network, compared to its size. You need to know how their APIs work. Otherwise you’re going to run in circles in a very very mad world. And you thought that they had it worse in the 1920s.

Continue Reading

28 May 2018 ~ 0 Comments

Mapping the International Aid Community

A few years ago (2013, God I’m old), I was talking to you on how to give a “2.0” flavor to international aid: at CID we created an Aid Explorer to better coordinate the provision of humanitarian work. I’m happy to report that “Aid Explorer” has been adopted — directly or indirectly —  by multiple international organizations, for instance USAID and the European Union. The World Bank’s Independent Evaluation Group contacted me to make an updated version, focused on estimating the World Bank’s position in the global health arena. The result is a paper, “Mapping the international health aid community using web data“, recently published in EPJ Data Science, and the product of a great collaboration with Katsumasa Hamaguchi, Maria Elena Pinglo, and Antonio Giuffrida.

The idea is to collect all the webpages of a hundred international aid organizations, looking for specific keywords and for hyperlinks to the other organizations — differently from the old Aid Explorer in which we relied on the index from Google. The aim is to create different networks of co-occurrences connecting:

  • Aid organizations co-mentioned in the same page;
  • Aid organizations mentioned or linked by another;
  • Issues co-mentioned in the same page;
  • Countries co-mentioned in the same page.

We then analyze these structures to learn something about the community as a whole.

One thing I didn’t expect was that organizations cluster by type. The “type” here is the force behind the organization — private philanthropy, UN system, bilateral (a single country’s aid extension of the foreign ministry), multilateral (international co-operations acting globally), etc. In the picture above (click on the image to enlarge), we encode the agency type in the node color. Organizations are overwhelmingly co-mentioned with organizations of the same type, which is curious because bilaterals often have nothing in common with each other besides the fact they are bilaterals: they work on different issues, with different developed and developing partners.

We can make a similar map connecting issues if they are co-mentioned in a web page. The map is useful as validation because it connects some “synonyms”, for instance “TB” and “Tubercolosis”. However, you can do much more with it. For instance, you can use color to show where an organization is most often cited. Below (click on the image to enlarge) you see the issue map for the World Bank, with the red nodes showing the issues strongly co-mentioned with the World Bank. Basically, the node color is the edge weight in a organization-issue bipartite network, where the organization is the World Bank. To give you an idea, the tiny “Infant Survival” node on the right saw the World Bank mentioned in 9% of the pages in which it was discussed. The World Bank was mentioned in 3.8% of web pages discussing AIDS.

This can lead to interesting discussions. While the World Bank does indeed discuss a lot about some of the red issues above — for instance about “Health Market” and “Health Reform” — its doesn’t say much about “Infant Survival”, relatively speaking at least. It’s intriguing that other organizations mention this particular issue so often in conjunction with the World Bank.

This difference between the global speech about issues and the one specific to another organization allows us to calculate two measures we call “Alignment” and “Impact”. By analyzing how similar the issue co-occurrence network of an organization is with the global one — a simple correlation of edge weights — we can estimate how “Aligned” it is with the global community. On the other hand, an “Impactful” organization is one that, were it to disappear, would dramatically change the global issue network: issues would not be co-mentioned that much.

In the plot above, we have Alignment and Impact on the x and y axis, respectively. The horizontal and vertical lines cutting through the plot above show the median of each measure. The top-right quadrant are organization both impactful and aligned: the organizations that have probably been setting the discourse of the international aid community. No wonder the World Health Organization is there. On the top left we have interesting mavericks, the ones which are not aligned to the community at large, and yet have an impact on it. They are trying to shape the international aid community into something different than what it is now.

A final fun — if a bit loose — analysis regards the potential for an organization to spread a message through the international aid network. What would be the reach of a message if it originated from a specific organization? We can use the Susceptible-Infected model from epidemiology. A message is a “virus” and it is more likely to infect an agency if more than a x% of the agency’s incoming links come from other infected agencies.

This depends on the issue, as shown above. In the figures we see the fraction of “infected” agencies (on the y-axis) given an original “patient zero” organization which starts spreading the message. To the left we see the result of the simulation aggregating all issues. The World Bank reaches saturation faster than UNICEF, and USAID is only heard by a tiny fraction of the network. However, if we only consider web pages talking about “Nurses” (right), then USAID is on par with the top international aid organizations — and UNICEF beats the World Bank. This happens because the discourse on the topic is relatively speaking more concentrated in USAID than average.

As with the Aid Explorer, this is a small step forward improving the provision of international aid. We do not have an interactive website this time, but you can download both the data and the code to create your own maps. Ideally, what we did only for international aid keywords can be extended for all other topics of interest in the humanitarian community: economic development, justice, or disaster relief.

Continue Reading

27 February 2018 ~ 0 Comments

Network Hierarchies and the Michelangelo Principle

The Michelangelo Principle — almost certainly apocryphal — states that:

If you want to transform a marble block into David, just chip away the stone that doesn’t look like David.

This seems to be a great idea to solve literally every problem in the world: if you want to fix evil, just remove from the world everything that is evil. But I don’t want to solve every problem in the world. I want to publish papers on network analysis. And so I apply it to network analysis. In this case, I use it to answer the question whether a directed network has a hierarchical structure or not. The result has been published in the paper “Using arborescences to estimate hierarchicalness in directed complex networks“, which appeared last month in PLoS One.

To determine whether a network is hierarchical is useful in a number of applications. Maybe you want to know how resilient your organization is to the removal of key elements in the team. Or you want to prove that a system you mapped does indeed have a head, instead of being a messy blob. Or you want to know whether it is wise to attempt a takedown on the structure. In all these scenarios, you really desire a number between 0 and 1 that tells you how hierarchical the network is. This is the aim of the paper.

Following the Michelangelo Principle, I decide to take the directed network and chip away from it everything that does not look like a perfect hierarchy. Whatever I’m left with is, by definition, a perfect hierarchy. If I’m left with a significant portion of the original network, then it was a pretty darn good hierarchy to begin with. If I’m left with nothing, well, then it wasn’t a hierarchy at all. Easy. Let’s translate this into an algorithm. To do so, we need to answer a couple of questions:

  1. What’s a “perfect hierarchy”?
  2. How do we do the chipping?
  3. How do we quantify the amount we chipped?

The first question is the one where we have most wiggle room. People might agree or disagree with the definition of perfect hierarchy that I came up with. Which is: a perfect hierarchy is a system where everybody answers to a single boss, except the boss of them all, who answers to no one. I like this definition because it solves a few problems.

Consider the picture above. In the leftmost example, if we assume nodes 1 and 2 give contradictory orders, node 5 doesn’t really know what to do, and the idea of a hierarchy breaks down. In the example in the middle, we don’t even know who’s the boss of them all: is it node 0 or node 3? The rightmost example leaves us no doubt about who’s boss, and there’s no tension. For those curious, network scientists call that particular topology an “arborescence“, and that’s the reason this exotic word is in the paper title. Since this is a well defined concept, we know exactly what to remove from an arbitrary directed network to make it into an arborescence.

Time to chip! Arbitrary directed networks contain strongly connected components: they have paths that can lead you back to your origin if you follow the edge directions. An arborescence is a directed acyclic graph, meaning that it cannot have such components. So our first step is to collapse them (highlighted in yellow above) into a single node. Think of strongly connected components as headless teams where all the collaborators are at the same level. They are a node in the hierarchy. We don’t care how a node organizes itself internally. As long as it answers to a boss and gives direction to its underlings, it can do it in whichever way it wants.

Second chipping step: in an arborescence, all nodes have at most one incoming connection, and only one node has zero. So we need to remove all offending remaining edges (highlighted in orange above). Once we complete both steps, we end up with an arborescence, and we’re done. (There are edge cases in which you’ll end up with multiple weakly connected components. That’s fine. If you follow the procedure, each of these components is an arborescence. Technically speaking, this is an “arborescence forest”, and it is an acceptable output)

We can now answer the final question: quantifying how much we chipped. I decide to focus on the number of edges removed. Above, you see that the original graph (left) had twenty edges, and that (right) nine edges survived. So the “hierarchicalness” of the original graph is 9 / 20 = .45.

Now the question is: why would someone use this method to estimate a network’s degree of hierarchicalness and not one of the many already proposed in the literature? The other methods all have small downsides. I build some toy examples where I can show that arborescence is the way to go. For instance, you can’t find a more perfect hierarchy than a balanced tree (leftmost example above). However, Global Reach Centrality would fail to give it a perfect score — since it thinks only a star graph is a perfect hierarchy. Agony and Flow Hierarchy aren’t fooled in this case, but give perfect scores in many other scenarios: a wheel graph with a single flipped edge (example in the middle), or a case where there are more bosses than underlings (rightmost example). Those who have been in a team with more bosses than workers know that the arrangement could be described in many ways, but “perfect” ain’t one of them.

Arborescence is also able to better distinguish between purely random graphs and graphs with a hierarchy — such as a preferential attachment with edges going from the older to the newer nodes (details in the paper). Before you start despairing, it’s not that I’m saying we’ve been doing hierarchy detection wrong for all these years. In most real world scenarios, these measures agree. But, when they disagree, arborescence is the one that most often sides with the domain experts, who have well informed opinions whether the system represented by the network should be a hierarchy or not.

To conclude, this method has several advantages over the competitors. It’s intuitive: it doesn’t give perfect ratings to imperfect hierarchies and vice versa. It’s graphic: you can easily picture what’s going on in it, as I did in this blog post. It’s conservative: it doesn’t make the outlandish claim that “everyone before me was a fool”. It’s rich: it gives you not only a score, but also a picture of the hierarchy itself. So… Give it a try! The code is freely available, and it plays nicely with networkx.

Continue Reading

25 January 2017 ~ 0 Comments

Network Backboning with Noisy Data

Networks are a fantastic tool for understanding an interconnected world. But, to paraphrase Spider Man, with networks’ expressive power come great headaches. Networks lure us in with their promise of clearly representing complex phenomena. However, once you start working with them, all you get is a tangled mess. This is because, most of the time, there’s noise in the data and/or there are too many connections: you need to weed out the spurious ones. The process of shaving the hairball by keeping only the significant connections — the red ones in the picture below —  is called “network backboning”. The network backbone represents the true relationships better and will play much nicer with other network algorithms. In this post, I describe a backboning method I developed with Frank Neffke, from the paper “Network Backboning with Noisy Data” accepted for publication in the International Conference on Data Engineering (the code implementing the most important backbone algorithms is available here).

bb1

Network backboning is as old as network analysis. The first solution to the problem was to keep edges according to their weight. If you want to connect people who read the same books, pairs who have few books in common are out. Serrano et al. pointed out that edge weight distributions can span many orders of magnitude — as shown in the figure below (left). Even with a small threshold, we are throwing away a lot of edges. This might not seem like a big deal — after all we’re in the business of making the network sparser — except that the weights are not distributed randomly. The weight of an edge is correlated with the weights of the edges sharing a node with it — as shown by the figure below (right). It is easy to see why: if you have a person who read only one book, all its edges can have at most weight one.

Their weights might be low in comparison with the rest of the network, but they are high for their nodes, given their propensity to connect weakly. Isolating too many nodes because we accidentally removed all their edges is a no-no, so Serrano and coauthors developed the Disparity Filter (DF): a technique to estimate the significance of one node’s connections given its typical edge weight, regardless of what the rest of the network says.

bb2

This sounds great, but DF and other network backboning approaches make imprecise assumptions about the possibility of noise in our estimate of edge weights. In our book example, noise means that a user might have accidentally said that she read a book she didn’t, maybe because the titles were very similar. One thing DF gets wrong is that, when two nodes are not connected in the raw network data, it would say that measurement error is absent. This is likely incorrect, and it screams for a more accurate estimate of noise. I’m going to leave the gory math details in the paper, but the bottom line is that we used Bayes’ rule. The law allows us to answer the question: how surprising is the weight of this edge, given the weights of the two connected nodes? How much does it defy my expectation?

The expectation here can be thought of as an extraction without replacement, much like Bingo (which statisticians — notorious for being terrible at naming things — would call a “hypergeometric” one). Each reader gets to extract a given number of balls (n, the total number of books she read), drawing from a bin in which all balls are the other users. If a user read ten books, then there are ten balls representing her in the bin. This is a good way to have an expectation for zero edge weights (nodes that are not connected), because we can estimate the probability of never extracting a ball with a particular label.

bb4

I highlighted the words one and two, because they’re a helpful key to understand the practical difference between the approaches. Consider the toy example below. In it, each edge’s thickness is proportional to its weight. Both DF and our Noise Corrected backbone (NC) select the black edges: they’re thick and important. But they have different opinions about the blue and red edges. DF sees that nodes 2 and 3 have mostly weak connections, meaning their thick connection to node 1 stands out. So, DF keeps the blue edges and it drops the red edge. It only ever looks at one node at a time.

bb5

NC takes a different stance. It selects the red edge and drops the blue ones. Why? Because for NC what matters more is the collaboration between the two nodes. Sure, the blue connection is thicker than the red one. But node 1 always has strong connections, and its blue edges are actually particularly weak. On the other hand, node 3 usually has weak connections. Proportionally speaking, the red edge is more important for it, and so it gets saved.

To sum up, NC:

  1. Refines our estimate of noise in the edge weights;
  2. Sees an edge as the collaboration between two nodes rather that an event happening to one of them;
  3. Uses a different model exploiting Bayes’ law to bake these aspects together.

bb6

How does that work for us in practice? Above you see some simulations made with artificial networks, of which we know the actual underlying structure, plus some random noise — edges thrown in that shouldn’t exist. The more noise we add the more difficult it is to recover the original structure. When there is little noise, DF (in blue) is better. NC (in red) starts to shine as we increase the amount of noise, because that’s the scenario we were targeting.

In the paper we also show that NC backbones have a comparable stability with DF, meaning that extracting the backbone from different time snapshots of the same phenomenon usually does not yield wildly different results. Coverage — the number of nodes that still have at least one edge in the network — is also comparable. Then we look at quality. When we want to predict some future relationship among the nodes, we expect noisy edges to introduce errors in the estimates. Since a backbone aims at throwing them away, it should increase our predictive power. The table below (click it to enlarge) shows that, in different country-country networks, the predictive quality (R2) using an NC backbone is higher than the one we get using the full noisy network. The quality of prediction can get as high as twice the baseline (the table reports the quality ratio: R2 of the backbone over R2 of the full network, for different methods).

bb8

The conclusion is that, when you are confident about the measurement of your network, you should probably extract its backbone using DF. However, in cases of uncertainty, NC is the better option. You can test it yourself!

Continue Reading

07 July 2016 ~ 0 Comments

Building Data-Driven Development

A few weeks ago I had to honor to speak at my group’s  “Global Empowerment Meeting” about my research on data science and economic development. I’m linking here the Youtube video of my talk and my transcript for those who want to follow it. The transcript is not 100% accurate given some last minute edits — and the fact that I’m a horrible presenter 🙂 — but it should be good enough. Enjoy!


We think that the big question of this decade is on data. Data is the building blocks of our modern society. We think in development we are not currently using enough of these blocks, we are not exploiting data nearly as much as we should. And we want to fix that.

Many of the fastest growing companies in the world, and definitely the ones that are shaping the progress of humanity, are data-intensive companies. Here at CID we just want to add the entire world to the party.

So how do we do it? To fix the data problem development literature has, we focus on knowing how the global knowledge building looks like. And we inspect three floors: how does knowledge flow between countries? What lessons can we learn inside these countries? What are the policy implications?

To answer these questions, we were helped by two big data players. The quantity and quality of the data they collect represent a revolution in the economic development literature. You heard them speaking at the event: they are MasterCard – through their Center for Inclusive Growth – and Telefonica.

Let’s start with MasterCard, they help us with the first question: how does knowledge flow between countries? Credit card data answer to that. Some of you might have a corporate issued credit card in your wallet right now. And you are here, offering your knowledge and assimilating the knowledge offered by the people sitting at the table with you. The movements of these cards are movements of brains, ideas and knowledge.

When you aggregate this at the global level you can draw the map of international knowledge exchange. When you have a map, you have a way to know where you are and where you want to be. The map doesn’t tell you why you are where you are. That’s why CID builds something better than a map.

We are developing a method to tell why people are traveling. And reasons are different for different countries: equity in foreign establishments like the UK, trade partnerships like Saudi Arabia, foreign greenfield investments like Taiwan.

Using this map, it is easy to envision where you want to go. You can find countries who have a profile similar to yours and copy their best practices. For Kenya, Taiwan seems to be the best choice. You can see that, if investments drive more knowledge into a country, then you should attract investments. And we have preliminary results to suggest whom to attract: the people carrying the knowledge you can use.

The Product Space helps here. If you want to attract knowledge, you want to attract the one you can more easily use. The one connected to what you already know. Nobody likes to build cathedrals in a desert. More than having a cool knowledge building, you want your knowledge to be useful. And used.

There are other things you can do with international travelers flows. Like tourism. Tourism is a great export: for many countries it is the first export. See these big portion of the exports of Zimbabwe or Spain? For them tourism would look like this.

Tourism is hard to pin down. But it is easier with our data partners. We can know when, where and which foreigners spend their money in a country. You cannot paint pictures as accurate as these without the unique dataset MasterCard has.

Let’s go to our second question: what lessons can we learn from knowledge flows inside a country? Telefonica data is helping answering this question for us. Here we focus on a test country: Colombia. We use anonymized call metadata to paint the knowledge map of Colombia, and we discover that the country has its own knowledge departments. You can see them here, where each square is a municipality, connecting to the ones it talks to. These departments correlate only so slightly with the actual political boundaries. But they matter so much more.

In fact, we asked if these boundaries could explain the growth in wages inside the country. And they seem to be able to do it, in surprisingly different ways. If you are a poor municipality in a rich state in Colombia, we see your wage growth penalized. You are on a path of divergence.

However, if you are a poor municipality and you talk to rich ones, we have evidence to show that you are on a path of convergence: you grow faster than you expect to. Our preliminary results seem to suggest that being in a rich knowledge state matters.

So, how do you use this data and knowledge? To do so you have to drill down at the city level. We look not only at communication links, but also at mobility ones. We ask if a city like Bogota is really a city, or different cities in the same metropolitan area. With the data you can draw four different “mobility districts”, with a lot of movements inside them, and not so many across them.

The mobility districts matter, because combining mobility and economic activities we can map the potential of a neighborhood, answering the question: if I live here, how productive can I be? A lot in the green areas, not so much in the red ones.

With this data you can reshape urban mobility. You know where the entrance barriers to productivity are, and you can destroy them. You remodel your city to include in its productive structure people that are currently isolated by commuting time and cost. These people have valuable skills and knowhow, but they are relegated in the informal sector.

So, MasterCard data told us how knowledge flows between countries. Telefonica data showed the lessons we can learn inside a country. We are left with the last question: what are the policy implications?

So far we have mapped the landscape of knowledge, at different levels. But to hike through it you need a lot of equipment. And governments provide part of that equipment. Some in better ways than others.

To discover the policy implications, we unleashed a data collector program on the Web. We wanted to know how the structure of the government in the US looks like. Our program returned us a picture of the hierarchical organization of government functions. We know how each state structures its own version of this hierarchy. And we know how all those connections fit together in the union, state by state. We are discovering that the way a state government is shaped seems to be the result of two main ingredients: where a state is and how its productive structure looks like.

We want to establish that the way a state expresses its government on the Web reflects the way it actually performs its functions. We seem to find a positive answer: for instance having your environmental agencies to talk with each other seems to work well to improve your environmental indicators, as recorded by the EPA. Wiring organization when we see positive feedback and rethinking them when we see a negative one is a direct consequence of this Web investigation.

I hope I was able to communicate to you the enthusiasm CID discovered in the usage of big data. Zooming out to gaze at the big picture, we start to realize how the knowledge building looks like. As the building grows, so does our understanding of the world, development and growth. And here’s the punchline of CID: the building of knowledge grows with data, but the shape it takes is up to what we make of this data. We chose to shape this building with larger doors, so that it can be used to ensure a more inclusive world.


By the way, the other presentations of my session were great, and we had a nice panel after that. You can check out the presentations in the official Center for International Development Youtube channel. I’m embedding the panel’s video below:

Continue Reading

29 May 2015 ~ 0 Comments

Networks of Networks – NetSci 2015

The time has finally come! The NetSci conference—the place to be if you are interested in complex networks—is happening next week, from June 1st to June 5th. The venue is in Zaragoza, Spain. You can get all the information you need about the event from the official website. For the third year, I am co-organizing one of its satellite events: the Multiple Network Modeling, Analysis and Mining symposium, this year held jointly with Networks of Networks. The satellite will take place on June 2nd. As I previously said, unfortunately I am not going to be physically present in Span, and that makes me sad, because we have a phenomenal program this year.

We have four great invited speakers: Giovanni Sansavini, Rui Carvalho, Arunabha Sen and Katharina Zweig. It is a perfect mix between the infrastructure focus of the networks of networks crowd and the multidisciplinary approach of multiple networks. Sansavini works on reliability and risk engineering, while Carvalho focuses on characterizing and modeling networks in energy. Sen and Zweig provide their outstanding experience in the fields of computer networks and graph theory.

Among the contributed talks I am delighted to see that many interesting names from the network analysis crowd decided to send their work to be presented in our event. Among the highlights we have a contribution from the group of Mason Porter, who won last year’s Erdos Prize as one of the most outstanding young network scientists. I am also happy to see contributions from the group of Cellai and Gleeson, with whom I share not only an interest on multiplex networks, but also on internet memes. Contributions from groups lead by heavyweights like Schweitzer and Havlin are another sign of the attention that this event has captured.

I hope many of you will attend this seminar. You’ll be in good hands: Gregorio D’Agostino, Przemyslaw Kazienko and Antonio Scala will be much better hosts than I can ever be. I am copying here the full program of the event. Enjoy Spain!

NoN’15 Program

Session I

9.00 – 9.30 Speaker Set Up

9.30 – 9.45 Introduction: Welcome from the organizers, presentation of the program

9.45 – 10.15 Keynote I: Giovanni Sansavini. Systemic risk in critical infrastructures

10.15 – 10.35 Contributed I: Davide Cellai and Ginestra Bianconi. Multiplex networks with heterogeneous activities of the nodes

10.35 – 10.55 Contributed II: Mikko Kivela and Mason Porter. Isomorphisms in Multilayer Networks

10.55 – 11.30 Coffee Break

Session II

11.30 – 12.00 Keynote II: Rui Carvalho, Lubos Buzna, Richard Gibbens and Frank Kelly. Congestion control in charging of electric vehicles

12.00 – 12.20 Contributed III: Saray Shai, Dror Y. Kenett, Yoed N. Kenett, Miriam Faust, Simon Dobson and Shlomo Havlin. A critical tipping point in interconnected networks

12.20 – 12.40 Contributed IV: Adam Hackett, Davide Cellai, Sergio Gomez, Alex Arenas and James Gleeson. Bond percolation on multiplex networks

12.40 – 13.00 Contributed V: Marco Santarelli, Mario Beretta, Giorgio D’Urbano, Lorenzo Spina, Renato De Leone and Emilia Marchitto. Soccer and networks: changing the way of playing soccer through GPS, video analysis and social networks

13.00 – 14.30 Lunch

Session III

14.30 – 15.00 Keynote III: Arunabha Sen. Strategic Analysis and Design of Robust and Resilient Interdependent Power and Communication Networks with a New Model of Interdependency

15.00 – 15.20 Invited I: Alfonso Damiano,Univ. di Cagliari – Electric Market – Italy; Antonio Scala CNR-ICS, IMT, LIMS

15.20 – 15.40 Contributed VI: Rebekka Burkholz, Antonios Garas, Matt V Leduc, Ingo Scholtes and Frank Schweitzer. Cascades on Multiplexes with Threshold Feedback

15.40 – 16.00 Contributed VII: Soumajit Pramanik, Maximilien Danisch, Qinna Wang, Jean-Loup Guillaume and Bivas Mitra. Analyzing the Impact of Mentioning in Twitter

16.00 – 16.30 Coffee Break

Session IV

16.00 – 16.30 Keynote IV: Katharina Zweig. Science-theoretic musings on the analysis of networks (of networks)

16.30 – 16.50 Contributed VIII: Vinko Zladic, Sebastian Krause, Michael Danziger. Avoidable colors percolation

16.50 – 17.10 Contributed IX: Borut Sluban, Jasmina Smailovic, Igor Mozetic and Stefano Battiston. Sentiment Leaning of Influential Communities in Social Networks

17.10 – 17.30 Invited II: one speaker from the CI2C project (confirmed, yet to be defined)

17.30   Planning Netonets Future Activities

Continue Reading

22 January 2015 ~ 0 Comments

Surprising Facts About Shortest Paths

Maybe it’s the new year, maybe it’s the fact that I haven’t published anything new recently, but today I wanted to take a look at my publication history. This, for a scientist, is something not unlike a time machine, bringing her back to an earlier age. What was I thinking in 2009? What sparked my interest and what were the tools further refined to get to the point where I am now? It’s usually a humbling (not to say embarrassing) operation, as past papers always look so awful – mine, at least. But small interesting bits can be found, like the one I retrieved today, about shortest paths in communication networks.

A shortest path in a network is the most efficient way to go from one node to another. You start from your origin and you choose to follow an edge to another node. Then you choose again an edge and so on until you get to your destination. When your choices are perfect and you used the minimum possible number of edges to follow, that’s a shortest path (it’s A shortest path and not THE shortest path because there might be alternative paths of the same length). Now, in creating this path, you obviously visited some nodes in between, unless your origin and destination are directly connected. Turns out that there are some nodes that are crossed by a lot of shortest paths, it’s a characteristic of real world networks. This is interesting, so scientists decided to create a measure called betweenness centrality. For each node, betweenness centrality is the share of all possible shortest paths in the network that pass through them.

Intuitively, these nodes are important. Think about a rail network, where the nodes are the train stations. High betweenness stations see a lot of trains passing through them. They are big and important to make connections run faster: if they didn’t exist every train would have to make detours and would take longer to bring you home. A good engineer would then craft rail networks in such a way to have these hubs and make her passengers happy. However, it turns out that this intuitive rule is not universally applicable. For example some communication networks aren’t willing to let this happen. Michele Berlingerio, Fosca Giannotti and I stumbled upon this interesting result while working on a paper titled Mining the Temporal Dimension of the Information Propagation.

tas2

We built two communication networks. One is corporate-based: it’s the web of emails exchanged across the Enron employee ecosystem. The email record has been publicly released for the investigation about the company’s financial meltdown. An employee is connected to all the employees she emailed. The second is more horizontal in nature, with no work hierarchies. We took users from different email newsgroups and connected them if they sent a message to the same thread. It’s the nerdy version of commenting on the same status update on Facebook. Differently from most communication network papers, we didn’t stop there. Every edge still carries some temporal information, namely the moment in which the email was sent. Above you have an extract of the network for a particular subject, where we have the email timestamp next to each edge.

Here’s where the magic happens. With some data mining wizardry, we are able to tell the characteristic reaction times of different nodes in the network. We can divide these nodes in classes: high degree nodes, nodes inside a smaller community where everybody replies to everybody else and, yes, nodes with high betweenness centrality, our train station hubs. For every measure (characteristic), nodes are divided in five classes. Let’s consider betweenness. Class 1 contains all nodes which have betweenness 0, i.e. those through which no shortest path passes. From class 2 to 5 we have nodes of increasing betweenness. So, nodes in class 3 have a medium-low betweenness centrality and nodes in class 5 are the most central nodes in the network. At this point, we can plot the average reaction times for nodes belonging to different classes in the two networks. (Click on the plots to enlarge them)

tas1

The first thing that jumps to the eye is that Enron’s communications (on the left) are much more dependent on the node’s characteristics (whether the characteristic is degree or betweenness it doesn’t seem to matter) than Newsgroup’s ones, given the higher spread. But the interesting bit, at least for me, comes when you only look at betweenness centrality – the dashed line with crosses. Nodes with low (class 2) and medium-low (class 3) betweenness centrality have low reaction times, while more central nodes have significantly higher reaction times. Note that the classes have the same number of nodes in them, so we are not looking at statistical aberrations*. This does not happen in Newsgroups, due to the different nature of the communication in there: corporate in Enron versus topic-driven in Newsgroup.

The result carries some counter intuitive implications. In a corporate communication network the shortest path is not the fastest. In other words, don’t let your train pass through the central hub for a shortcut, ’cause it’s going to stay there for a long long time. It looks like people’s brains are less elastic than our train stations. You can’t add more platforms and personnel to make more things passing through them: if your communication network has large hubs, they are going to work slower. Surprisingly, this does not hold for the degree (solid line): it doesn’t seem to matter with how many people you interact, only that you are the person through which many shortest paths pass.

I can see myself trying to bring this line of research back from the dead. This premature paper needs quite some sanity checks (understatement alert), but it can go a long way. It can be part of the manual on how to build an efficient communication culture in your organization. Don’t overload people. Don’t create over-important nodes in the network, because you can’t allow all your communications to pass through them. Do keep in mind that your team is not a clockwork, it’s a brain-work. And brains don’t work like clocks.


* That’s also the reason to ditch class 1: it contains outliers and it is not comparable in size to the other classes.

 

Continue Reading

25 September 2014 ~ 0 Comments

Digital Humanities @ KU: Report

Earlier this month I had the pleasure of being invited to hold a workshop with Isabel Meirelles on complex network visualization and analysis at the Digital Humanities 2014 Forum, held at Kansas University, Lawrence. I figured that this is a good occasion to report on my experience, since it was very interesting and, being quite different from my usual venues, it adds a bit of diversity. The official page of the event is useful to get an overall picture of what was going on. It will be helpful also for everything I will not touch upon in this post.

net_ex19

I think that one of the main highlights of the event was the half of our workshop curated by Isabel, with the additional keynote that she gave. Isabel is extremely skilled both in the know-how and in the know-what about information visualization: she is not only able to create wonderful visualizations, but she also has a powerful critical sense of what works and what doesn’t. I think that the best piece of supporting evidence for this statement is her latest book, which you can find here. As for my part of the workshop, it was focused on a very basic introduction to the simplest metrics of network analysis. You can take a glance at the slides here, but if you are already somewhat proficient in network terminology do not expect your world to be shattered by it.

The other two keynotes were equally fascinating. The first one was from Steven Jones. His talk gravitated around the concept of the eversion of the virtual into reality. Many works of science fiction imagined human beings ending up in some more or less well defined “virtual reality”, where everything is possible as long as you can program it. See for example Gibson’s “Neuromancer”, or “The Matrix”, just to give a popular example that most people would know. But what is happening right now, observes Jones, is exactly the opposite. We see more and more examples of virtual reality elements being introduced, mostly playfully, into reality. Think about qonqr, where team of people have to physically “fight” to keep virtual control of an actual neighborhood. A clever artistic way to depict eversion is also:

The last keynote was from Scott Weingart. Scott is a smart guy and he is particularly interested in studying the history of science. In the (too few!) interactions we had during the forum we touched upon many topics also included in his talk: ethic responsibility of usage of data about people, the influence of the perspective you use to analyze human activities and, a must exchange between a historian of science and yours truly formed as scientist in Pisa, Galileo Galilei. I feel I cannot do justice to his very eloquent and thought-provoking keynote in this narrow space. So I redirect you to its transcript, hosted on Scott’s blog. It’s a good read.

Then, the contributed talks. Among all the papers you can explore from the official forum page, I’d like to focus particularly on two. The first is the Salons project, presented by Melanie Conroy. The idea is to map the cultural exchange happening in Europe during the Enlightenment years. A great role for this exchange was played by Salons, where wealthy people were happy to give intellectuals a place for gathering and discussing. You can find more information on the Salons project page. I liked it because it fits with the idea of knowledge creation and human advancement as a collective process, where an equal contribution is given by both intellect and communication. By basing itself on richly annotated data, projects like these can help us understanding where breakthroughs come from, or to understand that there is no such a thing as a breakthrough, only a progressive interconnection of ideas. Usually, we realize it only after the fact, and that’s why we think it happened all of a sudden.

Another talk I really enjoyed was from Hannah Jacobs. Her talk described a visualization tool to explore the evolution of the concept of “New Woman”, one of the first examples of feminism. I am currently unable to find an online link to the tool. What I liked about it was the seamless way in which different visualizations are used to tell the various points of view on the story. The whole point of information visualization is that when there is too much data to show at the same time, one has to select what to highlight and what to discard. But in this framework, with a wise choice of techniques, one can jump into different magnifying glasses and understand one part of the story of the term “New Woman” at a time.

Many other things were cool, from the usage of the Unity 3D engine to recreate historic view, to the charming visualizations of “Enchanters of Men”. But my time here is up, and I’m left with the hope of being invited also to the 2015 edition of the forum.

Continue Reading

28 August 2014 ~ 0 Comments

The Curious World of Network Mapping

Complex networks can come in different flavors. As you know if you follow this blog, my signature dish is multilayer/multidimensional networks: networks with multiple edge types. One of the most popular types is bipartite networks. In bipartite networks, you have two types of nodes. For example, you can connect users of Netflix to the movies they like. As you can see from this example, in bipartite networks we allow only edges going from one type of nodes to the other. Users connect to movies, but not to other users, and movies can’t like other movies (movies are notoriously mean to each other).

m1

Many things (arguably almost everything) can be represented as a bipartite network. An occupation can be connected to the skills and/or tasks it requires, an aid organization can be connected to the countries and/or the topics into which it is interested, a politician is connected to the bills she sponsored. Any object has attributes. And so it can be represented as an object-attribute bipartite network. However, most of the times you just want to know how similar two nodes of the same type are. For example, given a movie you like, you want to know a similar movie you might like too. This is called link prediction and there are two ways to do this. You could focus on predicting a new user-movie connection, or focus instead on projecting the bipartite network to discover the previously unknown movie-movie connections. The latter is the path I chose, and the result is called “Network Map”.

It is clearly the wrong choice, as the real money lies in tackling the former challenge. But if I wanted to get rich I wouldn’t have chosen a life in academia anyway. The network map, in fact, has several advantages over just predicting the bipartite connections. By creating a network map you can have a systemic view of the similarities between entities. The Product Space, the Diseasome, my work on international aid. These are all examples of network maps, where we go from a bipartite network to a unipartite network that is much easier to understand for humans and to analyze for computers.

ps4

Creating a network map, meaning going from a user-movie bipartite network to a movie-movie unipartite network, is conceptually easy. After all, we are basically dealing with objects with attributes. You just calculate a similarity between these attributes and you are done. There are many similarities you can use: Jaccard, Pearson, Cosine, Euclidean distances… the possibilities are endless. So, are we good? Not quite. In a paper that was recently accepted in PLoS One, Muhammed Yildirim and I showed that real world networks have properties that make the general application of any of these measures quite troublesome.

For example, bipartite networks have power-law degree distributions. That means that a handful of attributes are very popular. It also means that most objects have very few attributes. You put the two together and, with almost 100% probability, the many objects with few attributes will have the most popular attributes. This causes a great deal of problems. Most statistical techniques aren’t ready for this scenario. Thus they tend to clutter the network map, because they think that everything is similar to everything else. The resulting network maps are quite useless, made of poorly connected dense areas and lacking properties of real world networks, such as power-law degree distributions and short average path length, as shown in these plots:

m2

m3

Of course sometimes some measure gets it right. But if you look closely at the pictures above, the only method that consistently give the shortest paths (above, when the peak is on the left we are good) and the broadest degree distributions (below, the rightmost line at the end in the lower-right part of the plot is the best one) is the red line of “BPR”. BPR stands for “Bipartite Projection via Random-walks” and it happens to be the methodology that Muhammed and I invented. BPR is cool not only because its network maps are pretty. It is also achieving higher scores when using the network maps to predict the similarity between objects using ground truth, meaning that it gives the results we expect when we actually already know the answers, that are made artificially invisible to test the methodology. Here we have the ROC plots, where the highest line is the winner:

m4

So what makes BPR so special? It all comes down to the way you discount the popular attributes. BPR does it in a “network intelligent” way. We unleash countless random walkers on the bipartite network. A random walker is just a process that starts from a random object of the network and then it jumps from it to one of its attributes. The target attribute is chosen at random. And then the walker jumps back to an object possessing that attribute, again choosing it at random. And then we go on. At some point, we start from scratch with a new random walk. We note down how many times two objects end up in the same random walk and that’s our similarity measure. Why does it work? Because when the walker jumps back from a very popular attribute, it could essentially go to any object of the network. This simple fact makes the contribution of the very popular attributes quite low.

BPR is just the latest proof that random walks are one of the most powerful tools in network analysis. They solve node ranking, community discovery, link prediction and now also network mapping. Sometimes I think that all of network science is founded on just one algorithm, and that’s random walks. As a final note, I point out that you can create your own network maps using BPR. I put the code online (the page still bears the old algorithm’s name, YCN). That’s because I am a generous coder.

Continue Reading

16 January 2014 ~ 2 Comments

The Eternal Struggle between Partitions and Overlapping Coverages

New year, old topic. I could make a lot of resolutions for this new year, but for sure to stop talking about community discovery is not among them. At least this time I tried to turn it up a notch in the epicness of the title. My aim is to give some substance to one of the many typical filler phrases in science writing. The culprit sentence in this case is “different application scenarios demand different approaches”. Bear with me for a metaphoric example.

When presenting a new toaster, it is difficult to prove that it toasts everything better under any point of view, under any circumstances. It usually does most toasts okay, and for one kind of toasts it really shines. Or its toasts really suck, but it can toast underwater. That’s fine. We are all grown up here, we don’t believe in the fairy tales of the silver bullets any more. At this point, our toaster salesman is forced to say it. “Different application scenarios demand different approaches”. In some cases this is a shameful fig leaf, but in many others it is simply true. Problem is: nobody really checks.

domo_toaster

I decided to check. At least one of them. Teaming up with Diego Pennacchioli and Dino Pedreschi, I put the spotlight on one of the strongest dichotomies in community discovery.  As you may remember, community discovery algorithms can force every node to belong to just one community, or allow them to be in many of them. The former approach is called “graph partitioning”, whilst the latter aims to find an “overlapping coverage”. Are these two strategies yielding interesting, yet completely different, results? This question has been dissected in the paper: “Overlap Versus Partition: Marketing Classification and Customer Profiling in Complex Networks of Products“, that will be presented in one workshop of the 2014 edition of the International Conference of Data Engineering.  Let me refresh your mind about overlaps and partitions.

Above you have the nec plus ultra scenario for a partitioning algorithm. If a partitioning algorithm sees the graph on the left, it would just die of happiness. In the graph, in fact, it appears very clearly that each node belongs to a very specific community. And it can’t belong to any other. If we assume that our algorithm works on edge strength (e.g. the inverse of the edge betweenness), then what the algorithm really sees is the graph on the right. It then proceeds to group together the nodes for which the edge strength is maximal, et voilà.

Here we have an example that’s a bit more complex. The picture has too many overlapping parts, so let me describe the connection pattern. In the graph on the left there are several groups of 6 nodes, each node connected to all other members of the group. In practice, each diagonal is completely connected to the two neighbouring diagonals. Clearly, here there is no way we can put each node in a disjoint group. Why put together nodes 0,1,2 with 3,4,5 and not with 9,10,11? But at that point, why 9,10,11 should be in a community with them and not with 6,7,8? The correct approach is just to allow every completely connected group to be a community, thus letting nodes to be part of more than a community. Some overlapping algorithms see the graph as it has been depicted on the right, with an edge colour per densely connected group.

Time to test which one of these approaches is The Right One! For our data quest we focused on supermarket transactions. We created a network of products that you can buy in supermarkets. To be connected, two products have to be bought together by the same customers in a significant number of times. What does that mean? By pure intuition, bread and water aren’t going to be connected: both of them are bought very frequently, but they have little to do with each other, thus they are expected to be in the same shopping cart by chance. Eggs and flour are too very popular, but probably more than chance, since there are a lot of things you can do with them together. Therefore they are connected. Other specific pairs of products, say bacon flavoured lipstick and liquorice shoelaces, may ended up in the same, quite weird, shopping cart. But we don’t connect them, as their volume of sales is too low (or at least I hope so).

Here are some of the facts we found. First. The overlapping approach* tends to return relatively more communities with a larger amount of nodes than the partition approach**. In absolute terms that’s obvious, since the same node is counted more than once, but here the key term is “relatively”. See the plot above on the right, where we graph the probability (y axis) of finding a community with a given number of nodes (x axis). Second. The overlapping approach returns more “messy” communities. Our messiness measure checks how many different product categories are grouped together on average in the same community. Again, larger communities are expected to be messier, but the messiness measure that we used controls for community size. See the plot on the right, again the probability (y axis) of finding a community with a given entropy (x axis, “entropy” is the fancy scientific term for “messiness”). Third. The partition approach returned denser communities, whose link strength (the number of people buying the products together) is higher.

What is the meaning of all this? In our opinion, the two algorithms are aiming to do something completely different. The partition approach is aiming to create a new marketing classification. It more or less coincides with the established one (thus lower messiness), most customers buy those products together (high link strength) and there are very few giant categories (most communities are small). The overlapping approach, instead, wants to do customer profiling. A customer rarely buys all products of a marketing category (thus increasing its messiness), it has specific needs (that not many people have, thus lowering edge weight) and she usually needs a bunch of stuff (thus larger communities, on average).

Who’s right? That’s the catch: both. The fact that two results are incompatible, in this case, does not mean that one is right and one is wrong. They are just different applications. Which was exactly what I wanted to prove, in this narrow and very specific, probably unsurprising, scenario. Now you should feel better: I gave you a small proof that the hours you spend to choose the perfect toaster for you are really worth your time!


* As overlapping approach, we used the Hierarchical Link Clustering.

** As partitioning approach, we used Infomap.

 

Continue Reading