28 May 2018 ~ 0 Comments

Mapping the International Aid Community

A few years ago (2013, God I’m old), I was talking to you on how to give a “2.0” flavor to international aid: at CID we created an Aid Explorer to better coordinate the provision of humanitarian work. I’m happy to report that “Aid Explorer” has been adopted — directly or indirectly —  by multiple international organizations, for instance USAID and the European Union. The World Bank’s Independent Evaluation Group contacted me to make an updated version, focused on estimating the World Bank’s position in the global health arena. The result is a paper, “Mapping the international health aid community using web data“, recently published in EPJ Data Science, and the product of a great collaboration with Katsumasa Hamaguchi, Maria Elena Pinglo, and Antonio Giuffrida.

The idea is to collect all the webpages of a hundred international aid organizations, looking for specific keywords and for hyperlinks to the other organizations — differently from the old Aid Explorer in which we relied on the index from Google. The aim is to create different networks of co-occurrences connecting:

  • Aid organizations co-mentioned in the same page;
  • Aid organizations mentioned or linked by another;
  • Issues co-mentioned in the same page;
  • Countries co-mentioned in the same page.

We then analyze these structures to learn something about the community as a whole.

One thing I didn’t expect was that organizations cluster by type. The “type” here is the force behind the organization — private philanthropy, UN system, bilateral (a single country’s aid extension of the foreign ministry), multilateral (international co-operations acting globally), etc. In the picture above (click on the image to enlarge), we encode the agency type in the node color. Organizations are overwhelmingly co-mentioned with organizations of the same type, which is curious because bilaterals often have nothing in common with each other besides the fact they are bilaterals: they work on different issues, with different developed and developing partners.

We can make a similar map connecting issues if they are co-mentioned in a web page. The map is useful as validation because it connects some “synonyms”, for instance “TB” and “Tubercolosis”. However, you can do much more with it. For instance, you can use color to show where an organization is most often cited. Below (click on the image to enlarge) you see the issue map for the World Bank, with the red nodes showing the issues strongly co-mentioned with the World Bank. Basically, the node color is the edge weight in a organization-issue bipartite network, where the organization is the World Bank. To give you an idea, the tiny “Infant Survival” node on the right saw the World Bank mentioned in 9% of the pages in which it was discussed. The World Bank was mentioned in 3.8% of web pages discussing AIDS.

This can lead to interesting discussions. While the World Bank does indeed discuss a lot about some of the red issues above — for instance about “Health Market” and “Health Reform” — its doesn’t say much about “Infant Survival”, relatively speaking at least. It’s intriguing that other organizations mention this particular issue so often in conjunction with the World Bank.

This difference between the global speech about issues and the one specific to another organization allows us to calculate two measures we call “Alignment” and “Impact”. By analyzing how similar the issue co-occurrence network of an organization is with the global one — a simple correlation of edge weights — we can estimate how “Aligned” it is with the global community. On the other hand, an “Impactful” organization is one that, were it to disappear, would dramatically change the global issue network: issues would not be co-mentioned that much.

In the plot above, we have Alignment and Impact on the x and y axis, respectively. The horizontal and vertical lines cutting through the plot above show the median of each measure. The top-right quadrant are organization both impactful and aligned: the organizations that have probably been setting the discourse of the international aid community. No wonder the World Health Organization is there. On the top left we have interesting mavericks, the ones which are not aligned to the community at large, and yet have an impact on it. They are trying to shape the international aid community into something different than what it is now.

A final fun — if a bit loose — analysis regards the potential for an organization to spread a message through the international aid network. What would be the reach of a message if it originated from a specific organization? We can use the Susceptible-Infected model from epidemiology. A message is a “virus” and it is more likely to infect an agency if more than a x% of the agency’s incoming links come from other infected agencies.

This depends on the issue, as shown above. In the figures we see the fraction of “infected” agencies (on the y-axis) given an original “patient zero” organization which starts spreading the message. To the left we see the result of the simulation aggregating all issues. The World Bank reaches saturation faster than UNICEF, and USAID is only heard by a tiny fraction of the network. However, if we only consider web pages talking about “Nurses” (right), then USAID is on par with the top international aid organizations — and UNICEF beats the World Bank. This happens because the discourse on the topic is relatively speaking more concentrated in USAID than average.

As with the Aid Explorer, this is a small step forward improving the provision of international aid. We do not have an interactive website this time, but you can download both the data and the code to create your own maps. Ideally, what we did only for international aid keywords can be extended for all other topics of interest in the humanitarian community: economic development, justice, or disaster relief.

Continue Reading

22 January 2015 ~ 0 Comments

Surprising Facts About Shortest Paths

Maybe it’s the new year, maybe it’s the fact that I haven’t published anything new recently, but today I wanted to take a look at my publication history. This, for a scientist, is something not unlike a time machine, bringing her back to an earlier age. What was I thinking in 2009? What sparked my interest and what were the tools further refined to get to the point where I am now? It’s usually a humbling (not to say embarrassing) operation, as past papers always look so awful – mine, at least. But small interesting bits can be found, like the one I retrieved today, about shortest paths in communication networks.

A shortest path in a network is the most efficient way to go from one node to another. You start from your origin and you choose to follow an edge to another node. Then you choose again an edge and so on until you get to your destination. When your choices are perfect and you used the minimum possible number of edges to follow, that’s a shortest path (it’s A shortest path and not THE shortest path because there might be alternative paths of the same length). Now, in creating this path, you obviously visited some nodes in between, unless your origin and destination are directly connected. Turns out that there are some nodes that are crossed by a lot of shortest paths, it’s a characteristic of real world networks. This is interesting, so scientists decided to create a measure called betweenness centrality. For each node, betweenness centrality is the share of all possible shortest paths in the network that pass through them.

Intuitively, these nodes are important. Think about a rail network, where the nodes are the train stations. High betweenness stations see a lot of trains passing through them. They are big and important to make connections run faster: if they didn’t exist every train would have to make detours and would take longer to bring you home. A good engineer would then craft rail networks in such a way to have these hubs and make her passengers happy. However, it turns out that this intuitive rule is not universally applicable. For example some communication networks aren’t willing to let this happen. Michele Berlingerio, Fosca Giannotti and I stumbled upon this interesting result while working on a paper titled Mining the Temporal Dimension of the Information Propagation.

tas2

We built two communication networks. One is corporate-based: it’s the web of emails exchanged across the Enron employee ecosystem. The email record has been publicly released for the investigation about the company’s financial meltdown. An employee is connected to all the employees she emailed. The second is more horizontal in nature, with no work hierarchies. We took users from different email newsgroups and connected them if they sent a message to the same thread. It’s the nerdy version of commenting on the same status update on Facebook. Differently from most communication network papers, we didn’t stop there. Every edge still carries some temporal information, namely the moment in which the email was sent. Above you have an extract of the network for a particular subject, where we have the email timestamp next to each edge.

Here’s where the magic happens. With some data mining wizardry, we are able to tell the characteristic reaction times of different nodes in the network. We can divide these nodes in classes: high degree nodes, nodes inside a smaller community where everybody replies to everybody else and, yes, nodes with high betweenness centrality, our train station hubs. For every measure (characteristic), nodes are divided in five classes. Let’s consider betweenness. Class 1 contains all nodes which have betweenness 0, i.e. those through which no shortest path passes. From class 2 to 5 we have nodes of increasing betweenness. So, nodes in class 3 have a medium-low betweenness centrality and nodes in class 5 are the most central nodes in the network. At this point, we can plot the average reaction times for nodes belonging to different classes in the two networks. (Click on the plots to enlarge them)

tas1

The first thing that jumps to the eye is that Enron’s communications (on the left) are much more dependent on the node’s characteristics (whether the characteristic is degree or betweenness it doesn’t seem to matter) than Newsgroup’s ones, given the higher spread. But the interesting bit, at least for me, comes when you only look at betweenness centrality – the dashed line with crosses. Nodes with low (class 2) and medium-low (class 3) betweenness centrality have low reaction times, while more central nodes have significantly higher reaction times. Note that the classes have the same number of nodes in them, so we are not looking at statistical aberrations*. This does not happen in Newsgroups, due to the different nature of the communication in there: corporate in Enron versus topic-driven in Newsgroup.

The result carries some counter intuitive implications. In a corporate communication network the shortest path is not the fastest. In other words, don’t let your train pass through the central hub for a shortcut, ’cause it’s going to stay there for a long long time. It looks like people’s brains are less elastic than our train stations. You can’t add more platforms and personnel to make more things passing through them: if your communication network has large hubs, they are going to work slower. Surprisingly, this does not hold for the degree (solid line): it doesn’t seem to matter with how many people you interact, only that you are the person through which many shortest paths pass.

I can see myself trying to bring this line of research back from the dead. This premature paper needs quite some sanity checks (understatement alert), but it can go a long way. It can be part of the manual on how to build an efficient communication culture in your organization. Don’t overload people. Don’t create over-important nodes in the network, because you can’t allow all your communications to pass through them. Do keep in mind that your team is not a clockwork, it’s a brain-work. And brains don’t work like clocks.


* That’s also the reason to ditch class 1: it contains outliers and it is not comparable in size to the other classes.

 

Continue Reading