28 May 2018 ~ 0 Comments

Mapping the International Aid Community

A few years ago (2013, God I’m old), I was talking to you on how to give a “2.0” flavor to international aid: at CID we created an Aid Explorer to better coordinate the provision of humanitarian work. I’m happy to report that “Aid Explorer” has been adopted — directly or indirectly —  by multiple international organizations, for instance USAID and the European Union. The World Bank’s Independent Evaluation Group contacted me to make an updated version, focused on estimating the World Bank’s position in the global health arena. The result is a paper, “Mapping the international health aid community using web data“, recently published in EPJ Data Science, and the product of a great collaboration with Katsumasa Hamaguchi, Maria Elena Pinglo, and Antonio Giuffrida.

The idea is to collect all the webpages of a hundred international aid organizations, looking for specific keywords and for hyperlinks to the other organizations — differently from the old Aid Explorer in which we relied on the index from Google. The aim is to create different networks of co-occurrences connecting:

  • Aid organizations co-mentioned in the same page;
  • Aid organizations mentioned or linked by another;
  • Issues co-mentioned in the same page;
  • Countries co-mentioned in the same page.

We then analyze these structures to learn something about the community as a whole.

One thing I didn’t expect was that organizations cluster by type. The “type” here is the force behind the organization — private philanthropy, UN system, bilateral (a single country’s aid extension of the foreign ministry), multilateral (international co-operations acting globally), etc. In the picture above (click on the image to enlarge), we encode the agency type in the node color. Organizations are overwhelmingly co-mentioned with organizations of the same type, which is curious because bilaterals often have nothing in common with each other besides the fact they are bilaterals: they work on different issues, with different developed and developing partners.

We can make a similar map connecting issues if they are co-mentioned in a web page. The map is useful as validation because it connects some “synonyms”, for instance “TB” and “Tubercolosis”. However, you can do much more with it. For instance, you can use color to show where an organization is most often cited. Below (click on the image to enlarge) you see the issue map for the World Bank, with the red nodes showing the issues strongly co-mentioned with the World Bank. Basically, the node color is the edge weight in a organization-issue bipartite network, where the organization is the World Bank. To give you an idea, the tiny “Infant Survival” node on the right saw the World Bank mentioned in 9% of the pages in which it was discussed. The World Bank was mentioned in 3.8% of web pages discussing AIDS.

This can lead to interesting discussions. While the World Bank does indeed discuss a lot about some of the red issues above — for instance about “Health Market” and “Health Reform” — its doesn’t say much about “Infant Survival”, relatively speaking at least. It’s intriguing that other organizations mention this particular issue so often in conjunction with the World Bank.

This difference between the global speech about issues and the one specific to another organization allows us to calculate two measures we call “Alignment” and “Impact”. By analyzing how similar the issue co-occurrence network of an organization is with the global one — a simple correlation of edge weights — we can estimate how “Aligned” it is with the global community. On the other hand, an “Impactful” organization is one that, were it to disappear, would dramatically change the global issue network: issues would not be co-mentioned that much.

In the plot above, we have Alignment and Impact on the x and y axis, respectively. The horizontal and vertical lines cutting through the plot above show the median of each measure. The top-right quadrant are organization both impactful and aligned: the organizations that have probably been setting the discourse of the international aid community. No wonder the World Health Organization is there. On the top left we have interesting mavericks, the ones which are not aligned to the community at large, and yet have an impact on it. They are trying to shape the international aid community into something different than what it is now.

A final fun — if a bit loose — analysis regards the potential for an organization to spread a message through the international aid network. What would be the reach of a message if it originated from a specific organization? We can use the Susceptible-Infected model from epidemiology. A message is a “virus” and it is more likely to infect an agency if more than a x% of the agency’s incoming links come from other infected agencies.

This depends on the issue, as shown above. In the figures we see the fraction of “infected” agencies (on the y-axis) given an original “patient zero” organization which starts spreading the message. To the left we see the result of the simulation aggregating all issues. The World Bank reaches saturation faster than UNICEF, and USAID is only heard by a tiny fraction of the network. However, if we only consider web pages talking about “Nurses” (right), then USAID is on par with the top international aid organizations — and UNICEF beats the World Bank. This happens because the discourse on the topic is relatively speaking more concentrated in USAID than average.

As with the Aid Explorer, this is a small step forward improving the provision of international aid. We do not have an interactive website this time, but you can download both the data and the code to create your own maps. Ideally, what we did only for international aid keywords can be extended for all other topics of interest in the humanitarian community: economic development, justice, or disaster relief.

Continue Reading

28 August 2014 ~ 0 Comments

The Curious World of Network Mapping

Complex networks can come in different flavors. As you know if you follow this blog, my signature dish is multilayer/multidimensional networks: networks with multiple edge types. One of the most popular types is bipartite networks. In bipartite networks, you have two types of nodes. For example, you can connect users of Netflix to the movies they like. As you can see from this example, in bipartite networks we allow only edges going from one type of nodes to the other. Users connect to movies, but not to other users, and movies can’t like other movies (movies are notoriously mean to each other).

m1

Many things (arguably almost everything) can be represented as a bipartite network. An occupation can be connected to the skills and/or tasks it requires, an aid organization can be connected to the countries and/or the topics into which it is interested, a politician is connected to the bills she sponsored. Any object has attributes. And so it can be represented as an object-attribute bipartite network. However, most of the times you just want to know how similar two nodes of the same type are. For example, given a movie you like, you want to know a similar movie you might like too. This is called link prediction and there are two ways to do this. You could focus on predicting a new user-movie connection, or focus instead on projecting the bipartite network to discover the previously unknown movie-movie connections. The latter is the path I chose, and the result is called “Network Map”.

It is clearly the wrong choice, as the real money lies in tackling the former challenge. But if I wanted to get rich I wouldn’t have chosen a life in academia anyway. The network map, in fact, has several advantages over just predicting the bipartite connections. By creating a network map you can have a systemic view of the similarities between entities. The Product Space, the Diseasome, my work on international aid. These are all examples of network maps, where we go from a bipartite network to a unipartite network that is much easier to understand for humans and to analyze for computers.

ps4

Creating a network map, meaning going from a user-movie bipartite network to a movie-movie unipartite network, is conceptually easy. After all, we are basically dealing with objects with attributes. You just calculate a similarity between these attributes and you are done. There are many similarities you can use: Jaccard, Pearson, Cosine, Euclidean distances… the possibilities are endless. So, are we good? Not quite. In a paper that was recently accepted in PLoS One, Muhammed Yildirim and I showed that real world networks have properties that make the general application of any of these measures quite troublesome.

For example, bipartite networks have power-law degree distributions. That means that a handful of attributes are very popular. It also means that most objects have very few attributes. You put the two together and, with almost 100% probability, the many objects with few attributes will have the most popular attributes. This causes a great deal of problems. Most statistical techniques aren’t ready for this scenario. Thus they tend to clutter the network map, because they think that everything is similar to everything else. The resulting network maps are quite useless, made of poorly connected dense areas and lacking properties of real world networks, such as power-law degree distributions and short average path length, as shown in these plots:

m2

m3

Of course sometimes some measure gets it right. But if you look closely at the pictures above, the only method that consistently give the shortest paths (above, when the peak is on the left we are good) and the broadest degree distributions (below, the rightmost line at the end in the lower-right part of the plot is the best one) is the red line of “BPR”. BPR stands for “Bipartite Projection via Random-walks” and it happens to be the methodology that Muhammed and I invented. BPR is cool not only because its network maps are pretty. It is also achieving higher scores when using the network maps to predict the similarity between objects using ground truth, meaning that it gives the results we expect when we actually already know the answers, that are made artificially invisible to test the methodology. Here we have the ROC plots, where the highest line is the winner:

m4

So what makes BPR so special? It all comes down to the way you discount the popular attributes. BPR does it in a “network intelligent” way. We unleash countless random walkers on the bipartite network. A random walker is just a process that starts from a random object of the network and then it jumps from it to one of its attributes. The target attribute is chosen at random. And then the walker jumps back to an object possessing that attribute, again choosing it at random. And then we go on. At some point, we start from scratch with a new random walk. We note down how many times two objects end up in the same random walk and that’s our similarity measure. Why does it work? Because when the walker jumps back from a very popular attribute, it could essentially go to any object of the network. This simple fact makes the contribution of the very popular attributes quite low.

BPR is just the latest proof that random walks are one of the most powerful tools in network analysis. They solve node ranking, community discovery, link prediction and now also network mapping. Sometimes I think that all of network science is founded on just one algorithm, and that’s random walks. As a final note, I point out that you can create your own network maps using BPR. I put the code online (the page still bears the old algorithm’s name, YCN). That’s because I am a generous coder.

Continue Reading

15 April 2013 ~ 0 Comments

Aid 2.0

After the era of large multinational empires (British, Spanish, Portuguese  French), the number of sovereign states exploded. The international community realized that many states were being left behind in their development efforts. A new problem, international development, was created and nobody really had a clue about how to solve it. Eventually, the solution started by international organizations such as the UN or the World Bank culminated on the Millennium Development Goals (MDGs): a set of general objectives that humanity decided to achieve. The MDGs are obviously very noble. Nobody can argue against eradicating hunger or promoting gender equality. The real problem is that the logic that produced them is quite flawed. Some thousands of people met around 2000 and decided that those eight points were the most important global issues. That was probably even true, but what about particular countries, where none of the eight MDGs is crucial, but a ninth is? More importantly: why the hell am I talking about this?

I am talking about this because, not surprisingly, network science can provide a useful perspective on this topic. And it did, in a paper that I co-authored with Ricardo Hausmann and César Hidalgo, at the Center for International Development in Boston. In the paper we explain that the logic behind MDGs is a classical top-down, or strictly hierarchical, one: there are few centers where all information is collected and these centers direct all efforts towards the most important problems. This implies that (see the above picture):

  1. The information generated at the bottom level passes through several steps to get to the top, in a perverted telephone game where some information is lost and some noise is introduced;
  2. If some organization at the bottom level wants to coordinate with somebody else at the same level, it has to pass through several levels even before starting, instead of just creating a direct link.

In this world, if all funds for health are allocated to fighting HIV and child mortality, countries that do not have these problems but face, say, a cholera or a malaria epidemic are doomed to be left behind.

What it is really necessary is a mechanism with which aid organizations can self-organize, by focusing on the issues they are related to and on the places where they are really needed, without broad and inefficient programs. In this world, a small world, everybody can establish a weak link to connect to anybody else, instead of relying on a cumbersome hierarchy. In an editorial in the Financial Times, Ricardo Hausmann used the Encyclopedia Britannica as a metaphor for representing the top-down approach of the MDGs, against the Wikipedia of a self-organized and distributed system.

The question now is: is it really possible to enable the self-organization of international aid? Or: how do we know what country is related to what development issue, and which organization has an expertise on it? Well, it is not an easy question to answer, but in our paper we try to address it. In the paper we describe a system, based on web crawling (i.e. systematically downloading web pages), that capture the number of times each aid organization mentions an issue or a country in its public documents. That is no different from what Google does with the entire web: creating a global knowledge index that is at your fingertips.

Using this strategy, we can create network maps, like the one above (click to see a higher resolution version), to understand what is the current structure of aid development. We are also able to match aid organizations, developing countries and development issues according to how closely they are related to each other. The possible combinations are still quite high, so to actually use our results it is necessary to create a nice visualization tool. And that’s another thing we did: the Aid Explorer (developed and designed by yours truly).

In the Aid Explorer you can confront organizations, countries and issues and see if they are coordinating as they should. For example, you can check what are the issues related to Nordic Fund. Apparently, Microenterprise is a top priority. So, you can check how Nordic Fund relates to countries, according to how they are related to Microenterprise. That’s a good positive correlation! It means that indeed the Nordic Fund really relates most to the countries that are very related to Microenterprise. If we would have found a negative correlation that would have been bad, because it would have meant that Nordic Fund relates with the wrong countries. A general picture over all issues (or over all countries) of Nordic Fund can also be generated. Summing up these general pictures, we can generate rankings of organizations, countries and issues: the more high relevance and high correlation we observe together, the better.

Hopefully, this is the first step toward an ever more powerful Aid Explorer, that can help organizations to get the maximum bang for their buck and countries to get more visibility for their peculiar issues, without being overlooked by the international community because they are not acting in line with the MDG agenda.

Continue Reading

11 October 2012 ~ 0 Comments

The Product Space and Country Prosperity

As reported in different parts of this website, the group I am currently working in is called “Center for International Development”. The mission of this group, in the words of its head Ricardo Hausmann, is quite trivial and unambitious: to eradicate poverty from the world. Knowing how to do it is far from easy and there are different schools of thought about it. The one that Hausmann chose starts with understanding how production and economic growth work, i.e. why some industries are successful in a country and not in another.

To address this question, Hausmann (together with César A. Hidalgo, Bailey Klinger and Albert-Laslo Barabasi) developed the Product Space. The original idea has been published in this paper in 2007, far before I joined the group, but my (overestimated) expertise was heavily exploited to generate its current implementation (the Atlas of Economic Complexity, a book freely available in electronic format). For this reason I feel no shame in writing a post about it in this blog (and to share the Product Space itself in my Dataset page).

The fundamental assumption of the Product Space is the following: countries are able to have a comparative advantage in exporting a given product because they have the capabilities to export it. In abstract, it means: “I do this, because I can“. Quite reasonable. In pictures, with the awesome “A capability is a piece of LEGO” metaphor created by Hidalgo:

From this assumption, it follows that if a country can export two different products, it is because it has the capabilities to export both. Also this step is quite easy. The conclusion is immediate: a country’s development success is lead by its capabilities. The more capabilities the country has (meaning that its LEGO box is big and it contains a lot of pieces), the more capabilities it will be able to acquire, the faster it will grow.

There is a small problem with this conclusion: we can’t observe the capabilities. Of course, if we could then they would be blatantly obvious, so every country could employ its growth strategy based on them. (There is actually another problem: capabilities are tacit knowledge, as Nonaka and Takeuchi would say, so you really can’t teach them, but we will come to this problem later)

What we can observe is simply which countries export what:

But remember: if two products are exported by the same country, then there may be common capabilities needed for their productions; while if two countries export the same products, then they share at least part of the same capabilities. In mathematical terms, it means that the observed country export picture is actually the multiplication of the two halves of the second picture of the post, or:

This is nice because it means that we can collapse the country-product relationships into product-product relationships and then mapping which product is related to which other product, because it requires the same capabilities. The advice for countries is then: if you are exporting product x, then you are likely to have most of the capabilities to export all the products that are connected to x. And this is how the Product Space was born, a single picture expressing all these relationships:

(click on the picture for a higher resolution, or just browse the Atlas website, that is also dynamic).

In the picture the nodes are colored according to the community they belong to (for more information about communities, see a previous post). The communities make sense because they group products that intuitively require the same extended set of capabilities: in cyan we have the electronic products, light blue is machinery, green is garments and so on and so forth.

Is the structure of the Product Space telling us something reliable? Yes, countries are way more likely to start export products that are close, in the Product Space, to the products they already export. Also, it is important to know where the export products of a country are in the Product Space. The more present a country is in the denser cores of the Product Space, the more complex it is said to be. This measure of complexity is a better predictor of GDP growth than classical measures used in political economy like average years of schooling.

Is the structure of the Product Space telling us something interesting? Hell yeah, although it’s not nice to hear. The Product Space has communities, so if you export a product belonging to a community then you have a lot of options to expand inside the community. However, many products are outside the communities, and they are very weakly connected with the rest, often through long chains. The meaning? If you are only exporting those products, you are doomed to not grow, because there is no way that you’ll suddenly start exporting products of a community from nothing (because this would require tacit knowledge that you cannot learn and is not close to what you know). And guess what products the poorest countries are currently exporting.

To conclude, a couple of pictures to provide one proof of the reasoning above. In 1970, Peru had more average years of schooling, more land and twice the GDP per capita of South Korea. Traditional political economy would say that Peru was strong and there was no way that South Korea could catch up. Where South Korea is today is evident. What was the difference between the two countries in terms of Product Space?

(again, click for higher resolution. The black square border indicates which products the country is exporting). There is not a lot of difference in quantities (and this explains why South Korea was poorer). However, South Korea had those two or three products in a very valuable position, while Peru had only products in the branches: a long way to the core. In 2003, this was the result:

Peru is still mainly on the edges, South Korea occupies the center. And that’s all.

Continue Reading