Archive | Development

30 November 2016 ~ 0 Comments

Exploring the Uncharted Export

Exporting goods is great for countries: it is a way to attract foreign currency. Exports are also fairly easy to analyze, since they are put in big crates and physically shipped through borders, where they are usually triple checked*. However, there is another way to attract foreign currency that escapes this analytical convenience. And it is a huge one. Tourism. When tourists get inside your country, you are effectively exporting something: anything that they buy. Finding out exactly what and how much you’re exporting is tricky. Some things are easy: hotels, vacation resorts, and the like. Does that cover all they buy? Probably not.

Investigating this question is what I decided to do with Ricardo Hausmann and Frank Neffke in our new CID Working Paper “Exploring the Uncharted Export: An Analysis of Tourism-Related Foreign Expenditure with International Spend Data“. The paper analyzes tourism with a new and unique dataset. The MasterCard Center for Inclusive Growth endowed us with a data grant, sharing with us anonymized and aggregated transaction data giving us insights about the spend behavior of foreigners inside two countries, Colombia and the Netherlands.


The first thing to clear is the question: does tourism really matter? Tourism might be huge for some countries — Seychelles or Bahamas come to mind** — but does it matter for any other country? Using World Bank estimates — which we’ll see they are probably underestimations — we can draw tourism as the number one export of many countries. Above you see two treemaps (click on them to enlarge) showing the composition of the export basket of Zimbabwe and Spain. The larger the square the more the country makes exporting that product. Tourism would add a square larger than tobacco for Zimbabwe, and twice as big as cars for Spain. Countries make a lot of money out of tourism, so it is crucial to have a more precise way to investigate it.


How do we measure tourism? As said before, we’re working with anonymized and aggregated transaction data. In practice, for each postal code of the country of destination we can know how many cards and how much expenditure in total happened in different retail sectors. We focus on cards which were issued outside the country we are analyzing. This way we can be confident we are capturing mostly foreign expenditures. There are many caveats to keep in mind which affect our results: we do not see cash expenditures, we have only a non-random sample from MasterCard cards, and so on. However, when we look at maps showing us the dollar intensity in the various parts of the country (above for Colombia and the Netherlands — click on them to enlarge), we find comforting validation with external data: the top six tourism destinations as reported by Trip Advisor always correspond to areas where we see a lot of activity also in our data.


We also see an additional thing, and it turns out to be related to the advantage of our data over traditional tourism reports. A lot is happening on the border. In fact, the second most popular Colombian city after Bogotà is Cucuta. If you never heard of Cucuta it just means that you are not from Colombia, or Venezuela: Cucuta is a city on the northeastern border of the country. It is the place where many Venezuelan cross the border to do shopping, representing a huge influx of cash for Colombia. Until the border got closed, at least (the data is from before this happened, now it’s open again). In the Netherlands, you can cluster municipalities according to the dominant foreign countries observed there — see map above. You will find a Belgian cluster, for instance (in purple). This cluster is dominated by grocery and shopping.


While these Belgian shoppers are probably commuters rather than tourists, they are nevertheless bringing foreign currency to local coffers, so that’s great. And it is something not really captured by alternative methodologies. We classify a merchant type as “commuting” if it is predominant in the purple cluster, because it is more popular for “local” Belgian travelers. Everything else is either “tourism” — if it is predominant in the other non-border municipalities –, or “other” if there is no clear dominance anywhere. In the tourism cluster you find things like “Accommodations” and “Travel Agencies and Tour Operators”; in the commuting cluster you have merchants classified under “Automotive Retail” and “Pet Stores”. When you look at the share of expenditures going to the commuting cluster (above in green), you realize how significant this is. One out of four foreign dollars spent in the Netherlands go to non-tourism related activities. The share for Colombia goes up to 30%.


A post in this blog would not be complete without a gratuitous network visualization, so here we are. What you see is what we call “Origin Space” for Colombia (above) and the Netherlands (below). Nodes are countries of origin, and they are connected if the tourists from these countries behave similarly in what and where they make their purchases. The color of the node tells you the continent of the country. The size is the presence of tourists in the country of destination, relative to the origin’s GDP. The size and color of the edge is proportional to how similar the two origins are (orange = very similar; blue = similar, but not so much). We can see that Colombia has a lot of large red nodes — from the Americas — and the Netherlands is strong in blue nodes — from Europe.

If you click on the picture and zoom into the Colombia network you will see why this kind of visualization is useful. Colombia is fairly well-placed in the Australian market: the corresponding node is quite large. A thing then jumps to the eye. Australia has a large and very orange connection. To New Zealand. No surprise: Australians and New Zealanders are similar. Yet, the New Zealand node is much smaller than Australia. It shouldn’t be: these are relative expenditures. This means that, for some reason, Colombia is not currently an appealing destination for New Zealanders, even if it should, based on their similarity with Australians. New Zealand should then be a target of further investigation, which might lead to the untapping of a new potential market for Colombian tourism.

And this concludes the reasons why this data is so amazing to study tourism. To wrap up the message, we have first validated the data, showing that it agrees with what we expect being the most important tourism destinations of a country. Then, we unleashed its potential: the ability to detect “non-tourism” foreign cash inflows, and the promising initial development of tools to discover potential missing opportunities.

* The process is not foolproof, thought. I don’t remember where I read it, but it seems that if you sum all declared exports of countries and all declared imports, and subtract the two, you get a quite high positive number. I wonder where all these extra exports are going. Mars?

** When I was told we were doing a tourism project I got high hopes. I’m still waiting for a fully paid work mission to be approved by management. Any day now.

Continue Reading

07 July 2016 ~ 0 Comments

Building Data-Driven Development

A few weeks ago I had to honor to speak at my group’s  “Global Empowerment Meeting” about my research on data science and economic development. I’m linking here the Youtube video of my talk and my transcript for those who want to follow it. The transcript is not 100% accurate given some last minute edits — and the fact that I’m a horrible presenter :-) — but it should be good enough. Enjoy!

We think that the big question of this decade is on data. Data is the building blocks of our modern society. We think in development we are not currently using enough of these blocks, we are not exploiting data nearly as much as we should. And we want to fix that.

Many of the fastest growing companies in the world, and definitely the ones that are shaping the progress of humanity, are data-intensive companies. Here at CID we just want to add the entire world to the party.

So how do we do it? To fix the data problem development literature has, we focus on knowing how the global knowledge building looks like. And we inspect three floors: how does knowledge flow between countries? What lessons can we learn inside these countries? What are the policy implications?

To answer these questions, we were helped by two big data players. The quantity and quality of the data they collect represent a revolution in the economic development literature. You heard them speaking at the event: they are MasterCard – through their Center for Inclusive Growth – and Telefonica.

Let’s start with MasterCard, they help us with the first question: how does knowledge flow between countries? Credit card data answer to that. Some of you might have a corporate issued credit card in your wallet right now. And you are here, offering your knowledge and assimilating the knowledge offered by the people sitting at the table with you. The movements of these cards are movements of brains, ideas and knowledge.

When you aggregate this at the global level you can draw the map of international knowledge exchange. When you have a map, you have a way to know where you are and where you want to be. The map doesn’t tell you why you are where you are. That’s why CID builds something better than a map.

We are developing a method to tell why people are traveling. And reasons are different for different countries: equity in foreign establishments like the UK, trade partnerships like Saudi Arabia, foreign greenfield investments like Taiwan.

Using this map, it is easy to envision where you want to go. You can find countries who have a profile similar to yours and copy their best practices. For Kenya, Taiwan seems to be the best choice. You can see that, if investments drive more knowledge into a country, then you should attract investments. And we have preliminary results to suggest whom to attract: the people carrying the knowledge you can use.

The Product Space helps here. If you want to attract knowledge, you want to attract the one you can more easily use. The one connected to what you already know. Nobody likes to build cathedrals in a desert. More than having a cool knowledge building, you want your knowledge to be useful. And used.

There are other things you can do with international travelers flows. Like tourism. Tourism is a great export: for many countries it is the first export. See these big portion of the exports of Zimbabwe or Spain? For them tourism would look like this.

Tourism is hard to pin down. But it is easier with our data partners. We can know when, where and which foreigners spend their money in a country. You cannot paint pictures as accurate as these without the unique dataset MasterCard has.

Let’s go to our second question: what lessons can we learn from knowledge flows inside a country? Telefonica data is helping answering this question for us. Here we focus on a test country: Colombia. We use anonymized call metadata to paint the knowledge map of Colombia, and we discover that the country has its own knowledge departments. You can see them here, where each square is a municipality, connecting to the ones it talks to. These departments correlate only so slightly with the actual political boundaries. But they matter so much more.

In fact, we asked if these boundaries could explain the growth in wages inside the country. And they seem to be able to do it, in surprisingly different ways. If you are a poor municipality in a rich state in Colombia, we see your wage growth penalized. You are on a path of divergence.

However, if you are a poor municipality and you talk to rich ones, we have evidence to show that you are on a path of convergence: you grow faster than you expect to. Our preliminary results seem to suggest that being in a rich knowledge state matters.

So, how do you use this data and knowledge? To do so you have to drill down at the city level. We look not only at communication links, but also at mobility ones. We ask if a city like Bogota is really a city, or different cities in the same metropolitan area. With the data you can draw four different “mobility districts”, with a lot of movements inside them, and not so many across them.

The mobility districts matter, because combining mobility and economic activities we can map the potential of a neighborhood, answering the question: if I live here, how productive can I be? A lot in the green areas, not so much in the red ones.

With this data you can reshape urban mobility. You know where the entrance barriers to productivity are, and you can destroy them. You remodel your city to include in its productive structure people that are currently isolated by commuting time and cost. These people have valuable skills and knowhow, but they are relegated in the informal sector.

So, MasterCard data told us how knowledge flows between countries. Telefonica data showed the lessons we can learn inside a country. We are left with the last question: what are the policy implications?

So far we have mapped the landscape of knowledge, at different levels. But to hike through it you need a lot of equipment. And governments provide part of that equipment. Some in better ways than others.

To discover the policy implications, we unleashed a data collector program on the Web. We wanted to know how the structure of the government in the US looks like. Our program returned us a picture of the hierarchical organization of government functions. We know how each state structures its own version of this hierarchy. And we know how all those connections fit together in the union, state by state. We are discovering that the way a state government is shaped seems to be the result of two main ingredients: where a state is and how its productive structure looks like.

We want to establish that the way a state expresses its government on the Web reflects the way it actually performs its functions. We seem to find a positive answer: for instance having your environmental agencies to talk with each other seems to work well to improve your environmental indicators, as recorded by the EPA. Wiring organization when we see positive feedback and rethinking them when we see a negative one is a direct consequence of this Web investigation.

I hope I was able to communicate to you the enthusiasm CID discovered in the usage of big data. Zooming out to gaze at the big picture, we start to realize how the knowledge building looks like. As the building grows, so does our understanding of the world, development and growth. And here’s the punchline of CID: the building of knowledge grows with data, but the shape it takes is up to what we make of this data. We chose to shape this building with larger doors, so that it can be used to ensure a more inclusive world.

By the way, the other presentations of my session were great, and we had a nice panel after that. You can check out the presentations in the official Center for International Development Youtube channel. I’m embedding the panel’s video below:

Continue Reading

18 November 2015 ~ 0 Comments

Evaluating Prosperity Beyond GDP

When reporting on economics, news outlets very often refer to what happens to the GDP. How is policy X going to affect our GDP? Is the national debt too high compared to GDP? How does my GDP compare to yours? The concept lurking behind those three letters is the Gross Domestic Product, the measure of the gross value added by all domestic producers in a country. In principle, the idea of using GDP to take the pulse of an economy isn’t bad: we count how much we can produce, and this is more or less how well we are doing. In practice, today I am jumping on the huge bandwagon of people who despise GDP for its meaningless, oversimplified and frankly suspicious nature. I will talk about a paper in which my co-authors and I propose to use a different measure to evaluate a country’s prosperity. The title is “Going Beyond GDP to Nowcast Well-Being Using Retail Market Data“, my co-authors are Riccardo Guidotti, Dino Pedreschi and Diego Pennacchioli, and the paper will be presented at the Winter edition of the Network Science Conference.

GDP is gross for several reasons. What Simon Kuznets said resonates strongly with me, as already in the 30s he was talking like a complexity scientist:

The valuable capacity of the human mind to simplify a complex situation in a compact characterization becomes dangerous when not controlled in terms of definitely stated criteria. With quantitative measurements especially, the definiteness of the result suggests, often misleadingly, a precision and simplicity in the outlines of the object measured. Measurements of national income are subject to this type of illusion and resulting abuse, especially since they deal with matters that are the center of conflict of opposing social groups where the effectiveness of an argument is often contingent upon oversimplification.


In short, GDP is an oversimplification, and as such it cannot capture something as complex as an economy, or the multifaceted needs of a society. In our paper, we focus on some of its specific aspects. Income inequality skews the richness distribution, so that GDP doesn’t describe how the majority of the population is doing. But more importantly, it is not possible to quantify well-being just with the number of dollars in someone’s pocket: she might have dreams, aspirations and sophisticated needs that bear little to no correlation with the status of her wallet. And even if GDP was a good measure, it’s very hard to calculate: it takes months to estimate it reliably. Nowcasting it would be great.

And so we tried to hack our way out of GDP. The measure we decided to use is the one of customer sophistication, that I presented several times in the past. In practice, the measure is a summary of the connectivity of a node in a bipartite network*. The bipartite network connects customers to the products they buy. The more variegated the set of products a customer buys, the more complex she is. Our idea was to create an aggregated version at the network level, and to see if this version was telling us something insightful. We could make a direct correlation with the national GDP of Italy, because the data we used to calculate it comes from around a half million customers from several Italian regions, which are representative of the country as a whole.


The argument we made goes as follows. GDP stinks, but it is not 100% bad, otherwise nobody would use it. Our sophistication is better, because it is connected to the average degree with which a person can satisfy her needs**. Income inequality does not affect it either, at least not in trivial ways as it does it with GDP. Therefore, if sophistication correlates with GDP it is a good measure of well-being: it captures part of GDP and adds something to it. Finally, if the correlation happens with some anticipated temporal shift it is even better, because GDP pundits can just use it as instantaneous nowcasting of GDP.

We were pleased when our expectations met reality. We tested several versions of the measure at several temporal shifts — both anticipating and following the GDP estimate released by the Italian National Statistic Institute (ISTAT). When we applied the statistical correction to control for the multiple hypothesis testing, the only surviving significant and robust estimate was our customer sophistication measure calculated with a temporal shift of -2, i.e. two quarters before the corresponding GDP estimate was released. Before popping our champagne bottles, let me write an open letter to the elephant in the room.


As you see from the above chart, there are some wild seasonal fluctuations. This is rather obvious, but controlling for them is not easy. There is a standard approach — the X-13-Arima method — which is more complicated than simply averaging out the fluctuations. It takes into account a parameter tuning procedure including information we simply do not have for our measure, besides requiring observation windows longer than what we have (2007-2014). It is well possible that our result could disappear. It is also possible that the way we calculated our sophistication index makes no sense economically: I am not an economist and I do not pretend for a moment that I can tell them how to do their job.

What we humbly report is a blip on the radar. It is that kind of thing that makes you think “Uh, that’s interesting, I wonder what it means”. I would like someone with a more solid skill set in economics to take a look at this sophistication measure and to do a proper stress-test with it. I’m completely fine with her coming back to tell me I’m a moron. But that’s the risk of doing research and to try out new things. I just think that it would be a waste not to give this promising insight a chance to shine.

* Even if hereafter I talk only about the final measure, it is important to remark that it is by no means a complete substitute of the analysis of the bipartite network. Meaning that I’m not simply advocating to substitute a number (GDP) for another (sophistication), rather to replace GDP with a fully-blown network analysis.

** Note that this is a revealed measure of sophistication as inferred by the products actually bought and postulating that each product satisfies one or a part of a “need”. If you feel that the quality of your life depends on you being able to bathe in the milk of a virgin unicorn, the measure will not take into account the misery of this tacit disappointment. Such are the perils of data mining.

Continue Reading

31 July 2014 ~ 0 Comments

The (Not So) Little Shop of Horrors

For this end of July, I want to report some juicy facts about a work currently under development. Why? Because I think that these facts are interesting. And they are a bit depressing too. So instead of crying I make fun of them, because as the “Panic! At the Disco” would put it: I write sins, not tragedies.

So, a bit of background. Last year I got involved in an NSF project, with my boss Ricardo Hausmann and my good friend/colleague Prof. Stephen Kosack. Our aim is to understand what governments do. One could just pull out some fact-sheets and budgets, but in our opinion those data sources do not tell the whole story. Governments are complex systems and as complex systems we need to understand their emergent properties as collection of interacting parts. Long story short, we decided to collect data by crawling the websites of all public agencies for each US state government. As for why, you’ll have to wait until we publish something: the aim of this post is not to convince you that this is a good idea. It probably isn’t, at least in some sense.


It isn’t a good idea not because that data does not make sense. Au contraire, we already see it is very interesting. No, it is a tragic idea because crawling the Web is hard, and it requires some effort to do it properly. Which wouldn’t be necessarily a problem if there wasn’t an additional hurdle. We are not just crawling the Web. We are crawling government websites. (This is when in a bad horror movie you would hear a thunder nearby).

To make you understand the horror of this proposition is exactly the aim of this post. First, how many government websites are really out there? How to collect them? Of course I was not expecting a single directory for all US states. And I was wrong! Look for yourselves this beauty: The “About” page is pure poetry:

State and Local Government on the Net is the onle (sic) frequently updated directory of links to government sponsored and controlled resources on the Internet.

So up-to-date that their footer gets only to 2010 and their news section only includes items from 2004. It also points to:

SLGN Notes, a weblog, [that] was added to the site in June 2004. Here, SLGN’s editors comment on new, redesigned or updated state and local government websites, pointing out interesting or fun features for professional and consumer audiences alike and occasionally cover related news.

Yeah, go ahead and click the link, that is not the only 404 page you’ll see here.


Enough compliments to these guys! Let’s go back to work. I went straight to the 50 different state government’s websites and found in all of them an agency directory. Of course asking that these directories shared the same structure and design is too much. “What are we? Organizations whose aim is to make citizen’s life easier or governments?”. In any case from them I was able to collect the flabbergasting amount of 61,584 URLs. Note that this is six times as much as from, and it took me a week. Maybe I should start my own company :-)

Awesome! So it works! Not so fast. Here we hit the first real wall of government technological incompetence. Out of those 61,584, only 50,999 actually responded to my pings. Please note that I already corrected all the redirects: if the link was outdated but the agency redirected you to the new URL, then that connection is one of the 50,999. Allow me to rephrase it in poetry: in the state government directories there are more than ten thousand links that are pure, utter, hopeless garbage nonsense. More than one out of six links in those directories will land you exactly nowhere.


Oh, but let’s stay positive! Let’s take a look at the ones that actually lead you somewhere:

  • Inconsistent spaghetti-like design? Check. Honorable mention for the good ol’ frameset webdesign of
  • Making your website an image and use <area> tag for links? Check. That’s some solid ’95 school.
  • Links to websites left to their own devices and purchased by someone else? Check. Passing it through Google Translate, it provides pearls of wisdom like: “To say a word and wipe, There are various wipe up for the wax over it because wipe from”. Maybe I can get a half dozen of haiku from that page. (I have more, if you want it)
  • ??? Check. That’s a Massachusetts town I do not want to visit.
  • Maintenance works due to finish some 500 days ago? Check.
  • Websites mysteriously redirected somewhere else? Check. The link should go to, but alas it does not. I’m not even sure what the heck these guys are selling.
  • These aren’t the droids you’re looking for? Check.
  • The good old “I forgot to renew the domain contract”? Check.

Bear in mind that this stuff is part of the 50,999 “good” URLs (these are scare quotes at their finest). At some point I even gave up noting down this stuff. I saw hacked webpages that had been there since years. I saw an agency providing a useful Google Maps of their location, which according to them was the middle of the North Pole. But all those things will be lost in time, like tears in the rain.

Continue Reading

15 April 2013 ~ 0 Comments

Aid 2.0

After the era of large multinational empires (British, Spanish, Portuguese  French), the number of sovereign states exploded. The international community realized that many states were being left behind in their development efforts. A new problem, international development, was created and nobody really had a clue about how to solve it. Eventually, the solution started by international organizations such as the UN or the World Bank culminated on the Millennium Development Goals (MDGs): a set of general objectives that humanity decided to achieve. The MDGs are obviously very noble. Nobody can argue against eradicating hunger or promoting gender equality. The real problem is that the logic that produced them is quite flawed. Some thousands of people met around 2000 and decided that those eight points were the most important global issues. That was probably even true, but what about particular countries, where none of the eight MDGs is crucial, but a ninth is? More importantly: why the hell am I talking about this?

I am talking about this because, not surprisingly, network science can provide a useful perspective on this topic. And it did, in a paper that I co-authored with Ricardo Hausmann and César Hidalgo, at the Center for International Development in Boston. In the paper we explain that the logic behind MDGs is a classical top-down, or strictly hierarchical, one: there are few centers where all information is collected and these centers direct all efforts towards the most important problems. This implies that (see the above picture):

  1. The information generated at the bottom level passes through several steps to get to the top, in a perverted telephone game where some information is lost and some noise is introduced;
  2. If some organization at the bottom level wants to coordinate with somebody else at the same level, it has to pass through several levels even before starting, instead of just creating a direct link.

In this world, if all funds for health are allocated to fighting HIV and child mortality, countries that do not have these problems but face, say, a cholera or a malaria epidemic are doomed to be left behind.

What it is really necessary is a mechanism with which aid organizations can self-organize, by focusing on the issues they are related to and on the places where they are really needed, without broad and inefficient programs. In this world, a small world, everybody can establish a weak link to connect to anybody else, instead of relying on a cumbersome hierarchy. In an editorial in the Financial Times, Ricardo Hausmann used the Encyclopedia Britannica as a metaphor for representing the top-down approach of the MDGs, against the Wikipedia of a self-organized and distributed system.

The question now is: is it really possible to enable the self-organization of international aid? Or: how do we know what country is related to what development issue, and which organization has an expertise on it? Well, it is not an easy question to answer, but in our paper we try to address it. In the paper we describe a system, based on web crawling (i.e. systematically downloading web pages), that capture the number of times each aid organization mentions an issue or a country in its public documents. That is no different from what Google does with the entire web: creating a global knowledge index that is at your fingertips.

Using this strategy, we can create network maps, like the one above (click to see a higher resolution version), to understand what is the current structure of aid development. We are also able to match aid organizations, developing countries and development issues according to how closely they are related to each other. The possible combinations are still quite high, so to actually use our results it is necessary to create a nice visualization tool. And that’s another thing we did: the Aid Explorer (developed and designed by yours truly).

In the Aid Explorer you can confront organizations, countries and issues and see if they are coordinating as they should. For example, you can check what are the issues related to Nordic Fund. Apparently, Microenterprise is a top priority. So, you can check how Nordic Fund relates to countries, according to how they are related to Microenterprise. That’s a good positive correlation! It means that indeed the Nordic Fund really relates most to the countries that are very related to Microenterprise. If we would have found a negative correlation that would have been bad, because it would have meant that Nordic Fund relates with the wrong countries. A general picture over all issues (or over all countries) of Nordic Fund can also be generated. Summing up these general pictures, we can generate rankings of organizations, countries and issues: the more high relevance and high correlation we observe together, the better.

Hopefully, this is the first step toward an ever more powerful Aid Explorer, that can help organizations to get the maximum bang for their buck and countries to get more visibility for their peculiar issues, without being overlooked by the international community because they are not acting in line with the MDG agenda.

Continue Reading

04 November 2012 ~ 0 Comments

Destroying Drug Traffic, One Query at a Time

in·tel·li·gence NOUN: a. The capacity to acquire and apply knowledge.

The intelligence process, like in Central Intelligence Agency, is the process any person or organization should go through when making important operative decisions. But this is a description of a perfect world. In reality, organizations have to face phenomena that are very complex. When the organization itself is significantly smaller than the complexity it has to face, its members have to rely on intuition, art or not solidly grounded decisions.

This is usually the case for the local police when facing organized crime. Large crime organizations like the Italian Camorra or the drug cartels in Mexico are usually international. If you read Roberto Saviano’s Gomorrah, you’ll realize that Camorra operates as far as Germany or Scotland, while drug cartels usually span from Colombia to the US passing through Mexico. On the other hand, most of their activities happen at the local level: kidnappings, killings, drug traffic. Their main adversary is not usually a broadly operating institution like the FBI, but the local police. But for the local police, to gather a satisfying amount of information to face them is usually hopeless.

With this problem in mind, I teamed up with a Mexican colleague of mine at Harvard, Viridiana Rios. Our aim was to develop a system to enable a cheap and cost-effective way to gather intelligence operations about criminal activities.

The problem with criminal activities is that they are not only part of the complex organism of organized crime. They are also usually hidden from the public. Of course, no head of a mafia family wants to conduct his business en plein air. However, whether he likes it or not, some of these activities reach the general public anyway. This happens because, “luckily”, bad news sells a lot of newspapers. Criminal activities usually leave a clear footprint in the news. Mexican drug traffic in this is also particular, for the tradition of leaving the so called narcomensajes. These messages are writings painted on walls or on highway billboards. They are used by the criminal organizations to threaten each other or the government and the police. A narcomensaje looks like this:

When we design a system for tracking the activities of crime organizations, we want this system to be as automatic as possible. Therefore, we use some computer science tricks and we rely on the information present on the websites of newspapers. Web knowledge has a lot of problems: it’s big, it’s about many different things and it’s subject to reliability concerns. However, Google News deals with most of these problems by carefully selecting topics and reliable sources. What was left for us to do, was to systematically query the system with its APIs and clean the results. The details of this process are in a paper presented by me this week at the Conference for Information and Knowledge Management (CIKM 2012).

We did not have any way to understand if our queries connecting drug traffickers to Mexican municipalities were capturing real connections. For this reason, we performed the very same task using Mexican state governors. With our great surprise, we were able to detect with high accuracy their real patterns of activities. (Not that we are drawing a parallel between organized crime and politics, just to be clear!) This indicates that our method of tracking people’s activities by using Google News data is valid. Here are some maps of some state governors. In red the municipalities where they are detected and with a large black border their state:

What did we find?

Mexican drug traffic follows a fat-tail distribution. The meaning? There is an incredible amount of municipalities with a weak drug traffic presence and some others are an explosive factory where the employees have to carry flamethrowers. Moreover, it really looks like a hydra: destroying one hub is likely just to generate another hub, or ten smaller hubs.

And the system is growing fast, jumping from one order of magnitude to a larger one in less than a decade.

We are also able to classify cartels with several features: how much they like to compete or to explore the territory. In the future, this may be used to predict where and when we will see a spike of activity for a particular drug cartel in a particular municipality (in the picture, the migration patter of the Los Zetas cartel).

Apart from the insights, our methodology really aids the intelligence problem, whenever there are no sufficient resources to perform an actual intelligence task. We used the case of criminal activities, but the system is fairly general: by using a list of something other than a drug cartel and something other than a Mexican municipality, you can bend the system to give you information about your favorite events.

Continue Reading

11 October 2012 ~ 0 Comments

The Product Space and Country Prosperity

As reported in different parts of this website, the group I am currently working in is called “Center for International Development”. The mission of this group, in the words of its head Ricardo Hausmann, is quite trivial and unambitious: to eradicate poverty from the world. Knowing how to do it is far from easy and there are different schools of thought about it. The one that Hausmann chose starts with understanding how production and economic growth work, i.e. why some industries are successful in a country and not in another.

To address this question, Hausmann (together with César A. Hidalgo, Bailey Klinger and Albert-Laslo Barabasi) developed the Product Space. The original idea has been published in this paper in 2007, far before I joined the group, but my (overestimated) expertise was heavily exploited to generate its current implementation (the Atlas of Economic Complexity, a book freely available in electronic format). For this reason I feel no shame in writing a post about it in this blog (and to share the Product Space itself in my Dataset page).

The fundamental assumption of the Product Space is the following: countries are able to have a comparative advantage in exporting a given product because they have the capabilities to export it. In abstract, it means: “I do this, because I can“. Quite reasonable. In pictures, with the awesome “A capability is a piece of LEGO” metaphor created by Hidalgo:

From this assumption, it follows that if a country can export two different products, it is because it has the capabilities to export both. Also this step is quite easy. The conclusion is immediate: a country’s development success is lead by its capabilities. The more capabilities the country has (meaning that its LEGO box is big and it contains a lot of pieces), the more capabilities it will be able to acquire, the faster it will grow.

There is a small problem with this conclusion: we can’t observe the capabilities. Of course, if we could then they would be blatantly obvious, so every country could employ its growth strategy based on them. (There is actually another problem: capabilities are tacit knowledge, as Nonaka and Takeuchi would say, so you really can’t teach them, but we will come to this problem later)

What we can observe is simply which countries export what:

But remember: if two products are exported by the same country, then there may be common capabilities needed for their productions; while if two countries export the same products, then they share at least part of the same capabilities. In mathematical terms, it means that the observed country export picture is actually the multiplication of the two halves of the second picture of the post, or:

This is nice because it means that we can collapse the country-product relationships into product-product relationships and then mapping which product is related to which other product, because it requires the same capabilities. The advice for countries is then: if you are exporting product x, then you are likely to have most of the capabilities to export all the products that are connected to x. And this is how the Product Space was born, a single picture expressing all these relationships:

(click on the picture for a higher resolution, or just browse the Atlas website, that is also dynamic).

In the picture the nodes are colored according to the community they belong to (for more information about communities, see a previous post). The communities make sense because they group products that intuitively require the same extended set of capabilities: in cyan we have the electronic products, light blue is machinery, green is garments and so on and so forth.

Is the structure of the Product Space telling us something reliable? Yes, countries are way more likely to start export products that are close, in the Product Space, to the products they already export. Also, it is important to know where the export products of a country are in the Product Space. The more present a country is in the denser cores of the Product Space, the more complex it is said to be. This measure of complexity is a better predictor of GDP growth than classical measures used in political economy like average years of schooling.

Is the structure of the Product Space telling us something interesting? Hell yeah, although it’s not nice to hear. The Product Space has communities, so if you export a product belonging to a community then you have a lot of options to expand inside the community. However, many products are outside the communities, and they are very weakly connected with the rest, often through long chains. The meaning? If you are only exporting those products, you are doomed to not grow, because there is no way that you’ll suddenly start exporting products of a community from nothing (because this would require tacit knowledge that you cannot learn and is not close to what you know). And guess what products the poorest countries are currently exporting.

To conclude, a couple of pictures to provide one proof of the reasoning above. In 1970, Peru had more average years of schooling, more land and twice the GDP per capita of South Korea. Traditional political economy would say that Peru was strong and there was no way that South Korea could catch up. Where South Korea is today is evident. What was the difference between the two countries in terms of Product Space?

(again, click for higher resolution. The black square border indicates which products the country is exporting). There is not a lot of difference in quantities (and this explains why South Korea was poorer). However, South Korea had those two or three products in a very valuable position, while Peru had only products in the branches: a long way to the core. In 2003, this was the result:

Peru is still mainly on the edges, South Korea occupies the center. And that’s all.

Continue Reading