Archive | Development

11 August 2020 ~ 0 Comments

The Effect of Shutting Down International Travels

Face-to-face interactions are a key component of knowledge transfer. Learning-by-doing, imitation, and tutoring are necessary tools for the acquisition of tacit knowledge: everything you need to know that cannot be encoded in a tool or in a manual. If you want to create something, the best way to do so is to be in direct contact with the people who can already do it. The proof is in business travels. Why would businesses spend a fortune — 1 trillion dollars in 2017 — to send their employees around the world ignoring the fact that we’re living in a telecommunication golden era? Because remote meetings don’t work. There are no substitutes for direct interactions. We cannot do without them. Except now we are forced to. So what’s the effect of shutting down international travels?

This is a question I set out to answer together with Frank Neffke and Ricardo Hausmann in the paper “Knowledge Diffusion in the Network of International Business Travel“, which has been published on Nature Human Behaviour. Of course, none of us took the hypothetical “international travel screeching to a halt” scenario seriously: we’re not precogs, it was merely an academic thought experiment. I must have, at some point, accidentally knocked over the lever that switched from simulation to reality. Oops.

We wanted to understand the effect of business travel on the development of new industrial activities in the countries receiving them. We did so by partnering with the MasterCard Center for Inclusive Growth, which provided access to aggregated and anonymized data based on foreign corporate card expenditures. The data allowed us to see how many corporate-issued spending cards were observed making transactions abroad in the 2011-2016 period. If a corporate card issued in Mexico made an expenditure in Colombia we can infer that it was due to a business travel — after some important data cleaning steps.*

Our problem was that we needed to gauge how many travelers from an industry reached a country, but we only had information about the country of origin. We solved this with a simple mathematical trick. We just assumed that the industries of a country were all equally likely to send out business travelers. Thus, if 20% of firms in Japan are car manufacturing plants, then 20% of business travelers from Japan are associated with the car manufacturing industry. This is rather naive and probably wrong — some industries are more likely to send travelers. But — if anything — this would dampen our results: we’re confident that, if we see any signal, that would actually be an underestimation of what’s really going on.

So, are business travels really contributing to the development of new industries in the country of destination? Yes! Our estimates show that, if we were to double the number of business travelers, we would expect a growth in industrial activity of around 6-14%. We have good reasons to believe that this effect is causal: it’s not simply that business travels happen because of a blossoming industrial activity in the destination. We test this by comparing different pairs of countries with different visa regimes between them — full details in the paper.

Click figure for a high resolution version.

Who are the largest contributors to this growth? To answer this question we ran that hypothetical scenario: how much would global GDP shrink if a country would completely cease to send out business travelers forever? We found out that the most impactful country would be Germany, contributing a staggering 4.82% of global GDP with its business travelers (see image above). Canada and the US are in second and third place, both impacting more than 1%. Not great news in light of renewed travel bans that are making this nightmare scenario all too real.

Click figure for a high resolution version.

Who benefits the most from the knowledge flowing with business travelers? This is where our study unveils some uncomfortable truths. To answer our question we should first see how the global network of business travel looks like (figure above). Its most striking feature is how geographically clustered it looks. You can clearly see an Americas cluster. The European cluster includes some countries in the near East and North Africa. Asia is split in middle and far East. This isn’t great news. Current patterns of economic inequality hint at the fact that tacit knowledge is concentrated in some rich countries. If that’s true, such strong geographical clustering of business travel means that tacit knowledge will find a hard time spreading globally.

The map below shows which countries are comparatively receiving more knowledge,. Western Europe and North America are clear winners, because they tap into the large reservoirs of knowhow that are Germany, Canada, the US, the UK. The rest of the world, outside these tightly-knit clusters, is left scrambling for scraps.

Click figure for a high resolution version.

So what are the lessons learned from this exercise? First, we need to solve the pandemic crisis effectively and put in place some solid countermeasures for future ones. We cannot do without business travel. If we could, we would have saved a trillion dollars in 2017, and kept plenty of CO2 from entering the atmosphere. Like it or not, Zoom calls are — for the moment — not substitutes for face-to-face interactions. Second, we need to figure out how to break the geographical compartmentalization of international knowledge transfer. If we want to achieve economic convergence and lift developing countries out of poverty, we need such countries to access what they lack to make the leap to become developed economies: the otherwise immobile tacit knowledge.

You can read more and access to interactive visualizations on the webpage of the paper, and request access to the data for result replication.


* There are countries in which the company doesn’t issue cards, or wasn’t able to grant access to data at the necessary level of granularity due to privacy regulations. Some countries simply are cash societies and thus don’t use cards. Such countries are not represented in our study.

Continue Reading

28 April 2020 ~ 0 Comments

A Worried Look at Economic Convergence

A moral imperative that wealthy communities have — in my opinion — is to ensure economic convergence: to help the poorer economies to have a stronger economic growth so that everyone is lifted out of poverty*. There is a lot of debate on whether economic convergence is actually happening (some say yes, others no) — and, if so, at which scale (global, national, regional?). In my little contribution to the question I show that — if convergence happens — it is not via traditional institutional channels, but via participation in the global social network. Which is terrible news in these days, since we’re experiencing an unprecedented collapse in this web of relations due to the COVID-19 pandemic.

An example of economic divergence: some countries like Singapore are now 6X richer than other countries that had a comparable level of income in the late 1800. Image from EconoTimes.

This message comes from a paper I wrote a while ago with Tim Cheston and Ricardo Hausmann: “Institutions vs. Social Interactions in Driving Economic Convergence: Evidence from Colombia“. I never mentioned it because it is just a working paper, so all conclusions should be taken with a boatload of grains of salt. But it is an interesting perspective on the consequences of these troubling times — plus it foreshadows another post I’m planning for the future, so stay tuned 🙂

The idea is simple: we want to know whether economic convergence happens in Colombia. If it does, we want to show that its driving force is the participation in social networks. In other words, economic growth is a matter of connecting skillful people with people possessing capital. We need to make sure we’re not confusing our “social relationships” explanation with the ability of some states to be better at providing public goods and redistributing wealth from the rich municipalities to the poor ones.

The public institutions hypothesis seems natural: if you have good politicians, they would write good laws which will support their population’s prosperity. Bad politicians would just be inept, or even corrupt. In this hypothesis, poor municipalities in rich (= well managed) regions should grow faster than poor municipalities in poor (= badly managed) regions. Our hypothesis, instead, proposes that poor municipalities with strong social connections to rich municipalities should grow faster than poor municipalities without such connections. For this we need to know two things: in which administrative region a municipality is (easy!) and to which social group of municipalities it belongs.

The latter is tricky, but not if you’re a data hoarder like yours truly. I had already worked with phone call records in Colombia, so you might guess where this is going. I can represent Colombia as a network, where nodes are the municipalities. Municipalities are connected to other municipalities if there is a significant number of residents in the two municipalities that call each other. Once I have this network, I can perform community discovery and find groups of municipalities with tightly knit social relations.

Colombia’s social network at the municipality level. Click to enlarge.

Using data on the municipalities’ GDP (from DANE) and average wage (from PILA), we can now test whether convergence happens — i.e. growth is negatively correlated with starting level, the poorer you are the more you grow. This is false for administrative regions but true for social communities: there is a mildly significant (p < 0.05) negative relationship between a social group’s GDP (and average wage) growth and its initial level. Meaning: economic convergence happens at the social but not at the institutional level. I’d love an even lower p-value, but one can’t do much with such a low number of regions/groups (32 in Colombia).

If social communities are converging, what could be driving the effect? We observe a robust (p < 0.01) positive relationship between the growth of per capita wages in a municipality and the average per capita wage in its social group. Meaning: if you talk to rich municipalities, you grow faster. Even the formality rate converges: if you talk with municipalities with low tax evasion, you start tax-evading less! Such relationships are absent for administrative regions, and survive a number of robustness checks — excluding the capital city Bogotá, excluding particularly small municipalities (in inhabitants, employees, or number of phone calls), using admin region fixed effects, etc. To get a sense of scale: suppose baseline growth is 1%. If you talk to a rich social group you’d grow, instead, by 1.02%. If you you talk to rich municipalities and you are also poor, you grow by 1.09% instead. This might not sound much, but it’s better to have it than not, and it stacks over time, as the picture below shows.

The effect of social relationships on average wage (y axis) over time (x axis). Gray = base growth; blue = growth while talking to rich social communities; red = talking to rich social community *and* being poor.

These results would be great in normal times, because they provide a possible roadmap to fostering economic convergence. One would have to identify places which lack the proper connections in the global knowledge network, and try to plug them in. The problem is that we’re not living in normal times. Lockdowns and quarantines due to the global pandemic have created gigantic obstacles to human mobility almost everywhere in the world. And, as I’ve shown previously, social relationships go hand in hand with mobility. For that reason, physical obstacles are also hampering the tightening of the global social network, one of the main highways of global development.

Don’t get me wrong: those are the correct measures and we should see them through. But we also should be mindful of their possible unintended side effects. Perhaps there are already enough people working on research on better medical devices, and on how to track and forecast outbreaks on the global social network. If this post has a moral, it’s to encourage people to find new ways to make the weaving of such global social network more robust to the black swan events that will follow COVID-19. Because they will happen, and our moral imperative of lifting people out of poverty can’t be the price we pay to survive them.


* This needs to be a structural intervention: simple handouts don’t work and may even make the problem worse.

Continue Reading

08 November 2018 ~ 0 Comments

Using Web Crawling to Map US State Governments

What do governments do? These days, you can answer this question in many ways: the desire for accountability has pushed many governments to publish more and more data online. These efforts are fantastic, but they seem to have a blind spot. The data is flat. By that, I mean that they usually contain tables, budgets, pieces of information that cannot be really connected to each other. As a network scientist, I think inter-actions contain as much information as actions. Where is the dataset that tells me how public agencies interact with each other? Now the answer is simple: we built it!

The US state governments in all their gory tangledness. Click to enjoy it fully.

The question is: how do we map interactions between agencies if they do not publish data about what they do with whom? And, once we’ve done that, how do we know we’re representing the interactions correctly? This is the result of an effort started by Stephen Kosack, which then gathered to accompany me a fantastic team with a diverse set of skills: Evann Smith, Kim Albrecht, Albert-Laszlo Barabasi, and Ricardo Hausmann. It resulted in the paper “Functional structures of US state governments,” recently published (Open Access!) in PNAS.

We realized that there is a place where agencies say what they are doing: their website. And because websites are built around the idea of interconnecting documents, they are a natural place to scout for links. Our fundamental assumption is that, when a school’s website links to its school district, they’re implicitly or explicitly saying: “What that agency says is relevant for what I do, so you should check them out.” In other words, we can and should link them in a network of agencies.

Crawling the web is notoriously complicated. Besides the technical challenges, there’s also the fact that it’s difficult to build a complete index of government websites. We did our best to follow the agency directories published by central state governments, and integrated that with data from Wikipedia and other websites specializing in listing public schools, libraries, city, and county governments.

Obviously, the picture is incomplete and it changes constantly: we did the crawl in 2014 and the web presence of governments ought to look a bit different today. However, we managed to estimate websites’ ages using the Wayback Machine (consider donating!) and we see signs of saturation: the growth of government presence online is significantly slowing down. A hint that the big picture we’re seeing shouldn’t have changed that much. So what’s this big picture? Allow me to introduce you to what we affectionately call “The Cathedral”:

Click to enlarge and cathedralize yourself!

We collapsed each government agency in a two-level classification of activities into government functions. The top level is a generic activity area and the second level is a specialization inside the area. For instance, a top-level function is education. Then “primary/secondary education” and “public libraries” are two possible second level functions inside education. We built this classification by looking at the textual content of each website. Mechanical Turkers validated it. In the cathedral, each node is a function. We connected nodes if agencies, classified under one function, linked themselves to agencies classified under the other. The position of the node is determined by its centrality both horizontally — most central in the middle — and vertically — from top, most central, to bottom. The cathedral confirms our intuition of hierarchicalness: central state and local governments oversee everything and connect to everything.

We could test how much our picture overlaps with actual agency interactions, because the government of Massachusetts mandates some collaborations in its constitution. So we forced Evann to dedicate one year of her life going through the constitution of Massachusetts and noting down when two agencies were mentioned in the same article as in need of interacting. Then it took me a whole five minutes to swoop in to take the credit by testing how much the web network fared compared to the constitutional network (thanks Evann!).

Red is web-constitution overlap, yellow is what the constitution sees and we don’t, blue is what we see and the constitution doesn’t. Click to enlarge.

The answer is “a lot, with a twist.” Agencies that are mandated to collaborate connect to each other on the web much more strongly than we expected — almost eight times as much. The majority of mandated connections are represented in the web. But the overall measure of alignment returned rather low values. How come? Because the internet reveals many more agencies than are mentioned in the constitution. And it reveals many more links too. Although some of that ought to be noise, there are ways to filter it out to end up with a network that is actually more complete than the one you’d get by looking only at the constitution.

One question we wanted to answer with our data was: how can we explain differences in how states decide to implement their functions? We had a few hypotheses. Maybe it’s ideology: red and blue have… different ideas on how to do things — to put it mildly. Or maybe it’s how rich the states are: more $$$ = more complexity. Maybe it’s just their geographical position. It could be a bit of all of that, but there is one thing dominating everything else: economic structure. It’s not the amount of dollars you earn, but how you do it. Hi-tech states look like hi-tech states, agricultural states look like agricultural states, whether they are red, blue, purple, ocher, indigo, rich, or poor.

Each circle here is a state, and we look at three dimensions on how they implement each function: how many agencies they created to serve it, how big they are, how much they talk to each other. Some functions, like education, are consistently implemented the same across states — we can tell because the states’ circles are very clustered –. Others are all over the place, like administrative law. Click to enlarge.

Everything I said can be visualized in all its glory in the official website of the project: http://govmaps.cid.hks.harvard.edu/ (by Kim Albrecht). You have a few interactive visualizations, the paper, and the data. We strongly believe in open data and reproducibility: our data release includes not only the US state government networks, but also a collection of scripts to reproduce the main results of the paper.

I hope this could be a significant step forward in how we understand government actions and effects on society. Nowadays, it feels that there is a gargantuan ideological divide between sides: handing over your state to “the other side” feels like a tragedy. Our results seem to suggest that, actually, it doesn’t make that much of a difference.

Continue Reading

11 September 2018 ~ 0 Comments

The Struggle for Existence in the World Market Ecosystem

The global trade market definitely seems red in tooth and claw. Competition is fierce and some claim it has to be that way: only the fittest should survive because the fittest, at least in theory, are the ones satisfying the customers’ needs the best. In a paper with Viviana Viña-Cervantes and Renaud Lambiotte, we explored this metaphor to see if we can say something about how world trade will evolve. The paper will soon appear in PLoS One and it is titled “The Struggle for Existence in the World Market Ecosystem“. In it we create competition networks and we show that positions in these networks are a predictor of future growth: a strong position in a product means that the country will increase its export share in that product in the medium term.

How do we create competition networks? In our view, each slice of the market is an ecological niche. For instance, the car market in the United States or the computer market in Germany. If you can sell a lot of cars in the US, it means you’re fit to occupy that niche. You’re meeting the needs of that set of customers, and thus you can make a profit there, cutting out other exporters who would oh so much like to make a buck there.

An example of data generating one edge in the competition network: Japan emerges beyond 1% of the car market in the US at the same time that Italy plunges below the 1% mark. With this data, we create an edge from Japan to Italy in the US-car 1960 competition network.

 

Niches are not stable: they change over time. As a consequence of evolution, animals can become fitter to fill their current niche — or a new one. Out of this observation, we create our competition networks. Suppose that you were doing well at selling cars to Americans, like Italy was in the 60s — who doesn’t love a vintage Alfa Romeo?  Then something happens: a mutation, an evolution. Japan’s cars start gaining appeal in North America. All of a sudden, the market share of Italy declines once Japan appears on the scene. In this case, we can create a directed edge going from Japan to Italy, because Japanese firms happened to be successful at the same time that Italian ones lost their, um, edge.* That’s our competition network. We built one per decade: 1960, 1970, 1980, 1990, and 2000.

In the vast majority of cases, when you study a network, the edges have a positive meaning: they imply connections, social relations, friendship. Competition networks tell a fundamentally negative story. The originator of the edge is displacing the receiver of the edge in a market, something definitely not nice. The out-degree tells us something about a country’s fitness: how many competitors it displaced. The in-degree is the other side of the coin: how many times the country’s entrepreneurs were just not good enough. So these two measures should tell us what we want, right? A high out-degree is a sign of strength and growth, a high in-degree a sign of weakness.

The correlation between in- and out-degree is pretty darn high.

Not really. The problem is that big countries produce a lot of stuff and export it everywhere. So they are constantly fighting a lot of battles. Winning some and losing some. The in- and out-degree are highly correlated, and thus they do not give much information. We decided to look at a generalization of in- and out-degree. When displacing a country from a market, quantity is not as important as quality. Displacing a fierce exporter like the US is not the same as tripping up a weaker economy. So we weight the out-degree by counting each displaced country as much as their out-degree. This is a higher-order degree, because it looks at a second hop beyond the displaced. The more this country was a displacer, the more it counts that we were able to displace it. Displacing the displacers is the way to go.

At this point interesting stuff starts emerging. We can use this normalized in- and out-degree to classify countries into three buckets: out-competing (high out-degree), displaced (high in-degree), and transitioning (roughly equivalent in- and out-degree). We do so per decade and per product. Then, we check whether belonging to one cluster has any relationship with how the country will evolve its market share in the decade following the one we used for the classification. If you were a strong out-competitor in the 60s in the car market, your position in the car market in the 70s will get stronger and stronger.

The growth rate the decade after the observation window used for classifying countries. Here: 1 = Displaced, 2 = Transitioning, and 3 = Out-competing countries.

We see these strong relationships for all products and for all decades, with a couple of exceptions. For instance, our method does not work for natural resources. Which is expected: we cannot use this system to predict whether you’re going to find oil in your territory or not. It also does not work in the last decade, the 2000s, because we have very little data for making the prediction: our data runs only until 2013. Thus, it means this method cannot work for short term predictions: it works well when looking at decade-long transitions, not year-long ones. The effect gets a bit weaker if we look at what happens two, three and even four decades after our classification, but it’s still significant.

We also checked the robustness of our results by creating a synthetic trade world. We broke all relationships between countries by generating random trade, maintaining the sparseness — most exporter-importer-product relationships never happen — and the skewed nature of our data — a few high-throughput links constitute most of world trade, and the vast majority of links are low-value ones. In this world with random competition, we see far fewer links in our networks. Using the ratio between in- and out-degree also doesn’t work: as predictor, it returns a much lower result quality.

The average growth rate for out-competing countries when the prediction period is one, two, three or four decades away from the observation one.

So, to wrap up, in the paper we show how to build competition networks, by connecting strong emerging economies to the ones they are out-competing for specific products. Analyzing the higher-order relationships in this networks — i.e. going beyond the simple degree — uncovered a way to estimate the real strength of these emerging countries. A lot of questions remain unanswered. Chief among them: what if we ensured that each edge in the competition networks is truly causal? This will be a quest for another time.


* We’re extremely aware of the fiendish operation we did: these competition networks are absolutely correlational and will never imply causation. However we believe there’s also value in looking at these serendipitous events. If nothing else, at least some econometrician might have had a stroke reading the original sentence.

Continue Reading

28 May 2018 ~ 0 Comments

Mapping the International Aid Community

A few years ago (2013, God I’m old), I was talking to you on how to give a “2.0” flavor to international aid: at CID we created an Aid Explorer to better coordinate the provision of humanitarian work. I’m happy to report that “Aid Explorer” has been adopted — directly or indirectly —  by multiple international organizations, for instance USAID and the European Union. The World Bank’s Independent Evaluation Group contacted me to make an updated version, focused on estimating the World Bank’s position in the global health arena. The result is a paper, “Mapping the international health aid community using web data“, recently published in EPJ Data Science, and the product of a great collaboration with Katsumasa Hamaguchi, Maria Elena Pinglo, and Antonio Giuffrida.

The idea is to collect all the webpages of a hundred international aid organizations, looking for specific keywords and for hyperlinks to the other organizations — differently from the old Aid Explorer in which we relied on the index from Google. The aim is to create different networks of co-occurrences connecting:

  • Aid organizations co-mentioned in the same page;
  • Aid organizations mentioned or linked by another;
  • Issues co-mentioned in the same page;
  • Countries co-mentioned in the same page.

We then analyze these structures to learn something about the community as a whole.

One thing I didn’t expect was that organizations cluster by type. The “type” here is the force behind the organization — private philanthropy, UN system, bilateral (a single country’s aid extension of the foreign ministry), multilateral (international co-operations acting globally), etc. In the picture above (click on the image to enlarge), we encode the agency type in the node color. Organizations are overwhelmingly co-mentioned with organizations of the same type, which is curious because bilaterals often have nothing in common with each other besides the fact they are bilaterals: they work on different issues, with different developed and developing partners.

We can make a similar map connecting issues if they are co-mentioned in a web page. The map is useful as validation because it connects some “synonyms”, for instance “TB” and “Tubercolosis”. However, you can do much more with it. For instance, you can use color to show where an organization is most often cited. Below (click on the image to enlarge) you see the issue map for the World Bank, with the red nodes showing the issues strongly co-mentioned with the World Bank. Basically, the node color is the edge weight in a organization-issue bipartite network, where the organization is the World Bank. To give you an idea, the tiny “Infant Survival” node on the right saw the World Bank mentioned in 9% of the pages in which it was discussed. The World Bank was mentioned in 3.8% of web pages discussing AIDS.

This can lead to interesting discussions. While the World Bank does indeed discuss a lot about some of the red issues above — for instance about “Health Market” and “Health Reform” — its doesn’t say much about “Infant Survival”, relatively speaking at least. It’s intriguing that other organizations mention this particular issue so often in conjunction with the World Bank.

This difference between the global speech about issues and the one specific to another organization allows us to calculate two measures we call “Alignment” and “Impact”. By analyzing how similar the issue co-occurrence network of an organization is with the global one — a simple correlation of edge weights — we can estimate how “Aligned” it is with the global community. On the other hand, an “Impactful” organization is one that, were it to disappear, would dramatically change the global issue network: issues would not be co-mentioned that much.

In the plot above, we have Alignment and Impact on the x and y axis, respectively. The horizontal and vertical lines cutting through the plot above show the median of each measure. The top-right quadrant are organization both impactful and aligned: the organizations that have probably been setting the discourse of the international aid community. No wonder the World Health Organization is there. On the top left we have interesting mavericks, the ones which are not aligned to the community at large, and yet have an impact on it. They are trying to shape the international aid community into something different than what it is now.

A final fun — if a bit loose — analysis regards the potential for an organization to spread a message through the international aid network. What would be the reach of a message if it originated from a specific organization? We can use the Susceptible-Infected model from epidemiology. A message is a “virus” and it is more likely to infect an agency if more than a x% of the agency’s incoming links come from other infected agencies.

This depends on the issue, as shown above. In the figures we see the fraction of “infected” agencies (on the y-axis) given an original “patient zero” organization which starts spreading the message. To the left we see the result of the simulation aggregating all issues. The World Bank reaches saturation faster than UNICEF, and USAID is only heard by a tiny fraction of the network. However, if we only consider web pages talking about “Nurses” (right), then USAID is on par with the top international aid organizations — and UNICEF beats the World Bank. This happens because the discourse on the topic is relatively speaking more concentrated in USAID than average.

As with the Aid Explorer, this is a small step forward improving the provision of international aid. We do not have an interactive website this time, but you can download both the data and the code to create your own maps. Ideally, what we did only for international aid keywords can be extended for all other topics of interest in the humanitarian community: economic development, justice, or disaster relief.

Continue Reading

19 April 2018 ~ 0 Comments

Birds of a Feather Scam Together

In Italy we have a saying:

He who walks with the lame learns how to limp

It means that the company you keep will influence your behavior, more often than not in a negative way — never mind the fact that Judas had impeccable friends. Setting my mind on how to verify this tidbit of ancient wisdom, I decided to investigate how the fraudulent behavior of some businesses might correlate with that of the other companies they work with. Does doing business with a crook say something about you? Will Mexican drug cartels ever recoup the damage to their reputation when it turns out they do business with European banks?

The result is the paper “Birds of a feather scam together: Trustworthiness homophily in a business network” published last month in the Social Networks journal. I wrote this paper with Mauro Barone at the Agenzia delle Entrate, the Italian equivalent of the IRS in the US. The idea is simple. We want to answer the question: is the level of fraud committed by a business correlated with the level of fraud committed by its customer/supplier businesses?

I bet none of these birdbrains filled in their 1040 this year.

To answer the question we analyze a business-to-business (B2B) network. Each node in the network is a company, and it connects to other companies with directed edges. The edge goes from a provider to a customer, it follows the flow of goods and services. These connections are weighted according to the monetary value of the total transaction value during the year. The network contains all the relationships of a few thousands audited Italian businesses, centered on one Italian region: Tuscany.

The peculiarity of the data structure that allow us to speak about tax fraud is that each connection is registered twice. Say we have business A and business B. The “A sold to B” connection should be reported by both A and B: A says how much it sold, and B says how much it purchased. See the example below: the blue edges are A’s declarations, and the orange edges are B’s. The two businesses agree on how much B sold to A (75 Euros), but they disagree on how much A sold to B.

Finding who’s trustworthy and who’s not seems, at first sight, an easy problem. If a business constantly misreports its transactions with its customer/supplier network, then we have good ground to suspect something fishy is going on. However, it’s not that easy. Say we decide a mismatch of 5 Euros is divided into 2.5 blamed on A and 2.5 blamed on B. We might think we’re doing a neutral operation, but we aren’t. What appears to be fair — shifting the blame equally to both actors involved in a disagreement — is already a heavy judgment. Maybe the fault lays completely on B, and A is a blameless victim. We all know how the story ends if we simply wash our hands of the problem like Pontius Pilate: with brigands freed and prophets on the cross.

Paradoxically, if you simply average mismatches, the larger you are, the more you can con. The vast majority of businesses are honest thus, if you have a lot of relationships, they will mostly check out and bring your average match value up. At that point, you can tactically limit your misreports to a few connections and thus evade your taxes. The network also has a broad degree distribution, meaning that most businesses have few connections. This means that your victims will not have enough honest links to make up for the curveball you just threw at them. You can see an example below: the number on the node is their average edge weight, which is the match level of the connection — the total amount of Euros accounted for over the total transaction value (in the A-B example above, we have 75+75+100+95=345 total edge value, but only 340 accounted for given the mismatch of 5, thus the edge weight would be 340/345= 0.9855). Even though the hub has the most fraudulent connections — edges with low match values — its average is higher than that of its victims.

We solve this problem by taking these average mismatches only as a starting condition. At each iteration, we recalculate the average by re-weighting connections according to the trustworthiness equilibrium. The trick is to do it in an asymmetric way. If the previous iteration’s trustworthiness of A was larger than B’s, then in this iteration the mismatch for A is reduced, thus more of the blame will end on B. In practice, if on average we trusted A more, then we’re going to keep doing so in its other cases of mismatch. We don’t let a speck in A’s eye to make us miss the plank in B’s. (Man, my Biblical references today are on point!)

The exact mathematical formulation of this trick is in the paper. In practice, in our example above, the honest businesses below the hub siphon out all its trustworthiness — as they should. They leave it dealing with the businesses at the top, which are corroborating each other honesty with their own high-match links. After a few iterations, the hub ends up being the only business we should not trust — see below (note that I changed the node’s trust value, but I kept the edges constant, as showing the asymmetric correction would make the picture more confusing). In the paper we show that now we break the relationship between the trustworthiness score and the topological characteristics of the nodes: they are now being judged not by their position, but by their actions.

So: is it true that the friends of lame businesses are picking up limps? Yes it is: the difference in trustworthiness levels is a significant indicator whether a connection between two businesses is there or not. In practice, you can use it to predict future links in the B2B network. A strong match on trustworthiness doubles the chances that two businesses are connected to each other when compared to businesses with very different trustworthiness levels.

Making a point about Italian popular wisdom is all good and well, but this paper has a few more things to say. The trustworthiness score as we defined it has some interesting practical applications when it comes to predict which businesses to audit. If a low-trust business is audited, its probability of being fined — meaning something was indeed fishy — is ten percentage points higher than a high-trust business. Since our sample comes from audited businesses, it is not randomly selected: we’re observing businesses that were already suspect. So, if anything, this is an underestimation of the power of our measure: if we’re able to weed out the small crooks from the big ones, we’re probably going to do an even better job distinguishing innocents and culprits.

I think that this paper can do good for honest businesses. To be audited is a long and painful process: if you can skip it because the Agenzia delle Entrate knows how to find the bad guys, you can save time, energy, and peace of mind. And the crooks? Well, the crooks will have to render unto Caesar the things which are Caesar’s.

Continue Reading

30 November 2016 ~ 0 Comments

Exploring the Uncharted Export

Exporting goods is great for countries: it is a way to attract foreign currency. Exports are also fairly easy to analyze, since they are put in big crates and physically shipped through borders, where they are usually triple checked*. However, there is another way to attract foreign currency that escapes this analytical convenience. And it is a huge one. Tourism. When tourists get inside your country, you are effectively exporting something: anything that they buy. Finding out exactly what and how much you’re exporting is tricky. Some things are easy: hotels, vacation resorts, and the like. Does that cover all they buy? Probably not.

Investigating this question is what I decided to do with Ricardo Hausmann and Frank Neffke in our new CID Working Paper “Exploring the Uncharted Export: An Analysis of Tourism-Related Foreign Expenditure with International Spend Data“. The paper analyzes tourism with a new and unique dataset. The MasterCard Center for Inclusive Growth endowed us with a data grant, sharing with us anonymized and aggregated transaction data giving us insights about the spend behavior of foreigners inside two countries, Colombia and the Netherlands.

tourism1

The first thing to clear is the question: does tourism really matter? Tourism might be huge for some countries — Seychelles or Bahamas come to mind** — but does it matter for any other country? Using World Bank estimates — which we’ll see they are probably underestimations — we can draw tourism as the number one export of many countries. Above you see two treemaps (click on them to enlarge) showing the composition of the export basket of Zimbabwe and Spain. The larger the square the more the country makes exporting that product. Tourism would add a square larger than tobacco for Zimbabwe, and twice as big as cars for Spain. Countries make a lot of money out of tourism, so it is crucial to have a more precise way to investigate it.

tourism2

How do we measure tourism? As said before, we’re working with anonymized and aggregated transaction data. In practice, for each postal code of the country of destination we can know how many cards and how much expenditure in total happened in different retail sectors. We focus on cards which were issued outside the country we are analyzing. This way we can be confident we are capturing mostly foreign expenditures. There are many caveats to keep in mind which affect our results: we do not see cash expenditures, we have only a non-random sample from MasterCard cards, and so on. However, when we look at maps showing us the dollar intensity in the various parts of the country (above for Colombia and the Netherlands — click on them to enlarge), we find comforting validation with external data: the top six tourism destinations as reported by Trip Advisor always correspond to areas where we see a lot of activity also in our data.

nld_communities

We also see an additional thing, and it turns out to be related to the advantage of our data over traditional tourism reports. A lot is happening on the border. In fact, the second most popular Colombian city after BogotĂ  is Cucuta. If you never heard of Cucuta it just means that you are not from Colombia, or Venezuela: Cucuta is a city on the northeastern border of the country. It is the place where many Venezuelan cross the border to do shopping, representing a huge influx of cash for Colombia. Until the border got closed, at least (the data is from before this happened, now it’s open again). In the Netherlands, you can cluster municipalities according to the dominant foreign countries observed there — see map above. You will find a Belgian cluster, for instance (in purple). This cluster is dominated by grocery and shopping.

tourism3

While these Belgian shoppers are probably commuters rather than tourists, they are nevertheless bringing foreign currency to local coffers, so that’s great. And it is something not really captured by alternative methodologies. We classify a merchant type as “commuting” if it is predominant in the purple cluster, because it is more popular for “local” Belgian travelers. Everything else is either “tourism” — if it is predominant in the other non-border municipalities –, or “other” if there is no clear dominance anywhere. In the tourism cluster you find things like “Accommodations” and “Travel Agencies and Tour Operators”; in the commuting cluster you have merchants classified under “Automotive Retail” and “Pet Stores”. When you look at the share of expenditures going to the commuting cluster (above in green), you realize how significant this is. One out of four foreign dollars spent in the Netherlands go to non-tourism related activities. The share for Colombia goes up to 30%.

tourism4

A post in this blog would not be complete without a gratuitous network visualization, so here we are. What you see is what we call “Origin Space” for Colombia (above) and the Netherlands (below). Nodes are countries of origin, and they are connected if the tourists from these countries behave similarly in what and where they make their purchases. The color of the node tells you the continent of the country. The size is the presence of tourists in the country of destination, relative to the origin’s GDP. The size and color of the edge is proportional to how similar the two origins are (orange = very similar; blue = similar, but not so much). We can see that Colombia has a lot of large red nodes — from the Americas — and the Netherlands is strong in blue nodes — from Europe.

If you click on the picture and zoom into the Colombia network you will see why this kind of visualization is useful. Colombia is fairly well-placed in the Australian market: the corresponding node is quite large. A thing then jumps to the eye. Australia has a large and very orange connection. To New Zealand. No surprise: Australians and New Zealanders are similar. Yet, the New Zealand node is much smaller than Australia. It shouldn’t be: these are relative expenditures. This means that, for some reason, Colombia is not currently an appealing destination for New Zealanders, even if it should, based on their similarity with Australians. New Zealand should then be a target of further investigation, which might lead to the untapping of a new potential market for Colombian tourism.

And this concludes the reasons why this data is so amazing to study tourism. To wrap up the message, we have first validated the data, showing that it agrees with what we expect being the most important tourism destinations of a country. Then, we unleashed its potential: the ability to detect “non-tourism” foreign cash inflows, and the promising initial development of tools to discover potential missing opportunities.


* The process is not foolproof, thought. I don’t remember where I read it, but it seems that if you sum all declared exports of countries and all declared imports, and subtract the two, you get a quite high positive number. I wonder where all these extra exports are going. Mars?

** When I was told we were doing a tourism project I got high hopes. I’m still waiting for a fully paid work mission to be approved by management. Any day now.

Continue Reading

07 July 2016 ~ 0 Comments

Building Data-Driven Development

A few weeks ago I had to honor to speak at my group’s  “Global Empowerment Meeting” about my research on data science and economic development. I’m linking here the Youtube video of my talk and my transcript for those who want to follow it. The transcript is not 100% accurate given some last minute edits — and the fact that I’m a horrible presenter 🙂 — but it should be good enough. Enjoy!


We think that the big question of this decade is on data. Data is the building blocks of our modern society. We think in development we are not currently using enough of these blocks, we are not exploiting data nearly as much as we should. And we want to fix that.

Many of the fastest growing companies in the world, and definitely the ones that are shaping the progress of humanity, are data-intensive companies. Here at CID we just want to add the entire world to the party.

So how do we do it? To fix the data problem development literature has, we focus on knowing how the global knowledge building looks like. And we inspect three floors: how does knowledge flow between countries? What lessons can we learn inside these countries? What are the policy implications?

To answer these questions, we were helped by two big data players. The quantity and quality of the data they collect represent a revolution in the economic development literature. You heard them speaking at the event: they are MasterCard – through their Center for Inclusive Growth – and Telefonica.

Let’s start with MasterCard, they help us with the first question: how does knowledge flow between countries? Credit card data answer to that. Some of you might have a corporate issued credit card in your wallet right now. And you are here, offering your knowledge and assimilating the knowledge offered by the people sitting at the table with you. The movements of these cards are movements of brains, ideas and knowledge.

When you aggregate this at the global level you can draw the map of international knowledge exchange. When you have a map, you have a way to know where you are and where you want to be. The map doesn’t tell you why you are where you are. That’s why CID builds something better than a map.

We are developing a method to tell why people are traveling. And reasons are different for different countries: equity in foreign establishments like the UK, trade partnerships like Saudi Arabia, foreign greenfield investments like Taiwan.

Using this map, it is easy to envision where you want to go. You can find countries who have a profile similar to yours and copy their best practices. For Kenya, Taiwan seems to be the best choice. You can see that, if investments drive more knowledge into a country, then you should attract investments. And we have preliminary results to suggest whom to attract: the people carrying the knowledge you can use.

The Product Space helps here. If you want to attract knowledge, you want to attract the one you can more easily use. The one connected to what you already know. Nobody likes to build cathedrals in a desert. More than having a cool knowledge building, you want your knowledge to be useful. And used.

There are other things you can do with international travelers flows. Like tourism. Tourism is a great export: for many countries it is the first export. See these big portion of the exports of Zimbabwe or Spain? For them tourism would look like this.

Tourism is hard to pin down. But it is easier with our data partners. We can know when, where and which foreigners spend their money in a country. You cannot paint pictures as accurate as these without the unique dataset MasterCard has.

Let’s go to our second question: what lessons can we learn from knowledge flows inside a country? Telefonica data is helping answering this question for us. Here we focus on a test country: Colombia. We use anonymized call metadata to paint the knowledge map of Colombia, and we discover that the country has its own knowledge departments. You can see them here, where each square is a municipality, connecting to the ones it talks to. These departments correlate only so slightly with the actual political boundaries. But they matter so much more.

In fact, we asked if these boundaries could explain the growth in wages inside the country. And they seem to be able to do it, in surprisingly different ways. If you are a poor municipality in a rich state in Colombia, we see your wage growth penalized. You are on a path of divergence.

However, if you are a poor municipality and you talk to rich ones, we have evidence to show that you are on a path of convergence: you grow faster than you expect to. Our preliminary results seem to suggest that being in a rich knowledge state matters.

So, how do you use this data and knowledge? To do so you have to drill down at the city level. We look not only at communication links, but also at mobility ones. We ask if a city like Bogota is really a city, or different cities in the same metropolitan area. With the data you can draw four different “mobility districts”, with a lot of movements inside them, and not so many across them.

The mobility districts matter, because combining mobility and economic activities we can map the potential of a neighborhood, answering the question: if I live here, how productive can I be? A lot in the green areas, not so much in the red ones.

With this data you can reshape urban mobility. You know where the entrance barriers to productivity are, and you can destroy them. You remodel your city to include in its productive structure people that are currently isolated by commuting time and cost. These people have valuable skills and knowhow, but they are relegated in the informal sector.

So, MasterCard data told us how knowledge flows between countries. Telefonica data showed the lessons we can learn inside a country. We are left with the last question: what are the policy implications?

So far we have mapped the landscape of knowledge, at different levels. But to hike through it you need a lot of equipment. And governments provide part of that equipment. Some in better ways than others.

To discover the policy implications, we unleashed a data collector program on the Web. We wanted to know how the structure of the government in the US looks like. Our program returned us a picture of the hierarchical organization of government functions. We know how each state structures its own version of this hierarchy. And we know how all those connections fit together in the union, state by state. We are discovering that the way a state government is shaped seems to be the result of two main ingredients: where a state is and how its productive structure looks like.

We want to establish that the way a state expresses its government on the Web reflects the way it actually performs its functions. We seem to find a positive answer: for instance having your environmental agencies to talk with each other seems to work well to improve your environmental indicators, as recorded by the EPA. Wiring organization when we see positive feedback and rethinking them when we see a negative one is a direct consequence of this Web investigation.

I hope I was able to communicate to you the enthusiasm CID discovered in the usage of big data. Zooming out to gaze at the big picture, we start to realize how the knowledge building looks like. As the building grows, so does our understanding of the world, development and growth. And here’s the punchline of CID: the building of knowledge grows with data, but the shape it takes is up to what we make of this data. We chose to shape this building with larger doors, so that it can be used to ensure a more inclusive world.


By the way, the other presentations of my session were great, and we had a nice panel after that. You can check out the presentations in the official Center for International Development Youtube channel. I’m embedding the panel’s video below:

Continue Reading

18 November 2015 ~ 0 Comments

Evaluating Prosperity Beyond GDP

When reporting on economics, news outlets very often refer to what happens to the GDP. How is policy X going to affect our GDP? Is the national debt too high compared to GDP? How does my GDP compare to yours? The concept lurking behind those three letters is the Gross Domestic Product, the measure of the gross value added by all domestic producers in a country. In principle, the idea of using GDP to take the pulse of an economy isn’t bad: we count how much we can produce, and this is more or less how well we are doing. In practice, today I am jumping on the huge bandwagon of people who despise GDP for its meaningless, oversimplified and frankly suspicious nature. I will talk about a paper in which my co-authors and I propose to use a different measure to evaluate a country’s prosperity. The title is “Going Beyond GDP to Nowcast Well-Being Using Retail Market Data“, my co-authors are Riccardo Guidotti, Dino Pedreschi and Diego Pennacchioli, and the paper will be presented at the Winter edition of the Network Science Conference.

GDP is gross for several reasons. What Simon Kuznets said resonates strongly with me, as already in the 30s he was talking like a complexity scientist:

The valuable capacity of the human mind to simplify a complex situation in a compact characterization becomes dangerous when not controlled in terms of definitely stated criteria. With quantitative measurements especially, the definiteness of the result suggests, often misleadingly, a precision and simplicity in the outlines of the object measured. Measurements of national income are subject to this type of illusion and resulting abuse, especially since they deal with matters that are the center of conflict of opposing social groups where the effectiveness of an argument is often contingent upon oversimplification.

cdp1

In short, GDP is an oversimplification, and as such it cannot capture something as complex as an economy, or the multifaceted needs of a society. In our paper, we focus on some of its specific aspects. Income inequality skews the richness distribution, so that GDP doesn’t describe how the majority of the population is doing. But more importantly, it is not possible to quantify well-being just with the number of dollars in someone’s pocket: she might have dreams, aspirations and sophisticated needs that bear little to no correlation with the status of her wallet. And even if GDP was a good measure, it’s very hard to calculate: it takes months to estimate it reliably. Nowcasting it would be great.

And so we tried to hack our way out of GDP. The measure we decided to use is the one of customer sophistication, that I presented several times in the past. In practice, the measure is a summary of the connectivity of a node in a bipartite network*. The bipartite network connects customers to the products they buy. The more variegated the set of products a customer buys, the more complex she is. Our idea was to create an aggregated version at the network level, and to see if this version was telling us something insightful. We could make a direct correlation with the national GDP of Italy, because the data we used to calculate it comes from around a half million customers from several Italian regions, which are representative of the country as a whole.

gdp2

The argument we made goes as follows. GDP stinks, but it is not 100% bad, otherwise nobody would use it. Our sophistication is better, because it is connected to the average degree with which a person can satisfy her needs**. Income inequality does not affect it either, at least not in trivial ways as it does it with GDP. Therefore, if sophistication correlates with GDP it is a good measure of well-being: it captures part of GDP and adds something to it. Finally, if the correlation happens with some anticipated temporal shift it is even better, because GDP pundits can just use it as instantaneous nowcasting of GDP.

We were pleased when our expectations met reality. We tested several versions of the measure at several temporal shifts — both anticipating and following the GDP estimate released by the Italian National Statistic Institute (ISTAT). When we applied the statistical correction to control for the multiple hypothesis testing, the only surviving significant and robust estimate was our customer sophistication measure calculated with a temporal shift of -2, i.e. two quarters before the corresponding GDP estimate was released. Before popping our champagne bottles, let me write an open letter to the elephant in the room.

gdp3

As you see from the above chart, there are some wild seasonal fluctuations. This is rather obvious, but controlling for them is not easy. There is a standard approach — the X-13-Arima method — which is more complicated than simply averaging out the fluctuations. It takes into account a parameter tuning procedure including information we simply do not have for our measure, besides requiring observation windows longer than what we have (2007-2014). It is well possible that our result could disappear. It is also possible that the way we calculated our sophistication index makes no sense economically: I am not an economist and I do not pretend for a moment that I can tell them how to do their job.

What we humbly report is a blip on the radar. It is that kind of thing that makes you think “Uh, that’s interesting, I wonder what it means”. I would like someone with a more solid skill set in economics to take a look at this sophistication measure and to do a proper stress-test with it. I’m completely fine with her coming back to tell me I’m a moron. But that’s the risk of doing research and to try out new things. I just think that it would be a waste not to give this promising insight a chance to shine.


* Even if hereafter I talk only about the final measure, it is important to remark that it is by no means a complete substitute of the analysis of the bipartite network. Meaning that I’m not simply advocating to substitute a number (GDP) for another (sophistication), rather to replace GDP with a fully-blown network analysis.

** Note that this is a revealed measure of sophistication as inferred by the products actually bought and postulating that each product satisfies one or a part of a “need”. If you feel that the quality of your life depends on you being able to bathe in the milk of a virgin unicorn, the measure will not take into account the misery of this tacit disappointment. Such are the perils of data mining.

Continue Reading

31 July 2014 ~ 0 Comments

The (Not So) Little Shop of Horrors

For this end of July, I want to report some juicy facts about a work currently under development. Why? Because I think that these facts are interesting. And they are a bit depressing too. So instead of crying I make fun of them, because as the “Panic! At the Disco” would put it: I write sins, not tragedies.

So, a bit of background. Last year I got involved in an NSF project, with my boss Ricardo Hausmann and my good friend/colleague Prof. Stephen Kosack. Our aim is to understand what governments do. One could just pull out some fact-sheets and budgets, but in our opinion those data sources do not tell the whole story. Governments are complex systems and as complex systems we need to understand their emergent properties as collection of interacting parts. Long story short, we decided to collect data by crawling the websites of all public agencies for each US state government. As for why, you’ll have to wait until we publish something: the aim of this post is not to convince you that this is a good idea. It probably isn’t, at least in some sense.

p1

It isn’t a good idea not because that data does not make sense. Au contraire, we already see it is very interesting. No, it is a tragic idea because crawling the Web is hard, and it requires some effort to do it properly. Which wouldn’t be necessarily a problem if there wasn’t an additional hurdle. We are not just crawling the Web. We are crawling government websites. (This is when in a bad horror movie you would hear a thunder nearby).

To make you understand the horror of this proposition is exactly the aim of this post. First, how many government websites are really out there? How to collect them? Of course I was not expecting a single directory for all US states. And I was wrong! Look for yourselves this beauty: http://www.statelocalgov.net/. The “About” page is pure poetry:

State and Local Government on the Net is the onle (sic) frequently updated directory of links to government sponsored and controlled resources on the Internet.

So up-to-date that their footer gets only to 2010 and their news section only includes items from 2004. It also points to:

SLGN Notes, a weblog, [that] was added to the site in June 2004. Here, SLGN’s editors comment on new, redesigned or updated state and local government websites, pointing out interesting or fun features for professional and consumer audiences alike and occasionally cover related news.

Yeah, go ahead and click the link, that is not the only 404 page you’ll see here.

p3

Enough compliments to these guys! Let’s go back to work. I went straight to the 50 different state government’s websites and found in all of them an agency directory. Of course asking that these directories shared the same structure and design is too much. “What are we? Organizations whose aim is to make citizen’s life easier or governments?”. In any case from them I was able to collect the flabbergasting amount of 61,584 URLs. Note that this is six times as much as from statelocalgov.net, and it took me a week. Maybe I should start my own company 🙂

Awesome! So it works! Not so fast. Here we hit the first real wall of government technological incompetence. Out of those 61,584, only 50,999 actually responded to my pings. Please note that I already corrected all the redirects: if the link was outdated but the agency redirected you to the new URL, then that connection is one of the 50,999. Allow me to rephrase it in poetry: in the state government directories there are more than ten thousand links that are pure, utter, hopeless garbage nonsense. More than one out of six links in those directories will land you exactly nowhere.

p2

Oh, but let’s stay positive! Let’s take a look at the ones that actually lead you somewhere:

  • Inconsistent spaghetti-like design? Check. Honorable mention for the good ol’ frameset webdesign of http://colecounty.org/.
  • Making your website an image and use <area> tag for links? Check. That’s some solid ’95 school.
  • Links to websites left to their own devices and purchased by someone else? Check. Passing it through Google Translate, it provides pearls of wisdom like: “To say a word and wipe, There are various wipe up for the wax over it because wipe from”. Maybe I can get a half dozen of haiku from that page. (I have more, if you want it)
  • ??? Check. That’s a Massachusetts town I do not want to visit.
  • Maintenance works due to finish some 500 days ago? Check.
  • Websites mysteriously redirected somewhere else? Check. The link should go to http://www.cityoflaplata.com/, but alas it does not. I’m not even sure what the heck these guys are selling.
  • These aren’t the droids you’re looking for? Check.
  • The good old “I forgot to renew the domain contract”? Check.

Bear in mind that this stuff is part of the 50,999 “good” URLs (these are scare quotes at their finest). At some point I even gave up noting down this stuff. I saw hacked webpages that had been there since years. I saw an agency providing a useful Google Maps of their location, which according to them was the middle of the North Pole. But all those things will be lost in time, like tears in the rain.

Continue Reading