Michele Coscia - Connecting Humanities

Michele Coscia I am a post-doc fellow at the Center for International Development, Harvard University in Cambridge. I mainly work on mining complex networks, and on applying the extracted knowledge to international development and governance. My background is in Digital Humanities, i.e. the connection between the unstructured knowledge and the cold organized computer science. I have a PhD in Computer Science, obtained in June 2012 at the University of Pisa. In my career, I also worked at the Center for Complex Network Research at Northeastern University, with Albert-Laszlo Barabasi. On this website you can browse my papers, algorithms and datasets with the top navigation, or simply skim my blog posts that briefly present my topics and papers below this box.

21 February 2017 ~ 0 Comments

Netonets @ NetSci 2017: Call for Contributions!

We are delighted to invite submissions for

Network of Networks
7th Edition – Satellite Symposium at NetSci2017

taking place in JW Marriott Indianapolis, Indiana, United States,
on June 20th, 2017.

Submission:
We invite you to submit a 300 word abstract including one descriptive figure using our EasyChair submission link:
https://easychair.org/conferences/?conf=netonets2017

Deadline for submission: April 21st, 2017.
Notification of acceptance will be sent out by April 28th, 2017.

Further Information at: http://www.netonets.org/events/netonets2017

Abstract:
For the seventh time, it is our pleasure to bring together pioneer work in the study of networks of networks. Networks of networks are networks in which the nodes may be connected through different relations, are part of interdependent layers and connected by higher order dynamics. They can represent multifaceted social interactions, critical infrastructure and complex relational data structures. In our call, we are looking for a diversity of research contributions revolving around networks of networks of any kind: in social media, in infrastructure, in culture. We are particularly keen in receiving works raising novel issues and provocative queries in the investigation of networks of networks, as well as new contributions tackling these challenges. How do networks of networks change the paradigm of established problems like percolation or community detection? How are we shifting our thoughts to be ready for this evolution? Running parallel to NetSci, the top network science conference, this event provides a valuable opportunity to connect with leading researchers in complex network science.

Confirmed Keynote:

Marta Gonzales – MIT
Gareth Baxter – University of Aveiro
Romualdo Pastor-Satorras – Universitat Politècnica de Catalunya
Paul Hines – University of Vermont (UNCONFIRMED)

The final program will strive for the inclusion of contributions from different research fields, creating an interdisciplinary dialogue about networks of networks.

Best regards,
The Netonets organizers,

Antonio Scala, La Sapienza – antonio.scala.phys@gmail.com
Gregorio D’Agostino, ENEA – gregorio.dagostino@enea.it
Michele Coscia, Harvard University – michele_coscia@hks.harvard.edu
Przemysław Kazienko, Wroclaw University of Technology – przemyslaw.kazienko@pwr.edu.pl

25 January 2017 ~ 0 Comments

Network Backboning with Noisy Data

Networks are a fantastic tool for understanding an interconnected world. But, to paraphrase Spider Man, with networks’ expressive power come great headaches. Networks lure us in with their promise of clearly representing complex phenomena. However, once you start working with them, all you get is a tangled mess. This is because, most of the time, there’s noise in the data and/or there are too many connections: you need to weed out the spurious ones. The process of shaving the hairball by keeping only the significant connections — the red ones in the picture below —  is called “network backboning”. The network backbone represents the true relationships better and will play much nicer with other network algorithms. In this post, I describe a backboning method I developed with Frank Neffke, from the paper “Network Backboning with Noisy Data” accepted for publication in the International Conference on Data Engineering (the code implementing the most important backbone algorithms is available here).

bb1

Network backboning is as old as network analysis. The first solution to the problem was to keep edges according to their weight. If you want to connect people who read the same books, pairs who have few books in common are out. Serrano et al. pointed out that edge weight distributions can span many orders of magnitude — as shown in the figure below (left). Even with a small threshold, we are throwing away a lot of edges. This might not seem like a big deal — after all we’re in the business of making the network sparser — except that the weights are not distributed randomly. The weight of an edge is correlated with the weights of the edges sharing a node with it — as shown by the figure below (right). It is easy to see why: if you have a person who read only one book, all its edges can have at most weight one.

Their weights might be low in comparison with the rest of the network, but they are high for their nodes, given their propensity to connect weakly. Isolating too many nodes because we accidentally removed all their edges is a no-no, so Serrano and coauthors developed the Disparity Filter (DF): a technique to estimate the significance of one node’s connections given its typical edge weight, regardless of what the rest of the network says.

bb2

This sounds great, but DF and other network backboning approaches make imprecise assumptions about the possibility of noise in our estimate of edge weights. In our book example, noise means that a user might have accidentally said that she read a book she didn’t, maybe because the titles were very similar. One thing DF gets wrong is that, when two nodes are not connected in the raw network data, it would say that measurement error is absent. This is likely incorrect, and it screams for a more accurate estimate of noise. I’m going to leave the gory math details in the paper, but the bottom line is that we used Bayes’ rule. The law allows us to answer the question: how surprising is the weight of this edge, given the weights of the two connected nodes? How much does it defy my expectation?

The expectation here can be thought of as an extraction without replacement, much like Bingo (which statisticians — notorious for being terrible at naming things — would call a “hypergeometric” one). Each reader gets to extract a given number of balls (n, the total number of books she read), drawing from a bin in which all balls are the other users. If a user read ten books, then there are ten balls representing her in the bin. This is a good way to have an expectation for zero edge weights (nodes that are not connected), because we can estimate the probability of never extracting a ball with a particular label.

bb4

I highlighted the words one and two, because they’re a helpful key to understand the practical difference between the approaches. Consider the toy example below. In it, each edge’s thickness is proportional to its weight. Both DF and our Noise Corrected backbone (NC) select the black edges: they’re thick and important. But they have different opinions about the blue and red edges. DF sees that nodes 2 and 3 have mostly weak connections, meaning their thick connection to node 1 stands out. So, DF keeps the blue edges and it drops the red edge. It only ever looks at one node at a time.

bb5

NC takes a different stance. It selects the red edge and drops the blue ones. Why? Because for NC what matters more is the collaboration between the two nodes. Sure, the blue connection is thicker than the red one. But node 1 always has strong connections, and its blue edges are actually particularly weak. On the other hand, node 3 usually has weak connections. Proportionally speaking, the red edge is more important for it, and so it gets saved.

To sum up, NC:

  1. Refines our estimate of noise in the edge weights;
  2. Sees an edge as the collaboration between two nodes rather that an event happening to one of them;
  3. Uses a different model exploiting Bayes’ law to bake these aspects together.

bb6

How does that work for us in practice? Above you see some simulations made with artificial networks, of which we know the actual underlying structure, plus some random noise — edges thrown in that shouldn’t exist. The more noise we add the more difficult it is to recover the original structure. When there is little noise, DF (in blue) is better. NC (in red) starts to shine as we increase the amount of noise, because that’s the scenario we were targeting.

In the paper we also show that NC backbones have a comparable stability with DF, meaning that extracting the backbone from different time snapshots of the same phenomenon usually does not yield wildly different results. Coverage — the number of nodes that still have at least one edge in the network — is also comparable. Then we look at quality. When we want to predict some future relationship among the nodes, we expect noisy edges to introduce errors in the estimates. Since a backbone aims at throwing them away, it should increase our predictive power. The table below (click it to enlarge) shows that, in different country-country networks, the predictive quality (R2) using an NC backbone is higher than the one we get using the full noisy network. The quality of prediction can get as high as twice the baseline (the table reports the quality ratio: R2 of the backbone over R2 of the full network, for different methods).

bb8

The conclusion is that, when you are confident about the measurement of your network, you should probably extract its backbone using DF. However, in cases of uncertainty, NC is the better option. You can test it yourself!

30 November 2016 ~ 0 Comments

Exploring the Uncharted Export

Exporting goods is great for countries: it is a way to attract foreign currency. Exports are also fairly easy to analyze, since they are put in big crates and physically shipped through borders, where they are usually triple checked*. However, there is another way to attract foreign currency that escapes this analytical convenience. And it is a huge one. Tourism. When tourists get inside your country, you are effectively exporting something: anything that they buy. Finding out exactly what and how much you’re exporting is tricky. Some things are easy: hotels, vacation resorts, and the like. Does that cover all they buy? Probably not.

Investigating this question is what I decided to do with Ricardo Hausmann and Frank Neffke in our new CID Working Paper “Exploring the Uncharted Export: An Analysis of Tourism-Related Foreign Expenditure with International Spend Data“. The paper analyzes tourism with a new and unique dataset. The MasterCard Center for Inclusive Growth endowed us with a data grant, sharing with us anonymized and aggregated transaction data giving us insights about the spend behavior of foreigners inside two countries, Colombia and the Netherlands.

tourism1

The first thing to clear is the question: does tourism really matter? Tourism might be huge for some countries — Seychelles or Bahamas come to mind** — but does it matter for any other country? Using World Bank estimates — which we’ll see they are probably underestimations — we can draw tourism as the number one export of many countries. Above you see two treemaps (click on them to enlarge) showing the composition of the export basket of Zimbabwe and Spain. The larger the square the more the country makes exporting that product. Tourism would add a square larger than tobacco for Zimbabwe, and twice as big as cars for Spain. Countries make a lot of money out of tourism, so it is crucial to have a more precise way to investigate it.

tourism2

How do we measure tourism? As said before, we’re working with anonymized and aggregated transaction data. In practice, for each postal code of the country of destination we can know how many cards and how much expenditure in total happened in different retail sectors. We focus on cards which were issued outside the country we are analyzing. This way we can be confident we are capturing mostly foreign expenditures. There are many caveats to keep in mind which affect our results: we do not see cash expenditures, we have only a non-random sample from MasterCard cards, and so on. However, when we look at maps showing us the dollar intensity in the various parts of the country (above for Colombia and the Netherlands — click on them to enlarge), we find comforting validation with external data: the top six tourism destinations as reported by Trip Advisor always correspond to areas where we see a lot of activity also in our data.

nld_communities

We also see an additional thing, and it turns out to be related to the advantage of our data over traditional tourism reports. A lot is happening on the border. In fact, the second most popular Colombian city after Bogotà is Cucuta. If you never heard of Cucuta it just means that you are not from Colombia, or Venezuela: Cucuta is a city on the northeastern border of the country. It is the place where many Venezuelan cross the border to do shopping, representing a huge influx of cash for Colombia. Until the border got closed, at least (the data is from before this happened, now it’s open again). In the Netherlands, you can cluster municipalities according to the dominant foreign countries observed there — see map above. You will find a Belgian cluster, for instance (in purple). This cluster is dominated by grocery and shopping.

tourism3

While these Belgian shoppers are probably commuters rather than tourists, they are nevertheless bringing foreign currency to local coffers, so that’s great. And it is something not really captured by alternative methodologies. We classify a merchant type as “commuting” if it is predominant in the purple cluster, because it is more popular for “local” Belgian travelers. Everything else is either “tourism” — if it is predominant in the other non-border municipalities –, or “other” if there is no clear dominance anywhere. In the tourism cluster you find things like “Accommodations” and “Travel Agencies and Tour Operators”; in the commuting cluster you have merchants classified under “Automotive Retail” and “Pet Stores”. When you look at the share of expenditures going to the commuting cluster (above in green), you realize how significant this is. One out of four foreign dollars spent in the Netherlands go to non-tourism related activities. The share for Colombia goes up to 30%.

tourism4

A post in this blog would not be complete without a gratuitous network visualization, so here we are. What you see is what we call “Origin Space” for Colombia (above) and the Netherlands (below). Nodes are countries of origin, and they are connected if the tourists from these countries behave similarly in what and where they make their purchases. The color of the node tells you the continent of the country. The size is the presence of tourists in the country of destination, relative to the origin’s GDP. The size and color of the edge is proportional to how similar the two origins are (orange = very similar; blue = similar, but not so much). We can see that Colombia has a lot of large red nodes — from the Americas — and the Netherlands is strong in blue nodes — from Europe.

If you click on the picture and zoom into the Colombia network you will see why this kind of visualization is useful. Colombia is fairly well-placed in the Australian market: the corresponding node is quite large. A thing then jumps to the eye. Australia has a large and very orange connection. To New Zealand. No surprise: Australians and New Zealanders are similar. Yet, the New Zealand node is much smaller than Australia. It shouldn’t be: these are relative expenditures. This means that, for some reason, Colombia is not currently an appealing destination for New Zealanders, even if it should, based on their similarity with Australians. New Zealand should then be a target of further investigation, which might lead to the untapping of a new potential market for Colombian tourism.

And this concludes the reasons why this data is so amazing to study tourism. To wrap up the message, we have first validated the data, showing that it agrees with what we expect being the most important tourism destinations of a country. Then, we unleashed its potential: the ability to detect “non-tourism” foreign cash inflows, and the promising initial development of tools to discover potential missing opportunities.


* The process is not foolproof, thought. I don’t remember where I read it, but it seems that if you sum all declared exports of countries and all declared imports, and subtract the two, you get a quite high positive number. I wonder where all these extra exports are going. Mars?

** When I was told we were doing a tourism project I got high hopes. I’m still waiting for a fully paid work mission to be approved by management. Any day now.

22 August 2016 ~ 0 Comments

It’s Not All in the Haka: Networks Matter in Rugby Too

If there is a thing that I love more than looking at silly pictures on the Interwebz for work is to watch rugby for work. I love rugby: in my opinion it is the most beautiful team sport out there. It tingles my network senses: 15 men on the field have to coordinate like a single organism to achieve their goal — crossing the goal line with the ball by passing it backwards instead of forward. When Optasports made available some data collected during 18 rugby matches I felt I could not miss the opportunity for some hardcore network nerding on them. The way teams weave their collaboration networks during a match must have some relationship with their performance, and I was going to find out what this relationship might be.

1443058137058

For my quest I teamed up with Luca Pappalardo and Paolo Cintia, two friends of mine who are making an impact on network and big data sports analytics, both in soccer and in cycling. The result was “The Haka Network: Evaluating Rugby Team Performance with Dynamic Graph Analysis“, a paper recently presented at the DyNo workshop in San Francisco. Our questions were:

  1. Is there a relationship between the topology of the network of passes and the success of the team?
  2. Is there a relationship between disruptions made by tackles and territorial gains?
  3. If we want to predict a team’s success, is it better to build networks of passes and disruptions for each action separately or for the entire match?
  4. Can we use these relationships to “predict” the outcome of the match?

rugby1

A passage network is simply a network whose nodes are the players of a team and the directed connections go from the player originating a pass to the player receiving the ball. We consider only completed passages: the ones that did not result in an error or lost possession. In the above picture, those are the green edges and they are always established between players belonging to the same team. In rugby, players are allowed to tackle the current ball carrier of the opponent team. When that happens, we create another directed edge, this time in what we call “disruption network”. The aim of a tackle is to prevent the opponent team from gaining meters. These are the red edges in the above picture and can only be established between players belonging to opposite teams. The picture you see is the collection of all passes and tackles which happened in the Italy vs New Zealand match in 2012. It is a multilayer network as it contains edges of two different types: passes and tackles.

Once we have pass and disruption networks we can calculate a collection of network measures. I’ll give a brief idea here, but if you are looking for more formal definitions you’ll have to search for them in the paper:

  • Connectivity: how many pass connections you have to remove to isolate players;
  • Assortativity: the tendency of players to pass the ball to players with a similar number of connections — in high assortativity central players pass to other central players and marginal players to other marginal players;
  • Components: how many “sinks” there are, in that the ball never goes back to the bulk of the team when it is passed to a player in a sink;
  • Clustering: how many triangles there are, meaning that the team can be decomposed in many different smaller sub-teams of three players.

These are the features we calculated for the pass networks. The disruption case is slightly different. We calculated the same features for the team when removing the tackled player, weighted on the relative number of tackles. If 50% of the tackles hit player number 11, then 50% of the disrupted connectivity is the connectivity value of the pass network when removing player 11. The reason is that the tackled player is temporarily removed from the game, so we need to know how the team performs without him, weighted on the number of times this occurrence happens.

So, it is time to give some answers. Shall we?

1. Is there a relationship between the topology of the network of passes and the success of the team?

rugby2

Yes, there is. We calculate “success” as the number of meters gained, ball in hand, by the team. The objective of rugby is to cross the goal line carrying the ball, so meters made is a pretty good indicator. We control for two things. First, the total number of passes: it simply means the team was able to hold onto the ball longer, so it is trivially expected to result in more meters. Second, the home advantage, which is a huge factor in rugby: Italy won only 12 out of 85 matches in the European “Six Nations” tournament, and 11 of them were in Italy. After these controls, we find that two features have good correlations with meters made: connectivity and components. The more edges are needed to isolate a player, the more meters a team is expected to make (p < .01, R2 = 47%). More sinks in a team is associated with lower gains in meters.

2. Is there a relationship between disruptions made by tackles and territorial gains?

rugby3

Again: yes. In this case it seems that all calculated features matter to predict meters made. The strongest factor is again leftover connectivity. It means that if the connectivity of the pass network increases after the tackled player is removed from it then the team is able to advance more. Simplifying: if you are able to tackle only low connectivity players, then your opponent is able to gain more territory (p < .01, R2 = 48%).

3. Is it better to build networks of passes and disruptions for each action separately or for the entire match?

The answer to the previous two questions were made by calculating the features on the global match networks. The global network uses all the data from a match, exactly like the pass and disruption edges depicted in the above figure. In principle, one could calculate these features as the match unfolds: sequence by sequence. In fact, networks features at the action level work very well in soccer, as Luca and Paolo already proved. Does that work also in rugby?

Surprisingly, the answer is no. We recalculated the features for each passage of play. A passage of play is the part of a match from when a team gets into possession of the ball until it loses it, scores, or the game flow stops for an infraction. When we calculate features at this level, we find very weak correlations: almost nothing is significant and, when it is, the predictive power is very low. We think that this is because in rugby our definition of sequence is too strict. While soccer is a tactical game — where each sequence counts for itself — rugby is a grand strategy game: sequences build cumulative advantage which pays off after a series of them — or only in the match as a whole.

4. Can we use these relationships to “predict” the outcome of the match?

mca_3099866b

This is the real queen question of the post, and we do not fully answer it, unfortunately. However, we have a very good reason to think that the answer could be positive. We created a predictor which trains on 17 matches and then, given the global multi-layer network, will pick the winner. You can see the problem of the approach here: we use the network of the match as it happened to “predict” the outcome. However, we did that only because we did not have enough matches for each individual team: we believe we can first predict how pass and disruption networks will shape in a new match using historic data and then use that to predict the outcome. That will be future works, maybe if some team is intrigued by networks and wants to contact us for a collaboration… (wink wink).

The reason I still like to report on our predictor is that it has a very promising property. Its accuracy was 83%. We compared with a prediction made with official rugby rankings, whose performance is worse: 76% accuracy. We also tested against bookmakers, who are better than us with their 86% accuracy. However, historic data on bets only cover more important matches — only 14 out of 18 — and matches between minor teams are usually less predictable. The fact that we are on par on a more difficult task is remarkable. More importantly, bookies tend to just “choose the best team”. For instance, they always predict a New Zealand win. The Haka, however, is not always enough and our networks caught that. New Zealand lost to England in a big upset on December 1st 2012. The bookmakers didn’t see that coming, but our network approach could have.

07 July 2016 ~ 0 Comments

Building Data-Driven Development

A few weeks ago I had to honor to speak at my group’s  “Global Empowerment Meeting” about my research on data science and economic development. I’m linking here the Youtube video of my talk and my transcript for those who want to follow it. The transcript is not 100% accurate given some last minute edits — and the fact that I’m a horrible presenter :-) — but it should be good enough. Enjoy!


We think that the big question of this decade is on data. Data is the building blocks of our modern society. We think in development we are not currently using enough of these blocks, we are not exploiting data nearly as much as we should. And we want to fix that.

Many of the fastest growing companies in the world, and definitely the ones that are shaping the progress of humanity, are data-intensive companies. Here at CID we just want to add the entire world to the party.

So how do we do it? To fix the data problem development literature has, we focus on knowing how the global knowledge building looks like. And we inspect three floors: how does knowledge flow between countries? What lessons can we learn inside these countries? What are the policy implications?

To answer these questions, we were helped by two big data players. The quantity and quality of the data they collect represent a revolution in the economic development literature. You heard them speaking at the event: they are MasterCard – through their Center for Inclusive Growth – and Telefonica.

Let’s start with MasterCard, they help us with the first question: how does knowledge flow between countries? Credit card data answer to that. Some of you might have a corporate issued credit card in your wallet right now. And you are here, offering your knowledge and assimilating the knowledge offered by the people sitting at the table with you. The movements of these cards are movements of brains, ideas and knowledge.

When you aggregate this at the global level you can draw the map of international knowledge exchange. When you have a map, you have a way to know where you are and where you want to be. The map doesn’t tell you why you are where you are. That’s why CID builds something better than a map.

We are developing a method to tell why people are traveling. And reasons are different for different countries: equity in foreign establishments like the UK, trade partnerships like Saudi Arabia, foreign greenfield investments like Taiwan.

Using this map, it is easy to envision where you want to go. You can find countries who have a profile similar to yours and copy their best practices. For Kenya, Taiwan seems to be the best choice. You can see that, if investments drive more knowledge into a country, then you should attract investments. And we have preliminary results to suggest whom to attract: the people carrying the knowledge you can use.

The Product Space helps here. If you want to attract knowledge, you want to attract the one you can more easily use. The one connected to what you already know. Nobody likes to build cathedrals in a desert. More than having a cool knowledge building, you want your knowledge to be useful. And used.

There are other things you can do with international travelers flows. Like tourism. Tourism is a great export: for many countries it is the first export. See these big portion of the exports of Zimbabwe or Spain? For them tourism would look like this.

Tourism is hard to pin down. But it is easier with our data partners. We can know when, where and which foreigners spend their money in a country. You cannot paint pictures as accurate as these without the unique dataset MasterCard has.

Let’s go to our second question: what lessons can we learn from knowledge flows inside a country? Telefonica data is helping answering this question for us. Here we focus on a test country: Colombia. We use anonymized call metadata to paint the knowledge map of Colombia, and we discover that the country has its own knowledge departments. You can see them here, where each square is a municipality, connecting to the ones it talks to. These departments correlate only so slightly with the actual political boundaries. But they matter so much more.

In fact, we asked if these boundaries could explain the growth in wages inside the country. And they seem to be able to do it, in surprisingly different ways. If you are a poor municipality in a rich state in Colombia, we see your wage growth penalized. You are on a path of divergence.

However, if you are a poor municipality and you talk to rich ones, we have evidence to show that you are on a path of convergence: you grow faster than you expect to. Our preliminary results seem to suggest that being in a rich knowledge state matters.

So, how do you use this data and knowledge? To do so you have to drill down at the city level. We look not only at communication links, but also at mobility ones. We ask if a city like Bogota is really a city, or different cities in the same metropolitan area. With the data you can draw four different “mobility districts”, with a lot of movements inside them, and not so many across them.

The mobility districts matter, because combining mobility and economic activities we can map the potential of a neighborhood, answering the question: if I live here, how productive can I be? A lot in the green areas, not so much in the red ones.

With this data you can reshape urban mobility. You know where the entrance barriers to productivity are, and you can destroy them. You remodel your city to include in its productive structure people that are currently isolated by commuting time and cost. These people have valuable skills and knowhow, but they are relegated in the informal sector.

So, MasterCard data told us how knowledge flows between countries. Telefonica data showed the lessons we can learn inside a country. We are left with the last question: what are the policy implications?

So far we have mapped the landscape of knowledge, at different levels. But to hike through it you need a lot of equipment. And governments provide part of that equipment. Some in better ways than others.

To discover the policy implications, we unleashed a data collector program on the Web. We wanted to know how the structure of the government in the US looks like. Our program returned us a picture of the hierarchical organization of government functions. We know how each state structures its own version of this hierarchy. And we know how all those connections fit together in the union, state by state. We are discovering that the way a state government is shaped seems to be the result of two main ingredients: where a state is and how its productive structure looks like.

We want to establish that the way a state expresses its government on the Web reflects the way it actually performs its functions. We seem to find a positive answer: for instance having your environmental agencies to talk with each other seems to work well to improve your environmental indicators, as recorded by the EPA. Wiring organization when we see positive feedback and rethinking them when we see a negative one is a direct consequence of this Web investigation.

I hope I was able to communicate to you the enthusiasm CID discovered in the usage of big data. Zooming out to gaze at the big picture, we start to realize how the knowledge building looks like. As the building grows, so does our understanding of the world, development and growth. And here’s the punchline of CID: the building of knowledge grows with data, but the shape it takes is up to what we make of this data. We chose to shape this building with larger doors, so that it can be used to ensure a more inclusive world.


By the way, the other presentations of my session were great, and we had a nice panel after that. You can check out the presentations in the official Center for International Development Youtube channel. I’m embedding the panel’s video below:

09 June 2016 ~ 0 Comments

Netsci 2016 Report

netsci1

Another NetSci edition went by, as interconnected as ever. This year we got to enjoy Northeast Asia, a new scenario for us network scientists, and an appropriate one: many new faces popped up both among speakers and attendees. Seoul was definitely what NetSci needed at this time. I want to spend just a few words about what impressed me the most during this trip — well, second most after what Koreans did with their pizzas: that is unbeatable. Let’s go chronologically, starting with the satellites.

You all know I was co-organizing the one on Networks of networks (you didn’t? Then scroll down a bit and get informed!). I am pleased with how things went: the talks we gathered this year were most excellent. Space constraints don’t allow me to give everyone the attention they deserve, but I want to mention two. First is Yong-Yeol Ahn, who was the star of this year. He gave four talks at the conference — provided I haven’t miscounted — and his plenary one on the analysis of the Linkedin graph was just breathtaking. At Netonets, he talked about the internal belief network each one of us carries in her own brain, and its relationship with how macro societal behaviors arise in social networks. An original take on networks of networks, and one that spurred the idea: how much are the inner workings of one’s belief network affected by the metabolic and the bio-connectome networks of one own body? Should we study networks of networks of networks? Second, Nitesh Chawla showed us how high order networks unveil real relationships among nodes. The same node can behave like it is many different ones, depending on which of its connections we are considering.

yy1

Besides the most awesome networks of networks satellite, other ones caught my attention. Again, space is my tyrant here, so I get to award just one slot, and I would like to give it to Hyejin Youn. Her satellite was on the evolution of technological networks. She does amazing things tracking how the patent network evolved from the depths of 1800 until now. The idea is to find viable innovation paths, and to predict which fields will have the largest impact in the future.

When it comes to the plenary sessions, I think Yang-Yu Liu stole the spotlight with a flashy presentation about the microcosmos everybody carries in their guts. The analysis of the human microbiome is a very hot topic right now, and it pleases me to know that there is somebody working on a network perspective of it. Besides scientific merits, whoever extensively quotes Minute Earth videos — bonus points for it being the one about poop transplants — has my eternal admiration. I also want to highlight Ginestra Bianconi‘s talk. She has an extraordinary talent in bringing to network science the most cutting edge aspects of physics. Her line of research combining quantum gravity and network geometry is a dream come true for a physics nerd like myself. I always wished to see advanced physics concepts translated into network terms, but I never had the capacity to do so: now I just have to sit back and wait for Ginestra’s next paper.

netsci2

What about contributed talks? The race for the second best is very tight. The very best was clearly mine on the link between mobility and communication patterns, about which I showed a scaling relationship connecting them (paperpost). I will be magnanimous and spare you all the praises I could sing of it. Enough joking around, let’s move on. Juyong Park gave two fantastic talks on networks and music. This was a nice breath of fresh air for digital humanities: this NetSci edition was orphan of the great satellite chaired by Max Schich. Juyong showed how to navigate through collaboration networks on classical music CDs, and through judge biases in music competitions. By the way, Max dominated — as expected — the lighting talk session, showing some new products coming from his digital humanities landmark published last year in Science.  Tomomi Kito was also great: she borrowed the tools of economic complexity and shifted her focus from the macro analysis of countries to the micro analysis of networks of multinational corporations. A final mention goes to Roberta Sinatra. Her talk was about her struggle into making PhD committees recognize that what she is doing is actually physics. It resonates with my personal experience, trying to convince hiring committees that what I’m doing is actually computer science. Maybe we should all give up the struggle and just create a network science department.

And so we get to the last treat of the conference: the Erdos-Renyi prize, awarded to the most excellent network researcher under the age of 40. This year it went to Aaron Clauset, and this pleases me for several reasons. First, because Aaron is awesome, and he deserves it. Second, because he is the first computer scientist who is awarded the prize, and this just gives me hope that our work too is getting recognized by the network gurus. His talk was fantastic on two accounts.

aaron1

For starters, he presented his brand new Index of Complex Networks. The interface is pretty clunky, especially on my Ubuntu Firefox, but that does not hinder the usefulness of such an instrument. With his collaborators, Aaron collected the most important papers in the network literature, trying to find a link to a publicly available network. If they were successful, that link went in the index, along with some metadata about the network. This is going to be a prime resource for network scientists, both for starting new projects and for the sorely needed task of replicating previous results.

Replication is the core of the second reason I loved Aaron’s talk. Once he collected all these networks, for fun he took a jab at some of the dogmas of networks science. The main one everybody knows is: “Power-laws are everywhere”. You can see where this is going: the impertinent Colorado University boy showed that yes, power-laws are very common… among the 5-10% of networks in which it is possible to find them. Not so much “everywhere” any more, huh? This was especially irreverent given that not so long before Stefan Thurner gave a very nice plenary talk featuring a carousel of power laws. I’m not picking sides on the debate — I feel hardly qualified in doing so. I just think that questioning dearly held results is always a good thing, to avoid fooling ourselves into believing we’ve reached an objective truth.

netsci3

Among the non-scientific merits of the conference, I talked with Vinko Zlatic about the Croatian government on the brink of collapse, spread the search for a new network scientist by the Center for International Development, and discovered that Korean pizzas are topped with almonds (you didn’t really think I was going to let slip that pizza reference at the beginning of the post, did you?). And now I made myself sad: I wish there was another NetSci right away, to shove my brain down into another blender of awesomeness.  Oh well, there are going to be plenty of occasions to do so. See you maybe in Dubrovnik, Tel-Aviv or Indianapolis?

20 May 2016 ~ 0 Comments

Program of Netonets 2016 is Out!

As announced in the previous post, the symposium on networks of networks is happening in less than two weeks: May 31st @ 9AM, room Dongkang C of the K-Hotel Seoul, South Korea. Przemek Kazienko, Gregorio D’Agostino and I have a fantastic program and set of speakers to keep you entertained on multilayer, interdependent and multislice networks. Take a look for yourself!

Session I

9:00 – 9:15: Room set up
9:15 – 9:30: Welcome from the organizers
9:30 – 10:15: Invited I: Yong-Yeol Ahn: Dynamics of social network of belief networks
10:15 – 11:00: Invited II: Luca Maria Aiello: The Nature of Social Links

11:00 – 11:30: Coffee Break

Session II

11:30 – 12:15: Invited III: Jianxi Gao: Networks of Networks: From Structure to Dynamics
12:15 – 13:00: Invited IV: Tomasz Kajdanowicz: Fusion methods for classification in multiplex networks

13:00 – 14:30: Lunch Break

Session III

14:30 – 15:15: Invited V: Michael Danziger: Beyond interdependent networks
15:15 – 15:35: Contributed I: Bruno Coutinho: Greedy Leaf Removal on Hypergraphs
15:35 – 15:55: Contributed II: Yong Zhuang: Complex Contagions in Clustered Random Multiplex Networks

15:55 – 16:30: Coffee Break

Session IV

16:30 – 17:15: Invited VI: Nitesh Chawla: From complex interactions to networks: the higher-order network representation

17:15 – 18:00: Round table – Open discussion
18:00 – 18:15: Organizers wrap up

Remember to register to the main NetSci conference if you want to attend.

Incidentally, the end of May is going to be a rather busy period for me. Besides co-organizing Netonets and speaking at the main Netsci conference, I’m going to present also at the Core50 conference in Louvain-la-Neuve, Belgium, on the role of social and mobility networks in shaping the economic growth of a country. Thanks to Jean-Charles Delvenne for inviting me!

I hope to see many of you there!

17 March 2016 ~ 0 Comments

Networks of Networks @ NetSci 2016

EDIT: Deadlines & speakers updated. Submission deadline is on April 27th, notification on April 29th.

 

Dear readers of this blog — yes, both of you –: it’s that time of the year again. As tradition dictates, I’m organizing the Networks of Networks symposium, satellite event of the NetSci conference.

Networks of networks are structures in which the nodes may be connected through different relations. They can represent multifaceted social interaction, critical infrastructure and complex relational data structures. In the symposium, we are looking for a diversity of research contributions revolving around networks of networks of any kind: in social media, in infrastructure, in culture. The call for contributed talks is OPEN, and you can submit your abstract here: https://easychair.org/conferences/?conf=non2016

The deadline for submissions is April 15th, 2016 April 27th, 2016, just a month from now. We will notify acceptance by April 22nd, 2016 April 29th, 2016.

Here’s my handy guide to few of the many reasons to come:

  • Networks of networks are awesome, a hot topic in network science and a lot of super smart people work on them. You wouldn’t pass the opportunity to mingle with them, would you?
  • We have a lineup of outstanding confirmed keynotes this year — truth to be told, we have that every year:
  • This year NetSci will take place at the K-Hotel, Seoul, Korea (South, whew…). You really should not miss this occasion to visit such fascinating place.

The Networks of Networks symposium will be held on May 31st, 2016. The full conference, including all satellites, runs from May 30th to June 3rd. You can find all relevant information for the conference in the official NetSci website. Our symposium has a website too: check it out. In it, you will find also the fundamental information about all the people organizing this event with me: without them none of this would be possible. Here they are:

And also a list of other people, helping with their ideas, time and enthusiasm:

  • Matteo Magnani
  • Ian Dobson
  • Luca Rossi
  • Leonardo Duenas-Osorio
  • Dino Pedreschi
  • Guido Caldarelli
  • Vito Latora

Hope to see many of you in Korea!

16 February 2016 ~ 0 Comments

Data Trips Diary: Bogotá

My last post on this blog was about mobility in Colombia. For that study, I had the opportunity of dunking my hands into a bag filled with interesting data. To do so, I traveled to Bogotá. It is a fascinating place and I decided to dedicate this post to it: what the city looks like under the lens of some simple mobility and economic data analysis. If in the future I will repeat the experience somewhere else I will be more than happy to make this a recurrent column of this blog.

The cliché would demand from me a celebration of the chaos in Bogotá. After all, we are talking about one of the top five largest capitals in Latin America, the chaos continent par excellence. Yet, your data goggles would tell you a different story. Bogotá is extremely organized. Even at the point of being scary. There is a very strict division of social strata: the city government assigns each block a number from 1 (poorest) to 6 (richest) according to its level of development and the blocks are very clustered and homogeneous:

sisben_strata

In the picture: red=1, blue=2, green=3, purple=4, yellow=5 and orange=6 (grey = not classified). That map doesn’t seem very chaotic to me, rather organized and clustered. One might feel uneasy about it, but that is how things are. The clustering is not only on the social stratum of the block, but also in where people work. If you take a taxi ride, you will find entire blocks filled with the very same economic activities. Not knowing that, during one of my cab rides I thought in Bogotá everybody was a car mechanic… until we got passed that block.

The order emerges also when you look at the way the people use the city. My personal experience was of incredulity: I went from the city hall to the house of a co-worker and it felt like moving to a different city. After a turn left, the big crowded highway with improvised selling stands disappeared into a suburb park with no cars and total quiet. In fact, Bogotá looks like four different cities:

bogota_mobilityclusters

Here I represented each city block as a node in a network and I connected blocks if people commute to the two places. Then I ran a community discovery algorithm, and plotted on the map the result. Each color represents an area that does not see a lot of inter-commutes with the other areas, at least compared with its own intra-commutes.

Human mobility is interesting because it gives you an idea of the pulse of a place. Looking at the commute data we discovered that a big city like Bogotá gets even bigger during a working day. Almost half a million people pour inside the capital every day to work and use its services, which means that the population of the city increases, in a matter of hours, by more than 5%.

bogota1

It’s unsurprising to see that this does not happen during a typical Sunday. The difference is not only in volume, but also in destination: people go to different places on weekends.

cell_avgdaycommuters_weekdaydifference

Here, the red blocks are visited more during weekdays, the white blocks are visited more in weekends. It seems that there is an axis that is more popular during weekdays — that is where the good jobs are. The white is prevalently residential.

Crossing this commute information with the data on establishments from the chamber of commerce (camara de comercio), we can also know which businesses types are more visited during weekends, because many commuters are stopping in areas hosting such businesses. There is a lot of shopping going on (comercio al por menor) and of course visits to pubs (Expendio De Bebidas Alcoholicas Para El Consumo Dentro Del Establecimiento). It matches well with my personal experience as, once my data quests were over, my local guide (Andres Gomez) lead me to Andres Carne de Res, a bedlam of music, food and lights, absolutely not to be missed if you find yourself in Bogotá. My personal advice is to be careful about your beverage requests: I discovered too late that a mojito there is served in a soup bowl larger than my skull.

Most of what I wrote here (minus the mojito misadventure) is included in a report I put together with my travel companion (Frank Neffke) and another local (Eduardo Lora). You can find it in the working paper collection of the Center for International Development. I sure hope that my data future will bring me to explore other places as interesting as the capital of Colombia.

15 January 2016 ~ 0 Comments

The Limited Power of Telecommunication

As a kid from the 80s*, I remember how revolutionary the cellphone era was. It happened so fast. It seemed that, overnight, you could carry in your pocket a device connecting you to everybody you knew, no matter how far. To me, it changed everything. But did it? Yes, over-apprehensive parents can check their babies at the swipe of a finger, and whoever does not carry their cellphone with themselves at all times is labeled as a weirdo — I’m guilty of that. But the telecommunication revolution promised something more: the elimination of distance in communication. Did it deliver? This question was the motivation engine for the paper “Evidence That Calls-Based and Mobility Networks Are Isomorphic” which I wrote with my boss Ricardo Hausmann and which recently appeared in PLoS One.

The question is rather daring, so we decided to take it step by step. The simplest thing we came up was: let’s draw a map of cellphone calls and see if it looks like a geographical map. If it does, we might be onto something. To do so, we obtained data from telecommunication operators in Colombia. They provided us call detail records, where identifiers were encrypted to preserve the anonymity of the people making and receiving the calls. We also aggregated the data to make even the slightest re-identification impossible: every ID was associated to the municipality in which it spent most of its time and so all data was lumped together at the municipality level. At this point, we could draw a map of which municipalities had a significant call traffic with one another. This we called the “Call-based” network:

colombia_social

Click to enlarge

Before jumping to conclusions with this picture, we built a sister network. Since we just said we knew the location of a phone when making a call, we can keep a record of the different municipalities where we spotted the phone. Again, we joined together all data at the municipality level. This sister network is then a “Mobility” network of Colombia:

colombia_mobility

Click to enlarge

It seems there’s something here. The two networks appear to be similar: Bogotá seems to be a prominent center and the connections have a geographical component embedded into them. To make this more evident, we drew the networks on a Colombian map. The color of the municipalities is the same color of the nodes in the pictures above: nodes with the same color are very related in the network — network clusters.

plos1

Click to enlarge

The call-based network is on the left, the mobility is on the right. Blocks of the same color on the left are a clear indication of the call connections being influenced by geography. If there was no relation, the map would look like the Harlequin shirt, with colors scattered evenly across the territory. Mobility clusters are also short-range, although the pattern is harder to see because I had to use many more colors: the clusters are smaller. But the two networks are closely related: in fact, the larger call-based clusters contain the smaller mobility ones, as we show in the paper. We can say that there is a strong relationship between calls and mobility.

This is nice, because it fits with many works in computer science that actually use social relationships to predict human mobility… and vice versa. On the other hand, it is not nice because the existence of these papers also tells us ours is not a new result. Moreover, my starting point was to hint that the call-based and mobility networks are obeying the same laws, not that they are merely correlated. We need to go a step further.

Our step was to consider the difference that distance makes in the two networks. When looking at mobility, the distance between an origin and a destination is an important cost. In the call-based networks, things are a bit trickier. If modern telecommunication really delivered what it promised, distance should be a really low cost, and probably non-linear. To start a social relationship it is not needed to be in the same place at any given time, and even if we move to opposite ends of the world, we can still call each other. As a consequence, there shouldn’t be a way to scale the cost of distance in the call-based network to look like the one in the mobility network.

When we attempted to perform such scaling, we discovered it was actually possible. We checked, at any given distance, the ratio between commuters and callers. If two municipalities are at 50km distance, and there are twice as many commuters than callers, we have a dot on coordinates (50, 2). If we take two municipalities at 100km distance, and the commuters are just a third of the number of callers, the data point is at coordinates (100, .33). Once we consider all data points, we can fit our green line, AKA the scaling function from calls to mobility:

plos3

When we used this adjustment to calculate new call-based clusters using the distance cost “as if” it was the mobility network, we obtained the mobility clusters. We detail in the paper the reasons why this is not as circular as it seems.  In practice, our green line is a transformation function that morphs the call-based network into the mobility network. If modern telecommunication really killed distance, that green line shouldn’t exist, or at least it should be so wobbly to be practically useless.

There are many ways in which you could interpret this result. One that Ricardo and I like focuses on the relationship between face-to-face and electronic mediated meetings. It’s not like the people you call are the ones you really would rather meet but you cannot. It’s more like you call AND you meet, whenever it is possible. Face-to-face and electronic mediated meetings are not really substitutes in this world, they are more like complements. To come back to my opening, I’d say new technologies didn’t eliminate distance from the communication equation. Alleviate, yes. But ultimately, it’s more like an increased bandwidth than a revolution. At least so far.


* Shut up, I’m still in my twenties. Everybody knows 1996 was only 10 years ago.