Michele Coscia - Connecting Humanities

Michele Coscia I am a post-doc fellow at the Center for International Development, Harvard University in Cambridge. I mainly work on mining complex networks, and on applying the extracted knowledge to international development and governance. My background is in Digital Humanities, i.e. the connection between the unstructured knowledge and the cold organized computer science. I have a PhD in Computer Science, obtained in June 2012 at the University of Pisa. In my career, I also worked at the Center for Complex Network Research at Northeastern University, with Albert-Laszlo Barabasi. On this website you can browse my papers, algorithms and datasets with the top navigation, or simply skim my blog posts that briefly present my topics and papers below this box.

20 May 2016 ~ 0 Comments

Program of Netonets 2016 is Out!

As announced in the previous post, the symposium on networks of networks is happening in less than two weeks: May 31st @ 9AM, room Dongkang C of the K-Hotel Seoul, South Korea. Przemek Kazienko, Gregorio D’Agostino and I have a fantastic program and set of speakers to keep you entertained on multilayer, interdependent and multislice networks. Take a look for yourself!

Session I

9:00 – 9:15: Room set up
9:15 – 9:30: Welcome from the organizers
9:30 – 10:15: Invited I: Yong-Yeol Ahn: Title TBD
10:15 – 11:00: Invited II: Luca Maria Aiello: The Nature of Social Links

11:00 – 11:30: Coffee Break

Session II

11:30 – 12:15: Invited III: Jianxi Gao: Networks of Networks: From Structure to Dynamics
12:15 – 13:00: Invited IV: Tomasz Kajdanowicz: Fusion methods for classification in multiplex networks

13:00 – 14:30: Lunch Break

Session III

14:30 – 15:15: Invited V: Michael Danziger: Beyond interdependent networks
15:15 – 15:35: Contributed I: Bruno Coutinho: Greedy Leaf Removal on Hypergraphs
15:35 – 15:55: Contributed II: Yong Zhuang: Complex Contagions in Clustered Random Multiplex Networks

15:55 – 16:30: Coffee Break

Session IV

16:30 – 17:15: Invited VI: Nitesh Chawla: From complex interactions to networks: the higher-order network representation

17:15 – 18:00: Round table – Open discussion
18:00 – 18:15: Organizers wrap up

Remember to register to the main NetSci conference if you want to attend.

Incidentally, the end of May is going to be a rather busy period for me. Besides co-organizing Netonets and speaking at the main Netsci conference, I’m going to present also at the Core50 conference in Louvain-la-Neuve, Belgium, on the role of social and mobility networks in shaping the economic growth of a country. Thanks to Jean-Charles Delvenne for inviting me!

I hope to see many of you there!

17 March 2016 ~ 0 Comments

Networks of Networks @ NetSci 2016

EDIT: Deadlines & speakers updated. Submission deadline is on April 27th, notification on April 29th.


Dear readers of this blog — yes, both of you –: it’s that time of the year again. As tradition dictates, I’m organizing the Networks of Networks symposium, satellite event of the NetSci conference.

Networks of networks are structures in which the nodes may be connected through different relations. They can represent multifaceted social interaction, critical infrastructure and complex relational data structures. In the symposium, we are looking for a diversity of research contributions revolving around networks of networks of any kind: in social media, in infrastructure, in culture. The call for contributed talks is OPEN, and you can submit your abstract here: https://easychair.org/conferences/?conf=non2016

The deadline for submissions is April 15th, 2016 April 27th, 2016, just a month from now. We will notify acceptance by April 22nd, 2016 April 29th, 2016.

Here’s my handy guide to few of the many reasons to come:

  • Networks of networks are awesome, a hot topic in network science and a lot of super smart people work on them. You wouldn’t pass the opportunity to mingle with them, would you?
  • We have a lineup of outstanding confirmed keynotes this year — truth to be told, we have that every year:
  • This year NetSci will take place at the K-Hotel, Seoul, Korea (South, whew…). You really should not miss this occasion to visit such fascinating place.

The Networks of Networks symposium will be held on May 31st, 2016. The full conference, including all satellites, runs from May 30th to June 3rd. You can find all relevant information for the conference in the official NetSci website. Our symposium has a website too: check it out. In it, you will find also the fundamental information about all the people organizing this event with me: without them none of this would be possible. Here they are:

And also a list of other people, helping with their ideas, time and enthusiasm:

  • Matteo Magnani
  • Ian Dobson
  • Luca Rossi
  • Leonardo Duenas-Osorio
  • Dino Pedreschi
  • Guido Caldarelli
  • Vito Latora

Hope to see many of you in Korea!

16 February 2016 ~ 0 Comments

Data Trips Diary: Bogotá

My last post on this blog was about mobility in Colombia. For that study, I had the opportunity of dunking my hands into a bag filled with interesting data. To do so, I traveled to Bogotá. It is a fascinating place and I decided to dedicate this post to it: what the city looks like under the lens of some simple mobility and economic data analysis. If in the future I will repeat the experience somewhere else I will be more than happy to make this a recurrent column of this blog.

The cliché would demand from me a celebration of the chaos in Bogotá. After all, we are talking about one of the top five largest capitals in Latin America, the chaos continent par excellence. Yet, your data goggles would tell you a different story. Bogotá is extremely organized. Even at the point of being scary. There is a very strict division of social strata: the city government assigns each block a number from 1 (poorest) to 6 (richest) according to its level of development and the blocks are very clustered and homogeneous:


In the picture: red=1, blue=2, green=3, purple=4, yellow=5 and orange=6 (grey = not classified). That map doesn’t seem very chaotic to me, rather organized and clustered. One might feel uneasy about it, but that is how things are. The clustering is not only on the social stratum of the block, but also in where people work. If you take a taxi ride, you will find entire blocks filled with the very same economic activities. Not knowing that, during one of my cab rides I thought in Bogotá everybody was a car mechanic… until we got passed that block.

The order emerges also when you look at the way the people use the city. My personal experience was of incredulity: I went from the city hall to the house of a co-worker and it felt like moving to a different city. After a turn left, the big crowded highway with improvised selling stands disappeared into a suburb park with no cars and total quiet. In fact, Bogotá looks like four different cities:


Here I represented each city block as a node in a network and I connected blocks if people commute to the two places. Then I ran a community discovery algorithm, and plotted on the map the result. Each color represents an area that does not see a lot of inter-commutes with the other areas, at least compared with its own intra-commutes.

Human mobility is interesting because it gives you an idea of the pulse of a place. Looking at the commute data we discovered that a big city like Bogotá gets even bigger during a working day. Almost half a million people pour inside the capital every day to work and use its services, which means that the population of the city increases, in a matter of hours, by more than 5%.


It’s unsurprising to see that this does not happen during a typical Sunday. The difference is not only in volume, but also in destination: people go to different places on weekends.


Here, the red blocks are visited more during weekdays, the white blocks are visited more in weekends. It seems that there is an axis that is more popular during weekdays — that is where the good jobs are. The white is prevalently residential.

Crossing this commute information with the data on establishments from the chamber of commerce (camara de comercio), we can also know which businesses types are more visited during weekends, because many commuters are stopping in areas hosting such businesses. There is a lot of shopping going on (comercio al por menor) and of course visits to pubs (Expendio De Bebidas Alcoholicas Para El Consumo Dentro Del Establecimiento). It matches well with my personal experience as, once my data quests were over, my local guide (Andres Gomez) lead me to Andres Carne de Res, a bedlam of music, food and lights, absolutely not to be missed if you find yourself in Bogotá. My personal advice is to be careful about your beverage requests: I discovered too late that a mojito there is served in a soup bowl larger than my skull.

Most of what I wrote here (minus the mojito misadventure) is included in a report I put together with my travel companion (Frank Neffke) and another local (Eduardo Lora). You can find it in the working paper collection of the Center for International Development. I sure hope that my data future will bring me to explore other places as interesting as the capital of Colombia.

15 January 2016 ~ 0 Comments

The Limited Power of Telecommunication

As a kid from the 80s*, I remember how revolutionary the cellphone era was. It happened so fast. It seemed that, overnight, you could carry in your pocket a device connecting you to everybody you knew, no matter how far. To me, it changed everything. But did it? Yes, over-apprehensive parents can check their babies at the swipe of a finger, and whoever does not carry their cellphone with themselves at all times is labeled as a weirdo — I’m guilty of that. But the telecommunication revolution promised something more: the elimination of distance in communication. Did it deliver? This question was the motivation engine for the paper “Evidence That Calls-Based and Mobility Networks Are Isomorphic” which I wrote with my boss Ricardo Hausmann and which recently appeared in PLoS One.

The question is rather daring, so we decided to take it step by step. The simplest thing we came up was: let’s draw a map of cellphone calls and see if it looks like a geographical map. If it does, we might be onto something. To do so, we obtained data from telecommunication operators in Colombia. They provided us call detail records, where identifiers were encrypted to preserve the anonymity of the people making and receiving the calls. We also aggregated the data to make even the slightest re-identification impossible: every ID was associated to the municipality in which it spent most of its time and so all data was lumped together at the municipality level. At this point, we could draw a map of which municipalities had a significant call traffic with one another. This we called the “Call-based” network:


Click to enlarge

Before jumping to conclusions with this picture, we built a sister network. Since we just said we knew the location of a phone when making a call, we can keep a record of the different municipalities where we spotted the phone. Again, we joined together all data at the municipality level. This sister network is then a “Mobility” network of Colombia:


Click to enlarge

It seems there’s something here. The two networks appear to be similar: Bogotá seems to be a prominent center and the connections have a geographical component embedded into them. To make this more evident, we drew the networks on a Colombian map. The color of the municipalities is the same color of the nodes in the pictures above: nodes with the same color are very related in the network — network clusters.


Click to enlarge

The call-based network is on the left, the mobility is on the right. Blocks of the same color on the left are a clear indication of the call connections being influenced by geography. If there was no relation, the map would look like the Harlequin shirt, with colors scattered evenly across the territory. Mobility clusters are also short-range, although the pattern is harder to see because I had to use many more colors: the clusters are smaller. But the two networks are closely related: in fact, the larger call-based clusters contain the smaller mobility ones, as we show in the paper. We can say that there is a strong relationship between calls and mobility.

This is nice, because it fits with many works in computer science that actually use social relationships to predict human mobility… and vice versa. On the other hand, it is not nice because the existence of these papers also tells us ours is not a new result. Moreover, my starting point was to hint that the call-based and mobility networks are obeying the same laws, not that they are merely correlated. We need to go a step further.

Our step was to consider the difference that distance makes in the two networks. When looking at mobility, the distance between an origin and a destination is an important cost. In the call-based networks, things are a bit trickier. If modern telecommunication really delivered what it promised, distance should be a really low cost, and probably non-linear. To start a social relationship it is not needed to be in the same place at any given time, and even if we move to opposite ends of the world, we can still call each other. As a consequence, there shouldn’t be a way to scale the cost of distance in the call-based network to look like the one in the mobility network.

When we attempted to perform such scaling, we discovered it was actually possible. We checked, at any given distance, the ratio between commuters and callers. If two municipalities are at 50km distance, and there are twice as many commuters than callers, we have a dot on coordinates (50, 2). If we take two municipalities at 100km distance, and the commuters are just a third of the number of callers, the data point is at coordinates (100, .33). Once we consider all data points, we can fit our green line, AKA the scaling function from calls to mobility:


When we used this adjustment to calculate new call-based clusters using the distance cost “as if” it was the mobility network, we obtained the mobility clusters. We detail in the paper the reasons why this is not as circular as it seems.  In practice, our green line is a transformation function that morphs the call-based network into the mobility network. If modern telecommunication really killed distance, that green line shouldn’t exist, or at least it should be so wobbly to be practically useless.

There are many ways in which you could interpret this result. One that Ricardo and I like focuses on the relationship between face-to-face and electronic mediated meetings. It’s not like the people you call are the ones you really would rather meet but you cannot. It’s more like you call AND you meet, whenever it is possible. Face-to-face and electronic mediated meetings are not really substitutes in this world, they are more like complements. To come back to my opening, I’d say new technologies didn’t eliminate distance from the communication equation. Alleviate, yes. But ultimately, it’s more like an increased bandwidth than a revolution. At least so far.

* Shut up, I’m still in my twenties. Everybody knows 1996 was only 10 years ago.

18 November 2015 ~ 0 Comments

Evaluating Prosperity Beyond GDP

When reporting on economics, news outlets very often refer to what happens to the GDP. How is policy X going to affect our GDP? Is the national debt too high compared to GDP? How does my GDP compare to yours? The concept lurking behind those three letters is the Gross Domestic Product, the measure of the gross value added by all domestic producers in a country. In principle, the idea of using GDP to take the pulse of an economy isn’t bad: we count how much we can produce, and this is more or less how well we are doing. In practice, today I am jumping on the huge bandwagon of people who despise GDP for its meaningless, oversimplified and frankly suspicious nature. I will talk about a paper in which my co-authors and I propose to use a different measure to evaluate a country’s prosperity. The title is “Going Beyond GDP to Nowcast Well-Being Using Retail Market Data“, my co-authors are Riccardo Guidotti, Dino Pedreschi and Diego Pennacchioli, and the paper will be presented at the Winter edition of the Network Science Conference.

GDP is gross for several reasons. What Simon Kuznets said resonates strongly with me, as already in the 30s he was talking like a complexity scientist:

The valuable capacity of the human mind to simplify a complex situation in a compact characterization becomes dangerous when not controlled in terms of definitely stated criteria. With quantitative measurements especially, the definiteness of the result suggests, often misleadingly, a precision and simplicity in the outlines of the object measured. Measurements of national income are subject to this type of illusion and resulting abuse, especially since they deal with matters that are the center of conflict of opposing social groups where the effectiveness of an argument is often contingent upon oversimplification.


In short, GDP is an oversimplification, and as such it cannot capture something as complex as an economy, or the multifaceted needs of a society. In our paper, we focus on some of its specific aspects. Income inequality skews the richness distribution, so that GDP doesn’t describe how the majority of the population is doing. But more importantly, it is not possible to quantify well-being just with the number of dollars in someone’s pocket: she might have dreams, aspirations and sophisticated needs that bear little to no correlation with the status of her wallet. And even if GDP was a good measure, it’s very hard to calculate: it takes months to estimate it reliably. Nowcasting it would be great.

And so we tried to hack our way out of GDP. The measure we decided to use is the one of customer sophistication, that I presented several times in the past. In practice, the measure is a summary of the connectivity of a node in a bipartite network*. The bipartite network connects customers to the products they buy. The more variegated the set of products a customer buys, the more complex she is. Our idea was to create an aggregated version at the network level, and to see if this version was telling us something insightful. We could make a direct correlation with the national GDP of Italy, because the data we used to calculate it comes from around a half million customers from several Italian regions, which are representative of the country as a whole.


The argument we made goes as follows. GDP stinks, but it is not 100% bad, otherwise nobody would use it. Our sophistication is better, because it is connected to the average degree with which a person can satisfy her needs**. Income inequality does not affect it either, at least not in trivial ways as it does it with GDP. Therefore, if sophistication correlates with GDP it is a good measure of well-being: it captures part of GDP and adds something to it. Finally, if the correlation happens with some anticipated temporal shift it is even better, because GDP pundits can just use it as instantaneous nowcasting of GDP.

We were pleased when our expectations met reality. We tested several versions of the measure at several temporal shifts — both anticipating and following the GDP estimate released by the Italian National Statistic Institute (ISTAT). When we applied the statistical correction to control for the multiple hypothesis testing, the only surviving significant and robust estimate was our customer sophistication measure calculated with a temporal shift of -2, i.e. two quarters before the corresponding GDP estimate was released. Before popping our champagne bottles, let me write an open letter to the elephant in the room.


As you see from the above chart, there are some wild seasonal fluctuations. This is rather obvious, but controlling for them is not easy. There is a standard approach — the X-13-Arima method — which is more complicated than simply averaging out the fluctuations. It takes into account a parameter tuning procedure including information we simply do not have for our measure, besides requiring observation windows longer than what we have (2007-2014). It is well possible that our result could disappear. It is also possible that the way we calculated our sophistication index makes no sense economically: I am not an economist and I do not pretend for a moment that I can tell them how to do their job.

What we humbly report is a blip on the radar. It is that kind of thing that makes you think “Uh, that’s interesting, I wonder what it means”. I would like someone with a more solid skill set in economics to take a look at this sophistication measure and to do a proper stress-test with it. I’m completely fine with her coming back to tell me I’m a moron. But that’s the risk of doing research and to try out new things. I just think that it would be a waste not to give this promising insight a chance to shine.

* Even if hereafter I talk only about the final measure, it is important to remark that it is by no means a complete substitute of the analysis of the bipartite network. Meaning that I’m not simply advocating to substitute a number (GDP) for another (sophistication), rather to replace GDP with a fully-blown network analysis.

** Note that this is a revealed measure of sophistication as inferred by the products actually bought and postulating that each product satisfies one or a part of a “need”. If you feel that the quality of your life depends on you being able to bathe in the milk of a virgin unicorn, the measure will not take into account the misery of this tacit disappointment. Such are the perils of data mining.

16 October 2015 ~ 0 Comments

Central Places and Sophistication


Looking at a population map, one may wonder why sometimes you find metropoles in the middle of nowhere — I’m looking at you, Phoenix. Or why cities are distributed the way they are. When in doubt, you should always refer to your favorite geographer. She would probably be very happy to direct your interest to the Central Place Theory (CPT), developed by Walter Christaller in the 30s. The theory simply states that cities provide services to the surrounding areas. As a consequence, the big cities will provide many services and small cities a few, therefore the small cities will gravitate around larger settlements. This smells like complexity science to me and this post is exactly about connecting CPT with my research on retail customer sophistication and mobility. But first I need to convince you that CPT actually needs this treatment.

CPT explains why sometimes you will need a big settlement in the middle of the desert. That is because, for most of history, civilizations relied on horses instead of the interwebz for communication and, with very long stretches of nothing, that system would fall apart. That is why Phoenix has been an obsolete city since 1994 at the very least, and people should just give it up and move on. You now might be tempted to take a look at the Wikipedia page of the Central Place Theory to get some more details. If you do, you might notice a few “simplifications” used by Christaller when developing the theory. And if you don’t, let me spoil it for you. Lo and behold, to make CPT work we need:

  • An infinite flat Earth — easy-peasy-lemon-squeezy compared to what comes next;
  • Perfectly homogeneous distribution of people and resources;
  • Perfectly equidistant cities in a grid much like the one of Civilization 5;
  • The legendary perfect competition and rational market conjured by economists out of thin air;
  • Only one mode of transportation;
  • A completely homogeneous population, all equal in desires and income.

In short, the original CPT works in a world that is no more real than Mordor.


And here where’s sophistication comes into play. I teamed up with Diego Pennacchioli and Fosca Giannotti with the objective of discovering the relationship between CPT and our previous research on sophistication — the result is in the paper Product Assortment and Customer Mobility, just published on EPJ Data Science. In the past, we showed that the more sophisticated the needs of a customer, the further the customer is willing to travel to satisfy those. And our sophistication measure worked better than other product characteristics, such as the price and its average selling volume.

Now, to be honest, geographers did not sleep for 80 years, and they already pointed out the problems of CPT. Some of them developed extensions to get rid of many troubling assumptions, others tested the predictions of these models, others just looked at Phoenix in baffled awe. However, without going too in depth (I’m not exactly qualified to do it) these new contributions are either very theoretical in nature, or they haven’t used larger and more detailed data validation. Also, the way central places are defined is unsatisfactory to me. Central places are either just very populous cities, or cities with a high variety of services. For a person like me trained in complexity science, this is just too simple. I need to bring sophistication into the mix.


Focusing on my supermarket data, variety is the number of different products provided. Two supermarkets selling three items have the same variety. Sophistication requires the products not only to be different, but also to satisfy different needs. Suppose shop #1 sells water, juice and soda, and shop #2 sells water, bread and T-shirts. Even if the shops have the same variety, one is more sophisticated than the other. And indeed the sophistication of a shop explains better the “retention rate” of a shop, its ability to preserve its customer base even for customers who live far away from the shop. That is what the above table reports: controlling for distance (which causes a 2.6 percentage point loss of customer base per extra minute of travel), each standard deviation increase in sophistication strengthens the retention rate by 11 percentage points. Variety of products does not matter, the volume of the shop (its sheer size) matters just a bit.

In practice, what we found is that CPT holds in our data where big supermarkets play the role of big cities and provide more sophisticated “services”. This is a nice finding for two reasons. First, it confirms the intuition of CPT in a real world scenario, making us a bit wiser about the world in which we live — and maybe avoiding mistakes in the future, such as creating a new Phoenix. This is non-trivial: the space in our data is not infinite, homogeneous, with a perfect market and it has differentiated people. Yet, CPT holds, using our sophistication measure as driving factor. Second, it validates our sophistication measure in a theoretical framework, potentially giving it the power to be used more widely than what we have done so far. However, both contributions are rather theoretical. I’m a man of deeds, so I asked myself: are there immediate applications of this finding?


There might be one, with caveats. Remember we are analyzing hundreds of supermarkets in Italy. We know things about these supermarkets. First, we have a shop type, which by accident correlates with sophistication very well. Then, we know if the shop was closed down during the multi-year observation period. We can’t know the reason, thus everything that follows is a speculation to be confirmed, but we can play with this. We can compare the above mentioned retention rate of closing and non-closing shops. We can also define a catch rate. While “retention” meant how many of your closest customers you can keep, catch means how many of the non-closest customers you can get. The above plots show retention and catch ratios. The higher the number the more the ratio is in favor of the non-closing shop.

For the retention rate, the average sophistication shops (green) have by far the largest spread between shops that are still open and the ones which got shut down. It means that these medium shops survive if they can keep their nearby customers. For the catch rate, the very sophisticated shops (red) are always on top, regardless of distance. It means that large shops survive if they really can attract customers, even if they are not the closest shop. The small shops (blue) seem to obey neither logic. The application of this finding is now evident: sophistication can enlighten us as to the destiny of different types of shops. If medium shops fail to retain their nearby customers, they’re likely to shut down. If large shops don’t catch a wider range of customers, they will shut down. This result talks about supermarkets, but there are likely connections with settlements too, replacing products with various services. Once we calculate a service sophistication, we could know which centers are aptly placed and which ones are not and should be closed down. I know one for sure even without running regressions: Phoenix.


12 August 2015 ~ 1 Comment

Entropy Applied to Shopping

I don’t know about you guys, but when it comes to groceries I show behaviors that are strongly reminiscent of Rain Man. I go to the supermarket the same day of the week (Saturday) at the same time (9 AM), I want to go through the shelves in the very same order (the good ol’ veggie-cookies-pasta-meat-cat food track), I buy mostly the same things every week. Some supermarkets periodically re-order their shelves, for reasons that are unknown to me. That’s enraging, because it breaks my pattern. The mahātmā said it best:


Amen to that. As a consequence, I signed up immediately when my friends Riccardo Guidotti and Diego Pennacchioli told me about a paper they were writing about studying the regularity of customer behavior. Our question was: what is the relationship between the regularity of a customer’s behavior and her profitability for a shop? The results are published in the paper “Behavioral Entropy and Profitability in Retail“, which will be presented in the International Conference on Data Science and Advanced Analytics, in October. To my extreme satisfaction the answer is that the more regular customers are also the most profitable. I hope that this cry for predictability will reach at least the ears of the supermarket managers where I shop. Ok, so: how did we get to this conclusion?

First, we need to measure regularity in a reasonable way. We propose two ways. First, a customer is regular if she buys mostly the same stuff every time she shops, or at least her baskets can be described with few typical “basket templates”. Second, a customer is regular if she shows up always at the same supermarket, at the same time, on the same day of the week. We didn’t have to reinvent the wheel to figure out a way for evaluating regularity in signals: giants of the past solved this problem for us. We decided to use the tools of information theory, in particular the concept of information entropy. Information entropy tells how much information there is in an event. In general, the more uncertain or random the event is, the more information it will contain.


If a person always buys the same thing, no matter how many times she shops, we can fully describe her purchases with a single bit of information: the thing she buys. Thus, there is little information in her observed shopping events, and she has low entropy. This we call Basket Revealed Entropy. Low basket entropy, high regularity. Same reasoning if she always goes to the same shop, and we call this measure Spatio-Temporal Revealed Entropy. Now the question is: what does happen to a customer’s expenditure for different levels of basket and spatio-temporal entropy?

To wrap our heads around these two concepts we started by classifying customers according to their basket and spatio-temporal entropy. We used the k-Means algorithm, which simply tries to find “clumps” in the data. You can think of customers as ants choosing to sit in a point in space. The coordinates of this point are the basket and spatio-temporal entropy. k-Means will find the parts of this space where there are many ants nearby each other. In our case, it found five groups:

  1. The average people, with medium basket and spatio-temporal entropy;
  2. The crazy people, with unpredictable behavior (high basket and spatio-temporal entropy);
  3. The movers, with medium basket entropy, but high spatio-temporal entropy (they shop in unpredictable shops at unpredictable times);
  4. The nomads, similar to the movers, with low basket entropy but high spatio-temporal entropy;
  5. The regulars, with low basket and spatio-temporal entropy.

Click to enlarge

Once you cubbyholed your customers, you can start doing some simple statistics. For instance: we found out that the class E regulars spend more per capita over the year (4,083 Euros) than the class B crazy ones (2,509 Euros, see the histogram above). The regulars also visit the shop more often: 163 times a year. This is nice, but one wonders: why haven’t the supermarket managers figured it out yet? Well, they may have been, but there is also a catch: incurable creatures of habit like me aren’t a common breed. In fact, if we redo the same histograms looking at the group total yearly values of expenditures and baskets, we see that class E is the least profitable, because fewer people are very regular (only 6.9%):

Click to enlarge

Without dividing customers in discrete classes, we can see what is the direct relationship between behavioral entropy and the yearly expenditure of a customer. This aggregated behavioral entropy measure is simply the multiplication of basket and spatio-temporal entropy. Unsurprisingly, entropy and expenditure are negatively correlated:


Finally, we want to quantify this relationship. We want to have an objective way to tell how much more money the supermarket could make if the customers would be more regular. We didn’t get too fancy here, just a linear model where we try to predict the customers’ expenditures from their basket and spatio-temporal entropy. We don’t care very much about causation here, we just want to make the point that basket and spatio-temporal entropy are interesting measures.

Click to enlarge

The negative sign isn’t a surprise: the more chaotic a customer’s life, the lower her expenditures. What the coefficients tell us is that we expect the least chaotic (0) customer to spend almost four times as much as the most chaotic (1) customer*. You can understand why this was an extremely pleasant finding for me. This week, I’m going to print out the paper and ask to see the supermarket manager. I’ll tell him: “Hey, if you stop moving stuff around and you encourage your customers to be more and more regular, maybe you could increase your revenues”. Only that I won’t do it, because that’d break my Saturday shopping routine. Oh dear.

* The interpretation of coefficients in regressions are a bit tricky, especially when transforming your variables with logs. Here, I just jump straight to the conclusion. See here for the full explanation, if you don’t believe me.

10 July 2015 ~ 0 Comments

Collective Intelligence 2015 Report

As I wrote previously, this year I missed NetSci, the yearly appointment for everybody who is interested in network analysis. The reason is that I was invited to give a talk at the Collective Intelligence conference, which happened almost at the same time. And once I got an invitation from Lada Adamic, I knew I couldn’t say no to her. Look at the things she did and is doing: she is a superstar scientist! So I packed my bags and went to the West Coast.

The first day was immediately a blast. Jeff Howe chaired the first session with some great insights about crowdsourcing. As you know, crowdsourcing is a super hip thing nowadays. It goes like this: individually, each one of us is pretty terrible at solving a hard problem. But if we put together enough terrible people, the average of their errors cancels out and we get an almost perfect performance. The term itself crowdsourcing was basically invented by Jeff (and Mark Robinson) when he was writing for Wired. The speakers in Jeff’s session showed us some cool examples of crowdsourcing research. The one that stuck with me the most was from Ágnes Horvát: she and her co-authors were able to analyze the internal communications of a hedge fund about investments and use the features of this communication (frequency of messages, mood, etc) to predict how the investments would perform. And they got it right much more than the strategists at the hedge fund itself.


The second day started with the session with my talk in it. A talk about memes of course! The people I got lined up with were spectacular. Jacob Foster talked about the collective intelligence of science. How do scientists make sense of the incredible amount of research out there? And how is it possible to advance knowledge in such hard times, when there are tens of new studies published every day? Dean Eckles gave an insightful talk about how Facebook users react when their stories get “snoped” (Snopes is a website dedicated to debunk hoaxes). Finally, fellow Italian Walter Quattrociocchi also spoke about hoaxes on Facebook: how they spread, how conspiracy believers interact with skeptics, and so on.

In the next session I attended, I particularly liked two talks. First, Ben Green talked about collective intelligence, and what it actually is. It reminded me of community discovery in networks: scientists dove enthusiastically into it, producing hundreds of papers. However, many didn’t realize that “communities” (and “collective intelligence”) are not so easily defined. Green is trying to fix that. Richard Mann‘s talk was also very interesting: in his work with Dirk Helbing he designed incentive strategies for getting the best out of the wisdom of crowds.


The lunch keynote was from a superstar in collective intelligence: Regina Dugan. Just to give you an idea about her, her CV sports a position as program manager at DARPA and she currently is vice president of Engineering, Advanced Technology and Projects at Google. Not bad. She shared her experiences in directing and experiencing the process of doing cutting edge research. Her talk was a textbook example of motivational speaking for scientists and entrepreneurs alike.

Finally, I had the pleasure to attend a couple of talks about prediction markets. These communities are basically a stock market for opinions. Given an event, say the 2016 president elections, people can put money on their prediction of who is going to be the winner. Websites like SciCast put in place some rules about buying and selling opinion “stocks” and eventually the market price converges on people’s best estimate of every candidate’s odds to win. Prediction markets are a favorite of Nate Silver, and he talks quite a lot about them in “The Signal and the Noise”.


Unfortunately, my report of the conference ends abruptly here, as I had to miss the last day of conference. But the experience was well worth the trip, and I am very grateful for the invitation to Lada Adamic, Scott Page and Deborah Gordon. Unfortunately, this also means that I discovered a shiny event that overlaps with NetSci. Next year, I’ll have to face hard decisions when I allocate my conference time in early June.

29 May 2015 ~ 0 Comments

Networks of Networks – NetSci 2015

The time has finally come! The NetSci conference—the place to be if you are interested in complex networks—is happening next week, from June 1st to June 5th. The venue is in Zaragoza, Spain. You can get all the information you need about the event from the official website. For the third year, I am co-organizing one of its satellite events: the Multiple Network Modeling, Analysis and Mining symposium, this year held jointly with Networks of Networks. The satellite will take place on June 2nd. As I previously said, unfortunately I am not going to be physically present in Span, and that makes me sad, because we have a phenomenal program this year.

We have four great invited speakers: Giovanni Sansavini, Rui Carvalho, Arunabha Sen and Katharina Zweig. It is a perfect mix between the infrastructure focus of the networks of networks crowd and the multidisciplinary approach of multiple networks. Sansavini works on reliability and risk engineering, while Carvalho focuses on characterizing and modeling networks in energy. Sen and Zweig provide their outstanding experience in the fields of computer networks and graph theory.

Among the contributed talks I am delighted to see that many interesting names from the network analysis crowd decided to send their work to be presented in our event. Among the highlights we have a contribution from the group of Mason Porter, who won last year’s Erdos Prize as one of the most outstanding young network scientists. I am also happy to see contributions from the group of Cellai and Gleeson, with whom I share not only an interest on multiplex networks, but also on internet memes. Contributions from groups lead by heavyweights like Schweitzer and Havlin are another sign of the attention that this event has captured.

I hope many of you will attend this seminar. You’ll be in good hands: Gregorio D’Agostino, Przemyslaw Kazienko and Antonio Scala will be much better hosts than I can ever be. I am copying here the full program of the event. Enjoy Spain!

NoN’15 Program

Session I

9.00 – 9.30 Speaker Set Up

9.30 – 9.45 Introduction: Welcome from the organizers, presentation of the program

9.45 – 10.15 Keynote I: Giovanni Sansavini. Systemic risk in critical infrastructures

10.15 – 10.35 Contributed I: Davide Cellai and Ginestra Bianconi. Multiplex networks with heterogeneous activities of the nodes

10.35 – 10.55 Contributed II: Mikko Kivela and Mason Porter. Isomorphisms in Multilayer Networks

10.55 – 11.30 Coffee Break

Session II

11.30 – 12.00 Keynote II: Rui Carvalho, Lubos Buzna, Richard Gibbens and Frank Kelly. Congestion control in charging of electric vehicles

12.00 – 12.20 Contributed III: Saray Shai, Dror Y. Kenett, Yoed N. Kenett, Miriam Faust, Simon Dobson and Shlomo Havlin. A critical tipping point in interconnected networks

12.20 – 12.40 Contributed IV: Adam Hackett, Davide Cellai, Sergio Gomez, Alex Arenas and James Gleeson. Bond percolation on multiplex networks

12.40 – 13.00 Contributed V: Marco Santarelli, Mario Beretta, Giorgio D’Urbano, Lorenzo Spina, Renato De Leone and Emilia Marchitto. Soccer and networks: changing the way of playing soccer through GPS, video analysis and social networks

13.00 – 14.30 Lunch

Session III

14.30 – 15.00 Keynote III: Arunabha Sen. Strategic Analysis and Design of Robust and Resilient Interdependent Power and Communication Networks with a New Model of Interdependency

15.00 – 15.20 Invited I: Alfonso Damiano,Univ. di Cagliari – Electric Market – Italy; Antonio Scala CNR-ICS, IMT, LIMS

15.20 – 15.40 Contributed VI: Rebekka Burkholz, Antonios Garas, Matt V Leduc, Ingo Scholtes and Frank Schweitzer. Cascades on Multiplexes with Threshold Feedback

15.40 – 16.00 Contributed VII: Soumajit Pramanik, Maximilien Danisch, Qinna Wang, Jean-Loup Guillaume and Bivas Mitra. Analyzing the Impact of Mentioning in Twitter

16.00 – 16.30 Coffee Break

Session IV

16.00 – 16.30 Keynote IV: Katharina Zweig. Science-theoretic musings on the analysis of networks (of networks)

16.30 – 16.50 Contributed VIII: Vinko Zladic, Sebastian Krause, Michael Danziger. Avoidable colors percolation

16.50 – 17.10 Contributed IX: Borut Sluban, Jasmina Smailovic, Igor Mozetic and Stefano Battiston. Sentiment Leaning of Influential Communities in Social Networks

17.10 – 17.30 Invited II: one speaker from the CI2C project (confirmed, yet to be defined)

17.30   Planning Netonets Future Activities

02 April 2015 ~ 0 Comments

A Marriage of Networks

My personal quest as ambassador of multiple networks at NetSci (previous episodes here and here) is continuing also this year. And, as every year, there are new exciting things coming along. This year, the usual satellite I organize is marrying another satellite. We are in fact merging our multiple networks with the Networks of Networks crowd. Networks of Networks is a society holding its own satellite at NetSci since quite a while. They are also interested in networks with multiple node and edge types, with more attention to infrastructure-flavored networks: computer networks, power grids, water infrastructure and so on. We are very excited to see what the impact between multiple networks and networks of networks, directed in particular by Antonio Scala and Gregorio D’Agostino, will generate.

The marriage is a promising one because, when talking about multiple networks and infrastructure, technical knowledge is dispersed among experts of different sectors – system operators from different industries (electric, gas, telecommunication, food chain, water supply, etc) – while researchers from different fields developed a number of different strategies to deal with these complex objects – from computer science to physics, from economics to humanities. To be exposed to these approaches and to confront one’s understanding of the potentialities of the analysis of multiple interdependent networks is key for the development of a common language to integrate the knowledge from all sectors. Complex Networks can be a common language for the needed federated approaches at both microscopic and macroscopic level. This satellite is here exactly to foster the development of such common language.

The usual practical information you might find useful:

  • The satellite will take place on June 2nd, 2015. It will be held, as usual, jointly with the other NetSci satellites. The location will be Zaragoza, Spain. The information about how to get there is included in the NetSci website.
  • The official website of the satellite is hosted by the Net-o-Nets parent website. The official page is this one. Information about the satellite is pretty bare-bones at the moment, but we’ll flesh it out in the following weeks.
  • We are open to submissions! You can send in your abstract and we’ll consider you for a contributed talk. The submission system goes through EasyChair, and this is the official link. The deadline for submission is April 19th, 2015 and we will notify you on April 29th.

Sadly, I will not be present in person to the event due to conflicting schedules. So I will not be able to write the usual report. I’ll leave you in the best hands possible. Submit something, and stop by in Zaragoza: you’ll find an exciting and stimulating crowd!