Michele Coscia - Connecting Humanities

Michele Coscia I am a post-doc fellow at the Center for International Development, Harvard University in Cambridge. I mainly work on mining complex networks, and on applying the extracted knowledge to international development and governance. My background is in Digital Humanities, i.e. the connection between the unstructured knowledge and the cold organized computer science. I have a PhD in Computer Science, obtained in June 2012 at the University of Pisa. In my career, I also worked at the Center for Complex Network Research at Northeastern University, with Albert-Laszlo Barabasi. On this website you can browse my papers, algorithms and datasets with the top navigation, or simply skim my blog posts that briefly present my topics and papers below this box.

18 November 2015 ~ 0 Comments

Evaluating Prosperity Beyond GDP

When reporting on economics, news outlets very often refer to what happens to the GDP. How is policy X going to affect our GDP? Is the national debt too high compared to GDP? How does my GDP compare to yours? The concept lurking behind those three letters is the Gross Domestic Product, the measure of the gross value added by all domestic producers in a country. In principle, the idea of using GDP to take the pulse of an economy isn’t bad: we count how much we can produce, and this is more or less how well we are doing. In practice, today I am jumping on the huge bandwagon of people who despise GDP for its meaningless, oversimplified and frankly suspicious nature. I will talk about a paper in which my co-authors and I propose to use a different measure to evaluate a country’s prosperity. The title is “Going Beyond GDP to Nowcast Well-Being Using Retail Market Data“, my co-authors are Riccardo Guidotti, Dino Pedreschi and Diego Pennacchioli, and the paper will be presented at the Winter edition of the Network Science Conference.

GDP is gross for several reasons. What Simon Kuznets said resonates strongly with me, as already in the 30s he was talking like a complexity scientist:

The valuable capacity of the human mind to simplify a complex situation in a compact characterization becomes dangerous when not controlled in terms of definitely stated criteria. With quantitative measurements especially, the definiteness of the result suggests, often misleadingly, a precision and simplicity in the outlines of the object measured. Measurements of national income are subject to this type of illusion and resulting abuse, especially since they deal with matters that are the center of conflict of opposing social groups where the effectiveness of an argument is often contingent upon oversimplification.


In short, GDP is an oversimplification, and as such it cannot capture something as complex as an economy, or the multifaceted needs of a society. In our paper, we focus on some of its specific aspects. Income inequality skews the richness distribution, so that GDP doesn’t describe how the majority of the population is doing. But more importantly, it is not possible to quantify well-being just with the number of dollars in someone’s pocket: she might have dreams, aspirations and sophisticated needs that bear little to no correlation with the status of her wallet. And even if GDP was a good measure, it’s very hard to calculate: it takes months to estimate it reliably. Nowcasting it would be great.

And so we tried to hack our way out of GDP. The measure we decided to use is the one of customer sophistication, that I presented several times in the past. In practice, the measure is a summary of the connectivity of a node in a bipartite network*. The bipartite network connects customers to the products they buy. The more variegated the set of products a customer buys, the more complex she is. Our idea was to create an aggregated version at the network level, and to see if this version was telling us something insightful. We could make a direct correlation with the national GDP of Italy, because the data we used to calculate it comes from around a half million customers from several Italian regions, which are representative of the country as a whole.


The argument we made goes as follows. GDP stinks, but it is not 100% bad, otherwise nobody would use it. Our sophistication is better, because it is connected to the average degree with which a person can satisfy her needs**. Income inequality does not affect it either, at least not in trivial ways as it does it with GDP. Therefore, if sophistication correlates with GDP it is a good measure of well-being: it captures part of GDP and adds something to it. Finally, if the correlation happens with some anticipated temporal shift it is even better, because GDP pundits can just use it as instantaneous nowcasting of GDP.

We were pleased when our expectations met reality. We tested several versions of the measure at several temporal shifts — both anticipating and following the GDP estimate released by the Italian National Statistic Institute (ISTAT). When we applied the statistical correction to control for the multiple hypothesis testing, the only surviving significant and robust estimate was our customer sophistication measure calculated with a temporal shift of -2, i.e. two quarters before the corresponding GDP estimate was released. Before popping our champagne bottles, let me write an open letter to the elephant in the room.


As you see from the above chart, there are some wild seasonal fluctuations. This is rather obvious, but controlling for them is not easy. There is a standard approach — the X-13-Arima method — which is more complicated than simply averaging out the fluctuations. It takes into account a parameter tuning procedure including information we simply do not have for our measure, besides requiring observation windows longer than what we have (2007-2014). It is well possible that our result could disappear. It is also possible that the way we calculated our sophistication index makes no sense economically: I am not an economist and I do not pretend for a moment that I can tell them how to do their job.

What we humbly report is a blip on the radar. It is that kind of thing that makes you think “Uh, that’s interesting, I wonder what it means”. I would like someone with a more solid skill set in economics to take a look at this sophistication measure and to do a proper stress-test with it. I’m completely fine with her coming back to tell me I’m a moron. But that’s the risk of doing research and to try out new things. I just think that it would be a waste not to give this promising insight a chance to shine.

* Even if hereafter I talk only about the final measure, it is important to remark that it is by no means a complete substitute of the analysis of the bipartite network. Meaning that I’m not simply advocating to substitute a number (GDP) for another (sophistication), rather to replace GDP with a fully-blown network analysis.

** Note that this is a revealed measure of sophistication as inferred by the products actually bought and postulating that each product satisfies one or a part of a “need”. If you feel that the quality of your life depends on you being able to bathe in the milk of a virgin unicorn, the measure will not take into account the misery of this tacit disappointment. Such are the perils of data mining.

16 October 2015 ~ 0 Comments

Central Places and Sophistication


Looking at a population map, one may wonder why sometimes you find metropoles in the middle of nowhere — I’m looking at you, Phoenix. Or why cities are distributed the way they are. When in doubt, you should always refer to your favorite geographer. She would probably be very happy to direct your interest to the Central Place Theory (CPT), developed by Walter Christaller in the 30s. The theory simply states that cities provide services to the surrounding areas. As a consequence, the big cities will provide many services and small cities a few, therefore the small cities will gravitate around larger settlements. This smells like complexity science to me and this post is exactly about connecting CPT with my research on retail customer sophistication and mobility. But first I need to convince you that CPT actually needs this treatment.

CPT explains why sometimes you will need a big settlement in the middle of the desert. That is because, for most of history, civilizations relied on horses instead of the interwebz for communication and, with very long stretches of nothing, that system would fall apart. That is why Phoenix has been an obsolete city since 1994 at the very least, and people should just give it up and move on. You now might be tempted to take a look at the Wikipedia page of the Central Place Theory to get some more details. If you do, you might notice a few “simplifications” used by Christaller when developing the theory. And if you don’t, let me spoil it for you. Lo and behold, to make CPT work we need:

  • An infinite flat Earth — easy-peasy-lemon-squeezy compared to what comes next;
  • Perfectly homogeneous distribution of people and resources;
  • Perfectly equidistant cities in a grid much like the one of Civilization 5;
  • The legendary perfect competition and rational market conjured by economists out of thin air;
  • Only one mode of transportation;
  • A completely homogeneous population, all equal in desires and income.

In short, the original CPT works in a world that is no more real than Mordor.


And here where’s sophistication comes into play. I teamed up with Diego Pennacchioli and Fosca Giannotti with the objective of discovering the relationship between CPT and our previous research on sophistication — the result is in the paper Product Assortment and Customer Mobility, just published on EPJ Data Science. In the past, we showed that the more sophisticated the needs of a customer, the further the customer is willing to travel to satisfy those. And our sophistication measure worked better than other product characteristics, such as the price and its average selling volume.

Now, to be honest, geographers did not sleep for 80 years, and they already pointed out the problems of CPT. Some of them developed extensions to get rid of many troubling assumptions, others tested the predictions of these models, others just looked at Phoenix in baffled awe. However, without going too in depth (I’m not exactly qualified to do it) these new contributions are either very theoretical in nature, or they haven’t used larger and more detailed data validation. Also, the way central places are defined is unsatisfactory to me. Central places are either just very populous cities, or cities with a high variety of services. For a person like me trained in complexity science, this is just too simple. I need to bring sophistication into the mix.


Focusing on my supermarket data, variety is the number of different products provided. Two supermarkets selling three items have the same variety. Sophistication requires the products not only to be different, but also to satisfy different needs. Suppose shop #1 sells water, juice and soda, and shop #2 sells water, bread and T-shirts. Even if the shops have the same variety, one is more sophisticated than the other. And indeed the sophistication of a shop explains better the “retention rate” of a shop, its ability to preserve its customer base even for customers who live far away from the shop. That is what the above table reports: controlling for distance (which causes a 2.6 percentage point loss of customer base per extra minute of travel), each standard deviation increase in sophistication strengthens the retention rate by 11 percentage points. Variety of products does not matter, the volume of the shop (its sheer size) matters just a bit.

In practice, what we found is that CPT holds in our data where big supermarkets play the role of big cities and provide more sophisticated “services”. This is a nice finding for two reasons. First, it confirms the intuition of CPT in a real world scenario, making us a bit wiser about the world in which we live — and maybe avoiding mistakes in the future, such as creating a new Phoenix. This is non-trivial: the space in our data is not infinite, homogeneous, with a perfect market and it has differentiated people. Yet, CPT holds, using our sophistication measure as driving factor. Second, it validates our sophistication measure in a theoretical framework, potentially giving it the power to be used more widely than what we have done so far. However, both contributions are rather theoretical. I’m a man of deeds, so I asked myself: are there immediate applications of this finding?


There might be one, with caveats. Remember we are analyzing hundreds of supermarkets in Italy. We know things about these supermarkets. First, we have a shop type, which by accident correlates with sophistication very well. Then, we know if the shop was closed down during the multi-year observation period. We can’t know the reason, thus everything that follows is a speculation to be confirmed, but we can play with this. We can compare the above mentioned retention rate of closing and non-closing shops. We can also define a catch rate. While “retention” meant how many of your closest customers you can keep, catch means how many of the non-closest customers you can get. The above plots show retention and catch ratios. The higher the number the more the ratio is in favor of the non-closing shop.

For the retention rate, the average sophistication shops (green) have by far the largest spread between shops that are still open and the ones which got shut down. It means that these medium shops survive if they can keep their nearby customers. For the catch rate, the very sophisticated shops (red) are always on top, regardless of distance. It means that large shops survive if they really can attract customers, even if they are not the closest shop. The small shops (blue) seem to obey neither logic. The application of this finding is now evident: sophistication can enlighten us as to the destiny of different types of shops. If medium shops fail to retain their nearby customers, they’re likely to shut down. If large shops don’t catch a wider range of customers, they will shut down. This result talks about supermarkets, but there are likely connections with settlements too, replacing products with various services. Once we calculate a service sophistication, we could know which centers are aptly placed and which ones are not and should be closed down. I know one for sure even without running regressions: Phoenix.


12 August 2015 ~ 1 Comment

Entropy Applied to Shopping

I don’t know about you guys, but when it comes to groceries I show behaviors that are strongly reminiscent of Rain Man. I go to the supermarket the same day of the week (Saturday) at the same time (9 AM), I want to go through the shelves in the very same order (the good ol’ veggie-cookies-pasta-meat-cat food track), I buy mostly the same things every week. Some supermarkets periodically re-order their shelves, for reasons that are unknown to me. That’s enraging, because it breaks my pattern. The mahātmā said it best:


Amen to that. As a consequence, I signed up immediately when my friends Riccardo Guidotti and Diego Pennacchioli told me about a paper they were writing about studying the regularity of customer behavior. Our question was: what is the relationship between the regularity of a customer’s behavior and her profitability for a shop? The results are published in the paper “Behavioral Entropy and Profitability in Retail“, which will be presented in the International Conference on Data Science and Advanced Analytics, in October. To my extreme satisfaction the answer is that the more regular customers are also the most profitable. I hope that this cry for predictability will reach at least the ears of the supermarket managers where I shop. Ok, so: how did we get to this conclusion?

First, we need to measure regularity in a reasonable way. We propose two ways. First, a customer is regular if she buys mostly the same stuff every time she shops, or at least her baskets can be described with few typical “basket templates”. Second, a customer is regular if she shows up always at the same supermarket, at the same time, on the same day of the week. We didn’t have to reinvent the wheel to figure out a way for evaluating regularity in signals: giants of the past solved this problem for us. We decided to use the tools of information theory, in particular the concept of information entropy. Information entropy tells how much information there is in an event. In general, the more uncertain or random the event is, the more information it will contain.


If a person always buys the same thing, no matter how many times she shops, we can fully describe her purchases with a single bit of information: the thing she buys. Thus, there is little information in her observed shopping events, and she has low entropy. This we call Basket Revealed Entropy. Low basket entropy, high regularity. Same reasoning if she always goes to the same shop, and we call this measure Spatio-Temporal Revealed Entropy. Now the question is: what does happen to a customer’s expenditure for different levels of basket and spatio-temporal entropy?

To wrap our heads around these two concepts we started by classifying customers according to their basket and spatio-temporal entropy. We used the k-Means algorithm, which simply tries to find “clumps” in the data. You can think of customers as ants choosing to sit in a point in space. The coordinates of this point are the basket and spatio-temporal entropy. k-Means will find the parts of this space where there are many ants nearby each other. In our case, it found five groups:

  1. The average people, with medium basket and spatio-temporal entropy;
  2. The crazy people, with unpredictable behavior (high basket and spatio-temporal entropy);
  3. The movers, with medium basket entropy, but high spatio-temporal entropy (they shop in unpredictable shops at unpredictable times);
  4. The nomads, similar to the movers, with low basket entropy but high spatio-temporal entropy;
  5. The regulars, with low basket and spatio-temporal entropy.

Click to enlarge

Once you cubbyholed your customers, you can start doing some simple statistics. For instance: we found out that the class E regulars spend more per capita over the year (4,083 Euros) than the class B crazy ones (2,509 Euros, see the histogram above). The regulars also visit the shop more often: 163 times a year. This is nice, but one wonders: why haven’t the supermarket managers figured it out yet? Well, they may have been, but there is also a catch: incurable creatures of habit like me aren’t a common breed. In fact, if we redo the same histograms looking at the group total yearly values of expenditures and baskets, we see that class E is the least profitable, because fewer people are very regular (only 6.9%):

Click to enlarge

Without dividing customers in discrete classes, we can see what is the direct relationship between behavioral entropy and the yearly expenditure of a customer. This aggregated behavioral entropy measure is simply the multiplication of basket and spatio-temporal entropy. Unsurprisingly, entropy and expenditure are negatively correlated:


Finally, we want to quantify this relationship. We want to have an objective way to tell how much more money the supermarket could make if the customers would be more regular. We didn’t get too fancy here, just a linear model where we try to predict the customers’ expenditures from their basket and spatio-temporal entropy. We don’t care very much about causation here, we just want to make the point that basket and spatio-temporal entropy are interesting measures.

Click to enlarge

The negative sign isn’t a surprise: the more chaotic a customer’s life, the lower her expenditures. What the coefficients tell us is that we expect the least chaotic (0) customer to spend almost four times as much as the most chaotic (1) customer*. You can understand why this was an extremely pleasant finding for me. This week, I’m going to print out the paper and ask to see the supermarket manager. I’ll tell him: “Hey, if you stop moving stuff around and you encourage your customers to be more and more regular, maybe you could increase your revenues”. Only that I won’t do it, because that’d break my Saturday shopping routine. Oh dear.

* The interpretation of coefficients in regressions are a bit tricky, especially when transforming your variables with logs. Here, I just jump straight to the conclusion. See here for the full explanation, if you don’t believe me.

10 July 2015 ~ 0 Comments

Collective Intelligence 2015 Report

As I wrote previously, this year I missed NetSci, the yearly appointment for everybody who is interested in network analysis. The reason is that I was invited to give a talk at the Collective Intelligence conference, which happened almost at the same time. And once I got an invitation from Lada Adamic, I knew I couldn’t say no to her. Look at the things she did and is doing: she is a superstar scientist! So I packed my bags and went to the West Coast.

The first day was immediately a blast. Jeff Howe chaired the first session with some great insights about crowdsourcing. As you know, crowdsourcing is a super hip thing nowadays. It goes like this: individually, each one of us is pretty terrible at solving a hard problem. But if we put together enough terrible people, the average of their errors cancels out and we get an almost perfect performance. The term itself crowdsourcing was basically invented by Jeff (and Mark Robinson) when he was writing for Wired. The speakers in Jeff’s session showed us some cool examples of crowdsourcing research. The one that stuck with me the most was from Ágnes Horvát: she and her co-authors were able to analyze the internal communications of a hedge fund about investments and use the features of this communication (frequency of messages, mood, etc) to predict how the investments would perform. And they got it right much more than the strategists at the hedge fund itself.


The second day started with the session with my talk in it. A talk about memes of course! The people I got lined up with were spectacular. Jacob Foster talked about the collective intelligence of science. How do scientists make sense of the incredible amount of research out there? And how is it possible to advance knowledge in such hard times, when there are tens of new studies published every day? Dean Eckles gave an insightful talk about how Facebook users react when their stories get “snoped” (Snopes is a website dedicated to debunk hoaxes). Finally, fellow Italian Walter Quattrociocchi also spoke about hoaxes on Facebook: how they spread, how conspiracy believers interact with skeptics, and so on.

In the next session I attended, I particularly liked two talks. First, Ben Green talked about collective intelligence, and what it actually is. It reminded me of community discovery in networks: scientists dove enthusiastically into it, producing hundreds of papers. However, many didn’t realize that “communities” (and “collective intelligence”) are not so easily defined. Green is trying to fix that. Richard Mann‘s talk was also very interesting: in his work with Dirk Helbing he designed incentive strategies for getting the best out of the wisdom of crowds.


The lunch keynote was from a superstar in collective intelligence: Regina Dugan. Just to give you an idea about her, her CV sports a position as program manager at DARPA and she currently is vice president of Engineering, Advanced Technology and Projects at Google. Not bad. She shared her experiences in directing and experiencing the process of doing cutting edge research. Her talk was a textbook example of motivational speaking for scientists and entrepreneurs alike.

Finally, I had the pleasure to attend a couple of talks about prediction markets. These communities are basically a stock market for opinions. Given an event, say the 2016 president elections, people can put money on their prediction of who is going to be the winner. Websites like SciCast put in place some rules about buying and selling opinion “stocks” and eventually the market price converges on people’s best estimate of every candidate’s odds to win. Prediction markets are a favorite of Nate Silver, and he talks quite a lot about them in “The Signal and the Noise”.


Unfortunately, my report of the conference ends abruptly here, as I had to miss the last day of conference. But the experience was well worth the trip, and I am very grateful for the invitation to Lada Adamic, Scott Page and Deborah Gordon. Unfortunately, this also means that I discovered a shiny event that overlaps with NetSci. Next year, I’ll have to face hard decisions when I allocate my conference time in early June.

29 May 2015 ~ 0 Comments

Networks of Networks – NetSci 2015

The time has finally come! The NetSci conference—the place to be if you are interested in complex networks—is happening next week, from June 1st to June 5th. The venue is in Zaragoza, Spain. You can get all the information you need about the event from the official website. For the third year, I am co-organizing one of its satellite events: the Multiple Network Modeling, Analysis and Mining symposium, this year held jointly with Networks of Networks. The satellite will take place on June 2nd. As I previously said, unfortunately I am not going to be physically present in Span, and that makes me sad, because we have a phenomenal program this year.

We have four great invited speakers: Giovanni Sansavini, Rui Carvalho, Arunabha Sen and Katharina Zweig. It is a perfect mix between the infrastructure focus of the networks of networks crowd and the multidisciplinary approach of multiple networks. Sansavini works on reliability and risk engineering, while Carvalho focuses on characterizing and modeling networks in energy. Sen and Zweig provide their outstanding experience in the fields of computer networks and graph theory.

Among the contributed talks I am delighted to see that many interesting names from the network analysis crowd decided to send their work to be presented in our event. Among the highlights we have a contribution from the group of Mason Porter, who won last year’s Erdos Prize as one of the most outstanding young network scientists. I am also happy to see contributions from the group of Cellai and Gleeson, with whom I share not only an interest on multiplex networks, but also on internet memes. Contributions from groups lead by heavyweights like Schweitzer and Havlin are another sign of the attention that this event has captured.

I hope many of you will attend this seminar. You’ll be in good hands: Gregorio D’Agostino, Przemyslaw Kazienko and Antonio Scala will be much better hosts than I can ever be. I am copying here the full program of the event. Enjoy Spain!

NoN’15 Program

Session I

9.00 – 9.30 Speaker Set Up

9.30 – 9.45 Introduction: Welcome from the organizers, presentation of the program

9.45 – 10.15 Keynote I: Giovanni Sansavini. Systemic risk in critical infrastructures

10.15 – 10.35 Contributed I: Davide Cellai and Ginestra Bianconi. Multiplex networks with heterogeneous activities of the nodes

10.35 – 10.55 Contributed II: Mikko Kivela and Mason Porter. Isomorphisms in Multilayer Networks

10.55 – 11.30 Coffee Break

Session II

11.30 – 12.00 Keynote II: Rui Carvalho, Lubos Buzna, Richard Gibbens and Frank Kelly. Congestion control in charging of electric vehicles

12.00 – 12.20 Contributed III: Saray Shai, Dror Y. Kenett, Yoed N. Kenett, Miriam Faust, Simon Dobson and Shlomo Havlin. A critical tipping point in interconnected networks

12.20 – 12.40 Contributed IV: Adam Hackett, Davide Cellai, Sergio Gomez, Alex Arenas and James Gleeson. Bond percolation on multiplex networks

12.40 – 13.00 Contributed V: Marco Santarelli, Mario Beretta, Giorgio D’Urbano, Lorenzo Spina, Renato De Leone and Emilia Marchitto. Soccer and networks: changing the way of playing soccer through GPS, video analysis and social networks

13.00 – 14.30 Lunch

Session III

14.30 – 15.00 Keynote III: Arunabha Sen. Strategic Analysis and Design of Robust and Resilient Interdependent Power and Communication Networks with a New Model of Interdependency

15.00 – 15.20 Invited I: Alfonso Damiano,Univ. di Cagliari – Electric Market – Italy; Antonio Scala CNR-ICS, IMT, LIMS

15.20 – 15.40 Contributed VI: Rebekka Burkholz, Antonios Garas, Matt V Leduc, Ingo Scholtes and Frank Schweitzer. Cascades on Multiplexes with Threshold Feedback

15.40 – 16.00 Contributed VII: Soumajit Pramanik, Maximilien Danisch, Qinna Wang, Jean-Loup Guillaume and Bivas Mitra. Analyzing the Impact of Mentioning in Twitter

16.00 – 16.30 Coffee Break

Session IV

16.00 – 16.30 Keynote IV: Katharina Zweig. Science-theoretic musings on the analysis of networks (of networks)

16.30 – 16.50 Contributed VIII: Vinko Zladic, Sebastian Krause, Michael Danziger. Avoidable colors percolation

16.50 – 17.10 Contributed IX: Borut Sluban, Jasmina Smailovic, Igor Mozetic and Stefano Battiston. Sentiment Leaning of Influential Communities in Social Networks

17.10 – 17.30 Invited II: one speaker from the CI2C project (confirmed, yet to be defined)

17.30   Planning Netonets Future Activities

02 April 2015 ~ 0 Comments

A Marriage of Networks

My personal quest as ambassador of multiple networks at NetSci (previous episodes here and here) is continuing also this year. And, as every year, there are new exciting things coming along. This year, the usual satellite I organize is marrying another satellite. We are in fact merging our multiple networks with the Networks of Networks crowd. Networks of Networks is a society holding its own satellite at NetSci since quite a while. They are also interested in networks with multiple node and edge types, with more attention to infrastructure-flavored networks: computer networks, power grids, water infrastructure and so on. We are very excited to see what the impact between multiple networks and networks of networks, directed in particular by Antonio Scala and Gregorio D’Agostino, will generate.

The marriage is a promising one because, when talking about multiple networks and infrastructure, technical knowledge is dispersed among experts of different sectors – system operators from different industries (electric, gas, telecommunication, food chain, water supply, etc) – while researchers from different fields developed a number of different strategies to deal with these complex objects – from computer science to physics, from economics to humanities. To be exposed to these approaches and to confront one’s understanding of the potentialities of the analysis of multiple interdependent networks is key for the development of a common language to integrate the knowledge from all sectors. Complex Networks can be a common language for the needed federated approaches at both microscopic and macroscopic level. This satellite is here exactly to foster the development of such common language.

The usual practical information you might find useful:

  • The satellite will take place on June 2nd, 2015. It will be held, as usual, jointly with the other NetSci satellites. The location will be Zaragoza, Spain. The information about how to get there is included in the NetSci website.
  • The official website of the satellite is hosted by the Net-o-Nets parent website. The official page is this one. Information about the satellite is pretty bare-bones at the moment, but we’ll flesh it out in the following weeks.
  • We are open to submissions! You can send in your abstract and we’ll consider you for a contributed talk. The submission system goes through EasyChair, and this is the official link. The deadline for submission is April 19th, 2015 and we will notify you on April 29th.

Sadly, I will not be present in person to the event due to conflicting schedules. So I will not be able to write the usual report. I’ll leave you in the best hands possible. Submit something, and stop by in Zaragoza: you’ll find an exciting and stimulating crowd!

26 February 2015 ~ 0 Comments

Yet Another Proof that the Italian Government doesn’t Understand Research

[Warning: this post is not about my research, but it is a pure rant of whiny frustration. Readers are warned.]

As you might have guessed by my clumsy English, I’m not exactly a native speaker. In fact, I’m from Italy. Italian researchers seem to have quite a knack for propelling themselves out of their home country, more than your average researcher from a first-world country. This is because basic research is mostly publicly funded – in Italy the private sector has no patience for something that takes a lot of resources and time, and has a high failure chance – but historically the Italian government never understood that research, eventually, always pays off. For this reason, public research funding has been cut whenever there was the possibility, being judged as an expenditure without returns. Academic careers have been made harder, contracts for researchers shorter. It was with great hopes, then, when last year we saw a change in trend. A new program, called “Scientific Independence of young Researchers” (SIR) was launched. And it looked great.


The total budget of SIR almost reached 50 million of euros. SIR welcomed projects from all disciplines. Each project could be awarded up to a million euros. The two-phase review process had been modeled on the best practices of the European community. Every Italian researcher currently working abroad would have been salivating just by looking at it. And just like 5000 others, I did too. Before March 13th 2014, the project submission deadline, I submitted my application. That’s when this train wreck started to happen, in a slow-motion that would fit the most cheesy Michael Bay action flick.

As you might have noticed, today is February 26th, 2015. In about two weeks, we’ll celebrate one year since the submission deadline. As of today, SIR hasn’t gotten back to me about my project. This alone is a piece of information that makes me and, presumably, all other applicants fuming with rage. The reasons why this utter disaster is developing in front of us are even more absurd. I didn’t research them carefully, as they are only tangentially related to my point, but I’ll loosely translate from this article.


The sensual promise of a “European quality” peer review was based on the idea of asking the European Research Council (ERC) to help with the nomination of the review committees. SIR claims to have reached out to the ERC on February 25th, 2014, one month after they published their call for projects. On July 16th, it became clear that the ERC had declined to help, because this sort of thing was not part of their mission or budget (you can read the ERC reply to a concerned researcher who wrote them on July 2014, interestingly enough they claim to have warned SIR well before the call was published). Coincidentally, this is also the time in which SIR claims that the ERC got back to them. SIR appointed its own review committees on October 8th and started working on October 20th, as published on the SIR website*. So long to the “European quality” peer review process.

To sum up: SIR claims to be able to use ERC reviewers, while this is not true. They are so adamant about this claim that they published the call and they checked it only after a month, while ERC already told them before that it was not possible. Next, SIR ignores the problem and goes into oblivious lethargy for 4 months. The sleepy not-so-beauty wakes up and it takes another 3 months to nominate the committees. The committees “work” for 4 months with no completed reviews in sight. And we are talking about the first phase of review. God only knows how the second phase will go.


Even when it tries, the Italian government does not understand how research works. I’ll spell it out for them. A researcher (hell, any sort of job-seeker) never sends out only one application. It wouldn’t be rational, as any single shot has very low chances of success. If an organization takes a year to reply, everybody who could go somewhere else did. Marie Curie grant applications were due on September 11th and final jury/committee decisions were sent out on February 4th. Less than 5 months and we got a final answer. Everybody who submitted both SIR and Marie Curie, and got a Marie Curie grant, will accept the latter. Everybody who submitted both SIR and anything else, and got into anything else, will accept that anything rather than the SIR offer, assuming for the sake of the argument that SIR will ever make one.

This tragicomedy will end in the least bad case scenario with Italy handing over almost 50 million Euros to C-rated projects, because if a project was any good it would have been accepted anywhere else before. In the worst case scenario, this will be a contest in which everybody loses. Even worse, researchers for whom 2015 was the last year in which they fit the requirements to apply, are effectively booted out of the contest.


I don’t want to make this a political issue. This post isn’t pro- or anti- government. My point is that, in Italy, your only hope to do primary research is to do so with public money. And if you, Italian government, are the only hope, at least be straightforward. We have a saying in Italy: si stava meglio quando si stava peggio (“We were better off when we were worse off”), which is pretty stupid, but here it applies. At least with the previous governments we researcher knew we were personae non gratae and we left. Now, instead, we are just part of a big commedia dell’arte. We are told we are important, we are welcomed to come back home, but once we get in from the door they throw us out from the window. No wonder Italy is lagging behind in innovation.

* One among other things that makes you feel like you’re being treated like a retard is that all news items in the SIR website regarding the review process overwrite the previous news item. So every time SIR promises that reviews will be done by day X, and day X comes, they just remove any trace of the promise and make a new promise (proof: screenshot on February 17th and on February 23rd). It doesn’t help either that the “News Archive” page doesn’t actually contain a news archive…

22 January 2015 ~ 0 Comments

Surprising Facts About Shortest Paths

Maybe it’s the new year, maybe it’s the fact that I haven’t published anything new recently, but today I wanted to take a look at my publication history. This, for a scientist, is something not unlike a time machine, bringing her back to an earlier age. What was I thinking in 2009? What sparked my interest and what were the tools further refined to get to the point where I am now? It’s usually a humbling (not to say embarrassing) operation, as past papers always look so awful – mine, at least. But small interesting bits can be found, like the one I retrieved today, about shortest paths in communication networks.

A shortest path in a network is the most efficient way to go from one node to another. You start from your origin and you choose to follow an edge to another node. Then you choose again an edge and so on until you get to your destination. When your choices are perfect and you used the minimum possible number of edges to follow, that’s a shortest path (it’s A shortest path and not THE shortest path because there might be alternative paths of the same length). Now, in creating this path, you obviously visited some nodes in between, unless your origin and destination are directly connected. Turns out that there are some nodes that are crossed by a lot of shortest paths, it’s a characteristic of real world networks. This is interesting, so scientists decided to create a measure called betweenness centrality. For each node, betweenness centrality is the share of all possible shortest paths in the network that pass through them.

Intuitively, these nodes are important. Think about a rail network, where the nodes are the train stations. High betweenness stations see a lot of trains passing through them. They are big and important to make connections run faster: if they didn’t exist every train would have to make detours and would take longer to bring you home. A good engineer would then craft rail networks in such a way to have these hubs and make her passengers happy. However, it turns out that this intuitive rule is not universally applicable. For example some communication networks aren’t willing to let this happen. Michele Berlingerio, Fosca Giannotti and I stumbled upon this interesting result while working on a paper titled Mining the Temporal Dimension of the Information Propagation.


We built two communication networks. One is corporate-based: it’s the web of emails exchanged across the Enron employee ecosystem. The email record has been publicly released for the investigation about the company’s financial meltdown. An employee is connected to all the employees she emailed. The second is more horizontal in nature, with no work hierarchies. We took users from different email newsgroups and connected them if they sent a message to the same thread. It’s the nerdy version of commenting on the same status update on Facebook. Differently from most communication network papers, we didn’t stop there. Every edge still carries some temporal information, namely the moment in which the email was sent. Above you have an extract of the network for a particular subject, where we have the email timestamp next to each edge.

Here’s where the magic happens. With some data mining wizardry, we are able to tell the characteristic reaction times of different nodes in the network. We can divide these nodes in classes: high degree nodes, nodes inside a smaller community where everybody replies to everybody else and, yes, nodes with high betweenness centrality, our train station hubs. For every measure (characteristic), nodes are divided in five classes. Let’s consider betweenness. Class 1 contains all nodes which have betweenness 0, i.e. those through which no shortest path passes. From class 2 to 5 we have nodes of increasing betweenness. So, nodes in class 3 have a medium-low betweenness centrality and nodes in class 5 are the most central nodes in the network. At this point, we can plot the average reaction times for nodes belonging to different classes in the two networks. (Click on the plots to enlarge them)


The first thing that jumps to the eye is that Enron’s communications (on the left) are much more dependent on the node’s characteristics (whether the characteristic is degree or betweenness it doesn’t seem to matter) than Newsgroup’s ones, given the higher spread. But the interesting bit, at least for me, comes when you only look at betweenness centrality – the dashed line with crosses. Nodes with low (class 2) and medium-low (class 3) betweenness centrality have low reaction times, while more central nodes have significantly higher reaction times. Note that the classes have the same number of nodes in them, so we are not looking at statistical aberrations*. This does not happen in Newsgroups, due to the different nature of the communication in there: corporate in Enron versus topic-driven in Newsgroup.

The result carries some counter intuitive implications. In a corporate communication network the shortest path is not the fastest. In other words, don’t let your train pass through the central hub for a shortcut, ’cause it’s going to stay there for a long long time. It looks like people’s brains are less elastic than our train stations. You can’t add more platforms and personnel to make more things passing through them: if your communication network has large hubs, they are going to work slower. Surprisingly, this does not hold for the degree (solid line): it doesn’t seem to matter with how many people you interact, only that you are the person through which many shortest paths pass.

I can see myself trying to bring this line of research back from the dead. This premature paper needs quite some sanity checks (understatement alert), but it can go a long way. It can be part of the manual on how to build an efficient communication culture in your organization. Don’t overload people. Don’t create over-important nodes in the network, because you can’t allow all your communications to pass through them. Do keep in mind that your team is not a clockwork, it’s a brain-work. And brains don’t work like clocks.

* That’s also the reason to ditch class 1: it contains outliers and it is not comparable in size to the other classes.


18 December 2014 ~ 0 Comments

The Supermarket is an Ecosystem

There are few things that you would consider less interesting than doing groceries at the supermarket. For some it’s a chore, others probably like it. But for sure you don’t see much of a meaning behind it. It’s not that you sense around you a grave atmosphere, the kind of mysterious background radiance you perceive when you feel part of Something Bigger. Just buy the bloody noodles already. Well, to some extent you are wrong.

Of course the reality is less mystical than what I tried to led you to believe in this opening paragraph. But it turns out that customers of a supermarket chain behave as if they were playing a specific role. These roles are the focus of the paper I recently authored with Diego Pennacchioli, Salvatore Rinzivillo, Dino Pedreschi and Fosca Giannotti. It has been published on the journal EPJ Data Science, and you can read it for free.

So what are these roles? The title of the paper is very telling: the retail market is a complex system. So the first thing to clear out is what the heck a complex system is. This is not so easily explained – otherwise it wouldn’t be complex, duh. The precise physics definition of complex systems might be too sophisticated. For this post, it will be sufficient to use the following one: a complex system is a collection of interacting parts and its behavior cannot be expressed as a sum of the behaviors of its parts. A good example of complexity is Earth’s ecosystem: there are so many interacting animals and phenomena that having a perfect description of it by just listing all interactions is just impossible.


And a supermarket is basically the same. In the paper we propose several proofs of it, but the one that goes best with the chosen example involves the esoteric word “nestedness”. When studying different ecosystems, some smart dudes decided to record their observations in matrix form. For each different island (ecosystem) they recorded if a particular species was present or not. When they looked at the resulting matrix they noticed a particular pattern. The islands with few species had only the species that were found in all islands, and at the same time the most rare species were present exclusively in those islands which were hosting all the observed species. If you reordered the islands by species count and the species by island count, the matrix had a particular triangular shape. They called matrices like that “nested”.

We did the same thing with customers and products. There are customers who buy only a handful of products: milk, water, bread. And those products are the products that everybody buys. Then there are those customers who, over a year, buy basically everything you can see in a supermarket. And they are the only ones buying the least sold products. The customers X products matrix ends up looking exactly like an ecosystem nested matrix (you probably already saw it over a year ago on this blog – in fact, this work builds on the one I wrote about back then, but the matrix picture is much prettier, thanks to Diego Pennacchioli):


Since we have too many products and customers, this is a compressed view and the color tells you how many observations we have per pixel (click for full resolution). One observation is simply a pairing of a customer and a product, indicating that the customer bought that product in significant quantities over a year. Ok, where does this bring us? First, as parts of a complex system, customers are not so easily classifiable. Marketing is all about finding uniformly behaving groups of people. The consequence of being complex parts is that this task is hopeless. You cannot really put people into bins. People are part of a continuous space, as shown in the picture, and every cut-off you propose is necessarily arbitrary.

The solution to this problem is represented by that black line you see on the matrix. That line is trying to divide the matrix in two parts: a part where we mostly have ones, and a part where we mostly have zeroes. The line does not match reality perfectly. It is a hyperbola that we told to fit itself as snugly to the data as possible. Once estimated, the function of the black line enables a neat application: to predict the next product a customer is interested in buying.

Remember that the matrix has its columns and rows sorted. The first customer is the one who bought the most products, the second bought a little less product and so on with increasing ranks. Same thing with products: the highest ranked (1st) is sold to most customers, the lowest ranked is sold to just one customer. This means that if you have the black line formula and the rank of a customer, you can calculate the rank of a corresponding product. Given that the black line divides the ones from the zeros, this product is a zero that can most easily become a one or, in other words, the supermarket’s best bet of what product the customer is most likely to want to buy next. You do not need customer segmentation any more: since the matrix is and will always be nested you just have to fill it following the nested pattern, and the black line is your roadmap.


We can use the ranks of the products for a description of customer’s needs. The highest ranked products are bought by everyone, so they are satisfying basic needs.  We decided to depict this concept borrowing Maslow’s pyramid of needs. The one reported above is interesting (again, click for full resolution), although it applies only to the supermarket area our data is coming from. In any case it is interesting how some things that are on the basis of Maslow’s pyramid are on top of our, for example having a baby. You could argue that many people do not buy those products in a supermarket, but we address these concerns in the paper.

So next time you are pondering whether buying or not buying that box of six donuts remember: you are part of a gigantic system and the little weight you might gain is insignificant compared to the beautiful role you are playing. So go for it, eat the hell out of those bad boys.

13 November 2014 ~ 3 Comments

Average is Boring

You fire up a thesaurus online and you look for synonyms of the word “interesting”. You can find words like “unusual”, “exotic”, “striking”. These are all antonyms of “average”. Average is the grey uniform shirt of the post office employee calling out the number of the next person in the queue, or the government-approved video that teaches you how to properly wash your hands. Of course “average is boring”. Why should we be interested in the average? I am. Because if we understand the average we understand how to avoid it. We can rekindle our interest for lost subjects, each in its own unique way. Even washing your hands. We can live in the tail of the distribution, instead of on top of the bell.


My quest for destroying the Average is a follow-up of my earlier paper on memes. Its subtitle is “How similarity kills a meme’s success” and it has been published in Scientific Reports. We are after the confirmation that the successful memes are unique, weird, unexpected. They escape from the blob of your average meme like a spring snake in a can. The starting point of every mission is to know your enemy. It hides itself in internet image memes, those images you can find everywhere on the Web with a usually funny text on top of them, just like this one.

I lined up a collection of these memes, downloaded from Memegenerator.net, and I started examining them, like a full-metal-jacket drill instructor. I demanded them to reveal me all about each other. I started with their name, the string of text associated with them, like “Socially Awkward Penguin” or “Bad Luck Brian”. I noted these strings down and compared their similarity, just like Google does when it suggests “Did you mean…?”. This was already enough to know who is related to whom (I’m looking at you, band of penguins).

Then it was time to examine what they look like. All of them gave me their best template picture and I ran it through the electronic eye of SURF, an amazing computer vision software able to detect image features. Again, I patiently noted down who looked like whom. Finally, I asked them to tell me everything about their history. I collected anything that was ever said on Memegenerator.net, meaning all the texts that the users wrote when creating an instance belonging to each meme. For example, the creation of this picture:


results in associating “If guns don’t … toast toast toast?” with the Philosoraptor meme. I condensed all this text into a given number of topics and exposed which of the memes are talking about the same things. At this point, I had all I needed to know about who is average and who could spark our interest. It’s an even more nerdy version of Hot or Not. So I created a network of memes, connecting two memes if they are similar to each other. I enlarged and highlighted in orange the memes that are widely used and popular. I won’t keep you on your toes any longer: here is the result.


I knew it! The big, orange nodes are the cool guys. And they avoid to mingle in the center of the neighborhood. They stay on the periphery, they want to be special, and they are. This conclusion is supported by all kinds of robustness checks, but I’m not going to report them because it’s hard enough for me to keep you awake while you have to read through all this boring stuff. “Ok”, you now think, “You proved what we already knew. Good job. What was this for?”.

This result is not as expected as you might think. Let it settle down in your brain for a second: I am saying that given your name, your image template and your topic I can tell you if you are likely to be successful or not. Plenty of smart people have a proof in their hand saying that a meme’s content isn’t necessary to explain why some memes are successful and some are less memorable than your average Congress hearing. They have plenty of good reasons to say that. In fact, you will never hear me reciting guru-like advices to reach success like “be different”. That’s just bollocks.

Instead of selling the popularity snake oil, I am describing what the path to success looks like. The works I cited do not do that. Some describe how the system works. It’s a bit like telling you that, given how the feudal system worked in the Middle Ages, some people had to be emperors. It doesn’t say so much about what characteristics the emperors had. Otherwise they tell you how good an emperor already on the throne could be. But not so much about how he did get to sit on that fancy chair wearing that silly hat. By looking at the content in a different way, and by posing different questions, I started writing emperor’s biographies and I noticed that they all have something in common. At the very least, I am the court jester.

We are not enemy and we are not contradicting each other. We are examining the same, big and complex ecosystem of silly-pictures-on-the-internet with different spectacles. We all want to see if we can describe human cultural production as a concrete thing following understandable laws. If you want to send a rocket to the moon, you need to know how and why if you throw up a ball it falls back to the ground. Tedious, yes, but fundamental. Now, if you excuse me, I have a lot of balls to throw.