Michele Coscia - Connecting Humanities

Michele Coscia I am an associate prof at IT University of Copenhagen. I mainly work on algorithms for the analysis of complex networks, and on applying the extracted knowledge to a variety of problems. My background is in Digital Humanities, i.e. the connection between the unstructured knowledge and the coldness of computer science. I have a PhD in Computer Science, obtained in June 2012 at the University of Pisa. In the past, I visited Barabasi's CCNR at Northeastern University, and worked for 6 years at CID, Harvard University.

27 October 2021 ~ 1 Comment

Pearson Correlations for Networks

We all know that correlation doesn’t imply causation:

And yet, we calculate correlations all the time. Because knowing when two things correlate is still pretty darn useful. Even if there is no causation link at all. For instance, it’d be great to know whether reading makes you love reading more. Part of the answer could start by correlating the number of books you read with the number of books you want to read.

The very important questions the Pearson correlation coefficient allows you to ask: will consuming cheese bring upon you the doom of dying by suffocating in your bedsheets? source: https://www.tylervigen.com/spurious-correlations

As a network scientist, you might think that you could calculate correlations of variables attached to the nodes of your network. Unfortunately, you cannot do this, because normal correlation measures assume that nodes do not influence each other — the measures are basically assuming the network doesn’t exist. Well, you couldn’t, until I decided to make a correlation coefficient that works on networks. I describe it in the paper “Pearson Correlations on Complex Networks,” which appeared in the Journal of Complex Networks earlier this week.

The formula you normally use to calculate the correlation between two variables is the Pearson correlation coefficient. What I realized is that this formula is the special case of a more general formula that can be applied to networks.

In Pearson, you compare two vectors, which are just two sequences of numbers. One could be the all the numbers of books that the people in our sample have read, and the other one is all of their ages. In the example, you might expect that older people had more time to read more books. To do so, you check each entry in the two vectors in order: every time you consider a new person, if their age is higher than the average person’s, then also the number of books they read should be higher.

If you are in a network, each entry of these vectors is the value of a node. In our book-reading case, you might have a social network: for each person you know who their friends are. Now you shouldn’t look at each person in isolation, because the numbers of books and the ages of people also correlate in different parts of the network — this is known as homophily. Some older people might be pressured into reading more books by their book-addicted older friends. Thus, leaving out the network might cause us to miss something: that a person’s age tells us not just about the number of books they have read, but it also allows us to predict the number of books their friends have read.

This is the type of networks you are forced to work with when you use the Pearson correlation. That’s just silly, isn’t it?

To put it simply, the classical Pearson correlation coefficient assumes that there is a very special network behind the data: a network in which each node is isolated and only connects to itself — see the image above. When we slightly modify the math behind its formula, it can take into account how close two nodes are in the network — for instance, by calculating their shortest path length.

You can interpret the results from this network correlation coefficient the same way you do with the Pearson one. The maximum value of +1 means that there is a perfect positive relation: for every extra year of age you read a certain amount of new books. The minimum of -1 means that there is a perfect negative relationship: a weird world where the oldest people have not read much. The midpoint of 0 means that the two variables have no relation at all.

Is the network correlation coefficient useful? Two answers. First: how dare you, asking me if the stuff I do has any practical application. The nerve of some people. Second: Yes! To begin with, in the paper I build a bunch of artificial cases in which I show how the Pearson coefficient would miss correlations that are actually present in a network. But you’re not here for synthetic data: you’re a data science connoisseur, you want the real deal, actual real world data. Above you can see a line chart, showing the vanilla Pearson (in blue) and the network-flavored (in red) correlations for a social network of book readers as they evolve over time.

The data comes from Anobii, a social network for bibliophiles. The plot is a correlation between number of books read and number of books in the wishlist of a user. These two variables are positively correlated: the more you have read, the more you want to read. However, the Pearson correlation coefficient greatly underestimates the true correlation, at 0.25, while the network correlation is beyond 0.6. This is because bookworms like each other and connect with each other, thus the number of books you have read also correlates with the wishlist size of your friends.

This other beauty of a plot, instead, shows the correlation between the age of a user and the number of tags they used to tag books. What is interesting here is that, for Pearson, there practically isn’t a correlation: the coefficient is almost zero and not statistically significant. Instead, when we consider the network, there is a strong and significant negative correlation at around -0.11. Older users are less inclined to tag the books they read — it’s just a fad kids do these days –, and they are even less inclined if their older friends do not tag either. If you were to hypothesize a link between age and tag activity and all you had was lousy Pearson, you’d miss this relationship. Luckily, you know good ol’ Michele.

If this makes you want to mess around with network correlations, you can do it because all the code I wrote for the paper is open and free to use. Don’t forget to like and subscrib… I mean, cite my paper if you are fed up with the Pearson correlation coefficient and you find it useful to estimate network correlations properly.

08 June 2021 ~ 0 Comments

Program for Networks in Economics Satellite @ Networks21 Conference

This year, I’ll be organizing a satellite event at the Networks21 conference, the major 2021 event bringing together all network scientists — merging the Sunbelt and the NetSci crowds for the first time! The satellite will be about network applications on research about economic development and innovation. The most excellent Morgan Frank and Lingfei Wu are the other engines behind this project. I briefly introduced it in this earlier post.

I’m glad to report that now we have a final roster of participants. We received several abstracts for the contributed talks and we managed to squeeze eight of them in our program. What follows is the current schedule — minor changes might happen due to author constraints and I will try to keep this post up to date. Note that the satellite will happen Wednesday, June 30, 2021, and all times reported are US East coast time.

  • 8:30AM Invited I: Lü Linyuan
  • 9:10AM Invited II: R. Maria del Rio-Chanona
  • 9:50AM Contributed I: Process-driven network analysis of a mobile money system in Asia. Carolina Mattsson and Frank Takes.
  • 10:10AM Contributed II: Discovering industries in networks of words. Juan Mateos-Garcia, Bishop Alex and Richardson George.
  • 10:30AM Break
  • 10:50AM Invited III: Yong-Yeol “YY” Ahn
  • 11:30AM Contributed III: From code to market: Network of developers and correlated returns of cryptocurrencies. Lorenzo Lucchini, Laura Maria Alessandretti, Bruno Lepri, Angela Gallo and Andrea Baronchelli.
  • 11:50AM Contributed IV: What is a Labor Market? Classifying Workers and Jobs Using Network Theory. James Fogel and Bernardo Modenesi.
  • 12:10PM Invited IV: Hyejin Youn
  • 12:50PM Lunch Break
  • 1:30PM Invited V: Esteban Moro
  • 2:10PM Invited VI: Marta Gonzalez
  • 2:50PM Contributed V: How to Govern Facebook. Seth Benzell and Avinash Collis.
  • 3:10PM Contributed VI: The network limits of infectious disease control via occupation-based targeting. Demetris Avraam, Nick Obradovich, Niccolò Pescetelli, Manuel Cebrian and Alex Rutherford.
  • 3:30PM Break
  • 3:50PM Invited VII: Daniel Rock
  • 4:30PM Contributed VII: Measuring Fraudulent Transactions On Complex Economic Networks Using Optimality Gap. Danilo Bernardineli and Wenjia Hu.
  • 4:50PM Contributed VIII: Local connections drive global structure for technological innovation. Dion O’Neale, Demival Vasques Filho and Shaun Hendy.
  • 5:10PM Invited VIII: Jiang Zhang

If you want to attend, you need to register to the Networks21 conference (deadline on June 20th) and then you will receive a Zoom link to the event.

I don’t know about you, but I’m very excited to see all of this! The official page of the satellite has more information for you.

15 April 2021 ~ 0 Comments

A Bayesian Framework for Online Social Network Sampling

Analyzing an online social network is hard because there’s no “download” button to get the data. In the best case scenario, you have to query an API (Application Program Interface) system. You have to tell the API which user you want to see the connections of, and you can’t just ask for all users at the same time. With the current API rate limits, downloading the Twitter graph would take seven centuries, as the picture below explains.

The gray circle represents the totality of Twitter’s users. The red one is what you’d get with a continuous crawl of one year.

I don’t know about you, but I don’t have seven centuries to publish my papers. I want to become a famous scientist now. If you’re like me, you’re forced to extract only a sample. In this post, I’ll describe a network sampling algorithm that works with social media APIs to maximize the amount of information you get from your sample. The method has been published in the paper “Noise Corrected Sampling of Online Social Networks“, which recently appeared in the journal Transactions on Knowledge Discovery from Data.

My Noise Corrected sampling (NC) is a topological method. This means that you start from one (or more) user IDs that are known, and then you gather new user IDs from their friends, then you move on to the friends’ friends and so on, following the connections in the graph you’re sampling. This is the usual approach, because there is no way to know the user IDs beforehand — unless you have inside knowledge about how the social media platform works.

The underlying common question behind a topological sampler is: how do we pick the best user to explore next? Different samplers answer this question differently. A Random Walk sampler says: “Screw it! Thinking is too hard! I’m gonna get a random node from the friends of the currently explored node! Then, with the time I saved, I’m gonna get some ice cream.” The Metropolis-Hastings Random Walk — the nerd’s answer to the jock’s Random Walk — wants to preserve the degree distribution, so it penalizes high-degree nodes — since they’re more likely to be oversampled. Alternatively, we could use classical graph exploration techniques such as Depth-First and Breadth-First Search — and, again, you can enhance them so that you can preserve some properties of the network, for instance its clustering coefficient.

A random walk on a graph. It looks like a drunk going around. I don’t drink and sample, and neither should you.

My NC departs from all these approaches in a fundamental way. The thing they all have in common is that they don’t use information from the sample they have currently gathered. They only look at the potential next users and make a decision on whom to explore based on their characteristics alone. For instance Metropolis-Hastings Random Walk checks the user’s popularity in number of connections. NC, instead, looks at the entire collected sample and asks a straightforward question: “which of these next users is most likely to give me the most information about the whole structure, given what I explored so far?”

It does so with a Bayesian framework. The gory math is in the paper, but the underlying philosophy is that you can calculate a z-score for each edge of the network. It tells you how surprising it is that this edge is there. In probability theory, “surprise” is a good thing: it is directly related to how much information something gives you. Probability theorists are a bunch of jolly folks, always happy to discover jacks-in-a-box. If you always pick the most surprising edge in the network to explore the next user, you’re maximizing the amount of information you’re going to get back. The picture below unravels what this entails on a simple example.

In this example, the green nodes are explored, the blue nodes are known — because they’re friends of an explored node — but not explored yet. The red nodes are unknown.

The first figure (a) is the start of the process, where we have a network and we pick a random starting point, node 7. I use the edge thickness to show its z-score: we always follow the widest edge. Initially, you only pick a random edge, because all edges have the same z-score in a star. However, once I pick node 3, all z-scores change as we discover new edges (b). In this case, first we prioritize the star around node 7 (c), then we start following the path from node 3 until node 9 (d-e), then we explore the star around node 9 until there are no more nodes nor edges to discover in this toy network (f). Note how edges exhaust their surprise as you learn more about the neighborhoods of the nodes they connect.

Our toy example is an unweighted network, where all edges have the same weight, but this method thrives if you have information about how strong a connection is. The edge weight is additional information that can be taken into account to measure its surprise value. Under the hood, NC estimates the expected weight of an edge given the average edge weights of the two nodes it connects. In an unweighted network this is equivalent to the degree, because all edges have the same weight. But in a weighted network you actually have meaningful differences between nodes that tend to have strong/weak connections.

So, the million dollar question: how does NC fare when tested against the alternatives I presented before? The first worry you might have is that it takes too much time to calculate the z-scores. After all, to do so you need to look at the entire sampled network so far, while the other methods only look at a local neighborhood. NC is indeed slower than them—it has to be. But it doesn’t matter. The figure above shows how much time it takes to calculate the z-scores for more and more edges. Those runtimes (less than a tenth of a second) are puny when compared to the multi-second wait times due to the APIs’ rate limits — or even to simple network latency. After all, NC is a simplification of the Noise-Corrected Backbone algorithm I developed with Frank Neffke, and we showed in the original paper that it was pretty darn fast.

What about the quality of the samples? This is a question with a million answers, because it depends on what you want to do with the sample, besides what are the characteristics of the original graph and of the API system you’re wrestling against, and how much time you have to gather your sample. For instance, in the figure below we see that NC is the best method when you want to estimate the assortativity value of a disassortative attribute (think gender in a dating network), but it is a boring middle-of-the-pack method when you want to reconstruct the original communities of a network.

Left: the Mean Absolute Error between the sample’s and the original graph’s assortativity values (lower is better). Right: the Normalized Mutual Information between the sample’s and the original graph’s communities (higher is better). NC in red.

NC ranks between the 3rd and 4th position on average across a wide range of applications, network topologies, API systems, and budget levels. This is good because I test eight alternatives, thus the expectation is that NC will practically always be better than average. Among all eight methods tested, in fact, no other method has a better rank average. The second best method is Depth-First Search, which ranks 4th three out of four times, against NC’s one out of two.

I have shared some code allowing to replicate my results. If you speak Python better than you speak academiquese, you could use that code to infer how to implement an NC sampling strategy next time you need to download Twitter.

15 March 2021 ~ 0 Comments

Networks in Economics Satellite @ Networks21 Conference

This year’s NetSci conference will be special. For the first time it will be held jointly with the other major network event of the year: Sunbelt, or the main network conference for social sciences. I could not miss an opportunity like this, and so I decided to organize a satellite event with the excellent Morgan Frank and Lingfei Wu. The topic of the satellite will be network applications on research about economic development and innovation.

We’re looking for contributors to send an abstract about their work in the area. If you’re unsure about what area that is, think about my research on the Product Space, or on the impact of business travel on economic growth, or economic convergence in Colombia, etc. Specifically, if you are interested in issues like:

  • Mapping the relationship of complex economic activities at the global, regional, and local level;
  • Tracking flows of knowhow in all its forms;
  • Estimating the relatedness of tasks and skills to estimate knockoff effects and productivity gains of automation;
  • Investigating the dynamics of research and innovation via analysis of patents, inventions, and science;
  • Uncovering scaling laws and other growth trends able to describe the systemic increase in complexity of activities due to agglomeration;

and you study them using networks and the tools of the science of complex systems, then you really should send us your abstract. The submission link is: https://easychair.org/my/conference?conf=cnei21. You should send a one-page one-figure abstract before May 5th, 2021.

We have a fantastic lineup of invited speakers you’ll mingle with:

The event will be held online on Zoom. The exact date is still to be determined, but it will be between June 21st and July 3rd. So stay tuned for updates! You should bookmark the official website of the satellite, to get fresh news about it: https://mrfrank8176.github.io/Complex-Networks-in-Economics-and-Innovation/

I hope to see your abstract in my inbox and then you presenting at the satellite!

11 February 2021 ~ 2 Comments

PhD Call: Combat Disinformation in Complex Social Systems

The IT University of Copenhagen has put out a call searching for people interested in starting a PhD in computer science in Fall 2021. One of the projects you could work on is my project on disinformation and social networks. You should apply if you think you: might be interested in pursuing a PhD related to misinformation and social media; like the idea of living in the happiest country in the world; and are ok with having a clueless supervisor like me. Link to apply.

The easiest way to figure out what sort of things you’d be doing is by reading my post on the topic of the PhD project. To sum it up briefly: in the first paper of this project, Luca Rossi and I made an agent-based model, simulating the way social media handle misinformation via flagging. We show that the system is counterproductive: its mechanics end up penalizing mostly popular and neutral news sources. Luca will be your co-supervisor, so you’d be working with a real social scientist who has an actual clue about what’s going on.

This project is born out of the idea to make that paper into an actual research program. There are a few paths that are all worth exploring and require a person giving them their full time attention. A few ideas:

  • Expand the agent-based model to make it more realistic and to enable it to answer more interesting questions (what if users and sources change their polarity? What if one side is less tolerant than the other? What if tolerance changes over time? What if…?);
  • Find real-world data and use it to validate, motivate, or find ideas on how to expand the model and its results;
  • Design intervention strategies for social media platforms that are less counterproductive than the ones currently deployed.

The position is fully funded by ITU, so it’s a pretty safe path for 3 to 4 years. The group where I work, NERDS, is full of amazing professionals doing extremely cool research, publishing in Nature, collaborating with UNICEF, and the like. We’re fun, really.

Link to apply, where you should specify that you want to work on the “Modelling Complex Social Systems to Handle Disinformation” project. I hope you’ll give it a thought!

09 February 2021 ~ 0 Comments

The Atlas for the Aspiring Network Scientist v1.1

Last month I put out in public my Atlas for the Aspiring Network Scientist. The reaction to it was very pleasant and people contacted me with a number of corrections, opinions, and comments. I just uploaded to arXiV a version 1.1 of it, doing my best to address whatever could be addressed quickly. The PDF on the official website is also updated and, in fact, that link will always direct you to the most up-to-date version.

Corrections involve mostly some notation, a few references, and the like. One important thing I want to point out is my rephrasing/retraction of some humorous parts. I still stand by my decision of using humor, but not when it comes at the expense of the feeling of inclusiveness in the community. Science is a social process and everyone should feel welcome to it. Using language that opposes that aim is a net loss for society. One example is in the chapter about tools, where some ruvid humor didn’t paint the correct picture: these open source tools are fantastic gifts to the community and should be unequivocally celebrated. All remaining jokes are about the self-deprecation I feel every day from my inability to measure up to the awesome fellows behind these libraries/software.

One thing that was flagged to me but I couldn’t touch was references. There are just too many for me to check them all. I’m asking your help: if you find some issue with references (missing information, or things like editors as authors, etc), please write me flagging the specific reference with the issue: mcos@itu.dk.

I’m also glad to announce that you can buy a physical copy of the book, in case you need it handy for whatever reason. This is only v1, though, so all corrections mentioned above are not included. When v2 will come out, I’ll also make that available for physical purchase. The book was printed via IngramSpark, thus there’s a good chance you can find it for sale & shipping almost everywhere. For instance, it is available on Amazon or, if you’re in Denmark (where I live), on Saxo. You could even buy it on friggin’ Walmart.

The final object’s quality is… eh. Some of it is by design: I wanted this to be as accessible as possible. You’ll hardly find another 650+ color pages book in US Letter format for less than $40. Compromises needed to be made. However, most of the things making it a clearly amateurish product are my fault. Take a look at the left margin in the back cover:

Eww… Also, since I had to upload the cover separately, I didn’t remember to include a blank page. So the left pages are on the right and vice versa. Which makes page numbers practically invisible in the middle:

That said, if you ignore everything that makes this book ugly, it’s actually pretty nice:

Also, apologies to your backs, but this thing is hyuuuge. It’s as tall as a half-kilo pack of bucatini and twice as thick (packs of bucatini are a standard unit of measure in Italy):

Finally, I would love to give a shout out to everyone I interacted with after the book came out. Everyone was super nice and/or super helpful, most were both. I discovered many things I wasn’t aware of. One of them is NETfrix, a network science podcast by super cool fellow Asaf Shapira. The podcast has transcripts in English available here.

That’s it for now! Hopefully new research posts will follow soon.

06 January 2021 ~ 1 Comment

The Atlas for the Aspiring Network Scientist

In the past two years, I’ve been working on a textbook for the Network Analysis class I teach at ITU. I’m glad to say that the book is now of passable enough quality to be considered in version 1.0 and so I’m putting it out for anyone to read for free. It appeared on arXiV yesterday. It is available for download on its official website, which contains the solutions to the exercises in the book. Ladies and gentlemen, I present you The Atlas for the Aspiring Network Scientist.

As you might know, there are dozens of awesome network science books. I cannot link them all here, but they are cited in my atlas’ introduction. So why do we need a new one? To explain why the atlas is special, the best way is to talk about the defects of the book, rather than its strengths.

The first distinctive characteristic is that it aims at being broad, not deep. As the title suggests, I wanted this to be an atlas. An atlas is a pointer to the things you need to know, rather than a deep explanation of those things. In the book, I never get tired of pointing out the resources you need to actually understand the nitty-gritty details. When you stumble on a chapter on something you’re familiar with, you’ll probably have the feeling that you know so much more than me — which is true. However, that’s the price to pay if I want to include topics from the Hitting Time Matrix to the Kronecker graph model, from network measurement error to graph embedding techniques. No book I know includes all of these concepts.

The second issue derives from the first: this is a profoundly personal journey of eleven years through network science. No one can, in such a short time, master all the topics I include. Thus there’s an uneven balance: some methods are explained in detail because they’re part of my everyday work. And others are far from my area of expertise. Rather than hiding such a defect, the book wears it on its sleeve. I prefer to include everything I can even if I’m not an expert on it, because the first priority is to let people know that something exists. If I were to wait until I was an expert R programmer before advising you to use iGraph, the book would not exist. If I were to leave out iGraph because I’m not good at it, it would make the book weaker — and give the impression of dishonesty, like the classic Pythonist who ignores R because “it’s the opposing team”.

Finally, the book reads more like a post on this blog than an academic textbook. I use a colorful style and plenty of humor. This is partially as a result of the second point, since the humor is mostly self-deprecating about my limits — for instance, the stabs I take at R are intended as light-hearted jest. In general, I want to avoid being excessively dry and have the readers fall asleep at page 20. This is a risky move, because humor is subjective and heavily culture-dependent. People have been and will be put off by this. If you think I cross the line somewhere in the book, feel free to point that out and ask me to consider your concerns. If, instead, you think that humor in general has no place in academia, then I disagree, but there are plenty alternatives, so you can safely ignore my book.

Given all of the above, it is no surprise that the atlas is imperfect and many things need to be fixed. Trust me that the first draft was significantly worse in all respects. The credit for catching my mistakes goes to my peer reviewers. Every one of their comments was awesome, and every one of the remaining mistakes are only my fault for being unable to address the issues properly. Chief among the reviewers was Aaron Clauset, who read (almost) the entire thing. The others* still donated their time and expertise for free, some of them only asked me to highlight worthwhile charities such as TechWomen and Evidence Action in return.

Given all the errors that remain, consider this a v1.0 of a continuous effort. There are many things to improve: language, concepts, references, figures. Please contact me with any comments. The PDF on the website will reflect changes as soon as is humanely possible. Before I put v1.1 on arXiV, I’ll wait to have a critical mass of changes — I expect to have it maybe for mid to late February.

I also plan to have interactive figures on the website in the future. Version 1.0 was all financed using my research money and time. For the future, I will need some support to do this in my free time. If you feel like encouraging this effort, you can consider becoming a patron on Patreon. A print-on-demand version will be available soon (link will follow), so you could also consider ordering a physical copy — I’ll make 70 juicy cents of profit for every unit sold, because I’m a seasoned capitalist who really knows how to get his money’s worth for two years of labor.

I poured my heart in this. I really hope you’ll enjoy it.


* Special thanks go to Andres Gomez-Lievano. The other peer reviewers are, in alphabetical order: Alexey Medvedev, Andrea Tagarelli, Charlie Brummitt, Ciro Cattuto, Clara Vandeweerdt, Fred Morstatter, Giulio Rossetti, Gourab Ghoshal, Isabel Meirelles, Laura Alessandretti, Luca Rossi, Mariano Beguerisse, Marta Sales-Pardo, Matte Hartog, Petter Holme, Renaud Lambiotte, Roberta Sinatra, Yong-Yeol Ahn, and Yu-Ru Lin.

09 December 2020 ~ 0 Comments

Speed-Check your Diseases on a Social Network

Back in March I wrote a blog post — and a paper — showing a technique to estimate the distance covered by a propagation event on a network between two moments in time. A propagation event could be the failure of a power grid, a word-of-mouth campaign on social media, or — more topically these days — a disease infecting people in a social network. The limitation of that paper was in taking only a single perspective. However, this problem could be solved in at least a dozen different ways. To give justice to such complexity, I recently co-authored the paper “The Node Vector Distance Problem in Complex Networks” with Andres Gomez, James McNerney, and Frank Neffke. The paper was published this month in the ACM Computing Surveys journal and it’s the main focus of this blog post.

Estimating the spreading speed of something in our normal geographical space is important, but relatively trivial. However, networks are complex spaces. You cannot estimate the speed of COVID by looking at the geographical areas it has covered, because what really connects places is not our physical space, but a complex network of relations among regions. In other words, the places closest to China are not necessarily countries like Mongolia or Nepal — both of which share a border with China — but Iran and Italy, because of the many direct flights connecting them.

My paper in March found a way to transform our notion of Euclidean distance — a straight line in physical space — to networks. It basically defined what a “straight line” means when all you have is nodes and edges. In the figure above, I connect countries if they have a significant number of direct flights between their airports. Darker nodes represent countries that were hit earlier, and nodes get progressively lighter the later they were first hit. My generalized Euclidean measure allows you to calculate a number describing how fast this process went. This means you could compare it with other pandemics, or you could use it to estimate the moment when a still-developing pandemic will cover a given fraction of the world.

Was mine the only way to translate the concept of “straight line”? No. For starters, it uses an indirect metaphor to define “straightness”. In my generalized Euclidean, every node is a “dimension” of a multidimensional space and, when COVID infects it, it means that the virus had traveled a certain amount of distance in that dimension. If you’re staring dumbfounded at the previous sentence, yeah, that’s pretty much what I expected. A more intuitive way of defining the distance covered in a network would be simply to count how many edges the disease crossed via the calculation of shortest paths.

However, it’s still not that easy: how far is each newly infected node from the set of previous infected nodes? And how do we combine all those path lengths into one new measure? In the paper, we explain various ways to do so. One option is to apply linking strategies from hierarchical clustering, as I show in the figure above. The distance between the group of red points from the blue points can be the distance of their closest pair — green line, called single linkage –; the distance of their centers of mass — orange line, average linkage –; or the distance between their farthest pair — purple line, complete linkage. Another option is to simulate an agent optimizing the movement of “packets” from the nodes in the origin to the nodes in the destination — the popular Earth Mover’s Distance measure.

And that still doesn’t cover the space of possibilities! Even in our simple geographical world, we can have different perspectives on what “distance” means. For instance, a popular distance measure is cosine distance. In it, it doesn’t matter how far two points are in the space: if they are on the same line from your point of view, you consider them close together. Now, “distance” becomes the angle by which you have to turn to go from looking straight at one point, to looking at the other. In the figure below, the Euclidean (straight line) distance between two red points is in blue. The cosine distance is the thick green line, showing in the point pair below that it could be quite small even for points that Euclid would say are very far away in the plane. Many measures adopt this distance philosophy for networks, for instance the Mean Markov Chain and the Graph Fourier Transform approaches.

That is more or less where we stop in the paper. I think that there are plenty more network distance measures to be discovered, but the work still needs to be done.

As a companion for the paper, I have developed an open source Python library implementing the majority of network distance measures that we discuss. You can use it to calculate network distances in your data, or to better understand how these measures work. Hopefully, you’ll make my job harder by discovering new measures and forcing me to publish an updated paper on the topic.

16 October 2020 ~ 0 Comments

Ruling your Network

When you’re studying complex systems, one of the most important questions you might have is: how will this system evolve in the future? If you’re modeling your system as a network — as I like to do in my spare time — this boils down to predicting the arrival of new nodes and links. This is the realm of link prediction. In this post, I’ll describe one advancement in the field that I developed with fellow NERD Michael Szell in the paper “Multiplex Graph Association Rules for Link Prediction“, to appear next year at the ICWSM conference.

A graph evolving: the green nodes and non-gray links are added over time.

There are many ways to predict new links in a network, but most of these methods have a disadvantage: they can only give you a score for potential future connections between two nodes that are already in the network when you observe it. In other words, they cannot predict new incoming nodes. But with a technique called “graph association rules”, used by the GERM algorithm published in 2009, we can predict new nodes. How is that possible?

In simple terms, a “graph association rule” is a rule that tells you: every time you see in your network a pattern A, it will turn into pattern B, with a certain degree of confidence. The rule is extracted by counting how many times patterns A and B appear. For instance, in the image below, if pattern A (the triangle) appears 9 times and pattern B (the triangle with a dangling node) appears 6 times, the confidence of the rule is 2/3. 66% of the time, a triangle has attracted a dangling node. Note that pattern B must include pattern A, otherwise it’s difficult to hypothesize that A evolved into B.

GERM has a problem of its own, which Michael and I set out to solve: it can predict incoming nodes and links, but it cannot distinguish between different link types. In other words, every predicted link is the same to GERM. However, many real world networks have link types: nodes can connect in different ways. For instance, on social media, you connect to the people you know in different ways: via Facebook, Twitter, Linkedin, etc.

You’d model such system with a multiplex network, which allows for link types. If you have a multiplex network, you need multiplex graph association rules for link prediction. Which is exactly the title of our paper! What a crazy coincidence!

In the paper we re-purpose Moss, a graph pattern miner that can extract multiplex patterns, to build such rules. We created a pre- and post-processor of Moss that can construct the rules based on the patterns it finds. Now we can give colors to the links that are featured in our rules, as the figure below shows. This is a generalization of the signed link predictor I already wrote about a long time ago (the second ever post on this blog. I feel old).

Doing so isn’t painless though. We made sacrifices. For instance, our rule extractor doesn’t really understand the passage of time. It knows that the input network is in the past and spits out the rules to predict its evolution, but it doesn’t know how long a rule will take to complete. Unlike GERM, which can tell you that a rule will take n timesteps to complete.

This downside is minor though. Our link predictor performs well, as witnessed by the ROC curves below (our method in red). The comparisons are other multiplex link predictors. Not only are they worse at predicting links, but they have the added disadvantage of being unable to predict the arrival of new nodes. They also have issues with memory consumption, because they generate a score for each pair of nodes that is not connected in the training data — which, for sparse networks, is a lot of scores. Our predictor, instead, only gives scores to the links that are valid consequences of the rules that we found, usually way fewer than all unconnected node pairs.

If you want to play with our link predictor, you can do so by downloading the code I made public for the replication of the paper’s results. The code is very academic — meaning: badly written, unreasonably fragile, and horribly inefficient. I have in the works an extension with more efficient and robust code, and a generalization from multiplex to fully multilayer networks. Stay tuned!

07 September 2020 ~ 0 Comments

Media, Special Issues, & Personal Growth

A quick and unusual post for this blog to keep you up to date with what happened in my summer.

First, I’m happy to report that my paper on the effect of business travels on economic growth — which I describe here — is generating a fair amount of buzz. You can read the excellent summary written by Ricardo on Project Syndicate, or my own “Behind the paper” version. If podcasts are your thing, we got you covered. And I’ll present it at the Copenhagen Fintech on September 16th.

Second, I’m teaming up with the most excellent prof. Morgan Frank from the University of Pittsburgh to edit a special issue for the journal Frontiers in Big Data Networks. We called the special issue “Complex Networks and Economics” and we intend it to be a safe haven for all of you advancing our understanding of the complex systems that compose our global, regional, and local economy. You can read more at the official page of the journal (linked above), or in this post I wrote for my NERDS research group. Consider submitting!

And, finally, one last bit of shameless self-celebration. Something new happened on the header of this blog:

Yai me! I’m a real associate grownup now! (Actually, effective on October 1st. So I still have time to mess this up)