22 August 2019 ~ 0 Comments

## Deriving Networks isn’t as Easy as it Looks

Networks are cool because they’re a relatively simple model that allows you to understand complex systems. The problem is that they’re too cool: sometimes they make you want to do network analysis on something that isn’t really a network. For instance, consider Netflix. Here you have people watching movies. You want to know which movies are similar to each other so that you can suggest them to similar users. On the wings of Maslow’s Lawwhen you’re holding a hammer everything starts looking like a nail –, the network scientist would want to build a movie-movie network.

The problem is that there are many different ways to make a movie-movie network from Netflix data. Each of these different ways will alter the shape of your network in dramatic ways, which will affect the results you’re going to get once you use it for your aims. With Luca Rossi, I started exploring this space. This resulted in the paper “The Impact of Projection and Backboning on Network Topologies“, which I will present next week at the ASONAM conference.

In the paper we take some real-world data and we apply all possible combinations of network building techniques on it. We systematically explore the key topological properties of the resulting networks, and see that they dramatically change depending on which strategy you picked. Meaning that you’re going to get completely different results from the same analysis later on.

Good network analysis is like good art: if you gaze long at the hairball, the hairball will gaze back at you. Image property of the Albright-Knox Art Gallery, Buffalo, NY.

The first thing we need to understand is that, to get to the movie-movie network, we need to perform two major steps. Each movie is a vector, containing information about each user. It could simply be a one if the user watched the movie, or a zero if they didn’t. Thus first we need to apply a similarity measure quantifying how similar two movies are to each other (what I call “projection”). Then we’ll realize that all we got is a hairball. Every movie has a non-zero similarity with any other movie. After all, there are millions of users, but just a handful of movies, so the probability that any two movies were watched by at least one user is pretty high. So you need to filter out your movie-movie similarity, otherwise your resulting network will be too dense.

Comparing two vectors is the oldest profession in the world, assuming your world is completely made up of linear algebra — mine, sadly, is. Thus you can pick and choose dozens of similarity metrics — Euclidean, cosine, I have a soft spot for the Mahalanobis distance myself. However, you’d be better served by the measures that were developed with complex networks in mind. You see, the binary movie-user vectors will have typical broad degree distributions: some movies are very popular — everybody watches them –, some people like me are pathological movie buffs and will watch everything — my watchlist has ~6,500 entries. Thus for this paper we focus on a few of those “bipartite projection” techniques: hyperbolic, resource allocation (ProbS), and my beloved YCN method.

Since I already jumped on the XKCD wagon, I see no harm in continuing down this path…

Then, to filter out connections, you have to have an idea of what’s a “strong, significant” connection and what isn’t. If you’re naive and just think that you should only keep connections with higher weights (what I call “naive thresholding”), boy do I have news for you. Also in this case, we’re going to consider a couple of ways to filter out noisy connections: the disparity filter and my noise corrected backbone.

Ok, the stage is set. If you were paying attention, you’ll figure out what’s coming next: a total mess.

Above (click to enlarge) you see the filtering techniques — top to bottom: naive threshold, disparity filter, noise corrected –, for each projection — line color –, on different network properties. From left to right we see: how many nodes survive the filter step, how much clustered the network is, and how well separated its communities are. The threshold levels (x-axis) attempt to preserve a comparable number of edges for each technique combination.

Yeah, it doesn’t look good. Look at the middle column: there are some versions of this network with perfect clustering, meaning that every common neighbor of a movie is connected to every other; while networks have a transitivity of zero; with almost every possible other values in between. The same holds for modularity, which can span from ~0.2 to practically 1. So there’s no way of saying whether these are properties of the system or just properties of the cleaning procedures you used. Keep in mind that the original data is the same. We could conclude anything by stirring our pile of linear algebra. Want to argue that the movie space doesn’t cluster? Project with YCN and filter with noise corrected method. Want to find strong communities instead? No biggie: project with resource allocation and do a simple threshold of the result.

Famous network scientist and two-time “world’s best mustache” winner Nietzsche once said: “He who fights with hairballs should look to it that he himself does not become a hairball.”

I wish I had a wise message to wrap up this blog post. Something about how to choose the best projection-filtering pair best fitting a specific analysis — one that you cannot tune to obtain the results you want. However, that will have to wait for further research. For now, I just want you to grow suspicious about specific results you see out there from networks that really aren’t network. If your nodes aren’t really connecting directly — like physical connections would do, for instance between neurons –, pretending they do so might lead you down a catastrophic over-confident path.

25 January 2017 ~ 0 Comments

## Network Backboning with Noisy Data

Networks are a fantastic tool for understanding an interconnected world. But, to paraphrase Spider Man, with networks’ expressive power come great headaches. Networks lure us in with their promise of clearly representing complex phenomena. However, once you start working with them, all you get is a tangled mess. This is because, most of the time, there’s noise in the data and/or there are too many connections: you need to weed out the spurious ones. The process of shaving the hairball by keeping only the significant connections — the red ones in the picture below —  is called “network backboning”. The network backbone represents the true relationships better and will play much nicer with other network algorithms. In this post, I describe a backboning method I developed with Frank Neffke, from the paper “Network Backboning with Noisy Data” accepted for publication in the International Conference on Data Engineering (the code implementing the most important backbone algorithms is available here).

Network backboning is as old as network analysis. The first solution to the problem was to keep edges according to their weight. If you want to connect people who read the same books, pairs who have few books in common are out. Serrano et al. pointed out that edge weight distributions can span many orders of magnitude — as shown in the figure below (left). Even with a small threshold, we are throwing away a lot of edges. This might not seem like a big deal — after all we’re in the business of making the network sparser — except that the weights are not distributed randomly. The weight of an edge is correlated with the weights of the edges sharing a node with it — as shown by the figure below (right). It is easy to see why: if you have a person who read only one book, all its edges can have at most weight one.

Their weights might be low in comparison with the rest of the network, but they are high for their nodes, given their propensity to connect weakly. Isolating too many nodes because we accidentally removed all their edges is a no-no, so Serrano and coauthors developed the Disparity Filter (DF): a technique to estimate the significance of one node’s connections given its typical edge weight, regardless of what the rest of the network says.

This sounds great, but DF and other network backboning approaches make imprecise assumptions about the possibility of noise in our estimate of edge weights. In our book example, noise means that a user might have accidentally said that she read a book she didn’t, maybe because the titles were very similar. One thing DF gets wrong is that, when two nodes are not connected in the raw network data, it would say that measurement error is absent. This is likely incorrect, and it screams for a more accurate estimate of noise. I’m going to leave the gory math details in the paper, but the bottom line is that we used Bayes’ rule. The law allows us to answer the question: how surprising is the weight of this edge, given the weights of the two connected nodes? How much does it defy my expectation?

The expectation here can be thought of as an extraction without replacement, much like Bingo (which statisticians — notorious for being terrible at naming things — would call a “hypergeometric” one). Each reader gets to extract a given number of balls (n, the total number of books she read), drawing from a bin in which all balls are the other users. If a user read ten books, then there are ten balls representing her in the bin. This is a good way to have an expectation for zero edge weights (nodes that are not connected), because we can estimate the probability of never extracting a ball with a particular label.

I highlighted the words one and two, because they’re a helpful key to understand the practical difference between the approaches. Consider the toy example below. In it, each edge’s thickness is proportional to its weight. Both DF and our Noise Corrected backbone (NC) select the black edges: they’re thick and important. But they have different opinions about the blue and red edges. DF sees that nodes 2 and 3 have mostly weak connections, meaning their thick connection to node 1 stands out. So, DF keeps the blue edges and it drops the red edge. It only ever looks at one node at a time.

NC takes a different stance. It selects the red edge and drops the blue ones. Why? Because for NC what matters more is the collaboration between the two nodes. Sure, the blue connection is thicker than the red one. But node 1 always has strong connections, and its blue edges are actually particularly weak. On the other hand, node 3 usually has weak connections. Proportionally speaking, the red edge is more important for it, and so it gets saved.

To sum up, NC:

1. Refines our estimate of noise in the edge weights;
2. Sees an edge as the collaboration between two nodes rather that an event happening to one of them;
3. Uses a different model exploiting Bayes’ law to bake these aspects together.

How does that work for us in practice? Above you see some simulations made with artificial networks, of which we know the actual underlying structure, plus some random noise — edges thrown in that shouldn’t exist. The more noise we add the more difficult it is to recover the original structure. When there is little noise, DF (in blue) is better. NC (in red) starts to shine as we increase the amount of noise, because that’s the scenario we were targeting.

In the paper we also show that NC backbones have a comparable stability with DF, meaning that extracting the backbone from different time snapshots of the same phenomenon usually does not yield wildly different results. Coverage — the number of nodes that still have at least one edge in the network — is also comparable. Then we look at quality. When we want to predict some future relationship among the nodes, we expect noisy edges to introduce errors in the estimates. Since a backbone aims at throwing them away, it should increase our predictive power. The table below (click it to enlarge) shows that, in different country-country networks, the predictive quality (R2) using an NC backbone is higher than the one we get using the full noisy network. The quality of prediction can get as high as twice the baseline (the table reports the quality ratio: R2 of the backbone over R2 of the full network, for different methods).

The conclusion is that, when you are confident about the measurement of your network, you should probably extract its backbone using DF. However, in cases of uncertainty, NC is the better option. You can test it yourself!

28 August 2014 ~ 0 Comments

## The Curious World of Network Mapping

Complex networks can come in different flavors. As you know if you follow this blog, my signature dish is multilayer/multidimensional networks: networks with multiple edge types. One of the most popular types is bipartite networks. In bipartite networks, you have two types of nodes. For example, you can connect users of Netflix to the movies they like. As you can see from this example, in bipartite networks we allow only edges going from one type of nodes to the other. Users connect to movies, but not to other users, and movies can’t like other movies (movies are notoriously mean to each other).

Many things (arguably almost everything) can be represented as a bipartite network. An occupation can be connected to the skills and/or tasks it requires, an aid organization can be connected to the countries and/or the topics into which it is interested, a politician is connected to the bills she sponsored. Any object has attributes. And so it can be represented as an object-attribute bipartite network. However, most of the times you just want to know how similar two nodes of the same type are. For example, given a movie you like, you want to know a similar movie you might like too. This is called link prediction and there are two ways to do this. You could focus on predicting a new user-movie connection, or focus instead on projecting the bipartite network to discover the previously unknown movie-movie connections. The latter is the path I chose, and the result is called “Network Map”.

It is clearly the wrong choice, as the real money lies in tackling the former challenge. But if I wanted to get rich I wouldn’t have chosen a life in academia anyway. The network map, in fact, has several advantages over just predicting the bipartite connections. By creating a network map you can have a systemic view of the similarities between entities. The Product Space, the Diseasome, my work on international aid. These are all examples of network maps, where we go from a bipartite network to a unipartite network that is much easier to understand for humans and to analyze for computers.

Creating a network map, meaning going from a user-movie bipartite network to a movie-movie unipartite network, is conceptually easy. After all, we are basically dealing with objects with attributes. You just calculate a similarity between these attributes and you are done. There are many similarities you can use: Jaccard, Pearson, Cosine, Euclidean distances… the possibilities are endless. So, are we good? Not quite. In a paper that was recently accepted in PLoS One, Muhammed Yildirim and I showed that real world networks have properties that make the general application of any of these measures quite troublesome.

For example, bipartite networks have power-law degree distributions. That means that a handful of attributes are very popular. It also means that most objects have very few attributes. You put the two together and, with almost 100% probability, the many objects with few attributes will have the most popular attributes. This causes a great deal of problems. Most statistical techniques aren’t ready for this scenario. Thus they tend to clutter the network map, because they think that everything is similar to everything else. The resulting network maps are quite useless, made of poorly connected dense areas and lacking properties of real world networks, such as power-law degree distributions and short average path length, as shown in these plots:

Of course sometimes some measure gets it right. But if you look closely at the pictures above, the only method that consistently give the shortest paths (above, when the peak is on the left we are good) and the broadest degree distributions (below, the rightmost line at the end in the lower-right part of the plot is the best one) is the red line of “BPR”. BPR stands for “Bipartite Projection via Random-walks” and it happens to be the methodology that Muhammed and I invented. BPR is cool not only because its network maps are pretty. It is also achieving higher scores when using the network maps to predict the similarity between objects using ground truth, meaning that it gives the results we expect when we actually already know the answers, that are made artificially invisible to test the methodology. Here we have the ROC plots, where the highest line is the winner:

So what makes BPR so special? It all comes down to the way you discount the popular attributes. BPR does it in a “network intelligent” way. We unleash countless random walkers on the bipartite network. A random walker is just a process that starts from a random object of the network and then it jumps from it to one of its attributes. The target attribute is chosen at random. And then the walker jumps back to an object possessing that attribute, again choosing it at random. And then we go on. At some point, we start from scratch with a new random walk. We note down how many times two objects end up in the same random walk and that’s our similarity measure. Why does it work? Because when the walker jumps back from a very popular attribute, it could essentially go to any object of the network. This simple fact makes the contribution of the very popular attributes quite low.

BPR is just the latest proof that random walks are one of the most powerful tools in network analysis. They solve node ranking, community discovery, link prediction and now also network mapping. Sometimes I think that all of network science is founded on just one algorithm, and that’s random walks. As a final note, I point out that you can create your own network maps using BPR. I put the code online (the page still bears the old algorithm’s name, YCN). That’s because I am a generous coder.