16 October 2020 ~ 0 Comments

When you’re studying complex systems, one of the most important questions you might have is: how will this system evolve in the future? If you’re modeling your system as a network — as I like to do in my spare time — this boils down to predicting the arrival of new nodes and links. This is the realm of link prediction. In this post, I’ll describe one advancement in the field that I developed with fellow NERD Michael Szell in the paper “Multiplex Graph Association Rules for Link Prediction“, to appear next year at the ICWSM conference.

There are many ways to predict new links in a network, but most of these methods have a disadvantage: they can only give you a score for potential future connections between two nodes that are already in the network when you observe it. In other words, they cannot predict new incoming nodes. But with a technique called “graph association rules”, used by the GERM algorithm published in 2009, we can predict new nodes. How is that possible?

In simple terms, a “graph association rule” is a rule that tells you: every time you see in your network a pattern A, it will turn into pattern B, with a certain degree of confidence. The rule is extracted by counting how many times patterns A and B appear. For instance, in the image below, if pattern A (the triangle) appears 9 times and pattern B (the triangle with a dangling node) appears 6 times, the confidence of the rule is 2/3. 66% of the time, a triangle has attracted a dangling node. Note that pattern B must include pattern A, otherwise it’s difficult to hypothesize that A evolved into B.

GERM has a problem of its own, which Michael and I set out to solve: it can predict incoming nodes and links, but it cannot distinguish between different link types. In other words, every predicted link is the same to GERM. However, many real world networks have link types: nodes can connect in different ways. For instance, on social media, you connect to the people you know in different ways: via Facebook, Twitter, Linkedin, etc.

You’d model such system with a multiplex network, which allows for link types. If you have a multiplex network, you need multiplex graph association rules for link prediction. Which is exactly the title of our paper! What a crazy coincidence!

In the paper we re-purpose Moss, a graph pattern miner that can extract multiplex patterns, to build such rules. We created a pre- and post-processor of Moss that can construct the rules based on the patterns it finds. Now we can give colors to the links that are featured in our rules, as the figure below shows. This is a generalization of the signed link predictor I already wrote about a long time ago (the second ever post on this blog. I feel old).

Doing so isn’t painless though. We made sacrifices. For instance, our rule extractor doesn’t really understand the passage of time. It knows that the input network is in the past and spits out the rules to predict its evolution, but it doesn’t know how long a rule will take to complete. Unlike GERM, which can tell you that a rule will take n timesteps to complete.

This downside is minor though. Our link predictor performs well, as witnessed by the ROC curves below (our method in red). The comparisons are other multiplex link predictors. Not only are they worse at predicting links, but they have the added disadvantage of being unable to predict the arrival of new nodes. They also have issues with memory consumption, because they generate a score for each pair of nodes that is not connected in the training data — which, for sparse networks, is a lot of scores. Our predictor, instead, only gives scores to the links that are valid consequences of the rules that we found, usually way fewer than all unconnected node pairs.

If you want to play with our link predictor, you can do so by downloading the code I made public for the replication of the paper’s results. The code is very academic — meaning: badly written, unreasonably fragile, and horribly inefficient. I have in the works an extension with more efficient and robust code, and a generalization from multiplex to fully multilayer networks. Stay tuned!

22 January 2015 ~ 0 Comments

## Surprising Facts About Shortest Paths

Maybe it’s the new year, maybe it’s the fact that I haven’t published anything new recently, but today I wanted to take a look at my publication history. This, for a scientist, is something not unlike a time machine, bringing her back to an earlier age. What was I thinking in 2009? What sparked my interest and what were the tools further refined to get to the point where I am now? It’s usually a humbling (not to say embarrassing) operation, as past papers always look so awful – mine, at least. But small interesting bits can be found, like the one I retrieved today, about shortest paths in communication networks.

A shortest path in a network is the most efficient way to go from one node to another. You start from your origin and you choose to follow an edge to another node. Then you choose again an edge and so on until you get to your destination. When your choices are perfect and you used the minimum possible number of edges to follow, that’s a shortest path (it’s A shortest path and not THE shortest path because there might be alternative paths of the same length). Now, in creating this path, you obviously visited some nodes in between, unless your origin and destination are directly connected. Turns out that there are some nodes that are crossed by a lot of shortest paths, it’s a characteristic of real world networks. This is interesting, so scientists decided to create a measure called betweenness centrality. For each node, betweenness centrality is the share of all possible shortest paths in the network that pass through them.

Intuitively, these nodes are important. Think about a rail network, where the nodes are the train stations. High betweenness stations see a lot of trains passing through them. They are big and important to make connections run faster: if they didn’t exist every train would have to make detours and would take longer to bring you home. A good engineer would then craft rail networks in such a way to have these hubs and make her passengers happy. However, it turns out that this intuitive rule is not universally applicable. For example some communication networks aren’t willing to let this happen. Michele Berlingerio, Fosca Giannotti and I stumbled upon this interesting result while working on a paper titled Mining the Temporal Dimension of the Information Propagation.

We built two communication networks. One is corporate-based: it’s the web of emails exchanged across the Enron employee ecosystem. The email record has been publicly released for the investigation about the company’s financial meltdown. An employee is connected to all the employees she emailed. The second is more horizontal in nature, with no work hierarchies. We took users from different email newsgroups and connected them if they sent a message to the same thread. It’s the nerdy version of commenting on the same status update on Facebook. Differently from most communication network papers, we didn’t stop there. Every edge still carries some temporal information, namely the moment in which the email was sent. Above you have an extract of the network for a particular subject, where we have the email timestamp next to each edge.

Here’s where the magic happens. With some data mining wizardry, we are able to tell the characteristic reaction times of different nodes in the network. We can divide these nodes in classes: high degree nodes, nodes inside a smaller community where everybody replies to everybody else and, yes, nodes with high betweenness centrality, our train station hubs. For every measure (characteristic), nodes are divided in five classes. Let’s consider betweenness. Class 1 contains all nodes which have betweenness 0, i.e. those through which no shortest path passes. From class 2 to 5 we have nodes of increasing betweenness. So, nodes in class 3 have a medium-low betweenness centrality and nodes in class 5 are the most central nodes in the network. At this point, we can plot the average reaction times for nodes belonging to different classes in the two networks. (Click on the plots to enlarge them)

The first thing that jumps to the eye is that Enron’s communications (on the left) are much more dependent on the node’s characteristics (whether the characteristic is degree or betweenness it doesn’t seem to matter) than Newsgroup’s ones, given the higher spread. But the interesting bit, at least for me, comes when you only look at betweenness centrality – the dashed line with crosses. Nodes with low (class 2) and medium-low (class 3) betweenness centrality have low reaction times, while more central nodes have significantly higher reaction times. Note that the classes have the same number of nodes in them, so we are not looking at statistical aberrations*. This does not happen in Newsgroups, due to the different nature of the communication in there: corporate in Enron versus topic-driven in Newsgroup.

The result carries some counter intuitive implications. In a corporate communication network the shortest path is not the fastest. In other words, don’t let your train pass through the central hub for a shortcut, ’cause it’s going to stay there for a long long time. It looks like people’s brains are less elastic than our train stations. You can’t add more platforms and personnel to make more things passing through them: if your communication network has large hubs, they are going to work slower. Surprisingly, this does not hold for the degree (solid line): it doesn’t seem to matter with how many people you interact, only that you are the person through which many shortest paths pass.

I can see myself trying to bring this line of research back from the dead. This premature paper needs quite some sanity checks (understatement alert), but it can go a long way. It can be part of the manual on how to build an efficient communication culture in your organization. Don’t overload people. Don’t create over-important nodes in the network, because you can’t allow all your communications to pass through them. Do keep in mind that your team is not a clockwork, it’s a brain-work. And brains don’t work like clocks.

* That’s also the reason to ditch class 1: it contains outliers and it is not comparable in size to the other classes.

28 February 2013 ~ 0 Comments

## Networks and Eras

The real world has many important characteristics. One I heard being quite salient is the fact that time passes. Any picture of the world has to evolve to reflect change, otherwise it is doomed to be representative only of a narrow moment in time. This is quite a problem in computer science, because when we want to analyze something we need to spend a lot of time in gathering data and, usually, the analysis can be done only once we have everything we need. It’s a bit like in physics, when the problems are solved in the vacuum and in the absence of friction. Of course, many people work to develop dynamics models, trying to handle the changes in the data.

Take link prediction, for example. Link prediction is the branch of network science whose aim is to predict which connections are more likely to appear in the near future, given the current status of a network. There are many approaches to this problem: one simply states that the probability that two nodes will connect is proportional to their current degree (because it’s being observed that high degree nodes attracts more edges, it’s called “preferential attachment“), another looks at the history of the new edges which came into existence and tries to redact some evolution rules (see the paper, not much different from my work on signed networks).

What’s the problem in this? The problem lies in the fact that any link came into existence in a specific moment, in which the network shape was different from any other moment. Let’s consider the preferential attachment, with an example. The preferential attachment tells you that the position in the market of Google not only is not in danger: it will become stronger and stronger, because its high visibility attracts everybody who needs the services it is providing. However, Google was not born with the web, but several years after. So in the moment in which Google was born, the preferential attachment would have told you that Google had no chance to beat Yahoo. And now it’s easy to laugh at this idea.

So, what happened? The idea that I investigated with my colleagues at the KDDLab in Italy is extremely simple: just like Earth’s geological times, also complex networks (and complex systems in general) evolve discontinuously, with eras in which some evolution rules apply and some others, valid in other eras, don’t. The original paper is quite old (from 2010), but we recently published an update journal version of it (see the Intelligent Data Analysis Journal), that’s why I’m writing about it.

In our paper, we describe how to build a framework to understand what are the eras in the evolution of a network. Basically, everything boils down to have many snapshots of the network in different moments of time and a similarity measure that tells you how similar are two consecutive snapshots. Then, by checking the values of this similarity function, one can understand if the last trends she is seeing are providing reliable information to make predictions or not. In our world, then, we understand that when Google enters in the web anything can happen, because we are in a new era and we do not use outdated information that do not apply anymore to the new scenario. In our world, also, we are aware that nobody is doomed to success, regardless how good its current position is. A nice and humbling perspective, if I may say.

I suggest reading the paper to understand how nicely our era detection system fits with the data. The geekier readers will find a nice history of programming languages (we applied the era discovery system to the network of co-authorship in computer science), normal people will probably find more amusement in our history of movies (from networks of collaboration extracted from the Internet Movie Database).

So, next time you’ll see somebody trying to make predictions using complex network analysis, check if she is considering data history using an equivalent of our framework. If she does, thumbs up. If she doesn’t, trust her just like you would trust a meteorologist trying to forecast tomorrow’s weather by crunching data from yesterday down to the Mesozoic.

15 September 2012 ~ 0 Comments

## On Social Balance and Link Classification

Imagine being subscribed to a service where you can read other users’ opinions about what interests you. Imagine that, for your convenience, the system allows you to tag other users as “reliable” or “unreliable”, such that the system will not bother you by signaling new opinions from users that you regard as unreliable, while highlighting the reliable ones. Imagine noticing a series of opinions, by a user you haven’t classified yet, regarding stuff you really care about. Imagine also being extremely lazy and unwilling to make up your own mind if her opinions are really worth your time or not. How is it possible to classify the user as reliable or unreliable for you without reading?

What I just described in the previous paragraph is a problem affecting only very lazy people. Computer Science is the science developed for lazy people, so you can bet that this problem definition has been tackled extensively. In fact, it has been. This problem is known as “Edge Sign Classification”, the art of classifying positive/negative relations in social media.

Computer Science is not only a science for lazy people, is also a science carried out by lazy people. For this reason, the edge sign classification problem has been mainly tackled by borrowing the social balance theory for complex networks, an approach that here I’m criticizing (while providing an alternative, I’m not a monster!). Let’s dive for a second into social balance theory, with the warning that we will stay on the surface without oxygen tanks, to get to the minimum depth that is sufficient for our purposes.

Formally, social balance theory states that in social environments there exist structures that are balanced and structures that are unbalanced. The former should be over-expressed (i.e. we are more likely to find them) while the latter should be rarer than expected. If these sentences are unclear to you, allow me to reformulate them using ancient popular wisdom. Social balance theory states mainly two sentences: “The friend of my friend is usually my friend too” and “The enemy of my enemy is usually my friend”, so social relations following these rules are more common than ones that do not follow them.

Let’s make two visual examples among the many possible. These structures in a social network are balanced:

(on the left we have the “friend of my friend” situation, while on the right we have the “enemy of my enemy” rule). On the other hand, these structures are unbalanced:

(the latter on the right is actually a bit more complicated situation, but our dive in social balance is coming to an abrupt end, therefore we have no space to deal with it).

These structures are indeed found, as expected, to be over- or under-expressed in real world social networks (see the work of Szell et al). So the state of the art of link sign classification simply takes an edge, it controls for all the triangles surrounding it and returns the expected edge (here’s the paper – to be honest there are other papers with different proposed techniques, but this one is the most successful).

Now, the critique. This approach in my opinion has two major flaws. First, it is too deterministic. By applying social balance theory we are implying that all social networks obey to the very same mechanics even if they appear in different contexts. My intuition is that this assumption is rather limiting and untrue. Second, it limits itself to simple structures, i.e. triads of three nodes and edges. It does so because triangles are easy to understand for humans, because they are described by the very intuitive sentences presented before. But a computer does not need this oversimplification: a computer is there to do the work we find too tedious to do (remember the introduction: computer scientists are lazy animals).

In the last months, I worked on this problem with a very smart undergraduate student for his master thesis (this is the resulting paper, published this year at SocialCom). Our idea was (1) to use graph mining techniques to count not only triangles but any kind of structures in signed complex networks. Then, (2) we generated graph association rules using the frequencies of the structures extracted and (3) we used the rules with highest confidence and support to classify the edge sign. Allow me to decode in human terms what I just described technically.

1) Graph mining algorithms are algorithms that, given a set of small graphs, count in how many of the graphs a particular substructure is present. The first problem is that these algorithms are not really well defined if we have a single graph for which we want to count how many times the structure is present*. Unfortunately, what we have is indeed not a collection of small graphs, but a single large graph, like this:

So, the first step is to split this structure in many small graphs. Our idea was to extract the ego network for each node in the network, i.e. the collection of all the neighbors of this node, and the edges among them. The logical explanation is that we count how many users “see” the given structure around them in the network:

(these are only 4 ego networks out of a total of 20 from the above example).

2) Now we can count how many times, for example, a triangle with three green edges is “seen” by the nodes of the network. Now we need to construct the “graph association rule”. Suppose that we “see” the following graph 20 times:

and the following graph, which is the same graph but with one additional negative edge, 16 times:

In this case we can create a rule stating: whenever we see the former graph, then we know with 80% confidence (16 / 20 = 0.8) that this graph is actually part of the latter one. The former graph is called premise and the latter is the consequence. We create rules in such a way that the consequence is an expansion of just one edge of the premise.

3) Whenever we want to know if we can trust the new guy (i.e. we have a mysterious edge without the sign), we can check around what premises match.  The consequences of these premises give us a hunch at the sign of our mysterious edge and the consequence with highest confidence is the one we will use to guess the sign.

In the paper you can find the details of the performance of this approach. Long story short: we are not always better than the approach based on social balance theory (the “triangle theory”, as you will remember). We are better in general, but worse when there are already a lot of friends with an expressed opinion about the new guy**. However, this is something that I give away very happily, as usually they don’t know, because social networks are sparse and there is a high chance that the new person does not share any friends with you. So, here’s the catch. If you want to answer the initial question the first thing you have to do is to calculate how many friends of yours have an opinion. If few do (and that happens most of the time) you use our method, otherwise you just rely on them and you trust the social balance theory.

With our approach we overcome the problems of determinism and oversimplification I stated before, as we generate different sets of rules for different networks and we can generate rules with an arbitrary number of nodes. Well, we could, as for now this is only a proof of principle. We are working very hard to provide some usable code, so you can test yourself the long blabbering present in this post.

* Note for experts in graph mining: I know that there are algorithms defined on the single-graph setting, but I am explicitly choosing an alternative, more intuitive way.

** Another boring formal explanation. The results are provided in function of a measure called “embeddedness” of an edge (E(u,v) for the edge connecting nodes u and v). E(u,v) is a measure returning the number of common neighbors of the two nodes connected by the mysterious edge. For all the edges with minimum embeddedness larger than zero (E(u,v) > 0, i.e. all the edges in the network) we outperform the social balance, with accuracy close to 90%. However, for larger minimum embeddedness (like E(u,v) > 25), social balance wins. This happens because there is a lot of information around the edge. It is important to note that the number of edges with E(u,v) > 25 is usually way lower than half of the network (from 4% to 25%).