Michele Coscia

08 November 2018 ~ 0 Comments

Using Web Crawling to Map US State Governments

What do governments do? These days, you can answer this question in many ways: the desire for accountability has pushed many governments to publish more and more data online. These efforts are fantastic, but they seem to have a blind spot. The data is flat. By that, I mean that they usually contain tables, budgets, pieces of information that cannot be really connected to each other. As a network scientist, I think inter-actions contain as much information as actions. Where is the dataset that tells me how public agencies interact with each other? Now the answer is simple: we built it!

The US state governments in all their gory tangledness. Click to enjoy it fully.

The question is: how do we map interactions between agencies if they do not publish data about what they do with whom? And, once we’ve done that, how do we know we’re representing the interactions correctly? This is the result of an effort started by Stephen Kosack, which then gathered to accompany me a fantastic team with a diverse set of skills: Evann Smith, Kim Albrecht, Albert-Laszlo Barabasi, and Ricardo Hausmann. It resulted in the paper “Functional structures of US state governments,” recently published (Open Access!) in PNAS.

We realized that there is a place where agencies say what they are doing: their website. And because websites are built around the idea of interconnecting documents, they are a natural place to scout for links. Our fundamental assumption is that, when a school’s website links to its school district, they’re implicitly or explicitly saying: “What that agency says is relevant for what I do, so you should check them out.” In other words, we can and should link them in a network of agencies.

Crawling the web is notoriously complicated. Besides the technical challenges, there’s also the fact that it’s difficult to build a complete index of government websites. We did our best to follow the agency directories published by central state governments, and integrated that with data from Wikipedia and other websites specializing in listing public schools, libraries, city, and county governments.

Obviously, the picture is incomplete and it changes constantly: we did the crawl in 2014 and the web presence of governments ought to look a bit different today. However, we managed to estimate websites’ ages using the Wayback Machine (consider donating!) and we see signs of saturation: the growth of government presence online is significantly slowing down. A hint that the big picture we’re seeing shouldn’t have changed that much. So what’s this big picture? Allow me to introduce you to what we affectionately call “The Cathedral”:

Click to enlarge and cathedralize yourself!

We collapsed each government agency in a two-level classification of activities into government functions. The top level is a generic activity area and the second level is a specialization inside the area. For instance, a top-level function is education. Then “primary/secondary education” and “public libraries” are two possible second level functions inside education. We built this classification by looking at the textual content of each website. Mechanical Turkers validated it. In the cathedral, each node is a function. We connected nodes if agencies, classified under one function, linked themselves to agencies classified under the other. The position of the node is determined by its centrality both horizontally — most central in the middle — and vertically — from top, most central, to bottom. The cathedral confirms our intuition of hierarchicalness: central state and local governments oversee everything and connect to everything.

We could test how much our picture overlaps with actual agency interactions, because the government of Massachusetts mandates some collaborations in its constitution. So we forced Evann to dedicate one year of her life going through the constitution of Massachusetts and noting down when two agencies were mentioned in the same article as in need of interacting. Then it took me a whole five minutes to swoop in to take the credit by testing how much the web network fared compared to the constitutional network (thanks Evann!).

Red is web-constitution overlap, yellow is what the constitution sees and we don’t, blue is what we see and the constitution doesn’t. Click to enlarge.

The answer is “a lot, with a twist.” Agencies that are mandated to collaborate connect to each other on the web much more strongly than we expected — almost eight times as much. The majority of mandated connections are represented in the web. But the overall measure of alignment returned rather low values. How come? Because the internet reveals many more agencies than are mentioned in the constitution. And it reveals many more links too. Although some of that ought to be noise, there are ways to filter it out to end up with a network that is actually more complete than the one you’d get by looking only at the constitution.

One question we wanted to answer with our data was: how can we explain differences in how states decide to implement their functions? We had a few hypotheses. Maybe it’s ideology: red and blue have… different ideas on how to do things — to put it mildly. Or maybe it’s how rich the states are: more $$$ = more complexity. Maybe it’s just their geographical position. It could be a bit of all of that, but there is one thing dominating everything else: economic structure. It’s not the amount of dollars you earn, but how you do it. Hi-tech states look like hi-tech states, agricultural states look like agricultural states, whether they are red, blue, purple, ocher, indigo, rich, or poor.

Each circle here is a state, and we look at three dimensions on how they implement each function: how many agencies they created to serve it, how big they are, how much they talk to each other. Some functions, like education, are consistently implemented the same across states — we can tell because the states’ circles are very clustered –. Others are all over the place, like administrative law. Click to enlarge.

Everything I said can be visualized in all its glory in the official website of the project: http://govmaps.cid.hks.harvard.edu/ (by Kim Albrecht). You have a few interactive visualizations, the paper, and the data. We strongly believe in open data and reproducibility: our data release includes not only the US state government networks, but also a collection of scripts to reproduce the main results of the paper.

I hope this could be a significant step forward in how we understand government actions and effects on society. Nowadays, it feels that there is a gargantuan ideological divide between sides: handing over your state to “the other side” feels like a tragedy. Our results seem to suggest that, actually, it doesn’t make that much of a difference.

Tags: crawling, government, hierarchies, ideology, state, united states, web crawling

27 February 2018 ~ 0 Comments

Network Hierarchies and the Michelangelo Principle

Link Analysis

The Michelangelo Principle — almost certainly apocryphal — states that:

If you want to transform a marble block into David, just chip away the stone that doesn’t look like David.

This seems to be a great idea to solve literally every problem in the world: if you want to fix evil, just remove from the world everything that is evil. But I don’t want to solve every problem in the world. I want to publish papers on network analysis. And so I apply it to network analysis. In this case, I use it to answer the question whether a directed network has a hierarchical structure or not. The result has been published in the paper “Using arborescences to estimate hierarchicalness in directed complex networks“, which appeared last month in PLoS One.

To determine whether a network is hierarchical is useful in a number of applications. Maybe you want to know how resilient your organization is to the removal of key elements in the team. Or you want to prove that a system you mapped does indeed have a head, instead of being a messy blob. Or you want to know whether it is wise to attempt a takedown on the structure. In all these scenarios, you really desire a number between 0 and 1 that tells you how hierarchical the network is. This is the aim of the paper.

Following the Michelangelo Principle, I decide to take the directed network and chip away from it everything that does not look like a perfect hierarchy. Whatever I’m left with is, by definition, a perfect hierarchy. If I’m left with a significant portion of the original network, then it was a pretty darn good hierarchy to begin with. If I’m left with nothing, well, then it wasn’t a hierarchy at all. Easy. Let’s translate this into an algorithm. To do so, we need to answer a couple of questions:

What’s a “perfect hierarchy”?
How do we do the chipping?
How do we quantify the amount we chipped?

The first question is the one where we have most wiggle room. People might agree or disagree with the definition of perfect hierarchy that I came up with. Which is: a perfect hierarchy is a system where everybody answers to a single boss, except the boss of them all, who answers to no one. I like this definition because it solves a few problems.

Consider the picture above. In the leftmost example, if we assume nodes 1 and 2 give contradictory orders, node 5 doesn’t really know what to do, and the idea of a hierarchy breaks down. In the example in the middle, we don’t even know who’s the boss of them all: is it node 0 or node 3? The rightmost example leaves us no doubt about who’s boss, and there’s no tension. For those curious, network scientists call that particular topology an “arborescence“, and that’s the reason this exotic word is in the paper title. Since this is a well defined concept, we know exactly what to remove from an arbitrary directed network to make it into an arborescence.

Time to chip! Arbitrary directed networks contain strongly connected components: they have paths that can lead you back to your origin if you follow the edge directions. An arborescence is a directed acyclic graph, meaning that it cannot have such components. So our first step is to collapse them (highlighted in yellow above) into a single node. Think of strongly connected components as headless teams where all the collaborators are at the same level. They are a node in the hierarchy. We don’t care how a node organizes itself internally. As long as it answers to a boss and gives direction to its underlings, it can do it in whichever way it wants.

Second chipping step: in an arborescence, all nodes have at most one incoming connection, and only one node has zero. So we need to remove all offending remaining edges (highlighted in orange above). Once we complete both steps, we end up with an arborescence, and we’re done. (There are edge cases in which you’ll end up with multiple weakly connected components. That’s fine. If you follow the procedure, each of these components is an arborescence. Technically speaking, this is an “arborescence forest”, and it is an acceptable output)

We can now answer the final question: quantifying how much we chipped. I decide to focus on the number of edges removed. Above, you see that the original graph (left) had twenty edges, and that (right) nine edges survived. So the “hierarchicalness” of the original graph is 9 / 20 = .45.

Now the question is: why would someone use this method to estimate a network’s degree of hierarchicalness and not one of the many already proposed in the literature? The other methods all have small downsides. I build some toy examples where I can show that arborescence is the way to go. For instance, you can’t find a more perfect hierarchy than a balanced tree (leftmost example above). However, Global Reach Centrality would fail to give it a perfect score — since it thinks only a star graph is a perfect hierarchy. Agony and Flow Hierarchy aren’t fooled in this case, but give perfect scores in many other scenarios: a wheel graph with a single flipped edge (example in the middle), or a case where there are more bosses than underlings (rightmost example). Those who have been in a team with more bosses than workers know that the arrangement could be described in many ways, but “perfect” ain’t one of them.

Arborescence is also able to better distinguish between purely random graphs and graphs with a hierarchy — such as a preferential attachment with edges going from the older to the newer nodes (details in the paper). Before you start despairing, it’s not that I’m saying we’ve been doing hierarchy detection wrong for all these years. In most real world scenarios, these measures agree. But, when they disagree, arborescence is the one that most often sides with the domain experts, who have well informed opinions whether the system represented by the network should be a hierarchy or not.

To conclude, this method has several advantages over the competitors. It’s intuitive: it doesn’t give perfect ratings to imperfect hierarchies and vice versa. It’s graphic: you can easily picture what’s going on in it, as I did in this blog post. It’s conservative: it doesn’t make the outlandish claim that “everyone before me was a fool”. It’s rich: it gives you not only a score, but also a picture of the hierarchy itself. So… Give it a try! The code is freely available, and it plays nicely with networkx.

Tags: complex networks, directed networks, hierarchies, link analysis, michelangelo

Connecting Humanities

Using Web Crawling to Map US State Governments

Network Hierarchies and the Michelangelo Principle

Twitter

People I find interesting

Categories

Recent Posts

Archives