For this end of July, I want to report some juicy facts about a work currently under development. Why? Because I think that these facts are interesting. And they are a bit depressing too. So instead of crying I make fun of them, because as the “Panic! At the Disco” would put it: I write sins, not tragedies.
So, a bit of background. Last year I got involved in an NSF project, with my boss Ricardo Hausmann and my good friend/colleague Prof. Stephen Kosack. Our aim is to understand what governments do. One could just pull out some fact-sheets and budgets, but in our opinion those data sources do not tell the whole story. Governments are complex systems and as complex systems we need to understand their emergent properties as collection of interacting parts. Long story short, we decided to collect data by crawling the websites of all public agencies for each US state government. As for why, you’ll have to wait until we publish something: the aim of this post is not to convince you that this is a good idea. It probably isn’t, at least in some sense.
It isn’t a good idea not because that data does not make sense. Au contraire, we already see it is very interesting. No, it is a tragic idea because crawling the Web is hard, and it requires some effort to do it properly. Which wouldn’t be necessarily a problem if there wasn’t an additional hurdle. We are not just crawling the Web. We are crawling government websites. (This is when in a bad horror movie you would hear a thunder nearby).
To make you understand the horror of this proposition is exactly the aim of this post. First, how many government websites are really out there? How to collect them? Of course I was not expecting a single directory for all US states. And I was wrong! Look for yourselves this beauty: http://www.statelocalgov.net/. The “About” page is pure poetry:
State and Local Government on the Net is the onle (sic) frequently updated directory of links to government sponsored and controlled resources on the Internet.
So up-to-date that their footer gets only to 2010 and their news section only includes items from 2004. It also points to:
SLGN Notes, a weblog, [that] was added to the site in June 2004. Here, SLGN’s editors comment on new, redesigned or updated state and local government websites, pointing out interesting or fun features for professional and consumer audiences alike and occasionally cover related news.
Yeah, go ahead and click the link, that is not the only 404 page you’ll see here.
Enough compliments to these guys! Let’s go back to work. I went straight to the 50 different state government’s websites and found in all of them an agency directory. Of course asking that these directories shared the same structure and design is too much. “What are we? Organizations whose aim is to make citizen’s life easier or governments?”. In any case from them I was able to collect the flabbergasting amount of 61,584 URLs. Note that this is six times as much as from statelocalgov.net, and it took me a week. Maybe I should start my own company
Awesome! So it works! Not so fast. Here we hit the first real wall of government technological incompetence. Out of those 61,584, only 50,999 actually responded to my pings. Please note that I already corrected all the redirects: if the link was outdated but the agency redirected you to the new URL, then that connection is one of the 50,999. Allow me to rephrase it in poetry: in the state government directories there are more than ten thousand links that are pure, utter, hopeless garbage nonsense. More than one out of six links in those directories will land you exactly nowhere.
Oh, but let’s stay positive! Let’s take a look at the ones that actually lead you somewhere:
- Inconsistent spaghetti-like design? Check. Honorable mention for the good ol’ frameset webdesign of http://colecounty.org/.
- Making your website an image and use <area> tag for links? Check. That’s some solid ’95 school.
- Links to websites left to their own devices and purchased by someone else? Check. Passing it through Google Translate, it provides pearls of wisdom like: “To say a word and wipe, There are various wipe up for the wax over it because wipe from”. Maybe I can get a half dozen of haiku from that page. (I have more, if you want it)
- ??? Check. That’s a Massachusetts town I do not want to visit.
- Maintenance works due to finish some 500 days ago? Check.
- Websites mysteriously redirected somewhere else? Check. The link should go to http://www.cityoflaplata.com/, but alas it does not. I’m not even sure what the heck these guys are selling.
- These aren’t the droids you’re looking for? Check.
- The good old “I forgot to renew the domain contract”? Check.
Bear in mind that this stuff is part of the 50,999 “good” URLs (these are scare quotes at their finest). At some point I even gave up noting down this stuff. I saw hacked webpages that had been there since years. I saw an agency providing a useful Google Maps of their location, which according to them was the middle of the North Pole. But all those things will be lost in time, like tears in the rain.