Michele Coscia

08 June 2022 ~ 0 Comments

Give Me Some Space!, or: the Slow Descent into Unicode Madness

Coding

My wife is the literal embodiment of this comic strip:

Laptop Issues — Source: https://xkcd.com/2083/

She has to overcome the weirdest code golfing issues to get through her research. The other day her issue was: “I have a bunch of text and I want to surround every unicode emoticon with spaces” — the idea being to treat each emoji like a full word.

Sounds easy enough until you get to the word “unicode” and you start hearing helicopters in the background, a Doors song comes on, and you flash memories of going upstream in a jungle river on a barge.

My problem is that, instead, I am the embodiment of the victim in this comic strip:

Nerd Sniping — Source: https://xkcd.com/356/

When confronted with something seemingly simple but with an interesting level of complexity behind it, I just have to drop everything I’m doing for my actual job for an hour, until I solve it.

At first glance, it seems easy enough: find a dictionary of emoticons and use it as the basis of a regular expression. The emoji package can take care of the first part. A naive solution could be the following:

import re
from emoji import unicode_codes

allemojis = "".join(unicode_codes.UNICODE_EMOJI['en'])
searcher = re.compile(u'([%s])' % allemojis)
spacemoji = {key: searcher.sub(r' \1 ', texts[key]) for key in texts}

This assumes that “texts” is a dictionary with a collection of texts we’re interested in. The “searcher” wraps a bunch of characters in between square brackets, which means that any matching characters will be found. The the “.sub” method will replace whatever matched (“\1“) with its content surrounded by spaces.

Easy enough. Let’s test it with some example strings:

Wonderful

Everything works as it should and is awesome

Say what now?

I wonder if they still hire at McDonald’s

A passing knowledge of unicode, or a quick Google search about the mysterious \u200d code popping out of nowhere in example #3, leads to a relatively quick diagnosis. Emoticons are not single characters: they can combine multiple unicode characters to modify the appearance of a single glyph. In my case, the baseline turban emoticon is of a male with the baseline yellow emoticon skin tone. To obtain a white lady with a turban you need to combine the turban emoticon with the white color and the woman symbol.

Same goes for numbers: some emoticons contain raw digit characters, and thus those will match even when not “inside” an emoticon.

So here’s a step-by-step cure for our unicode woes:

Don’t work with string or unicode string objects. Always work with bytestrings by calling “.encode("utf-8")“. This way you can see what the machine actually sees. It’s not ““. it’s “\xf0\x9f\x91\xb3\xf0\x9f\x8f\xbb\xe2\x80\x8d\xe2\x99\x80” (easy, right?).
Don’t use square brackets for your regular expression, because it will only match one character. Emoticons aren’t characters, they are words. Use the pipe, which allows for matching groups of characters.
Store your emoticons in a list and sort by descending length. The regular expression will stop at the first match, and is longer than , because the first one is a modified version. The simple turban emoticon is actually “\xf0\x9f\x91\xb3” (note how these are the first four bytes of the previous emoticon). So the regular expression will not match the “man with a turban” inside the “white woman with a turban“.
Escape your regular expression’s special characters. Some emoticons contain the raw character “*“, which will make your compiler scream offensive things at you.
Remember to encode your input text and then to decode your output, so that you obtain back a string and not a byte sequence — assuming you’re one of those basic people who want to be able to read the outputs of their pile of code.

If we put all of this together, here’s the result of my hour of swearing at the screen in Italian 🤌:

import re
from emoji import unicode_codes

allemojis = [x.encode("utf-8") for x in unicode_codes.UNICODE_EMOJI['en']]
allemojis = sorted(allemojis, key = len, reverse = True)
allemojis = b"|".join(allemojis)

searcher = re.compile(b'(%s)' % allemojis.replace(b'*', b'\*'))
spacemoji = {key: searcher.sub(rb' \1 ', texts[key].encode("utf-8")).decode("utf-8") for key in texts}

Which restores some of my sanity:

I’m sure better programmers than me can find better solutions (in less time) and that there are infinite edge cases I’m not considering here, but this was good enough for me. At least I can now go back and do the stuff my employer is actually paying me to do, and I don’t feel I’m in a Francis Ford Coppola masterpiece any more.

Tags: code golf, emoji, emoticons, horror tales, unicode

31 July 2014 ~ 0 Comments

The (Not So) Little Shop of Horrors

Coding

For this end of July, I want to report some juicy facts about a work currently under development. Why? Because I think that these facts are interesting. And they are a bit depressing too. So instead of crying I make fun of them, because as the “Panic! At the Disco” would put it: I write sins, not tragedies.

So, a bit of background. Last year I got involved in an NSF project, with my boss Ricardo Hausmann and my good friend/colleague Prof. Stephen Kosack. Our aim is to understand what governments do. One could just pull out some fact-sheets and budgets, but in our opinion those data sources do not tell the whole story. Governments are complex systems and as complex systems we need to understand their emergent properties as collection of interacting parts. Long story short, we decided to collect data by crawling the websites of all public agencies for each US state government. As for why, you’ll have to wait until we publish something: the aim of this post is not to convince you that this is a good idea. It probably isn’t, at least in some sense.

It isn’t a good idea not because that data does not make sense. Au contraire, we already see it is very interesting. No, it is a tragic idea because crawling the Web is hard, and it requires some effort to do it properly. Which wouldn’t be necessarily a problem if there wasn’t an additional hurdle. We are not just crawling the Web. We are crawling government websites. (This is when in a bad horror movie you would hear a thunder nearby).

To make you understand the horror of this proposition is exactly the aim of this post. First, how many government websites are really out there? How to collect them? Of course I was not expecting a single directory for all US states. And I was wrong! Look for yourselves this beauty: http://www.statelocalgov.net/. The “About” page is pure poetry:

State and Local Government on the Net is the onle (sic) frequently updated directory of links to government sponsored and controlled resources on the Internet.

So up-to-date that their footer gets only to 2010 and their news section only includes items from 2004. It also points to:

SLGN Notes, a weblog, [that] was added to the site in June 2004. Here, SLGN’s editors comment on new, redesigned or updated state and local government websites, pointing out interesting or fun features for professional and consumer audiences alike and occasionally cover related news.

Yeah, go ahead and click the link, that is not the only 404 page you’ll see here.

Enough compliments to these guys! Let’s go back to work. I went straight to the 50 different state government’s websites and found in all of them an agency directory. Of course asking that these directories shared the same structure and design is too much. “What are we? Organizations whose aim is to make citizen’s life easier or governments?”. In any case from them I was able to collect the flabbergasting amount of 61,584 URLs. Note that this is six times as much as from statelocalgov.net, and it took me a week. Maybe I should start my own company 🙂

Awesome! So it works! Not so fast. Here we hit the first real wall of government technological incompetence. Out of those 61,584, only 50,999 actually responded to my pings. Please note that I already corrected all the redirects: if the link was outdated but the agency redirected you to the new URL, then that connection is one of the 50,999. Allow me to rephrase it in poetry: in the state government directories there are more than ten thousand links that are pure, utter, hopeless garbage nonsense. More than one out of six links in those directories will land you exactly nowhere.

Oh, but let’s stay positive! Let’s take a look at the ones that actually lead you somewhere:

Inconsistent spaghetti-like design? Check. Honorable mention for the good ol’ frameset webdesign of http://colecounty.org/.
Making your website an image and use <area> tag for links? Check. That’s some solid ’95 school.
Links to websites left to their own devices and purchased by someone else? Check. Passing it through Google Translate, it provides pearls of wisdom like: “To say a word and wipe, There are various wipe up for the wax over it because wipe from”. Maybe I can get a half dozen of haiku from that page. (I have more, if you want it)
??? Check. That’s a Massachusetts town I do not want to visit.
Maintenance works due to finish some 500 days ago? Check.
Websites mysteriously redirected somewhere else? Check. The link should go to http://www.cityoflaplata.com/, but alas it does not. I’m not even sure what the heck these guys are selling.
These aren’t the droids you’re looking for? Check.
The good old “I forgot to renew the domain contract”? Check.

Bear in mind that this stuff is part of the 50,999 “good” URLs (these are scare quotes at their finest). At some point I even gave up noting down this stuff. I saw hacked webpages that had been there since years. I saw an agency providing a useful Google Maps of their location, which according to them was the middle of the North Pole. But all those things will be lost in time, like tears in the rain.

Tags: government, horror tales, political science, web crawling

Connecting Humanities

Give Me Some Space!, or: the Slow Descent into Unicode Madness

The (Not So) Little Shop of Horrors

Twitter

People I find interesting

Categories

Recent Posts

Archives