Oct 18th 2011
AS A child, Babbage struggled to master the well-formed curlicues and prim horizontal strokes of cursive handwriting. He never quite got the hang of it, only to be rescued by the digital age. Now, though, researchers led by Jeff Yan, of Newcastle University, have found that loops and crosses may prove critical online, too. In a paper co-authored with two colleagues he shows how these caligraphic fripperies can unlock the visual puzzle dubbed the Completely Automated Public Turing test to tell Computers and Humans Apart, but better known as CAPTCHAs.
The term CAPTCHA was coined in 2000 by Luis van Ahn and his fellow academics at Carnegie Mellon University (CMU). The idea was to stop spammers (and later criminals) creating accounts through which they could join forums and send e-mail by imposing a hurdle that would be tough for computers, but easy for human beings to scale. (This newspaper discussed the squiggles and potential future replacements in depth in 2009.)
Dr Yan’s group looked at a popular CAPTCHA technique known as “crowding characters together” (CCT) in which letters simply overlap. CCTs were considered a hard computer science problem, and no algorithm had yet been capable of disentagling the twists and skews of layered text, whereas the human visual cortex performs the task swiftly. The team’s method can pick out the telltale holes in letters like “a” or “p”, the vertical dashes in “t” and “f” or dots in “i” or “j”. It also captures letters like “s” with three horizontal segments on top of each other (and distinguishes these from “e” or “a”, which have a similar property, by dismissing characters where lines intersect). Their assorted techniques recognise anywhere between half and nearly all letters and numbers, depending on the particular CAPTCHA algorithm in use.
The researchers tested their algorithm by feeding it samples from Google’s CAPTCHA trove. They also looked at the more elaborate ReCAPTCHA, which Google bought in 2009 together with a spin-off set up by CAPTCHA’s inventors at CMU, and which has since been widely adopted on the internet. The results suggest that the method can crack nearly half of all CAPTCHAs and one-third of ReCAPTCHAs. Even if those numbers exaggerate the system’s efficacy tenfold, though, it would still represent a significant blow to the CAPTCHA model.
Dr Yan does, however, offer some solace. He suggests that adorning letters with false loops and crosses mimicking those in actual letters ought to stump his algorithm and others like it while still being relatively straighforward for human beings to interpret. Systems might also make less use of words containing the vulnerable characters.
In fact, some websites have already begun adapting. Readers might have noticed that both of Google’s CAPTCHA systems have suddenly become more difficult to parse. Your correspondent thought he might have had a small stroke after an hour testing ReCAPTCHAs on a site he runs. Google has declined to confirm that it was prompted by Dr Yan’s research, but it did admit that it has tweaked its CAPTCHAs several times since it was conducted. (The paper was ready as early as May but Dr Yan and his colleagues feared that releasing it before developers had time to come up with countermeasures could prove disruptive.)
In 2009 CMU’s Dr van Ahn told Babbage that computer vision might catch up with CAPTCHAs in as little as five years, making it impossible to produce text that only human brains could tease apart correctly. Dr Yan declines to make similar predictions. But for all his system’s cleverness, he thinks CAPTCHAs will continue to baffle digital eyes for a while yet.