Understanding reCAPTCHA

reCAPTCHAOne of the things I added to this blog when I moved from my own software to WordPress was the red and yellow box in the comments section, which defends this blog against comment spam by asking commenters to decipher a couple of words. Such challenge-response systems are called CAPTCHAs (a tortured and unmellifluous acroynm of “completely automated public Turing test to tell computers and humans apart”). What really caught my imagination about the CAPTCHA I’m using, called reCAPTCHA, is that it uses words from books scanned by the Internet Archive/Open Content Alliance. Thus at the same time commenters solve the word problems they are effectively serving as human OCR machines.

To date, about two million words have been deciphered using reCAPTCHA (see the article in Technology Review lauding reCAPTCHA’s mastermind, Luis von Ahn), which is a great start but by my calculation (100,000 words per average book) only the equivalent of about 20 books. Of course, it’s really much more than that because the words in reCAPTCHA are the hardest ones to decipher by machine and are sprinkled among thousands of books.

Indeed, that is the true genius of reCAPTCHA—it “tells computers and humans apart” by first using OCR software to find words computers can’t decipher, then feeds those words to humans, who can decipher the words (proving themselves human). Therefore a spammer running OCR software (as many of them do to decipher lesser CAPTCHAs), will have great difficulty cracking it. If you would like an in-depth lesson about how reCAPTCHA (and CAPTCHAs in general) works, take a listen to Steve Gibson’s podcast on the subject.

The brilliance of reCAPTCHA and its simultaneous assistance to the digital commons leads one to ponder: What other aspects of digitization, cataloging, and research could be aided by giving a large, distributed group of humans the bits that computers have great difficulty with?

And imagine the power of this system if all 60 million CAPTCHAs answered daily were reCAPTCHAs instead. Why not convert your blog or login system to reCAPTCHA today?


Jordan says:

Your post brings to mind translation projects that proceed in a sort of Wiki format…individual phrases or sentences that can be translated by an individual user, on a voluntary basis. These large sorts of translation projects, especially of languages like Latin, seem to require the human touch, but breaking up the work into much smaller “bits” spreads out the labor.

Maybe these aren’t as useful or consistent as the work of a single translator or team, but an interesting option, nonetheless.

I started using reCAPTCHA on a site to stop form spam and it seems to be working well.

Unfortunately I don’t think it will work better than other CAPTCHA systems just because it includes a word a machine has already had trouble with.

My understanding is that reCAPTCHA gives the user two words, one it has already deciphered (which serves as the true turing test) and one that it hasn’t deciphered. The system then uses the input from the commenter to “learn” what the un-deciphered word is.

So I don’t think reCAPTCHA is any more impervious to OCR hacks than other CAPTCHA systems.

[…] the New York Times’s ethicist, Randy Cohen (no relation to your’s truly). I have been a major proponent of reCAPTCHA, the red and yellow box at the bottom of my blog posts that uses words from books […]

Leave a Reply