Archives, Digitization, Tools

Understanding reCAPTCHA

reCAPTCHAOne of the things I added to this blog when I moved from my own software to WordPress was the red and yellow box in the comments section, which defends this blog against comment spam by asking commenters to decipher a couple of words. Such challenge-response systems are called CAPTCHAs (a tortured and unmellifluous acroynm of “completely automated public Turing test to tell computers and humans apart”). What really caught my imagination about the CAPTCHA I’m using, called reCAPTCHA, is that it uses words from books scanned by the Internet Archive/Open Content Alliance. Thus at the same time commenters solve the word problems they are effectively serving as human OCR machines.

To date, about two million words have been deciphered using reCAPTCHA (see the article in Technology Review lauding reCAPTCHA’s mastermind, Luis von Ahn), which is a great start but by my calculation (100,000 words per average book) only the equivalent of about 20 books. Of course, it’s really much more than that because the words in reCAPTCHA are the hardest ones to decipher by machine and are sprinkled among thousands of books.

Indeed, that is the true genius of reCAPTCHA—it “tells computers and humans apart” by first using OCR software to find words computers can’t decipher, then feeds those words to humans, who can decipher the words (proving themselves human). Therefore a spammer running OCR software (as many of them do to decipher lesser CAPTCHAs), will have great difficulty cracking it. If you would like an in-depth lesson about how reCAPTCHA (and CAPTCHAs in general) works, take a listen to Steve Gibson’s podcast on the subject.

The brilliance of reCAPTCHA and its simultaneous assistance to the digital commons leads one to ponder: What other aspects of digitization, cataloging, and research could be aided by giving a large, distributed group of humans the bits that computers have great difficulty with?

And imagine the power of this system if all 60 million CAPTCHAs answered daily were reCAPTCHAs instead. Why not convert your blog or login system to reCAPTCHA today?

Standard

3 thoughts on “Understanding reCAPTCHA

  1. Your post brings to mind translation projects that proceed in a sort of Wiki format…individual phrases or sentences that can be translated by an individual user, on a voluntary basis. These large sorts of translation projects, especially of languages like Latin, seem to require the human touch, but breaking up the work into much smaller “bits” spreads out the labor.

    Maybe these aren’t as useful or consistent as the work of a single translator or team, but an interesting option, nonetheless.

  2. I started using reCAPTCHA on a site to stop form spam and it seems to be working well.

    Unfortunately I don’t think it will work better than other CAPTCHA systems just because it includes a word a machine has already had trouble with.

    My understanding is that reCAPTCHA gives the user two words, one it has already deciphered (which serves as the true turing test) and one that it hasn’t deciphered. The system then uses the input from the commenter to “learn” what the un-deciphered word is.

    So I don’t think reCAPTCHA is any more impervious to OCR hacks than other CAPTCHA systems.

  3. Pingback: Dan Cohen’s Digital Humanities Blog » Blog Archive » A reCAPTCHA Dilemma?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s