Home reCaptcha: Stopping Spam While Transcribing Books

reCaptcha: Stopping Spam While Transcribing Books

CAPTCHAs, those pesky challenge-response tests that many web sites use to determine whether you are human or a spambot, are an annoyance to many users. According to a report in Science (subscription required), users now solve about 100 million CAPTCHAs a day. ReCAPTCHA, a project based at Carnegie Mellon University, has found an ingenious way to harness all this work and, according to the findings published in Science this week, CAPTCHAs could be used to transcribe printed texts at the rate of 160 books a day.

The current implementation of reCAPTCHA is being used by over 40,000 web sites. The basic idea behind reCAPTCHA is that optical character recognition (OCR), even though it is constantly improving, is still unable to cope with texts where the print has faded or a page is slightly damaged. While humans can transcribe a text with about 99% accuracy, OCR software often doesn’t get beyond 80% when dealing with a slightly damaged text.

reCAPTCHA combines traditional OCR with an approach similar to Amazon’s Mechanical Turk. Every text is analyzed by two different OCR programs and whenever those two program disagree on a word, it is marked as ‘suspicious.’ Those suspicious words are then fed into reCAPTCHA, which creates a CAPTCHA with both the suspicious word and a known control word. Once a certain number of users have solved the suspicious word with the same result, it becomes a control word itself.

Overall, reCAPTCHA achieves an accuracy of 99.1%, which is on par with the accuracy achieved by having two humans type the text and then verify the results.

While it is mostly a proof of concept right now, reCAPTCHA’s developers calculate that the system can be used to transcribe the equivalent of 160 books a day.

The most fascinating aspect of this idea is that it turns mental energy, which would otherwise be wasted, into something useful. Other projects like fold.it, which turns protein folding into a game, or Google’s Image Labeler take a similar approach, but the user has to actively decide to play a game. reCAPTCHA, on the other hand, turns a chore into a useful project.

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech industry for major developments, new product launches, AI breakthroughs, video game releases and other newsworthy events. Editors assign relevant stories to staff writers or freelance contributors with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.