Home Is It Time For a Web Crawling Code of Conduct?

Is It Time For a Web Crawling Code of Conduct?

Earlier this week, The Wall Street Journal posted an article entitled “‘Scrapers’ Dig Deep for Data on Web“. While the article highlights some important issues surrounding the murky and potentially shady business of Web crawling, it fails to provide a comprehensive story on the uses of Web crawling. In other words, by focusing on one or two companies with spotty business practices, it casts the entire practice of data collection from the Web as something to be feared.

Guest author Shion Deysarkar (@shiondev) is responsible for overall business development at 80legs. In a previous life, he founded and ran a predictive modeling firm. He enjoys playing poker and soccer, but is only good at one of them.

Why Web Crawling Is Good

For more on this topic, see:
The Glory, Bliss and How-to of Screen Scraping for RSS

There have certainly been cases where Web crawling has gone too far. The PatientsLikeMe.com case highlighted in the article is a great example. However, I would argue that there are far more cases where Web crawling and data collection from the Web has generated real value – not only for companies, but for individuals as well.

For instance, aggregate data from the Web helps companies learn what people think about their products. Companies that can listen better can meet the needs of their customers better. Another interesting use-case is discovering and analyzing potential ad channels. Ad networks crawl millions of Web pages to find content relevant to their ad inventory. Crawling also allows companies like Infochimps and Factual to build better, more structured data sets with anything from property data to sports data. Rather than having this data scattered around the Web, it’s now centralized for easy consumption and analysis.

A Web Crawling Code of Conduct

Unfortunately, and somewhat understandably, it’s easier to focus on the murky underbelly of Web crawling. People gravitate more to stories about organizations doing the wrong thing than stories about companies just running their businesses the right way. 80legs and other companies involved in legitimate Web data collection need to make sure we are not grouped in with the other organizations.

I think a great first step toward this is establishing a “Web Crawling Code of Conduct”. The rules and laws surrounding Web crawling have been hazy at best and show no signs of being clarified. This is not surprising, considering that law tends to play catch-up with technology. However, after some experience in this industry, I feel that the following two rules embody the minimum necessary guidelines for proper Web crawling:

1. Only publicly-available sources may be crawled. This means bots cannot log into websites, unless explicitly allowed by the website.

2. Do not overwhelm a website with crawling requests. Crawling requests should not significantly increase the amount of bandwidth needed by the server.

Some readers may feel I’ve left out certain aspects that should be included in proper Web crawling, such as following robots.txt and other practices. While I recognize the value that those practices have, my personal opinion is that Web data sources and Web data collectors should work together to maximize the value of Web data, and that some common practices hamper that unnecessarily. Further discussion is welcome and eagerly anticipated.

Perhaps while we wait for proper regulations to help distinguish those socially aware crawling services acting with best practices in mind from the more dubious companies with other interests, we should move toward creating a more formal, independent board that can certify, whether officially or unofficially, those crawling companies adhering to such a code and operating legitimate services.

Photo by homyox

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech industry for major developments, new product launches, AI breakthroughs, video game releases and other newsworthy events. Editors assign relevant stories to staff writers or freelance contributors with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.