Earlier this week, The Wall Street Journal posted an article entitled “‘Scrapers’ Dig Deep for Data on Web“. While the article highlights some important issues surrounding the murky and potentially shady business of Web crawling, it fails to provide a comprehensive story on the uses of Web crawling. In other words, by focusing on one or two companies with spotty business practices, it casts the entire practice of data collection from the Web as something to be feared.
Guest author Shion Deysarkar (@shiondev) is responsible for overall business development at 80legs. In a previous life, he founded and ran a predictive modeling firm. He enjoys playing poker and soccer, but is only good at one of them.
Why Web Crawling Is Good
For more on this topic, see:
The Glory, Bliss and How-to of Screen Scraping for RSS
There have certainly been cases where Web crawling has gone too far. The PatientsLikeMe.com case highlighted in the article is a great example. However, I would argue that there are far more cases where Web crawling and data collection from the Web has generated real value – not only for companies, but for individuals as well.
For instance, aggregate data from the Web helps companies learn what people think about their products. Companies that can listen better can meet the needs of their customers better. Another interesting use-case is discovering and analyzing potential ad channels. Ad networks crawl millions of Web pages to find content relevant to their ad inventory. Crawling also allows companies like Infochimps and Factual to build better, more structured data sets with anything from property data to sports data. Rather than having this data scattered around the Web, it’s now centralized for easy consumption and analysis.
A Web Crawling Code of Conduct
Unfortunately, and somewhat understandably, it’s easier to focus on the murky underbelly of Web crawling. People gravitate more to stories about organizations doing the wrong thing than stories about companies just running their businesses the right way. 80legs and other companies involved in legitimate Web data collection need to make sure we are not grouped in with the other organizations.
I think a great first step toward this is establishing a “Web Crawling Code of Conduct”. The rules and laws surrounding Web crawling have been hazy at best and show no signs of being clarified. This is not surprising, considering that law tends to play catch-up with technology. However, after some experience in this industry, I feel that the following two rules embody the minimum necessary guidelines for proper Web crawling:
1. Only publicly-available sources may be crawled. This means bots cannot log into websites, unless explicitly allowed by the website.
2. Do not overwhelm a website with crawling requests. Crawling requests should not significantly increase the amount of bandwidth needed by the server.
Some readers may feel I’ve left out certain aspects that should be included in proper Web crawling, such as following robots.txt and other practices. While I recognize the value that those practices have, my personal opinion is that Web data sources and Web data collectors should work together to maximize the value of Web data, and that some common practices hamper that unnecessarily. Further discussion is welcome and eagerly anticipated.
Perhaps while we wait for proper regulations to help distinguish those socially aware crawling services acting with best practices in mind from the more dubious companies with other interests, we should move toward creating a more formal, independent board that can certify, whether officially or unofficially, those crawling companies adhering to such a code and operating legitimate services.
Photo by homyox