The popular startup question and answer service Quora only allows the largest search engines to index its site. As Gabe Rivera of Techmeme pointed out yesterday, its robots.txt file explicitly grants Google, Bing, Blekko and other big players access, but excludes everyone else. If large sites had these restrictions back when Google was starting, it might never have succeeded and we’d still be stuck with Altavista. As more publishers move to this whitelist approach, are they stifling innovation?
Gabriel Weinberg has been struggling to persuade Facebook to add his DuckDuckGo search engine to their list of approved crawlers, with no luck. Concerned about mining of their public profiles, last year Facebook started requiring search engines to sign a legal agreement covering the usage of their data. Unfortunately it seems like the process has turned into a barrier for fledgling search companies like Gabriel’s.
Despite being happy to enter into that contract, he hasn’t heard back after several months. While he’s still able to show Facebook pages thanks to API partners like Bing, this leaves him unable to run his own algorithms to optimally rank and display the results. He’s frustrated by the trend towards whitelisting, pointing out that malicious or underhand scrapers ignore the policy file and says “Bad bots don’t respect it anyway”. In his view it’s a big drag on innovation too – “really you’re just hurting startups that may use your data in cool ways”.
Both Quora and Facebook offer APIs to access their data, so why do startups need to crawl their sites? After all, web page scraping is often associated with unsavory scammers and copyright infringers. The real loss is that APIs only allow you to ask the questions that the interface designers have anticipated. For example, Gabriel was hoping to build directories listing the Facebook pages for local businesses by location and type, together with snippets of information about them, just as he does for other categories of sites on the web. There’s no way to gather that information through the Facebook API, so without crawling access he’s unable to implement that feature.
As traditional search companies struggle to pull relevant results from an increasing deluge of low-quality content, we need innovative startups to pioneer new approaches. Without the openness that made it possible for Google to grow, the next big thing in search may never happen.
Photo by David Goehring