Google will stop at nothing in its quest to index the world’s information. Last year it ate through 100 exabytes of data, but there’s still a lot that it can’t get access to. Known as the deep web (or hidden web, or invisible web, etc.), it is estimated that the majority of online data is hidden safely from Google’s prying eyes — private intranets, unlinked pages, some non-textual content, and until today dynamic content returned via form input was all inaccessible to the search engine. Google today announced that its Googlebot web crawler would begin to fill out HTML forms and crawl the results.
“For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made,” explained Jayant Madhavan and Alon Halevy in a blog post. “If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.”
Google, which says that the crawling of dynamic form results doesn’t affect the “crawling, ranking, or selection of other web pages in any significant way,” also assured webmasters today that their enhanced crawl would respect robots.txt as usual. Any form forbidden in robots.txt won’t be crawled.
It is estimated that the deep web is several orders of magnitude larger than the regular, public world wide web. While there is some content that Google will never — and should never — get its hands on, by crawling form results Google is now peering just a little bit deeper into the Internet. As Matt Cutts points out, this is less about indexing search results (something Google has generally not liked to do) and more about finding new links that are only available via dynamically created pages.
It should be noted that Google is only crawling GET forms (i.e., forms used to retrieve dynamic content, such as search results) and not POST forms. That’s mildly disappointing as we were looking forward to befriending Googlebot on MySpace…