Home Google: “We’re Not Doing a Good Job with Structured Data”

Google: “We’re Not Doing a Good Job with Structured Data”

During a talk at the New England Database Day conference at the Massachusetts Institute of Technology, Google’s Alon Halevyadmitted that the search giant has “not been doing a good job” presenting the structured data found on the web to its users. By “structured data,” Halevy was referring to the databases of the “deep web” – those internet resources that sit behind forms and site-specific search boxes, unable to be indexed through passive means.

Google’s Deep Web Search

Halevy, who heads the “Deep Web” search initiative at Google, described the “Shallow Web” as containing about 5 million web pages while the “Deep Web” is estimated to be 500 times the size. This hidden web is currently being indexed in part by Google’s automated systems that submit queries to various databases, retrieving the content found for indexing. In addition to that aspect of the Deep Web – dubbed “vertical searching” – Halevy also referenced two other types of Deep Web Search: semantic search and product search.

Google wants to also be able to retrieve the data found in structured tables on the web, said Halevy, citing a table on a page listing the U.S. presidents as an example. There are 14 billion such tables on the web, and, after filtering, about 154 million of them are interesting enough to be worth indexing.

Can Google Dig into the Deep Web?

The question that remains is whether or not Google’s current search engine technology is going to be adept at doing all the different types of Deep Web indexing or if they will need to come up with something new. As of now, Google uses the Big Table database and MapReduce framework for everything search related, notes Alex Esterkin, Chief Architect at Infobright, Inc., a company delivering open source data warehousing solutions. During the talk, Halevy listed a number of analytical database application challenges that Google is currently dealing with: schema auto-complete, synonym discovery, creating entity lists, association between instances and aspects, and data level synonyms discovery. These challenges are addressed by Infobright’s technology, said Esterkin, but “Google will have to solve these problems the hard way.”

Also mentioned during the speech was how Google plans to organize “aspects” of search queries. The company wants to be able to separate exploratory queries (e.g., “Vietnam travel”) from ones where a user is in search of a particular fact (“Vietnam population”). The former query should deliver information about visa requirements, weather and tour packages, etc. In a way, this is like what the search service offered by Kosmix is doing. But Google wants to go further, said Halevy. “Kosmix will give you an ‘aspect,’ but it’s attached to an information source. In our case, all the aspects might be just Web search results, but we’d organize them differently.”

Yahoo Working on Similar Structured Data Retrieval

The challenges facing Google today are also being addressed by their nearest competitor in search, Yahoo. In December, Yahoo announced that they were taking their SearchMonkey technology in-house to automate the extraction of structured information from large classes of web sites. The results of that in-house extraction technique will allow Yahoo to augment their Yahoo Search results with key information returned alongside the URLs.

In this aspect of web search, it’s clear that no single company has yet to dominate. However, even if a non-Google company surges ahead, it may not be enough to get people to switch engines. Today, “Google” has become synonymous with web search, just like “Kleenex” is a tissue, “Band-Aid” is an adhesive bandage, and “Xerox” is a way to make photocopies. Once that psychological mark has been made into our collective psyches and the habit formed, people tend to stick with what they know, regardless of who does it better. That’s something that’s a bit troublesome – if better search technology for indexing the Deep Web comes into existence outside of Google, the world may not end up using it until such point Google either duplicates or acquires the invention.

Still, it’s far too soon to write Google off yet. They clearly have a lead when it comes to search and that came from hard work, incredibly smart people, and innovative technical achievements. No doubt they can figure out this Deep Web thing, too. (We hope).

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the gambling and blockchain industries for major developments, new product and brand launches, game releases and other newsworthy events. Editors assign relevant stories to in-house staff writers with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Get the biggest iGaming headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Gambling News

    Explore the latest in online gambling with our curated updates. We cut through the noise to deliver concise, relevant insights, keeping you informed about the ever-changing world of iGaming and its most important trends.

    In-Depth Strategy Guides

    Elevate your game with tailored strategies for sports betting, table games, slots, and poker. Learn how to maximize bonuses, refine your tactics, and boost your chances to beat the house.

    Unbiased Expert Reviews

    Honest and transparent reviews of sportsbooks, casinos and poker rooms crafted through industry expertise and in-depth analysis. Delve into intricacies, get the best bonus deals, and stay ahead with our trustworthy guides.