Editor’s note: we offer our long-term sponsors the opportunity to write ‘Sponsor Posts’ and tell their story. These posts are clearly marked as written by sponsors, but we also want them to be useful and interesting to our readers. We hope you like the posts and we encourage you to support our sponsors by trying out their products.
We at Hakia are proud to announce our upcoming commercial ontology, perhaps the world’s first. What is a commercial ontology? If you’re asking this question you have just touched on an important distinction: fantasy versus reality. In the context of the Web, a commercial ontology is a realistic version of an ontology, as we explain below.
Realities of the Web
Hakia has accomplished two important innovations in building its commercial ontology (CO): first, the development of concepts and lexicons that follow strict guidelines on the realities of Web operations. What are these realities? Most search queries on the Web reflect a single dimension of intent, almost exclusively relevant to commercial topics. “Commercial topics” here must be taken in the broadest sense possible. For example, if you were looking for “the benefits of foot massage” or “the director of the movie Last Emperor,” your queries would fall into a commercial pattern. One particular distinction of the commercial pattern is that they come in short packages, including a name (onomasticon) or referring to something sold, bought, watched, heard, etc.
In contrast, many (if not all) ontologies that have been built to date (or claimed to exist) are focused on the use of language in the general sense, but not in the sense of commercial patterns on the Web. Therefore, their usefulness when tackling Web search queries is greatly compromised, sometimes to the point of absolute failure. If such an ontology could disambiguate a dozen different senses of the word “kill,” it would be sad news if the last 100,000 queries in the search logs did not include a single occurrence of the word “kill.” Like drowning in two-inch-deep water, such ontologies do not use their disambiguation capacities for nearly 80% of queries because the queries include nothing but onomasticons or are too short (under-articulated).
The Sequence Approach
The second innovation used in the CO is the use of sequences instead of single words. A single word, like “kill,” is the most ambiguous state of information and is hardly used in human communication without a strong implied context. As a result, building natural-language processing (NLP) systems by taking individual words as units of computation is an invitation for disaster.
In contrast, word sequences (two or more words) are inherently safe and highly descriptive. Take “road kill,” for example. This sequence describes the corpse of an animal killed on the road by a passing vehicle. If a language processing system takes the sequence of words as a unit of computation, 99% of the ambiguity problem vanishes. There is no need to process the words “kill” and “road” separately, trace their senses, and locate convergence to identify the meaning of “road kill” if you can just take the sequence “road kill” itself as your unit of computation for mapping. This is depicted below:
Note the number of traces required in a conventional ontology approach compared to the sequence approach. The sequence approach requires a lot of data storage space (which is dirt cheap), whereas the conventional ontology approach requires a lot of CPU for a simple mapping task (which is expensive). But the bad news does not stop there. The trace routes in conventional ontology require manual work (impossible to automate), whereas sequence-based ontology can be easily built via automation.
Perhaps not everyone will understand the second point above. Nevertheless, the scalability and performance of the end product will speak for themselves when Hakia puts the testing platform online.
Usage of the Commercial Ontology
The immediate use of the CO is for search queries, or document characterizations, not tied to any advertising in conventional systems. This unrecognized domain of search queries and characterizations means loss of revenue. Hakia’s CO is designed to fill in this gap. For example, if the search query or page characterization is “beat generation,” the CO can map it to “literature” on the fly. As a result, systems using the CO will have a much deeper understanding of the incoming terms, and thus will be able to recognize the underlying intent beyond the face value of the words. The same capability can be used in a number of places other than advertising with the same effect.
Stay tuned for the release of the first version of Hakia‘s commercial ontology.