By Dr. Riza C. Berkan, Founder & CEO, hakia.com
Editor’s Note: This is a guest post by the CEO of Hakia, Dr Riza C. Berkan. I want to stress that this post is NOT an advertorial – in fact I made it a condition of publication that the post should focus on the theory of semantic search and it should mention Hakia’s competition, both of which Dr Berkan has done. I should also mention that while Hakia ads sometimes appear on this site, they are managed separately by FM Publishing. In any case the reason for this post by Dr Berkan is purely to explore the topic of semantic search and try to get a conversation going. Semantic search is seen as one of the next-generation search methods that may challenge Google, so the idea with this post is to understand it better – and perhaps debate its future in the comments.
How satisfied search engine users are today is an on-going debate. However, there is wide consensus, from a scientific viewpoint on the competency of the current search engines: They are half-way to the target and there is huge room for improvement. Semantic search is now under the magnifying glass and the question is “can semantic search be an antidote for poor relevancy?”
Let’s start with “what is semantic search?” Academically speaking, semantic search ought to be a system which understands both the user’s query and the Web text using cognitive algorithms similar to that of the human brain, then brings results that are dead on target (right context) at first glance (not requiring to open the Web page for further investigation.) There are several ideas on how to build such a system.
But before looking into these variations, let’s clarify one thing. A semantic system cannot be called “semantic” if it does not encapsulate the knowledge of languages. From this very basic fundamental requirement, we have to exclude all those fancy algorithms that rely on collecting statistics of links, symbols, words, clicking behaviors, and so forth. Statistics is a tool, not a model of a solution. To go the distance, we need a deterministic model of a language processing solution. We need algorithms that match the meaning of concepts (rather than mere words) and emulate “understanding.”
For example, the query “what is Palladium useful for?” may bring search results related to the London Palladium Theatre by statistical methods (a popular subject) as opposed to the actual meaning of the query which is not very popular. A semantic algorithm can easily identify that “useful for” implies the element Palladium.
The two basic views of a semantic search are identified by the location of the semantic resources to be implanted. The first view is to embed the semantic resources in the Web pages themselves. It is called the “Semantic Web”. Why not compose Web pages in a structure that is semantics friendly? The second approach is to locate the semantic resources in search engines which deploy algorithms that use them. This is called “Semantic Search Engine” and works on any text.
The “Semantic Web” approach has been around for a long time now. Unfortunately, it is based on an unrealistic assumption that every Web author will abide by the complex rules of semantics – not to mention the education it requires – and place content in the correct buckets of mysteriously unified standards. Another form of this approach may be to design Web factories that crank out refined Web pages once fed by ordinary Web pages. Of course if there is more than one factory, you have the standards issue again. In this day and age of fast content production, the Semantic Web seems to be more idealism than realism.
The option of “Semantic Search Engine” has yet to be tested. My company hakia, along with others like Powerset, Cognition Search, and Lexxe are taking steps in this new direction. There are challenges with this approach as well. First and foremost, the knowledge of languages must be built in a structure that would allow a scalable and speedy search process. Building such resources is an expensive, tedious, and time consuming endeavor. Then, all the Web pages must be analyzed using this system to prepare for a retrieval platform; another time-consuming process. But when all of this is done properly, the users will start to experience something totally new. Let me emphasize the word “properly” here, which is an entirely new discussion point.
One of the first impacts of semantic search engine will be on the handling of long-tail queries. Without relying on statistics, long-tail queries can be analyzed by semantic algorithms on the fly, and bring search results with the accurate context. With such a capability, we are talking about finding answers to longer than usual, complex, and unpopular queries.
Let’s make no mistake about it. The long-tail is the bottom part of the iceberg under the water. Philosophically, the number of long-tail queries is infinite where as the tip-of-the-iceberg queries can fit on one large hard drive. Popularity algorithms fail at the long-tail queries (by definition) because there is never enough statistical sampling.
Many people are not realizing the fact that long-tail queries are partly personal queries (uniquely unpopular and complex reflecting individual personalities.) Thus, the idea of “personalized search” actually requires semantic capabilities without the need for tracking the user’s behavior unless it is been tracked for psychological profiling.
In a similar argument, queries against dynamic content are also long-tail queries. Because dynamic content, like news, decays its value very fast during which there is no time to collect statistics. By the time the link referrals are made, or click statistics are collected, the content is no longer in demand. Therefore, a semantic approach is very effective in handling dynamic content and can unleash its full power the second the content is born.
Semantic search is definitely an antidote for poor relevancy; but only time will tell how well this can be done.
I will close this with a few commonly asked questions.
Q. How can a semantic search engine recognize a popular Web page compared to an unpopular one for a given query term(s)?
A. A semantic search engine recognizes the correct context for a given query term(s). Once the context is correct, popularity becomes irrelevant, and credibility must be questioned. Credibility of a Web page is a relatively easy task to detect. As a result, if you have the correct context from a credible source, the job is done. You can test this logic for any query today. The popularity method is a replacement of these capabilities as a crude approximation.
Q. If the user types “madonna”, how would a semantic search engine understands the intent of the user? (i.e., is it the artist, or the religious figure?)
A. Semantic Search engine is not a psychic. Thus, attempting to guess the intent is futile for an under-represented query. But the solution is easy, just to give back the user search results of all possible senses of the word. Even better, categorize them neatly. This is within the design envelope of semantic search engines.
Q. How can a semantic search engine be manipulated by spam pages?
A. If done properly, a semantic search engine cannot be manipulated by text. Because it specializes on detecting the right context, the spammers will have to put the right context for the right query; which is no longer spam per definition. The abuses related to image and video are possible. But these kinds of abuses are common today and can be detected in different ways.
Q. Will semantic search take over today’s search engines?
A. In the long run, they most likely will. Again, this depends on how well they are done. Once the long-tail searches start to show the difference, then it will probably have a domino effect. If people are satisfied in the complex query domain, they are more likely to switch for simple queries as well. Let’s remember that there is no cost to switch.
Q. There were previous failed attempts of natural language search engines. Why would this work now?
A. Natural Language is a wide term that includes all sorts of things. Previous attempts have failed mostly because they were not done properly, and methods used were not based on proper semantic principles. Some of them were merely statistical methods very similar to the conventional search engines. Others were behavior tracking AI applications. And some relied on human labor to keep up with question answering. There are so many ways of doing it improperly, and only one way of doing it right.
Q. How would the advertising systems be affected by semantic search?
A. The impact will be very big, perhaps more than the search itself. A semantic advertising system, which can detect the right context most of the time in a consistent manner, means a huge jump in ROI.
Q. What is the single most drastic problem in front of semantic search today?
A. Misconceptions and hype. Business continuity must rely on honest declarations of what is to be expected in accordance with the pace of development. Semantic search is a difficult technological endeavor; it takes time and patience. Investments with short-term agendas will hurt this newly emerging technology sector.