Editor's note: we offer our long-term sponsors the opportunity to write 'Sponsor Posts' and tell their story. These posts are clearly marked as written by sponsors, but we also want them to be useful and interesting to our readers. We hope you like the posts and we encourage you to support our sponsors by trying out their products.
Many of the recent real-time search engines are based on Twitter. They use the URLs enclosed in tweets to discover and rank new and popular pages. In this post, we'll take a look at the quantitative structure of the underlying foundation, to determine the feasibility and limits of this approach. We'll also look at how to overcome these limitations by using the implicit Web.
You may have seen recently the interesting visualization of Twitter statistics. It essentially proves that, as with other social services, only a small fraction of users actively contribute.
But it also shows another fact: that those people who contribute publish an even smaller fraction of the information they know.
Both of these factors account for the huge difference in efficiency between implicit and explicit voting. Explicit voting, as the name implies, requires users to actively express interest in a page; for example, by tweeting a link. Implicit voting requires no deliberate action on the part of the user; a simple visit to a Web page would count as a vote.
A Quick Calculation
According to Nielsen, the number of visited Web pages per person per month is 1,591.
Twitter's 44.5 million users visit 1.6 million Web pages per minute and explicitly vote for only 10,000 per minute. That is to say, implicit voting and discovery generates 160 times more attention-getting data than explicit voting.
This means that 280,000 implicit votes could provide as much information as 44.5 million explicit votes. Put another way, as many Web pages are implicitly discovered during one day as there are Web pages explicitly discovered during half a year.
This dramatically shows the limits of Web searches based solely on explicit votes and mentions, searches whose potential could be leveraged by using the implicit Web.
Beyond the Mainstream
This becomes even more important if we look beyond mainstream topics and the English language. Then it becomes simply impossible to achieve the critical mass of explicit votes needed to have statistically significant attention-based ranking or popularity-based discovery.
Time and Votes Are Precious
Time is also a crucial factor, especially with real-time search. We want to be able to discover new pages as soon as possible. And we want to assess almost instantly how popular those new pages are. If we fail to reliably rank a page quickly, it will get buried in the noise. But the goals of speed and votes conflict with the fact that the number of votes a page gets is inversely proportional to the time it took to be viewed.
Again a much higher frequency of implicit votes would help.
Relevance vs. Equality
We could also improve on explicit votes. But we should not treat them as being equal because they are not. We trust some of them more than others, and our interests overlap with some more than others, for the very same reason that we follow some people and not others. This helps us get more value and meaning out of that very first vote.
A Holistic Approach
Discovering topical, fresh, and novel information has always been an important aspect of search. But the perception of what "recent" is has changed dramatically with the popularity of services such as Twitter, and it has led to the emergence of real-time search engines.
Real-time search shouldn't be a silo, but rather should be part of a unified and distributed approach to Web search.
The era of purely document-centered search is over. The equally important roles of user and conversation, both as targets of search and as contributors to discovery and ranking, should be reflected in the infrastructure.
A Distributed Infrastructure
As long as both source and recipient of information are distributed, then the natural design of search is distributed, too. P2P offers an efficient alternative to the ubiquitous concentration and centralization of search we find today.
A peer-to-peer client allows every visited Web page to be implicitly discovered and ranked according to attention received. This is important, because the majority of pages in a real-time search are in the long tail. They appear once or not at all in the Twitter stream and can't be discovered or ranked through explicit votes.
With real-time search, the amount of indexed data is limited, because only recent documents (those that have gained a lot of attention and a high reputation) are accounted for in the index. This allows for a centralized infrastructure at a moderate cost. But as soon as search moves beyond the short head of real-time search and aims to fully index the long tail of the entire Web, then a distributed peer-to-peer architecture provides a huge cost advantage.