Is there a way to defeat spam? Late last week, the Yahoo Mail team shared news from an independent study that users of the Yahoo Mail receive significantly less spam messages in their email inbox than other competitive services.
We caught up with Vish Ramarao, anti-spam guru at Yahoo, to learn how the company was able to achieve these results and whether it is possible to outsmart spammers using more capable filters.
The Study
Here are the statistics supplied by the Yahoo team.
“The Fraunhofer Institute, an independent research firm, found that Yahoo! Mail users saw the least amount of spam out of the five providers tested, with nearly 40% less spam than Hotmail and 55% less spam than Gmail – meaning Gmail users in the study saw more than twice as much spam as Yahoo! Mail users.”
It is noted that Yahoo spam filtering processes reduce 99% of the spam for the 300 million account holders, adding up to over 120 billion blocked spam messages per month.
Spam is Polymorphic – Algorithms Need A Grid To Keep Up
Ramarao shared with us the approach that Yahoo has implemented that consists of analyzing both historical in present data to find spam patterns.
What we learned is that spam delivery is increasingly complex. Spammers are increasingly turning to “reputation bots” that help fight negative reports from users. The spammers have organized their systems to break the filtering routines, black lists, and reputation mechanics that have been employed to date.
Yahoo turned to building a better knowledge base, or in this case a broader and more available information set. By enabling the Map Reduce functionality of Hadoop, the company is able to perform ad hoc queries across broader grid of header data on email to find patterns previously not possible in the filtering process.
The Yahoo mail team recently shared more of the details of their process to use Hadoop and other companion big data technologies to fight the ever changing stream of spam.
The good news is that this approach is providing a new generation of data intelligence tools that can be tuned for real-time algorithms to find patterns previously undetected in the spam arms race.
Map Reducing the raw data provides a path to preparing the data for the real challenges seen in finding patterns in spam. The Yahoo team also shared their insights that this type of approach may also be useful in other security data models, where access to high volumes of data (e.g. logs) may have been impossible in the past but can now be optimized for real time analysis.
What do you think? Will techniques like Map Reduce unleash other good things in our information saturated world?