Online dating services promise you help finding your “perfect match.” Complete your profile, answer a series of questions, choose a few filters as you search and you’re given a list of singles in your area who are most likely to be a suitable date.
But providing these search results is no easy task. It isn’t simply a matter of identifying the right people based on a single user’s dating criteria. The people whose profiles are returned in the results should also, in turn, “like” the person who’s searching. In other words, the matching has to occur at both ends.
Today at the opening keynote of OSCON‘s new Data sub-conference, OKCupid‘s CTO Tom Quisel spoke about how the online dating service has built its architecture in order to handle these queries. As Quisel notes, the types of searches that OKCupid users conduct are different than those done via other search engines. After all, “Web pages don’t have personal preferences.”
OKCupid is well known for its data analysis and for releasing trends and insights that it’s gleaned from user profiles. With over 7 million active users, indeed, there is a lot of data to be had. On average, says Quisel, active users have answered about 3000 questions; they’ve hidden the profiles of several thousand users they aren’t interested in; they’ve voted for about 4000 profiles. All that data is in addition to users’ personal demographic data and preferences, as well as their site usage information (how often they log in, how often they respond to messages and so on).
And all that data makes a simple search for a list of potential matches quite complex. In fact, says Quisel, it can take 13 billion seeks in order to load one page of results.
OKCupid’s Technical Architecture
The challenge for OKCupid then is has been to build a system that is scalable, fast and reliable, but also low cost. In his talk today, Quisel detailed the distributed architecture that OKCupid utilizes. Interestingly, OKCupid has made the decision to utilize C++, as Quisel argues that it’s three times faster and uses four times less memory – as well as fewer support staff – than Java. OKCupid also primarily uses MySQL.
Users’ data is split across workers, says Quisel, and OKCupid uses a quadtree structure in order to split up the data. As one of the most important preferences for would-be daters is location, that’s the first filter utilized. Then, for each quadtree leaf node, the vector is sorted by last login, so that only recent visitors and active users are returned in search results.
Quisel also says that OKCupid utilizes SSDs – and consumer-grade SSDs, notably – but the company has done extensive SSD benchmarking to make the process efficient and reliable. In order to avoid problems with reads and writes, Quisel says that OKCupid has taken SSDs off “the most critical paths.”
More from OSCON…
Not able to make it to OSCON this year? ReadWriteWeb will be reporting from the conference all week, but you can also watch the livestream from OSCON here.