Data At Work profiles data scientists working at the cutting edge of big data.
Konstantin Gredeskoul knows what you want.
He’s the CTO of Wanelo, one of the online-shopping startups that's pioneering the Visual Web. It’s a wicked mixture of Pinterest, Tumblr, Twitter and Instagram with a simple twist—everything you see on Wanelo is for sale online. Its more than 10 million users follow each others, as well as stores, and topics. The result is a highly relevant personalized feed of stuff you want, need, or love (hence the company’s name).
Those desires flood past Gredeskoul’s fingertips as the tall man with a pink mohawk sifts through Wanelo's data from its bean-bag-strewn office. Users have saved more than 9.5 million products into over 35 million collections to organize their shopping online. Wanelo has more than 200,000 online stores indexed, from big brands to small shops and even Etsy sellers. Wanelo’s infrastructure now must handle about 200,000 requests per minute, a hundredfold increase since June 2012, when it relaunched its website.
So what do you want to know? That’s the hard question.
Gredeskoul knew from day one that Wanelo first needed to collect all the data it possibly could about user activities in order to properly analyze the interactions between various groups of users, stores, and products.
“We knew we would want to ask a lot questions in the future—including some we didn’t even know we would want to ask,” Gredeskoul said. "The only way to handle that is to save everything we can."
Save It For Later
First, a bit about Wanelo’s infrastructure. Most of its users are on mobile devices, with 80% of its interactions coming via iOS or Android apps. Beneath the mobile apps, the company built most of its application stack in Ruby, using the Rails framework. For its data store, Gredeskoul went with his beloved PostgreSQL, an open-source database. However, for analytics, Wanelo knew it would need a separate data storage and retrieval mechanism that would more easily pull out the most important user actions—registering, saving an item, and sharing an item—in what Gredeskoul calls an “append only” historical event log.
Most general-purpose databases are designed for frequent input and output, allowing database entries to be modified. But that approach doesn't make sense when recording historical events like a log of users’ actions, since they don’t change after they’ve occurred.
Initially, the Wanelo team tried to record this append-only data into a table in PostgreSQL. But after a month, it became rather unwieldy.
“We figured out pretty quickly that inserting 10 million records per day into PostgreSQL was not the best way to handle this,” says Gredeskoul. Instead, the team turned to rsyslog, a workhorse open-source logging tool, sending the data into an ordinary text file. Using rsyslog they were able to tame the distributed data collection problem.
Then they had a new problem: how to analyze it.
Hadoop, There It Isn't
Many startups would have tackled this problem by dumping this data on Amazon's servers and eventually set up a Hadoop cluster to parse it. The problem there, for Gredeskoul, was the inherent tradeoff. Moving data from storage to a cluster of servers for analysis would mean long, long waits—sometimes 12 hours or more—to get data sets processed. Because Wanelo was collecting 1.5 gigabytes of new user logs per day and that number was only going higher, Gredeskoul did not relish a reality where big data really meant slow data.
Alternatively, Wanelo could have set up its own clutch of dedicated hardware to run Hadoop and other related software like Hive and Flume. This meant that Wanelo could keep the data in its own big storage systems and then keep a continuous cluster running to quickly process jobs. This, however, is cost-prohibitive both in terms of servers and labor. Hadoop and its progeny require a fair bit of care and feeding to stay happy.
Gredeskoul instead elected to try something a bit different—a product from Joyent Cloud called Manta. (Full disclosure: I used to work at Joyent, but left two years ago. I have no financial interest in Joyent and hold no shares.) Manta is an object store like Amazon’s Simple Storage Service, but it also allows users to run compute operations directly on top of the objects. Translation? You can have your cake and eat it too. The compute could be on demand. The storage could be adjacent and in the cloud, with all the goodness of cloud distribution and scalability. In particular, what Gredeskoul liked was that running on Manta was actually a very nice replacement for basic Hadoop processes.
How so? Much of what Hadoop does is not actually complicated computing but more about handling very large compute jobs and breaking them up into smaller pieces. Because Manta was a distributed object store, Wanelo could use it to efficiently run queries—for example, on how users behave on tablets versus smartphones. Manta gave the Wanelo team tools to express their queries in terms of simple Unix tools, such as grep or awk, and aggregate the results into a single cohesive report.
For now, Wanelo and Gredeskoul are very happy to have a big-data strategy and tech stack that is streamlined and cheap.
“We just love the fact that we can run all this cost efficiently but also using an existing skillset—Unix commands," says Gredeskoul. "Everyone on my team knows most of what you need to run queries on Manta."
By keeping it simple, Gredeskoul actually thinks he’s laid the groundwork for more complex work.
“For us, the data questions fall into two main buckets,” Gredeskoul said. "One is around the products. The other is around the users. The challenges around products are based on having to rely on the product data from posting by Wanelo users. For example, identifying similar or even identical products across multiple retailers is not an easy problem. The other end of the spectrum is the user behavior and clustering. What features are they using? What do our users want but do not have, and what do they not even know that they want? We are exploring these and many other questions that we hope to answer with help of data science, statistics, machine learning, multivariate regression. Those are the main areas where I see us moving forward.”
By building just the infrastructure Wanelo needs, Gredeskoul is getting closer and closer to figuring out what his users love.