Hunch has really interesting problems. They collect a lot of data from a lot of users, and once someone creates a profile they need to quickly deliver useful recommendations across a wide range of topics. This means running a sophisticated analysis on a massive data set, all to a strict deadline. Nobody else is doing anything this ambitious with recommendation engines, so I sat down with their co-founder and CTO Matt Gattis to find out how they pulled it off.

The first thing he brought up was hardware costs, casually mentioning that they'd looked into getting a server with one terabyte of RAM from Dell! That immediately piqued my interest, because the Google-popularized trend has been towards throwing an army of cheap commodity servers at big data problems, rather than scaling vertically with a single monstrously powerful machine. It turns out their whole approach is based around parallelism within a single box, and they had some interesting reasons for making that choice.

They'd evaluated more conventional technologies like Hadoop, but the key requirement they couldn't achieve in their tests was low latency. They're running on a graph with over 30 billion edges, with multiple iterations to spread nodes' influence to distant neighbors and achieve a steady state, a bit like PageRank. This has to be extremely responsive to new users inputting their information, so they have to re-run the calculations frequently, and none of the systems they looked at could deliver the results at a speed that was acceptable.

They determined that the key bottleneck was network bandwidth, which led them towards housing all of their data processing within a single machine. It's much faster to share information across an internal system bus than to send it across even a fast network, so with their need for frequent communication between the parallel tasks, a monster server made sense. As it happens they decided against the $100,000 one terabyte server, and went for one with a still-impressive 256 GB of RAM, 48 cores and SSD drives.

The other part of the puzzle was the software they needed to actually implement the processing. They looked at a series of open-source graph databases, but ran into problems with all of them when they tried scaling up to 30 billion edge networks. Continuing their contrarian approach, they wrote their own engine from the ground up in C, internally codenamed TasteGraph. The system caches the entire graph in memory, with rolling processes re-running the graph calculations repeatedly, and the end-results cached on multiple external machines. They have even recoded some of their inner loops in assembler, since they spend a lot of their cycles running calculations on large matrices and even the specialized linear algebra libraries they use don't deliver the performance they need.

Even with their software and hardware architecture in place, there were still obstacles to overcome. Their monster server uses CentOS Linux, but very few people are running memory-intensive applications on machines with so much RAM, so they ran into performance problems. For example, by default the kernel will start paging out to disk once the memory is about 60% full, which left them with only about 150 GB of RAM available before swapping kicked in and performance cratered. There's not much documentation available around these parameters, so the team ended up scouring the kernel source to understand how it worked before they could produce a set hand-tuned for TasteGraph's needs.

When Matt first told me about his design decisions, I have to admit I was surprised that he was apparently swimming against the tide by working within a single uber-machine rather than using an army of dumb boxes, but as he explained their requirements it all started to make sense. With more and more companies facing similar latency issues, I wonder if the pendulum is swinging back towards parallelism across a system bus rather than a network?