Blekko has a refreshingly different interface to search, and a generous data-sharing philosophy, but what I didn’t realize was how innovative its underlying technology is. Last week I sat down with CEO Rich Skrenta and CTO Greg Lindahl, and they took me on a fascinating tour of the system they’ve built.
Starting with the hardware side, they have around 800 servers in their data center, each with 64 GB RAM and eight SATA drives giving each one about eight terabytes of local storage. The first thing that caught my attention about this setup was when they explained why they avoided RAID.
In their tests it cut performance in half, from 800 MB a second total across the eight disks with raw access to only 300-350 MB/s with a RAID controller in the pipeline. Even if it didn’t impact speed, it’s a nag. As soon as one of the eight disks fails, it will page an engineer to drive out to the co-lo and swap it out before there’s any data lost. With over 6,000 drives in their cluster, whoever’s on-call wouldn’t get much sleep!
An Entirely Decentralized Architecture
So how do they deal with dying drives? It’s designed into their software architecture, along with lots of other nifty tricks they’ve picked up over the years. Skrenta explained that he’d been involved in a lot of previous systems that had separate modules for crawling, analysis and delivering results to users. This separation of tasks meant that passing data between the modules was a messy, bug-prone process and as time went by engineers would end up spending most of their time firefighting problems as they came up, rather than adding improvements. As Lindahl put it, quoting Leslie Lamport, “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”
The real shocker was the strftime() C function’s bad behavior. They were tracking down an intermittent performance problem and discovered that it would sometimes access up to 50 files from disk, shoving a stick in the spokes of any application that relied on fast response times thanks to the unexpected disk seeks this causes.
What they needed was a system where the crawling, analysis and delivery of results could use a single data store and set of programming primitives, in a way that was both simple to write and to debug. This led them to the radical step of building an entirely decentralized architecture, with no masters, slaves or indeed any servers with special roles.
Even BigTable has the notion of a root tablet that acts a bit like a DNS server, holding the locations of the servers to talk to about a given key, but Blekko’s system relies on a completely distributed ‘swarm’ approach. Every server advertises the keys of the buckets it’s currently holding, and in turn each server listens and remembers the locations of the several thousand buckets holding the actual data. There are three copies of each bucket held on different machines, and if a drive or server fails, other machines will notice the problem through the broadcast information and start the healing process by replicating the affected buckets to other computers. This is a much lower maintenance system compared to swapping out drives when RAID starts complaining!
Values for multiple keys are held in a each bucket, with a hash function used to map the key to its destination. The data store supports the familiar set of key/value primitives, along with some more sophisticated variations to allow programmers to do things like only update a value if it’s never been set before, or allow sets that can be ignored if there’s an error, to prevent non-critical updates like server log messages from taking valuable time for recovery if there’s a writing error.
A “Naked Date”
Lindahl describes the store’s guarantees as “relaxed eventual consistency,” and explained that they expect their developers to write their applications with its characteristics in mind. It will let you shoot yourself in the foot with race conditions and requires more thought at the application level, but it’s worth it for the power and performance you get from the store. They’re crawling over 200 million pages a day, with 3 billion in total. The refresh frequency ranges from minutes for popular news site front pages, up to 14 days for the least-visited sites.
So, that sounds like geeky fun, but how does it help the end user of the system? To answer that, Lindahl muttered something about a “naked date,” which definitely caught my attention! It turns out this wasn’t a proposition, he was talking about the “/date” tag you can enter into Blekko’s search box:
This shows a selection of Blekko’s top search results as they’re crawled by the system. You can sit there refreshing the page, and as Blekko crawls the Web new pages will appear. The crawler is feeding sites into the data store, they’re being ranked on the fly and they’re showing up in the interface, all within a couple of seconds. Even Google’s Caffeine doesn’t offer that sort of responsiveness.
In technical terms, they’ve implemented MapReduce, but instead of a monolithic Reduce stage, they have primitives that allow simple operations like incrementing or merging data structures to be applied incrementally to build up results over time, with the intermediate results available for reading continuously. You can see this in action by refreshing the small SEO link that’s visible for every result, which shows all of the data that goes into their ranking calculations as it updates, from inbound links to content statistics and a complete site map.
What’s the secret of their programming success? Perl! Even though they’re the only NoSQL solution written in the language, they’re extremely happy with their choice, largely because of the rich and stable set of modules available on CPAN, with over 200 of them on every machine. Each server is running CentOS, and because there’s no special roles they can each be configured identically. They’re not using any virtualization, so to make machine creation simple Greg has rolled his own configuration system to install everything they need.
I asked them if there’d been any real performance surprises they could share, and they came up with a couple of corkers. First, they’d discovered that writing to a disk had a dramatic effect on seek times. Even in the normal case it can take 50ms for a disk head to move to a new position to access data, but when the drive was writing information, that same operation could be delayed by up to 500ms. A few half-second delays like that quickly add up and can ruin a user’s experience with a site, so they have write demons sitting on each machine that use a schedule to write in bursts according to a schedule, and other machines know to avoid reading from the server at those times.
The real shocker was the strftime() C function’s bad behavior. They were tracking down an intermittent performance problem and discovered that it would sometimes access up to 50 files from disk, shoving a stick in the spokes of any application that relied on fast response times thanks to the unexpected disk seeks this causes. It turns out that the function will load information from locale files to help with its formatting job, and even worse it will periodically recheck the files to see if they’ve changed. This may not sound like much, but for a programmer it’s as unexpected as discovering your grandmother moonlighting as a nightclub bouncer.
Blekko has built a very sexy system for processing massive data sets in a very dynamic way, and talking to Skrenta and Lindahl left me excited to see what they’ll be able to build on it next. The flexibility of their platform should let them keep producing innovative features nobody else can match. They also wanted me to let you know that if this sort of stuff is your cup of tea, they’re hiring!