Location-based network Foursquare was offline yesterday for about 11 hours, which was it admits “unacceptably long.” The support team has just updated the company blog with an apology and an explanation, aptly titled “So, That Was a Bummer.”
And indeed it was, not only for those of us who regularly check in via Foursquare, but for the other third-party tools that utilize the Foursquare API.
What We Have Here is a Failure to Shard
The explanation of yesterday’s outage is fairly technical: Foursquare uses MongoDB to store its data, and one of the features of this database is that it is scales horizontally via sharding. This means that the data is spread across multiple “shards.” And apparently yesterday one of these started performing poorly. As the team introduced a new shard, not only did it not correct the overloading problem, but the entire system crashed.
Foursquare admits it’s not quite sure why this happened, and the team spent the remainder of yesterday bringing the site back online – all, it boasts, without any data loss.
Foursquare says that it will be working closely with MongoDB on the database side of things, and says that it’s also exploring “artful degradation” so that only some functions, rather than the whole site, will be impacted by future crashes.
It also says that it’s going to work to better communicate these sorts of issues to users and third-party developers – an improvement over the error message that we saw yesterday that Foursquare was simply “upgrading its servers.” There is now a new Foursquare blog status.foursquare.com, for example.
And certainly if Foursquare wants to be the platform for location, addressing both the tech and the communication will be crucial.