As explained in this blog post, Foursquare needed a way for its business staff to run reports based on its data without slowing down production servers and without learning technologies such as Scala and MongoDB. The company decided to make its data available to business staff through a Hadoop cluster hosted by Amazon Web Services. Foursquare’s data miners could then query it using Hive, which provides a SQL-like query language for Hadoop.
As a proof-of-concept the company has produced a report on the rudest cities in the world, based on the number of tips that contain profanity. Which is pretty cool (apart from the assumption that profanity use = rudeness). But it makes me realize just how under-utilized geolocation APIs are.
Here are the results of Foursquare’s profanity-mining:
And here’s how Foursquare’s data analysis system works:
Some more practical applications, from a business standpoint, for data mining staff might include determining:
Which venues are fakes or duplicates (so we can delete them), what areas of the country are drawn to which kinds of venues (so we can help them promote themselves), and what are the demographics of our users in Belgium (so we can surface useful information)?
Of course, this sort of check-in data is solely in the hands of Foursquare’s internal users. But it makes me wonder whether you could pull together information like this through the Foursquare API if you build your own data warehouse for analysis.
I wonder what services like Fourwhere (which we covered here) could learn by caching all the data retrieved from location various APIs and running sentiment analysis on it. What could MisoTrendy (coverage) tell us about a venue based long-term trend patterns? Is there something in Foursquare’s terms of service that prevents people from doing this? I guess we’re back to that old question what would you do with the massive data sets produced by persistent location tracking?
Update: MisoTrendy’s Andrew Ferenci explains the limitations:
1. You would not be able to pull and process historical data like 4SQ did from their production databases and log files (only real-time data/ hard for small web app to run queries that generate 1bn records)
2. If you use something like Google Apps Engine you have lots of limitations on DB and backend processing (only 80-90K hits before you have to start payinh)
3. Most third party applications would only be able to pull real-time data from 4SQ API, so no backend processing.However, if you decided you want to create an application to do pull similar data starting today, you would definitely be able to, but not as the same historical breadth.
Techincally, its all feasible with some limitations. Misotrendy was built using Google Apps Engine with a Python backend. There are limitations for the DB and backend processing because you cannot use Ruby on Rails with this setup.
This feels like it could be the first steps towards accomplishing what was described in the opening lines of the Headmap Manifesto:
there are notes in boxes that are empty
every room has an accessible history
every place has emotional attachments you can open and
saveyou can search for sadness in new york