Hilary Mason Wants To Get You Started With Big Data

I spent part of this week with Hilary Mason, one of the smartest people that I know in Big Data. She works as the Chief Scientist for Bit.ly and has a wealth of skills at her fingertips that bridge computer science and mathematics. Plus, she is used to facing largely male audiences and just being the smartest person in the room. She was speaking at The Strange Loop conference in St. Louis this week, which should definitely be on your radar for next year if you are interested in this topic or want to broaden your programming skills.

Editor’s note: This story is part of a series we call Redux, where we’re re-publishing some of our best posts of 2011. As we look back at the year – and ahead to what next year holds – we think these are the stories that deserve a second glance. It’s not just a best-of list, it’s also a collection of posts that examine the fundamental issues that continue to shape the Web. We hope you enjoy reading them again and we look forward to bringing you more Web products and trends analysis in 2012. Happy holidays from Team ReadWriteWeb!

Mason outlined in a series of workshops the tools you need to get started with manipulating Big Data and understanding the basics of machine learning, something she does everyday as she sifts through each one of those shortened URLs that we all create furiously. (You can read about her latest revelation here which we wrote about earlier in the month.) You know when she says, “this is a hard problem” that she is really saying “this is a problem that I haven’t yet figured out the best answer to.” To each problem, her credo is Obtain, Scrub, Explore, Model, and Interpret. I’ll review each of these steps.

The first step is setting up a proper environment, and for Mason it is a Linux machine with a variety of tools on it that you can find on her Github page linked above. She is a Python programmer, and so this reflects that interest. She uses Python with JSONview’s Chrome extension, NLTK, numpy, Pycluster, hcluster, and mathplotlib. You can use most of these tools on other OSs too.

Second, you need to obtain a few test data sets that you can start to manipulate. Even if you aren’t drinking out of the Bit.ly data fire hose, there are ways to get access to lots of great data around the Internet. Mason mentioned a few places, including:

The New York Times. Each and every article that is posted on the main NYT Web site going back dozens of years has oodles of metadata and tags galore. Just view the source on any news article to see how the human editors have classified it. You’ll need to start off by registering for an API key here and then select “Article Search API”
Mason has put together several dozen different bundles of research-quality data sets at this link here
Pete Skomoroch’s various sample datasets that he has collected, and there is always the epic reference
Data Source Handbook by Pete Warden

Some of these datasets have gotten almost urban legend status, such as the email collection that was part of all the Enron court cases: this is useful for testing anti-spam programs, even though it is several years old. Because of how people use Bit.ly, Mason said that they usually see malware and other bad stuff several hours before anyone else has picked it up around the Internet.

Third, you need to start thinking about how to make your data sets smaller. “Big Data usually refers to a data set that is too big to fit into your available memory, or too big to store on your own hard drive, or too big to fit into an Excel spreadsheet,” says Mason. This is the “scrub” section. The smaller the dataset, the easier it is to manipulate and analyze.

Now comes the fun part, exploring your data. You want to ask questions and figure out patterns. If you are using the NYTimes data for example, you could look at words that are most frequently used to describe political candidates, or are particular words more often used in the technology section than in the sports section.

Next comes the mathematical modeling portion of the program. If you don’t have a lot of depth in probability or statistics, you are going to need some help here. Mason rolls off the math almost as if she is speaking fluently in a foreign language. And given that I was an undergraduate math major, I can only understand your own frustrations here when you see something with the Greek Sigma sign (hint: that means a sum of things). She let slip the words “fourier transform” which brought me to the Wikipedia definition before I could try to remember what it was. But the essence here is to write code to get the answers of the questions that you are asking in your explorations.

Finally is interpretation. You want to put the answers that you obtained from your modeling into the context of why they are important. You may need to do some visualizations or reporting so that your results can be understood. Or you may choose to omit this part if your answers are sufficient for your particular purposes. Mason mentioned her spam-fighting tactics that she uses to cleanse the shortened URLs from evil doers. “As long as our routines seemed to be working, that was good enough for us and so we just stopped,” she said. Part of this process is understanding when you have the best answer that you are going to have to your questions, “knowing when you have won” as she says.

If you are looking to learn more, a good place is to sign up for one of Stanford’s free classes on AI, Machine Learning, and databases here. The classes start October 10 and continue until mid-December. They are free and you submit homework assignments and do everything that you would normally do in a typical CS class, including getting help from teaching assistants too.

Mason is lucky to have such a large playground to operate in, and we are lucky when we can sit at her feet and try to understand the totality of her experience. I hope I have given you a taste for the world of Big Data here and some of the best ways to get started at your own analysis. And if you need more motivation, she tells me that just about everyone is hiring their own “data person” these days, so if you get good enough at you probably have your pick of employers for years to come.