Open Source Is Data Science’s Missing Ingredient

Here’s the dilemma: your company has lots of data and little clue what to do with it. So you figure you should hire a data scientist but, as it turns out, they’re in short supply. Good ones, anyway.

What do you do?

As an increasing number of companies are figuring out, you grow them. But not just anyone will be able to make the leap. It turns out that the best data scientists tend to be very comfortable with open source.

Buying A Big Data Clue

Over a year ago I picked apart Gartner’s Big Data surveys, finding that while nearly every company purports to running Big Data projects, the reality is far murkier. 

Dig into the data and it becomes evident that as much as we may wish we were masters of the Big Data universe, we’re actually neophytes that are trying to “determine how to get value from big data,” struggling to “define [a Big Data] strategy” and to hire “skills and capabilities needed” to do so.

Good luck with that!

Years into the Big Data movement companies are desperately, often futilely trying to hire a Big Data clue. Hence, even though Big Data skills top the list of LinkedIn’s hottest job skills of 2014, they also top the list of skills enterprises want but can’t find (which is why the relatively few data scientists in existence get 100-plus recruiter emails each day). 

Discovering Your Inner Data Scientist

Which is why companies are increasingly trying to figure out ways to train data scientists, and why a booming industry is developing around the training of data scientists.

Some will wait on a swelling population of students being trained for a data-rich future. Others are encouraging employees to get trained through Codecademy, Coursera or other options that Eileen McNulty showcases

See also: How Open Source Can Fix 2015’s Data Entropy

However these employees (future or current) get trained, it’s helpful that companies are now looking within. Even if there weren’t a dearth of data scientists, it still makes sense to look within, as Gartner’s Svetlana Sicular has called out:

[C]ompanies should look within. Organizations already have people who know their own data better than mystical data scientists…The internal people already gained experience and ability to model, research and analyze. Learning Hadoop is easier than learning the company’s business. 

She’s right, but not just anyone can do this. 

Open Sourcing Your Data

I’ve written before about why data scientists get paid so much. As Mitchell Sanders posits, data science is hard because it depends on a blend of domain knowledge, statistical and mathematical prowess, and programming skills. 

It’s hard to find all those skills in one person, which is why they get paid a lot. Supply and demand.

See also: Applications Drive The Biggest Money In Big Data

It may be even harder when we unpack that last attribute – programming skills. Implied but not stated in this attribute is the reality that data scientists need to be comfortable with a particular kind of programming: open source development.

As Gartner analyst Alexander Linden writes:

A lot of innovative data scientists really favor open source components (especially Python and R) in their advanced analytics stack. I hear this a lot, even from the most advanced of our clients… One department head, leading a dozen data scientists at one of the top retailers, gave me the following rationale: “I would be paying about $5 million just in annual maintenance, if I stuck with vendor xxx … imagine how many gifted data scientists I can buy for that money (?) … and by the way I did hire them and they all use a combination of R and Python”.

Most of the essential Big Data technology today is open source, whether Python and R or Hadoop, Spark, MongoDB, HBase and Cassandra. While you don’t have to develop Spark in order to use it, those that know how to swim in open source currents will do far better with such technologies than those who only know how to install and run whatever SAS, Microsoft or some other vendor offers them.

In short, the best data scientists are those that can thoughtfully ask questions of data, but also manipulate their data analysis tools to better craft the question. That’s the essence of open source, and your next data science trainee or hire will be much stronger if she groks open source.

Image courtesy of Shutterstock.

Facebook Comments

New

Rising

Popular