Large enterprises are putting a lot of money and effort into making sure they have the latest and greatest in Hadoop and other big data infrastructure tools, but it turns out their IT teams are far from prepared to actually use those tools once they are in place.
That’s one observation from Jeremy Howard, president and chief scientist of Kaggle, which uses crowdsourcing techniques to provide statistical and data analytic services for clients.
“A lot of companies don’t know how to find data scientists, and don’t understand data science,” Howard explained. “These enterprise companies can’t implement a proper data analytical solution because they have no data talent.”
Part of the problem is an overall lack of big data skills in the United States. In May 2011, the McKinsey Global Institute laid out the numbers: “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
Howard sees the problem reflected in his company’s clientele. Initially, Kaggle worked with smaller, highly capitalized startups, but now finds itself working with larger enterprise companies.
Startups Do Big Data Better Than Enterprises
The startups, it turns out, are much better equipped to handle big data than the enterprises.
“The startups are usually much closer to the data they’re analyzing,” Howard explained. “They know their stuff, and that knowledge is more centralized within a smaller organization.”
Enterprises, in contrast, are much broader and knowledge intimacy is much more distributed, he said.
Reprising the Problems of Data Warehousing
It’s a problem Howard has seen before, when the trend of data warehousing became popular 15 or so years ago. Companies would spend hundreds of millions of dollars on data warehousing - and once they were done would be stymied on what to do next.
Today, the cycle is repeating itself, as IT decision makers jump into the big data ecosystem for a variety of reasons - without thinking through the end results of their decision. “People don’t want to think they’re the last ones getting in on these technologies,” Howard said.
Is Crowdsourcing a Solution?
Kaggle, for its part, is working to bridge this data knowledge gap. Howard sees signs that the crowdsourcing technique Kaggle uses is encouraging detailed feedback for participants' data problems and raising the overall data analytics talent level.
The company got its start by hosting crowdsourcing competitions for organizations with the goal of producing data algorithms. The firm recently launched Kaggle Prospect, a new type of competition that asks a broader question for companies. Howard described it as: “Here’s our data. What do you think we should do with it?” Howard confided that Kaggle’s next project is an algorithm hosting platform to expand its offerings down the data analytics chain.
Open source development has often been credited for building big data technologies; open source-like methods - including crowdsourcing - may play a key role in how data is analyzed, too.
Image courtesy of Shutterstock.