IBM VP Anjul Bhambhri on the Era of the Data Scientist

Just a few short years ago, the problem of database size scaling to colossal capacities that exceeded the scope of entire network storage units, seemed insurmountable. Today, it’s practically under control, with a wealth of open source technology emerging not from database engineers but rather from Internet architects. Hadoop has transformed the very nature of transformation, becoming one of the most readily adopted technologies in the history of the data center.

But is it mature? And will businesses have access to the right people with the skill sets necessary to master this new aspect of information management? After having spent five years as a senior engineer at Sybase, another six years as a development director at Informix, and over three years managing DB2 development for IBM, Anjul Bhambhri is arguably one of the most skilled plain data architects in the business. In September 2010, IBM promoted her to the new post of Vice President for Big Data and Streams. In an interview with ReadWriteWeb, we asked Bhambhri whether the big data tools developed in so short a time are mature enough to be used by IT workers everywhere, or whether they will truly require a scientist to master.

Anjul Bhambhri, VP for Big Data, IBM: You’re absolutely right that what we’ve seen in just this last year around big data, what people want to do with it, the possibilities, the use cases that we’ve heard from customers, have all been completely mind-blowing. It’s reaching this level of… I wouldn’t say maturity, but obviously everybody’s at a different point in this whole curve of big data. There are clearly people who are still trying to understand, what is big data versus small data? But definitely with social media… it’s clear that people do understand that there’s a lot of unstructured data, and they cannot just stick with deterministic data that has been residing in warehouses, maybe in IBMs or Oracles. People do recognize that unstructured data needs to be brought into their information management platforms. Otherwise, they’re not getting the complete view of the different data points that they should be looking at to make decisions.

From that standpoint, what has been happening in the open source community has been fabulous. The Apache Hadoop project has really gone to that next level of being able to pre-process a lot of this noisy, unstructured data in a cost-effective manner. It made it possible for people to start using commodity hardware. This is a very important step for sure, that you can now pre-process and analyze this data – what we call the discovery phase, where you can observe things about the data. [But] they cannot just take the open source and be able to observe and discover data. So there is a certain set of consumability or tooling that is needed for data scientists to be able to observe and discover, what is the data telling them? That tooling also has not reached a level of maturity, so that these people’s jobs become easier.

We don’t want to expose this to the domain experts, because they have to be able to focus on asking the questions… They can’t get bogged down by the algorithms behind the scenes.
Anjul Bhambhri, VP for Big Data, IBM

Next, of course, there is more analysis and correlation of this data that has to be done across all the data in or outside the enterprise. There’s the aspect of integrating, bringing it in, and not looking at data in a silo. I think the integration of data is going to take a different meaning, and more advances have to be done there. This integration has to be very seamless, so that when questions are asked, all the right pieces of information have to be pulled from the five, six, ten, however many repositories there are in the enterprise, in the right context.

Scott Fulton, ReadWriteWeb: I know you’ve mentioned online recently that the data scientist is the Job of the Future. There might be some enterprises that might be scared by the concept – that they have to hire a scientist, somebody at that level, just to be able to make sense of what the analytics are telling her about big data. If the tools mature to the next level, as you suggest they do, shouldn’t understanding how the data works not necessarily be the task of scientists, but maybe more what we would call “skilled artisans?”

Anjul Bhambhri: In terms of observing and discovering and analyzing, yes, a lot of us have come to some agreement that you need a role like a data scientist. Now, people are going to be emerging [from college] with pretty much the same kinds of education that they were getting in the last three to five years. So it’s not like the whole of education is going to change here.

It’s going to be very critical that the tooling we build actually helps people with the educational background that they might have in mathematics or statistics, computer science, modeling, analytics. If they’ve taken those kinds of courses, then they have a good foundation. But the tools themselves have to help them become good data scientists. And it’ll happen over a period of time. They at least have to have that mindset that there are new sources of data that they have to look at. They have to understand as data scientists that they will probably have to cause some shifts inside the organization, maybe across the culture of the organization. That attitude will need to be there. [But] without the tools, it will be very difficult even for a smart data scientist to do this on their own, because the volumes here are really large.

It’s not that this analysis or observation has to be done once and it’s over. It’s an ongoing thing – it has to be done every day. The demands on these scientists are going to be huge, and my point of view is that, even if we get the best and the brightest, they will need the right set of tools to examine this data.

Scott Fulton: So you perceive a future where, just like a graphic artist must adopt Photoshop as a skill, and a mathematician might have to adopt Mathematica or Wolfram Alpha as a skill – where the tool itself becomes a skill – there will be a field for big data analysis, where whatever tool emerges from that field becomes the skill.

Anjul Bhambhri: Yes, absolutely. The tools will have to help build those skills, because the skills cannot be built alone by getting more people educated in mathematics, for example. Vendors like us… will leverage the people who are actually coming out with these degrees to help build the right tools, because this is not just about the visualization. A lot of the magic we want that hasn’t happened yet, has to happen behind the scenes. We don’t want to expose this to the domain experts, because they have to be able to focus on asking the questions, on exploring, on doing the what-if kind of analysis, extracting the right kinds of information from these silos or from these disparate data sources. They can’t get bogged down by the algorithms behind the scenes.

Scott Fulton, ReadWriteWeb: The greatest number of big data use cases in the last six to twelve months for the enterprise is in the marketing department. They’re gleaning the textual value coming out of the streams and streams of text generated by social media, and trying to find out, “Are they saying something positive about our product? Are they having a positive experience with our product? Is this a like or a dislike? Are they talking about our presidential candidate the way we want them to?” And a lot of the people asking these questions have degrees in business or marketing, but not mathematics or computer science.

Anjul Bhambhri, VP for Big Data, IBM: Exactly. I would state that our goal is for those people to become the data scientists. Because if we are talking about observing and discovering, they are the subject matter experts. They will know. They see a pattern, they will identify it.

People do recognize that unstructured data needs to be brought into their information management platforms. Otherwise, they’re not getting the complete view of the different data points that they should be looking at to make decisions.
Anjul Bhambhri, VP for Big Data, IBM

Just as you said, these may be marketing officers. They are not just going to be the CEOs of the company. They could be the Chief Marketing Officers. I would claim, if we can make the CMO a data scientist, then we have achieved what we needed to, and that’s when the potential of big data will actually be realized to get the business outcome that we think can come out of tapping into what’s hidden in this big data.

Scott Fulton: But doesn’t that require a substantial rethinking of the educational platform that we’re presenting to people who aspire to be CMOs now? I’m thinking of college entrance exams and processes where your future is plotted on a graph, and where they may say, “You’re not particularly skilled in such things as biology, science, mathematics. But you know what you could be good at? You’d be good at marketing!” It’s like we take people away from math and put them into marketing, and we push that as the future. You would have these two worlds be fused, and you see a future in doing so. That’s going to take a lot of work, won’t it?

Anjul Bhambhri: Yes, because we are seeing they have to have that bent of mind. We’re not taking them away from their core competency. But they have to know that now there are additional pieces of information that can be made available to them, that they can use in their decision making processes. That awareness has to be there. If they settle for the additional reports that they’ve been getting, they’re not leveraging some of the new data sources that are emerging. They have to recognize a shift in how their customers are expressing themselves, away from relying only on written surveys or surveys on the phone – “We’re going to give you this two-day vacation” – which most people actually don’t want to take.

At the same time, the individual may spend hours on something like social media, and they can go on for hours expressing what their point of view or sentiment is on certain products. This marketing thing is one example. If they don’t tap into that, either the business is going to suffer, or they’ll personally suffer, but somebody’s going to suffer. And the consumer will suffer too, because nobody’s listening to them.