The business of information technology has made verbs of many nouns, not the least of which is "siloing." On the one hand, workers in an enterprise tend to operate against their own interest when they continue to do their business from disparate silos. On the other, corporations that actively try to strike down their silo walls often find themselves dealing with information chaos.

Many vendors have characterized the emerging field of Big Data as a revolution, a collapse of the Berlin Wall-like structures that collect businesspeople into separate enclaves. You might be surprised that IBM isn't one of them. Okay, so silos are bad, says IBM's vice president for big data, Anjul Bhambhri, in part 2 of our ReadWriteWeb interview. But you can't expect database re-architecture to provide you with freedom, and in some cases, there are good reasons why enterprises are departmentalized in the first place.

Scott Fulton, ReadWriteWeb: As you probably know on a deeper level than I, the reason for database siloing dates way, way back to the 1970s and '80s when computing products were purchased on a department-by-department basis. In the mainframe era - which IBM helped the world inaugurate (so it's your fault) - computing products were purchased, deployed, configured, programmed by the people in finance, in budgeting, in human resources, in insurance management, in payroll. And these were all disparate systems. The archival data that has amassed from this era is based on these ancient foundations.

I was on a webcast the other day listening to a fellow make the case for why we should remove silos from big organizations, and develop ways of merging big data into "usable meshes," he called them. It was a good point and it lasted for 60 minutes. And the first question he got from somebody texting in was, "Simple question: Why?" And the presenter said, "What do you mean, why?" And he said, "Okay, don't you know that these silos exist for a reason? Businesses like ours [I think he was in banking] have departments, and these departments have controls and policies that prevent information from being visible to people in other departments of the business." And he asked, "Why would you make me spend millions of dollars rearchitecting my data to become all one base, and then spend millions more dollars implementing policies and protections to re-institute the controls that I already have?" And the presenter was baffled; he never expected that question, and he never really answered.

Anjul Bhambhri, VP for Big Data, IBM: It doesn't matter what we do; you can't just get all this data into one place. Data is going to be where it is, in an enterprise. There may be department-level decisions that were made, department-level applications that are running on top of it. Nobody's going to like [some guy coming in saying] "Let me bring this all together." It's too much of an investment that has been made over the years, and it's completely unreasonable for anybody... In hindsight, we can always say this is the way things should have been architected. But the reality is that this is how things have been architected, and you run into this in almost all the enterprises.

People have built those repositories and those applications because they were the best choices at the time for that class of applications. They can't all be thrown away.
Anjul Bhambhri, VP for Big Data, IBM

My response and suggestion - and we've actually done it with clients - has been that, you leave the data where it is. You're not going to start moving that around. You're not going to break those applications. You're not going to just rewrite those applications... just to solve this problem. And really, data federation and information integration is the way to go. Data is going to reside where it is. IBM has done a very good job in terms of our federation technology and our information integration capability, where we are able to federate the queries, we can pull the right set of information from the right repositories wherever it lies. Then we can obviously do joins across these things so that we can do lookups of information in maybe the warehouse, and we can correlate it with information that may be coming from a totally different application. And all of this is done while preserving the privacy and security, the accessibility, the role-based policies that may have been implemented.

We can't ask people to change all that. We can't have departments just start changing it. If there's some data that they don't want another department to see, then that has to be respected... Also, you don't really want it to change, right? People have built those repositories and those applications because they were the best choices at the time for that class of applications. They may have bought solutions from vendors like SAP, or they could be ERP or CRM systems... They can't all be thrown away.

But if a company used a CRM application, for example, to really understand aspects about the customer, we're saying you don't stop using that application, but you may need to augment the information that you can get from CRM with what maybe social media offers around the customer, so you can really get a 360 view of the customer. Don't abandon what you've got, but integrate. The level of tooling [around these services] has to be able, in that single dashboard, to pool the information from these CRM applications, [as well as] from these new data sources that may be Facebook or Twitter or text messages.

I really think federation and integration is the way to go here, and not dictate that data be moved or be in a single repository. Heterogeneity is a reality, and we have to accept it and provide the technology that actually takes advantage of that heterogeneity, and respect the decisions that the customers have made.

Scott Fulton: You see the emerging tools that we talked about earlier, that we will need the data scientists to effectively learn how to use, will be tools that won't change the underlying foundation of data as we currently have it, but simply add a layer of federation on top of that.

Anjul Bhambhri: What is happening behind the scenes, we really just want the data scientists to focus on that. Their expertise is needed with other data sources that are important to the organization. Given their subject matter or domain expertise, they are the best ones to recommend where else is information needs to be gleaned from. Then of course, the IT group has to make sure those can be dictated, plotted, in the data platform. They can't say, "Okay, we have two applications running on the mainframe and all these silos, but we can't bring in more data sources." They obviously have to facilitate that.

But the tools have to be so easy that they can say, "If I want to know this about customer X," information can be coming from the warehouse, from the CRM application, from the transactional system with the latest set of transactions that the customer has had in the last day or month. And if there's a way to say, okay, what is the last interaction that we had with this customer? Maybe the customer called in, maybe he went into our Web site and did some online stuff. It could just be random pieces of unrelated information about our customer, or it could be aggregated around the customer. But they should be able to visualize these things in the tool. Because you can imagine that random text about this customer also has to be presented properly - maybe things have to be highlighted, annotated so that the data scientist doesn't miss out on some important aspect.

That's really the direction we are moving in, and for any vendor to really help our customers to embrace what's happening in this Internet era, and really understand aspects of the business they are in, I think it's pretty critical that this happens.

Coming in Part 3: The shard and the cloud