Haven't we seen this before: IBM making a big acquisition of a “V” company in the big data space? Indeed, last April 13, IBM purchased analytics software maker Varicent for an undisclosed amount. And then this morning, IBM announced its acquisition of enterprise search facilitator Vivisimo

IBM usually doesn’t acquire applications software makers unless it has something very specific in mind for them. IBM is clearly piecing together a “big data” platform - a comprehensive package for storing, accessing and analyzing unstructured data.

That IBM wants to be perceived as the master of data stores is no surprise. What’s interesting here is that the company is taking an almost Oracle-like approach to the subject, buying up as many as 30 separate big data-related entities in recent months (Vivisimo being #30) and wedging them together in hopes that they form a picture that looks something like a platform.

The man in charge of stitching together the collected “big data” shards into a cohesive whole is Arvind Krishna, the general manager for IBM’s information management division. Big data, Krishna informs us, is big. But that’s really just characteristic number one of three. The second most important characteristic is its variety of formats, and the third is its velocity. (Remember that word; it’ll come up again in just a bit.) These nebulous qualities along with the quickly mutating nature of big data are keys to what Krishna perceives as its most important phenomenon: what he calls its perishability. Understanding Krishna’s attitude toward this phenomenon provides the best insight into IBM’s platform-by-acquisition strategy.

“If you don’t react to it in a few minutes, it might not be worth it to react at all,” remarked Krishna in an interview earlier today with ReadWriteWeb (using words that tugged at my very heart strings). “The way we look at this, every single company does a great job of dealing with structured information. Examples of this are like payroll or sales data - what product did I sell at which store at which geography at what time? I can slice it and dice it and see areas of weakness and strength. But if I look at my interactions with my clients, not just my transactions, what they’re saying about me could be in social media, what they’re writing to me could be in email, what they’re saying to my call center reps... Do I really have a full view across all of that? And the answer today would be, not really, right?”

Vivisimo began in the late 1990s as a very public effort at creating a next-generation Web portal, applying sophisticated semantics to refine the power of queries - just before Google’s rise to prominence. As it became more difficult for Vivisimo or any company to succeed on a broad technological scale by simply making better technology (and its query system, which I helped test, was indeed that), it found itself scaling back its customer base to more reasonable, conquerable verticals, such as the customer service field.

This played directly into some of today’s headlines, which speculated that IBM was purchasing a customer service software provider. That’s not how Arvind Krishna described it. Indeed, there’s a piece of Vivisimo’s original, broader plan that plays into IBM’s bigger, broader plan today, having to do with the creation of ontologies - structures of subjects that are related to a broader topic.

“When I want to index a document, simple indexing would count all the words in the document and say, ‘These words appear so many times,’ and then relevance becomes a matter of whether words appeared more than once. Next, you could go to grammar structure and semantics, so you can do ontologies. Suddenly, the way that Vivisimo does document capture and classification allows them to leverage these ontologies and develop their semantics. That is what is really interesting about their indexing technology.”

 

Simple indexing becomes inefficient “at scale” - meaning that as document sizes scale up linearly, time to index and time to read the index scale up exponentially. That fact alone renders traditional database methodologies pointless for the waves of incoming text and media that broadband Internet has enabled. Despite having been marketed largely to the customer service sector, Vivisimo has in its arsenal a potentially very potent weapon: a kind of ontology development tool whose name just happens to be (I told you this word would come in handy) Velocity.

“Vivisimo Velocity can find, extract and deliver content regardless of format or where it resides,” reads a recent Vivisimo white paper (PDF available here).  “Administrators have access to a library of pre-built Information Optimization connectors and a universal connect framework to enable indexing from common data repositories (e.g., file shares, databases, e-mail systems, content management and collaboration systems, customer relationship management applications, e-mail archives and other archival solutions). In addition, content from third-party feeds, subscriptions and other search engines can be added via federation.”

That’s another very important word in IBM’s emerging big data strategy, which Krishna characterized as being more flexible like a platform, as opposed to HP's Autonomy, which he portrays as more specific like an application. “[Calling it] indexing is almost a disservice to what this really does,” said Krishna. “It’s the ability to do rich indexing across a set of federated data sources at a scale that has not been done before and, in addition, preserving the security semantics of the underlying data sources.”

 

What does Krishna mean by this last part? How Velocity responds to its user depends on its understanding of who that user is - more specifically, whether she’s entitled to the information she’s searching for. Vivisimo has a rule definition system that tailors responses to known restrictions about the user - a later generation of the system originally intended for the Web portal, which could apply responses to known characteristics of the user.

“Federation implies that I don’t need to actually copy the data; it can stay where it is,” added Krishna. “Security semantics means that, when someone tries to do a query, I can preserve the semantics so only those who are authorized to look at [the response] in the first place can get an answer that contains that information. And scale means, I’m not confined to a single machine or a single cluster.

“We are clearly acquiring technology which goes into a general platform that people can use to build custom applications and create an ecosystem of ISVs in addition to direct clients,” the general manager said. “We are also interested in bringing analytics to the big data space, in the form of a complete solution as opposed to having to write a custom answer for a client.”