Hadoop is designed to store big data cheaply on a distributed file system across commodity servers. How you get that data there is your problem. And it’s a surprisingly critical issue because Hadoop isn’t a replacement for existing infrastructure, but rather a tool to augment data management and storage capabilities. Data, therefore, will be continually going in and out.
Beyond Basic Tools
Basic tools exist, of course: Since Hadoop came into being, simple commands like Hadoop Copy have enabled a very straightforward and slow way to get data into Hadoop. And there’s Apache Sqoop, which is built expressly for getting data within a relational database management system (RDBMS) in and out of Hadoop.
But Sqoop has limitations of its own. It works, but it uses low-level MapReduce jobs to accomplish the work, which introduces a lot of complexity and (since MapReduce is done in batch jobs) time to data import and export jobs. It might be possible to take the time, of course, and dump your data into Hadoop just the once, but that assumes that Hadoop will be completely replacing your data storage infrastructure.
This is the near-forgotten side of big data: properly placing Hadoop within existing infrastructure so data is stored cheaply, but still quickly accessible for analysis. It is here that data integration tools must play a role as the bridge between existing data stores, analytics and business intelligence tools on one side, and Hadoop on the other.
Pervasive Software is a recent entrant to the Hadoop space, but not to the field of data integration: The Pervasive Data Integrator is no stranger to those who move in data circles. Earlier this month, the Austin-based company announced a Hadoop edition of its product that enables users to roll data from more than 200 sources into Hadoop’s Distributed File System (HDFS) or HBase, the Big Table-type NoSQL database that runs atop Hadoop.
A Visual Approach
Unlike Sqoop, Pervasive uses a visual approach to integrating data.
“It’s a mapping problem,” described Pervasive CTO Mike Hoskins, detailing a story of how even in development, one of Pervasive’s developers was able to perform an off-the-cuff data integration of 50,000 rows of data from an Oracle database to Hadoop in seconds… and that included the time it took to visually map tables in Oracle to Hadoop.
“He just mapped the tables, set the filters and constraints, set the target and clicked go,” Hoskins said.
Hoskins has a vested interest in talking up Pervasive, of course, but his company’s software is part of a growing class of data integration software geared to work with Hadoop and its ecosystem of big data tools. Among these are Talend’s Open Studio and Enterprise Data Integration products, as well as Pentaho’s Kettle.
Data integration tools like these will make transitioning to Hadoop a lot easier up front, along with extracting data for further analysis with tools outside Hadoop. And they will be necessary if Big Data is to fulfill its promise of making it easier to understand the meanings and patterns hidden in complex information.