Big Data Boosts Storage Needs - And Opportunities

Storage has always been important to the enterprise – but the rise of big data applications puts unprecedented pressure on storage strategies and technologies. It’s also delivering unprecedented benefits to the companies that figure out how to do it right.

So how big is big data? Approximately 2.5 quintillion bytes of data being created every day – 90% of it unstructured, according to IBM’s estimates. Given that data can be in the form of customer sales interactions, corporate logistics information, or communications with partners and suppliers, companies are faced with tough choices. Data centers full of standard 2TB hard drives were not designed to handle big data.

What is needed is a combination of robust storage hardware and software that allow for quick access to relevant information.

Some Need Storage More Than Others

While big data and the information storage needed for analyzing and containing giant data sets are common amongst mid- to large-scale enterprises, some need big data storage solutions more than others. Earth scientists, engineering modeling, media and entertainment, and rapidly growing online services all contribute to the massive amounts of data being generated. The U.S. Library of Congress, for example, had 235TB of storage in April 2011. For this information to be analyzed, it must be stored properly for instant access.

“Whether it is storage systems architectures or storage devices enabling big data applications, the growth of content is increasing the amount of large data sets that enterprises must work with,” wrote Tom Coughlin, president of Coughlin Associates, a storage analyst and consultancy, in a recent blog post. “These big data applications require managing, protecting and analyzing large and complex data content.”

Analysts with McKinsey and Co. estimate nearly all sectors in the U.S. economy had an average of at least 200TB of stored data per company with more than 1,000 employees. That’s twice the size of U.S. retailer Wal-Mart’s data warehouse in 1999. Many sectors had more than a petabyte in mean stored data per company. European companies have also amassed a massive storage capacity (almost 11 exabytes). That’s 70% of the computer storage space created in the U.S. (more than 16 exabytes) in 2010.

But storing this information for data analysis can prove pricey, prompting enterprises to look for innovative ways to consolidate data sets and reconfigure connections between big data applications.

Overcoming Cost Constraints

While data warehouses cost tens or hundreds of millions of dollars to start with, the cost of storage can increase astronomically from there whenused for big data projects.

The average cost of a supported Hadoop distribution costs about $4,000 per node annually. A Hadoop cluster requires between 125 and 250 nodes and costs about $1 million, according to John Bantleman, CEO of big-data database developer RainStor. And companies like Yahoo have 200PB data sets spread across 50,000 network nodes!

“We know one thing is proven: The benefits of leveraging Big Data will outweigh IT investment,” Bantleman wrote in a blog. “Cost by how much is the question.”

Bantleman suggests there are two key areas that will continue decreasing the cost of big data storage:

Re-using existing SQL query language and existing business intelligence tools against data within Hadoop.
Compressing data at its most basic level, which not only reduces storage requirements, but drives down the number of nodes and simplify the infrastructure

Another factor affecting cost and complexity centers on where these storage arrays are physically located. New technologies are bringing some storage and storage functionality back much closer to the server and moving some further away in cloud storage. Increasingly, storage functions will be distributed inside and outside of the data center, in internal and external clouds.

More importantly, storage will be a key enabler of new business process and business intelligence applications that will be able to digest and present orders of magnitude more data than current applications, says Wikibon.com CTO David Floyer.

Storage-as-a-Service Meets Big-Data-as-a-Service

The economics of data movement are tipping the scales towards distributed compute services. Processing the data where it is sitting will be the model for the next generation of platforms. The infrastructure for this is falling into place.

As big data transforms from traditional closed data collection and analysis, companies are increasingly considering the benefits of cloud-based services. Online applications and services now create new sources for expanding data that create new challenges for fast access and fast use of information. Big data therefore results in big storage and big business opportunities.

While storage housed on-premise provides a controlled advantage for some storage systems dedicated to big data analytics, the more logical extension would be the expansion of online storage services for big data analytics. The concept of Big-Data-as-a-Service (BDaaS) is expected to debut in the Asia Pacific region in 2013, according to analysts with research firm IDC.

“We have seen cloud services, hosted data centers, service providers, and system integrators all expanding their XaaS offerings,” Craig Stires, research director for big data and analytics, IDC Asia/Pacific predicted in his 2013 outlook. “The implementation and execution of a provisioned BDaaS solution will leverage platform, networking, storage, and compute services. IDC expects to see a breakthrough BDaaS offering in 2013, which will leverage all of these assets, as well as solve the challenge of how customers will on-board their data.”

Whether in the cloud or in the data center, companies will look for ways to effectively cut costs without having to reduce the amount of information they work with. IT departments will be faced with the challenge of how to integrate these new sources of data within existing well-structured data management systems.

Organizations have invested considerable time in agreeing on what data is to be included into traditional analytical data storage, how it is to be defined, ownership, and permissions. The inclusion of new sources of data, streaming in at high speeds, with potentially large issues around data quality, will be a massive challenge. Finding the most efficent way to store this data will competitive advantages to organizations that do it right.