Guest author Seth Payne is a senior product manager at Skytap.
Big Data, just like Cloud Computing, has become a popular phrase to describe technology and practices that have been in use for many years. Ever-increasing storage capacity and falling storage costs - along with vast improvements in data analysis, however, have made Big Data available to a variety of new firms and industries.
Scientific researchers, financial analysts and pharmaceutical firms have long used incredibly large datasets to answer incredibly complex questions. Large datasets, especially when analyzed in tandem with other information, can reveal patterns and relationships that would otherwise remain hidden.
Extracting Simplicity From The Complex
As a product manager within the Global Market Data group at NYSE Technologies, I was consistently impressed with the how customers and partners analyzed the vast sets of market trade, quote and order-book data produced each day.
On the sell side, clients analyzed data spanning many years in an attempt to find patterns and relationships that could help fund portfolio managers build long-term investment strategies. On the buy side, clients mined more-recent data regarding the trade/quote activities of disparate assets. University and college clients sought data spanning decades. Regardless of the specific use case, clients required technology to process and analyze substantial and unwieldy amounts of data.
Various technologies are employed to meet the needs of these various use cases. For historical analysis, high-powered data warehouses such as those offered by 1010data, ParAccel, EMC and others, are incredible tools. Unlike databases, which are designed for simple storage and retrieval, data warehouses are optimized for analysis. Complex event processors such as those from One Market Data, KDB and Sybase give high-frequency and other algorithmic traders the ability to analyze market activity across a wide array of financial instruments and markets at any given microsecond throughout the trading day.
These technologies are now being deployed within new industries. Business intelligence tools such as those offered by Tableau and Microstrategy can now deal with very large and complex datasets. To a lesser extent, even Microsoft Excel has been retooled to handle Big Data with newly architected pivot tables and support for billions of rows of data within a single spreadsheet.
But Big Data is useful only if analysts ask the right questions and have at least a general idea of the relationships and patterns Big Data analysis may illuminate.
(See also Blinded By Big Data: It's The Models, Stupid.)
Do You Need Big Data?
Is Big Data right for your company? The first question any firm must ask is if they will benefit from Big Data analysis. Begin by understanding the data sets available to you. Analysis of 20 years of stock closing prices, for example, would not likely require the power of Big Data systems. Given the relatively small size of this dataset, analysis can, and probably should, be performed using SQL or even simply Excel.
But large sets of unsorted and unordered data — such as financial transactions, production output records and weather data — do require Big Data analysis to bring order to the chaos and shed light on relationships, trends and patterns made visible only by structured and systematic analysis.
To start, formulate a relatively simple hypothesis and use Big Data analysis to test it. The results of this analysis should reveal information that will lead to further, more complex questions.
(See also The Rising Costs Of MiIsunderstanding Big Data.)
Big Data In The Cloud
It is no surprise that the rise of Big Data has coincided with the rapid adoption of Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) technologies. PaaS lets firms scale their capacity on demand and reduce costs while IaaS allows the rapid deployment of additional computing nodes. Together, additional compute and storage capacity can be added to almost instantaneously.
For example, a large hedge fund in New York used a cluster of computing nodes and storage to analyze the day’s trade/quote activity across all U.S. equity markets. The size of the datasets used in the analysis - typically 10GB to 12GB compressed - was growing steadily, allowing the market data manager to accurately plan his capacity needs. On occasion, however, trade/quote volumes explode, creating exponentially larger data sets. On these occasions, the market data manager can deploy additional virtual machine (VM) nodes in the cluster, ensuring that even unusually large datasets do significantly delay analysis withouh having to permanently add expensive computing resources.
The flexibility of cloud computing allows resources to be deployed as needed. As a result, firms avoid the tremendous expense of buying hardware capacity they'll need only occasionally.
Big Data Isn't Always Cloud-Appropriate
While the cloud grants tremendous flexibility and reduces overall operation costs, it is not appropriate for all Big Data use cases.
For example, firms analyzing low-latency real-time data — aggregating Twitter feeds, for example — may need to find other approaches. The cloud does not currently offer the performance necessary to process real-time data without introducing latency that would make the results too “stale” (by a millisecond or two) to be useful. Within a few years, virtualization technology should accommodate these ultra low-latency use cases, but we're not there yet.
Cloud computing has given businesses flexible, affordable access to vast amounts of computing resources on demand - bringing Big Data analysis to the masses. As the technology continues to advance, the question for many businesses is how they can benefit from Big Data and how to use cloud computing to make it happen.
Image courtesy of Shutterstock.