Grid-computing startup Aster Data Systems will officially launch today, three years after it was founded. Aster, which began in the Ph.D program at Standford, is a provider of “massively parallel processing databases” for organizations that have mammoth quantities of data that need to be stored and analyzed quickly. The Redwood City, California-based company is backed by Sequoia Capital, Cambrian Ventures, and First-Round Capital.
Aster’s nCluster software allows companies with large amounts of data to store it on commodity hardware and scale with one-click, adding new servers as the data set grows. The company’s first major client is MySpace, which generates 100s of terabytes of traffic data from its 110 million monthly unique users. Mining that data to understand how customers use and interact with the site requires some pretty robust architecture.
Aster’s solution for MySpace uses a 100-node cluster of off-the-shelf commodity servers that can capture and load 100% of the data and run complex queries quickly. “MySpace needed to analyze complete datasets – not just samples or summaries. Sampling would completely miss infrequently occurring but highly profitable patterns,” according to Aster, which says that nCluster has allowed MySpace to work with all of its terabytes of data and avoid the need to sample.
nCluster works by splitting up the cloud into smaller bits that each have a specific task. “Loader” nodes load data from external sources (and export to them), while “worker” boxes keep data stored on local disks. A “queen” layer directs the entire operation intelligently routing queries to the proper node. The “loader” tier can scale independently as needed, say Aster. “This enables query load-balancing to eliminate hot-spots and increase performance, returning results in seconds or minutes versus hours or ‘did not finish,'” writes the company in a case study.
The software reminds me of 3Tera’s AppLogic (our coverage), which is a grid computing operating system that makes it easier for companies to deploy their own compute cloud on commodity hardware. nCluster is essentially the same idea, but with an eye specifically toward managing and querying massive databases.