The explosion of data driven by sensors, data mining and social media and other Web-based interactions means that more and more companies will need to find ways of dealing with massive data sets – even companies that haven’t typically been data driven before. But new business analytics applications may require more processing power than your organization has ever needed before, requiring you to find ways to handle data as efficiently as possible. Infrastructure-as-a-service providers and inexpensive data warehousing appliances with in-memory analytics will provide options for many organizations. But some may find distributed computing a better fit for their organization’s big data needs.
Scientists and academics have been taking advantage of distributed computing for years, but it’s an approach that can benefit information workers in other areas. Here are some methods of running applications in distributed environments, including some newer approaches.
Beowulf Clustering
Made popular by NASA in 1993, cluster computing uses commodity hardware to pool resources for virtual applications. Pooling multiple systems allows you to take advantage of parallel processing. Parallel processing puts multiple processors to work on a problem simultaneously – multiple slower processors working in parallel are generally more efficient than a single fast processor working alone. Supercomputers use multiple processors for parallel processing, but a cluster of low-end machines working in unison can become the equivalent of a large supercomputer.
Beowulf clustering is a popular architecture for cluster computing. The open source Parallel Virtual Machine (PVM) software package and Message Passing Interface (MPI) implementations, such as OpenMPI or MPICH, are common software for building these clusters.
The advantage of using this approach is that you can run an application on a large number of inexpensive pieces of hardware, including systems with completely different hardware. However, some of the newer methods we’ll discuss next may be preferable.
Server Aggregation
Although a Beowulf cluster attempts to mimic the behavior of a single machine with multiple processors, each part of the cluster still has its own operating system and software stack installed. Instead of having a virtual application that runs across several machines with each running an operating system, the server aggregation approach runs a single instance of an OS across all the servers in a cluster. Therefore, all the physical resources go to the virtual machine.
This can be a real cost saver. The cost of servers is non-linear, so it’s generally cheaper to buy several two-socket systems than a single multi-socket system. And since you’ll have multiple servers, they can be used for other purposes when you don’t need them for massive number crunching jobs.
Server aggregation is a relatively new approach to virtualization. As far as we know the only company to offer server aggregation solutions is ScaleMP. However, we expect to see this approach take off over the next few years.
Server aggregation makes sense if you want to use your own hardware. But what if you want to use a public infrastructure-as-a-service?
Virtual Cluster Appliances
A virtual appliance is typically a VM designed to do a specific function with minimal configuration. Virtual cluster appliances are VMs designed for cluster computing right of the out of the box. You can learn more about the approach here.
One of the advantages here is that these VMs can be deployed to a cloud service like Amazon EC2. Instead of hosting several physical servers running virtualization or aggregation software, you can have many virtual servers running in parallel in the cloud. You still get the advantage of massively parallel computing, but without the hassle of running physical infrastructure.
The Nimbus Project is an open source toolkit for creating virtual infrastructure for cluster computing. It was used by the STAR project build a 100 VM cluster on EC2. Another source for virtual cluster appliances is Grid Appliance, which offers both a general purpose cluster appliance and one built specifically for Apache Hadoop.
Photo by hutch