Providing adequate software and tools for researchers has always been of great importance to organizations, but has often come at a great cost. In an era of constantly evolving technology and rapidly dwindling budgets, my IT team has had to work with a large pool of researchers to provide cost-effective solutions that meet the ever-growing demand for innovation and computing power.
I am an Information Technologist for the Department of Statistics and Probability at Michigan State University. The Department is home to award-winning faculty with a wide variety of expertise in fundamental and interdisciplinary research, and over 100 graduate students from all over the world. Keeping the faculty and students ahead of their research is a constantly evolving challenge for my team and I.
Evolution of Statistical Software
For many years, most statistical analysis in our department was done in Matlab, S-Plus, SPSS or SAS. Even with a Higher Educational discount, most of the software required yearly renewal fees that quickly devoured our IT budget. Things started to change when the R language, which was first developed in 1993, began to gain traction in statistics communities in the early 2000s. R is an open source programming language and software environment that is used for statistical computing and data analysis. Several years ago, we began the transition at Michigan State to R; today, it is used for the majority of the research in the department–as well as being a central focus of our statistics curriculum. By switching to the free, open source version of R, our department has been able to cut thousands of dollars each year in software costs and have focused more on fueling and expanding research.
Erik Segur is an Information Technologist for the Department of Statistics and Probability at Michigan State University.
Lesson #1: The Shortcomings of Open Source
As more people began to use R and the analysis became increasingly complex, researchers began to face a large problem: time. Research was taking several months to complete in terms of processing jobs. Often, there is a need to run the calculations several times to ensure accuracy; waiting three months for one to complete was simply not feasible. It was taking R this long to process the jobs because the iterations were computed in serial, one right after another, using only one processor core at a time.
Bo Cowgill from Google once said “The best thing about R is that it was developed by statisticians. The worst thing about R …is that it was developed by statisticians.”
Until the spring of 2010, R was a 32-bit application and could only access a limited amount of memory. The maximum amount of memory that could be accessed by R was only 3GB. When dealing with large datasets researchers were quickly running out of memory as well as discovering they needed a solution to deal with large data efficiently.
Bo Cowgill from Google once said “The best thing about R is that it was developed by statisticians. The worst thing about R …is that it was developed by statisticians.” Even though R was–and still is–constantly evolving, the department needed a solution that could keep up with hardware technology and compute calculations in an efficient, scalable manner.
Lesson #2: Find Commercial Enhancements for Open Source
Our search for a more effective version of R ultimately brought us to a product called Revolution R Enterprise by Revolution Analytics, which provides commercial support and software for open source R. It takes advantage of multiple processor cores by using optimized assembly code and efficient multi-threaded algorithms that use all of the processor cores simultaneously. Although this addressed a lot of the issues of open source R, professors were only using Revolution R on their desktops. The next question was, how we could combine the power of our servers to dramatically decrease our computation times?
Lesson #3: Expanding to Infinity and Beyond
Open Source R is a memory-bound language. This means that all of the data, matrices, lists etc. need to be stored in memory. Issues quickly arose when data sets became several gigabytes large and were too big to fit into memory. This required implementing parallel external memory algorithms and data structures to handle the data. These challenges were tackled by Revolution Analytics as they developed the R language for a High Performance Computing (HPC) environment.
There are often great pieces of software created through open source, but they generally lack key features needed for an enterprise environment. Combined with commercial backing and expertise, these projects can be further developed and expanded to meet the needs of large-scale enterprise environments.
In 2010, Revolution Analytics offered Revolution R Enterprise free for academic users and shifted the focus of their enterprise software to big data, large scale multiprocessor computing and multi-core functionality. Revolution Analytics was going to tackle everything the department needed. The evolution was complete: open source R went from an inefficient single core program to a HPC environment.
Once the department could schedule R jobs in an HPC environment, the demand began to drastically increase. The HPC cluster is now scheduling more than four times the amount of jobs that were scheduled in previous semesters, from 200 jobs over a year ago to over 800 jobs this past semester. Jobs that were taking over three months to complete on open source R were completed in less than a few days with Revolution R. Computational jobs are now run multiple times with significantly higher levels of accuracy than ever before.
Conclusion
There are often great pieces of software created through open source, but they generally lack key features needed for an enterprise environment. Combined with commercial backing and expertise, these projects can be further developed and expanded to meet the needs of large-scale enterprise environments. IT departments can provide enhanced solutions to their users that adapt to the expanding world of cloud and High Performance Computing environments–all while minimizing the impact on a shrinking budget.
Photo courtesy of Shutterstock.