Over the past few years, machine learning has quickly become the "secret sauce" of large-scale web sites. Machine learning systems have historically been hand-crafted by the small armies of computer science and mathematics Ph.D.s in employ at places like Google. With the growing popularity of machine learning and other statistical techniques, the demand for so-called "data scientists" (software developers and analysts with the skill to apply statistical techniques to large data sets) has exploded since 2010.
As a result, these rarefied skills have become extremely difficult to find and expensive to retain, driving up the cost of machine learning systems and making it difficult for enterprises and smaller web firms to apply the technology. In the data scientist talent shortage is opportunity, however, and a new breed of software platform is rising to meet this need. Building upon the low-level big data infrastructure now available, these new platforms seek to democratize machine learning and advanced analytics, making their benefits available to enterprises and firms who either can't afford or can't find enough PhDs and data scientists. The first of this coming wave of machine learning-powered platforms is launching at this week's O'Reilly Strata conference. Here are three companies leading the way.
SkytreeSkytree Server is a software product aimed at allowing users to very quickly deploy highly accurate and very fast machine learning systems. The idea behind Skytree Server is to disrupt the typical development cycle of modeling machine learning systems in a high-level tool like R or Matlab, and then coding them up using Python or C for deployment in order to achieve an acceptable level of performance.
This is not to diminish the Skytree value proposition in the least. By analogy, if the skills required to use machine learning are akin to knowing how to drive, and the skills required to build a production machine learning system are akin to knowing how to build a car, the Amazons and Ebays of the world have built their machine learning 'vehicles' from the tires up, while what Skytree does is allow you to drive a Ferrari (a) without knowing how to build one and (b) on a Kia budget. Skytree Server is priced on a subscription basis, starting at $2,999 per year for up to 4 cores. It is also available as a Free Edition, which has the ability to process up to 100,000 rows of data.
BigMLBigML was founded a year ago with the vision of creating "ML for the rest of us." With that in mind, they've created a cloud-based offering targeted at business users that dramatically lowers the barriers to performing machine learning analysis. BigML users typically begin an analysis by uploading a data set in text format. The service offers a wizard-based approach to formatting and cleaning up data, backed by some sophisticated pattern matching, aimed at making sure the system can tolerate real-world (read, "messy") data. One or more columns in the data can be denoted as prediction targets, which the tool will use to train a predictive model.
Once the model has been generated, additional data can be fed into the system and the model will be used to make predictions about the prediction targets. BigML currently only supports decision tree models for machine learning. While this may be a limitation for true aficionados, the company argues that the decision tree technique is powerful because it can handle a wide variety of data types, is particularly intuitive and lends itself to visual representation, is a great place to start if you don't already know what kind of analysis to apply, and is easy to scale.
Continuing our driving analogy, BigML offers an easy-to-drive family sedan that appears at your driveway when you need it, takes you where you want to go, and presents helpful guidance on how you got there, ensuring that you're never lost. BigML is priced using a credit-based system at $0.05 per credit, with the number of credits required based on the size of your data, the size of your model, and the number of predictions you need to make.
Precog, still in stealth at the time of this writing, aims to offer a developer-focused platform for "data-driven, insightful, intelligent applications." Of the three companies profiled here, Precog seems to want most clearly to be a PaaS for machine learning, and takes a very interesting approach with its platform. Precog envisions a usage model in which users "Capture" data by explicitly (via a REST API) or implicitly (via an adapter) sending it to the Precog service, "Enrich" data by mashing it up with public and partner-provided datasets, "Analyze" the data using a variety of machine learning techniques, and "Act" on it in their own applications or by pushing it to third party systems.
At the heart of Precog are a scalable, custom-built analytics database coupled with a high-level analytics API that allows users to perform a variety of analyses by name (e.g. "optimize", "cluster", or "predict"), without getting bogged down by the details of which algorithm is best. In this way, Precog is probably most analogous to the kit cars you could order from the back of Popular Mechanics in the 80s, offering those willing to get their hands a bit dirty a way to create unique and customized high-performance vehicle without needing to engineer the engine and frame from scratch. Oh, and you can rent it, a la SaaS. Precog will become available this week to select alpha users, with a private beta expected shortly thereafter.The product is offered by ReportGrid, a year-old company initially focused on providing sexy embeddable analytics reports for SaaS companies.
The three companies profiled here represent very distinct approaches to making 'machine learning as a service' a reality, and I expect we'll see many more such offerings in the coming months.