LexisNexis announced today that it will open-source its High Performance Computing Cluster (HPCC) technology, as well as offer an enterprise version with commercial support. The company is positioning HPCC Systems, developed internally by its Risk Solutions unit, as an alternative to Apache Hadoop. A virtual machine for testing purposes will be available soon, and code will be available in a few weeks.
The Risk Solutions unit, less well known than LexisNexis' legal and media units, was founded 10 years ago. It provides identity verification services to government agencies and private organizations such as banks and insurance companies. According to Armando Escalante, CTO of Risk Solutions, the company started developing HPCC 10 years ago when it found that existing solutions weren't capable of munging large data sets and returning results fast enough.
Since its development, Risk Services has used HPCC to analyze and find links in large data sets. Its also provided its solutions to intelligence organizations and scientific research laboratories. HPCwire wrote about the technology in 2009:
LexisNexis specializes in data -- lots of data -- about you, me, and just about every other person in the US that has any kind of digital fingerprint. These data come from thousands of databases about all kinds of transactions and public records that are kept by companies and agencies around the US. But just having the data isn't very useful; LexisNexis has to be able to access it on behalf of their customers to help them make complex decisions about what businesses to start or stop, what 500,000 people to send a packet of coupons too, or which John Smith living in California to get a search warrant for.
LexisNexis claims HPCC can scale to "thousands of nodes handling petabytes of data and supporting millions of transactions per minute."
Escalante said he and his team have been watching the devlopment of Hadoop closely for the past few years, and felt the time was right to make the technology available to customers outside of the Risk Solutions base. Only the core technology is being released, LexisNexis' own data linking techniques aren't being released, nor are its data sources.
Like Hadoop, HPCC consists of clusters of commodity servers. HPCC consists of three main components:
- Thor Data Refinery Cluster: the data extraction, transformation and loading system.
- Roxie Rapid Data Delivery Cluster: a delivery system for querying and datawarehousing. Escalante believes this is a key competitive advantage over Hadoop.
- ECL (Enterprise Control Language): A declarative programming language developed in C++ for working with HPCC. Escalante says it's SQL-like, but "not too SQL-like."
HPCC will be available in two versions: a free open source Community Edition and a commercial Enterprise Edition. The Enterprise Edition will include support, training and some additional tools.
Escalante says the HPCC team has been working with Amazon Web Services to make sure the product work well on AWS servers.