LinkedIn has been making heavy use of Apache Hadoop and Pig with its People You May Know and skills features (among others), and has pulled together a lot of User Defined Functions (UDFs) for Pig in the process.
On January 10th, LinkedIn’s Matthew Hayes announced the release of DataFu on the LinkedIn engineering blog. DataFu is available on GitHub under the Apache 2.0 license. DataFu is a collection of UDFs that LinkedIn has developed for data mining and statistics.
The DataFu library has been tested against Pig 0.9. The library provides a number of functions for running PageRank, performing operations on Pig data bags, filtering input data and more.
Hayes’ post walks through using DataFu to work through an example scenario computing quantiles from a fake data set, so interested developers can jump in and try the DataFu library out immediately. The project also includes a set of unit tests for each UDF.
It’s impressive to see just how much work is coming out of the Hadoop community these days. Any projects that you’re keeping an eye on?