Searching Hadoop Data Just Got A Lot Easier

The story of Hadoop is about two things: storing data and getting actionable information about that data. One way to mine Hadoop for information has been with enterprise search, which enables near-Google-like searching of large datasets.

Cloudera is betting big on enterprise search as a data-gathering tool with its new Cloudera Search beta release that integrates search functionality right into Hadoop. Typically, enterprise search for Hadoop has been with add-on tools like open-source Apache Solr and Apache Lucene software, or commercial versions like LucidWorks Search.

Search Is Easy, Deployment Is Hard

Enterprise search is one of those concepts that so simple, it's easy to underestimate its value in the world of big data and data warehousing.

People "get" enterprise search much more easily than digging for data a lot more easily than tools like MapReduce, because from the user perspective, it's just search: you type in some search terms in an only-slightly-more complicated-than-Google format, and your results are shown. That's pretty much how people perceive the way Google and Bing find things on the Internet.

(See also: The Real Reason Hadoop Is Such A Big Deal In Big Data)

Of course, actually executing enterprise search isn't simple. Since data stored within Hadoop is typically unstructured, each record could be thought of as a single document. Think of a letter, for instance: you know there is an address for the recipient in the letter, a date and a salutation, among other elements. Structured data has all of these elements broken out into separate fields, but in unstructured data, there's no such parsing. Humans, of course, can look at unstructured data (and documents) and pick such elements out, but software needs help.

Enterprise search gets its help from facets. Facets enable users of enterprise search to treat data pieces within unstructured data as they would fields within a relational database. Facets are basically inverted indexes that let users find specific pieces of information within unstructured data, such as an address.

This is why enterprise search is ideal for examining large sets of unstructured data. Of course, more structured the data, the better: enterprise search does particularly well with data from weblogs, which are structured uniformly enough to enable deeper data mining.

Cloudera Moves To Unification

Because it is directly integrated within Cloudera's own commercial version of Hadoop, much of the configuration will already be handled for admins, smoothing out the deployment headaches.

"It's all about getting the entire thing to feel like one system. Enterprise search will all be handled within the same framework," explained Doug Cutting, Chief Architect of Cloudera. This means that functions like authentication will be unified within that framework.

For business-line users, the capability to reach in and pull out information from a data set without having to create a SQL query or a MapReduce job is a big shortcut.

Enterprise search isn't the be-all-end-all method to get rich information from data sets, but it has enough power to make fast and broad searches of that data a much simpler matter.

Image courtesy of Shutterstock.