Interacting with Big Data is daunting enough that, for most people, a search engine query is about as far as one is willing to go. But for those willing to get their hands dirty, Microsoft is quietly working towards fully integrating public data sources into Excel, eventually baking it into a future version.
This month, Microsoft shipped a "preview version" of Data Explorer, a tool to integrate all sorts of data sources within Excel. Microsoft's vision is "self-service business intelligence," a fancy name to describe you and I accumulating data and performimg your own analysis on it.
Over time, according to Herain Oberoi, a director in Microsoft's business intelligence division, the goal will be to fully integrate Data Explorer into Excel. A year ago, Data Explorer was a lab project. "When things go from a lab to a preview, it's a sign that it has legs," he said.
It's a sign, Oberoi added, that Microsoft intends to ship the product as a long-term offering, "and in this case it would be Excel."
So why is this important?
In some cases, the questions we have require data - a lot of data. "How likely is it that I will find a job in Austin, as opposed to San Francisco?" is a question that boils down to, at its most basic, two comparisons: the unemployment rate within both cities. We've also been trained by search engines not to even hope for additional data that might make our answer even more valuable: if I'm a nurse, for example, I might like to know how many hospitals, hospices and clinics are in each town, the total number of beds, and even data for each city such as housing prices and the cost of living. You might even wonder where in each city a nurse, with a typical salary, could find the most house for the money.
Some of these answers are available. Cities, states and the Federal government compile statistics on unemployment, for example, and this U.S. Department of Labor page presents wage and employment data for nurses. Real-estate sites compile their own databases, but can also tap into public records and data sources, too.
That's where Data Explorer comes in. Within Excel 2013, downloading the Data Explorer tool allows users to tap into relational, structured and semi-structured data from OData, Hadoop and Azure Marketplace, among other sources. These sources are terrific for corporate data analysis, but perhaps a bit out of reach for consumers.
But it also allows Excel to pull data directly from the Web, including public Web pages like Wikipedia - you can even pull data from Facebook. (Microsoft provides a simple, easy-to-follow tutorial on its Web site on how to add a Wikipedia page covering the Euro soccer championship, and extract data from it.) One federally-maintained site that compiles all sorts of statistics is data.gov, which was specifically designed to give the public access to high-quality, machine-readable datasets. Excel 2013 can handle millions of rows of data, using the new xVelocity in-memory engine.
Even better, if the maintainer of the data source updates the data, then the spreadsheet can be updated with a single click. Excel 2013 also contains nifty features like Flash Fill, which automatically formats the data if it notices a pattern within the entries. Location data can be plotted against maps, supplied by Bing Maps, of course.
At this point, Oberoi said Microsoft feels pretty comfortable with identifying and facilitating the collection of data from public data sources; as well as "shaping" it, where text needs to be changed to numerical notations, columns need to be merged, and so on. It's the third goal: to take the data, shape it, visualize it, and share it out, where Microsoft needs to continue its work. When that's done, he said, Data Explorer should be fully integrated into Excel.
One of the issues that Microsoft is facing, however, is the continued improvement in natural language search to simply answer those questions. A few years ago, Google said that it would integrate and compare public data, part of a response to the launch of Wolfram Alpha at the time. And Wolfram's not there yet - asking it to compare the unemployment rate of Austin and San Francisco is within its grasp. Asking it a more nuanced question, such as the scenario above, relies on at least three factors: the availability of data, its ability to parse the query via natural language, and the ability to construct a meaningful solution. (Somewhat surprisingly, Bing presented a more comprehensive picture of the economies of both regions - not because of any inherent advantage in the search engine, but because the ongoing Silicon Valley-Austin employment spat justified the creation of a Web site comparing the two.)
Generally, the term "database" is enough to scare off the average joe. What Data Explorer could be, in a polished, final form, is a tool to allow Excel users to begin constructing their own advanced queries when a search engine can't do the job.
Lead image courtesy of Shutterstock.