Blogcosm is a new company aiming to build a directory of the blogosphere. From the mundane to the esoteric, the company wants to provide users with a rich data set about any particular blog of interest or the vertical market it is in.
I met founder Scott Lawton, an old time geek from Massachusetts, last night at the first annual Blog World Expo in Las Vegas. Blogcosm built a blog directory of all the speakers at Blog World Expo and the blogs they write for, as a case study. Lawton is a data quality algorithm expert who says his involvement in the web 2.0 scene predates Dave Winer’s creation of Radio Weblogs. He started writing scripting utilities professionally for the Mac in 1993. He is nerdy and charming, if you like nerdy innovative types.
The Blog World Expo in Vegas leaves no doubt that blogging is an emerging powerhouse of an industry. Lead by professional trade-show organizer Rick Calvert, the event is now expected to have 2000 attendees or more. Two hundred tickets were sold yesterday alone. WordPress founder Matt Mullenweg keynoted this morning, TechCrunch’s Michael Arrington will speak tomorrow. I spoke twice yesterday and the energy here is high.
The Data
If you check out the Blogcosm page on the speakers at the expo you’ll see that the company so far is pulling in data from Technorati, Alexa and a hanful of other sources. The self-funded project is muscling its way through indexing the blogosphere manually. It aims to go well beyond tech blogs and wants to pull data from a long list of available APIs – from Compete to Del.icio.us. The goal is to offer a useful entry about any blog you look up and information about categories of blogs that no one is capturing today.
On the far end of the complexity spectrum, Lawton says he is experimenting with an algorithm that estimates the monetization of any given blog. Looking at the ad technology employed, probable CPM for a vertial and the estimated traffic of a blog, he says he hopes to be able to provide a rough estimate of how much money any blog is making automatically.
Lawton told me that for now he can answer simple questions, but that those provide valuable information as well. There are no parenting blogs in the Technorati 100, he told me for example. That’s interesting information. The ability to draw from a standardized taxonomy to discern who the leaders are in any blogging vertical is something that no one automates. As a data quality technical guy, Lawton says the software on the back end of his four person team should enable information parsing that Technorati, Techmeme and other sites just can’t perform. That software, though, will ultimately be assisted with intelligent human intervention.
“Data quality issues in the Technorati 100 are appalling, even today,” Lawton told me. “I think the world can afford to have someone look at a list like that before they publish it. Before there is a Blogcosm 100 it will face human judgment. Is a site a blog? Is it on the list for reasons that are correct or because of errors in the algorithm? What is it about? We think there are business models around answering those questions through a combination of automation and human editorial review.”
The site is ugly and bare bones today. The potential, though, is significant. As you can imagine, Lawton watches computer scientist Gabe River’s blog tracking service Techmeme closely as well. “Techmeme is just a pale shadow of what it could be,” he told me. “For Gabe’s sake I hope it’s him that builds what Techmeme could be. If it’s not him, it could be us.”
As a person who makes his living engaging with sites like Techmeme and Technorati – I am excited to see Blogcosm build its business around offering a high quality, structured dataset concerning the blogosphere.