<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
        <channel>
        <title>Pete Warden - ReadWrite</title>
        <link>http://readwrite.com</link>
        <description />
        <language>en</language>
        <copyright>Copyright 2012 SAY Media, Inc.</copyright>
        <managingEditor>readwriteweb@gmail.com</managingEditor>
        <docs>http://blogs.law.harvard.edu/tech/rss</docs> 
        <lastBuildDate>Tue, 08 Mar 2011 05:30:00 -0800</lastBuildDate>
        <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://rww.superfeedr.com/" />

                    <item>
                <title><![CDATA[Salesforce-for-Marketing Startup Raises $32 Million]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/hubspotlogo.jpg" style="" />
			</span>
<a href="http://hubspot.com/">Hubspot</a> has just announced a Series D round of funding for their marketing-as-a-service platform. Investors include some very big names like Google, Salesforce and Sequoia, which shows how much interest there is in its service aimed at small businesses. It's also a big boost for the Boston startup scene, as Hubspot is now one of the fastest growing SaaS companies in history by revenue, only behind Salesforce according to their CEO Dharmesh Shah.</p>

<p>Along with main competitors <a href="http://eloqua.com/">Eloqua</a> and <a href="http://www.marketo.com/">Marketo</a>, Hubspot helps small businesses move away from traditional cold-calling and display advertising and into the new world of social media, search engines and blogging. On my recent visit to Boston, Shah explained to me that the initial idea came about when he noticed how much traffic he was able to drive through his <a href="http://onstartups.com/">OnStartups blog</a>, when many of the businesses he was helping were struggling to get a fraction of the exposure. </p>
<p>To solve that problem, he set out to build a service that makes it simple for small business owners to use search ads, blogging and Twitter. The challenge has been creating tools that actually help small companies with little time or experience of the new technologies. </p>

<p>That's meant creating simple, actionable reports, as well as educating its users in the effective online marketing through initiatives like <a href="http://blog.hubspot.com/marketing-podcast/tabid/74768/Default.aspx">Hubspot TV</a>. With 4,000 paying customers, its approach seems to be paying dividends.</p>

<p>Perhaps unsurprisingly considering that six out of its eight executives went to MIT, the bulk of the investment will be going into research and development to support "an ambitious product strategy that calls for building a complete and fully-integrated marketing platform". It does seem like it is a strong contender to become the equivalent of Salesforce for the marketing world, which will be a boon for small firms struggling to adapt to the online world.</p>

<p><em>Disclosure: HubSpot is also a sponsor of a forthcoming series on data science, which will appear on ReadWriteWeb later this month.</em></p>
                    ]]></description>
                <link>http://readwrite.com/2011/03/08/salesforce-for-marketing-start</link>
                <guid>http://readwrite.com/2011/03/08/salesforce-for-marketing-start</guid>
                <category>News</category>
                <pubDate>Tue, 08 Mar 2011 05:30:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Helping Consumers with Data from Twenty Million Credit Cards]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/03/01/creditcard.jpg" style="" />
			</span>
<a href="http://www.bundle.com/">Bundle.com</a> is a personal finance website with a mission to "help US consumers make smarter decisions with their money". What really makes it stand out is the company's unique access to detailed, anonymized transaction histories from 20 million Citibank credit cards. </p>

<p>This allows them to build consumer tools in the same vein as Mint, but with a deep foundation of information to compare to right from the first user. Last week I sat down with CEO Jaidev Shergill, CTO Phil Kim and data scientist Alex Hasha to learn more about what they're doing with such a powerful data set.</p>
<p>The first question they wanted to address was the obvious one of how do they ensure privacy and security when dealing with such sensitive information? Everything is held in a secure data center, and no direct personally identifiable information is included in the histories - everything's anonymized. The team also takes further steps, like identifying and removing healthcare-related payments. I asked them though, doesn't it still make people a bit uncomfortable? Their response was that their whole business was based around helping consumers, and their investor Citi only shares the data on very strict conditions because they believe Bundle's work will make customers lives better. </p>

<p>CTO Phil Kim laid out their philosophy:</p>

<blockquote>Bundle takes great pains to protect the privacy of users. First, we hold ourselves to strict, bank-level information security standards, which means that sensitive data is held in a secure data center and access is heavily restricted, and that the Bundle application is heavily scrutinized for vulnerabilities on a regular basis, to prevent accidental or malicious leakage of user data. Second, much of the data analysis and synthesis work we do relies on data that has been sampled, modeled, flattened, or otherwise transformed -- we rarely work with raw transaction data, and we never work with data that has a direct linkage back to a named customer. Last but not least, Bundle is very much focused on building tools to help consumers -- remember that this data is not new... large companies use data just like this to market products and make business decisions -- Bundle is simply trying to share this data with consumers.</blockquote>

<p>Alex Hasha described how he'd worked in the finance industry as a quant, working in a team of over two hundred PhDs to analyze financial instruments. The attraction of Bundle.com for him was the chance to work on something that offered direct benefits to ordinary users, a refreshing change from the abstract world of high finance. </p>

<p>Users upload information from their own bank and credit card accounts onto the site, and in return they get back a score card showing how their spending compares to people like themselves. For example, you might discover that you're spending a lot more on groceries than other people in your neighborhood, and you'd be better off switching to a cheaper supermarket.</p>

<p>The key to all of their work is the value that they're able to extract from aggregate information, things like how much people in a particular zip code spend on particular categories such as eating out, groceries and transportation. Because this is the result of blending and averaging large numbers of different accounts, it helps reduce the risk that any sensitive information will leak out.</p>

<p> What's really impressive about the data they possess is its broad coverage. Almost every merchant in the U.S. will be represented, and it has the potential to offer the deep customer analytics that website publishers are used to. I could imagine it being used by restaurant owners to spot when they're losing previously loyal customers for example. Bundle.com won't speculate on where they will take their product in the future, but did want to emphasize how everything was driven by their mission to help consumers.</p>

<p>I spent a bit of time talking with them about the technical challenges of their work, too. Credit card systems are often 30 or 40 years old, and so the data they get back is often very messy. You know how you look at your statements and try to decipher what "MCDON 94117" could be? That's one of their biggest obstacles, the names of the merchants are often incomplete and unclear, so they have a whole system devoted to making sense of this unstructured data. "MCDON", "MCD" and "MCDONALDS" all likely to refer to the restaurant, which allows them to categorize any transactions as food purchases. </p>

<p>A large amount of their code is written in Perl, since they're big fans of CPAN's rich repository of libraries, and runs in-memory, so it's not a classic big data problem. They also rely on R for some of their analysis, thanks to its rich toolkit of statistical functions.</p>

<p>The data mining of billions of credit card transactions is bound to raise a lot of questions, but it was clear to me that Bundle.com is serious in its mission to help consumers. Its product certainly seems to offer a lot more value to the wider world than anything that Wall Street's quants have produced.</p>
                    ]]></description>
                <link>http://readwrite.com/2011/03/02/bundlecom-mines-data-from-twen</link>
                <guid>http://readwrite.com/2011/03/02/bundlecom-mines-data-from-twen</guid>
                <category>Big data</category>
                <pubDate>Wed, 02 Mar 2011 04:30:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Crawl Bank Accounts with the Ghost of Wesabe]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/25/safehandle.jpg" style="" />
			</span>
The personal finance startup <a href="http://wesabe.com/">Wesabe</a> may be dead, but its code lives on. Former team member Brian Donovan <a href="https://www-stage3.wesabe.com/groups/227-open-source-wesabe/discussions/5471-automatic-uploader#comment_39993">recently</a> open sourced <a href="https://github.com/wesabe/ssu">the framework used to connect with bank websites</a> and download statements in a machine-readable form. This might not sound impressive, but with thousands of banks just in the U.S., all with different website setups, entire companies like <a href="http://yodlee.com/">Yodlee</a> have been built around solving this problem.</p>

<p>By open sourcing the code, Wesabe makes it possible for hobbyists, researchers and starving startup founders to build new and innovative personal finance tools. The code itself is pretty bare bones; Brian admits he'd hoped to spruce it up before release but his new job didn't leave much time for a labor of love. What's crucial though is that it's a battle-tested system with broad coverage, and has a simple system for adding support for new institutions.</p>
<p>This makes it a strong potential competitor to Yodlee, if it can gather enough support from a community of developers to stay on top of the constantly changing bank websites. The forum posts ask for the code to be open sourced and now tips for running it show that there are enthusiasts interested in keeping it alive. This is a hopeful sign for innovation, but Yodlee may not be so happy. The loss of revenue from Mint after it was acquired by Intuit must have been painful for the company, and the emergence of an open-source alternative will be another headache.</p>

<p>So, what are the possibilities for end users? The simplest thing you can do with the code is set it up on your own machine and pull down all your own financial information automatically. Personal data lovers can create custom instrumentation for their own spending, saving and income patterns, building dashboards showing the measurements they care about. Getting a bit fancier, you could run something in the background on a private server. Want to send yourself an SMS when you approach your overdraft limit, or when there's an unusually large transaction? Having this "Automatic Uploader" code makes it easy to build your own system to handle those requirements.</p>

<p>Hopefully this will also inspire a new generation of startups to build personal finance tools. As founder Marc Hedlund says in <a href="http://blog.precipice.org/why-wesabe-lost-to-mint">his insight-packed post-mortem on Wesabe</a>, in the financial world "the help consumers have is absolutely abysmal", so there are worlds of opportunity to create better solutions.</p>

<p><em>Photo by <a href="http://www.flickr.com/photos/eklektikos/278691547/">Todd Ehlers</a></em></p>
                    ]]></description>
                <link>http://readwrite.com/2011/02/25/crawl-your-bank-account-with-w</link>
                <guid>http://readwrite.com/2011/02/25/crawl-your-bank-account-with-w</guid>
                <category>APIs</category>
                <pubDate>Fri, 25 Feb 2011 07:30:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[What Data-Mining Apple, Google and Microsoft's PR Reveals]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/24/applewordlesmall.jpg" style="" />
			</span>
What topics are the big three software giants focused on? Their press releases show what areas of their business they want the media to cover, so I thought analyzing them in bulk might reveal some of their priorities.</p>

<p>I started off by downloading every press release that Apple, Google and Microsoft have released in 2011, and then <a href="http://wordlin.gs/">built word frequency clouds</a> based on the text. My data-mining didn't uncover any secret messages hidden in the releases, but the visualizations do give a flavor of what's on their minds.</p>
<p><a href="http://wordlin.gs/view/5c3407b9a9cd9f64"><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/24/applewordle.jpg" style="" />
			</span>
</a><br />
My old employer Apple has <a href="http://www.apple.com/pr/library/">a characteristically minimalist set of press releases</a>, with just a handful since the start of the year. With the recent big news of the release of a Verizon iPhone, it's no surprise to see mobile terms high on the chart, but the Mac brand is still at the center of Apple's public story. </p>

<p>What I wasn't expecting was the emphasis on the new desktop App Store, with multiple stories pushing the service, so it's obviously a big priority for the company. It's also interesting to see "customers" show up prominently, reflecting the company's consumer focus. <br />
<br><br />
<a href="http://wordlin.gs/view/6ab22af327584916"><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/24/googlewordle.jpg" style="" />
			</span>
</a><br />
Google doesn't do press releases, so I analyzed <a href="http://googleblog.blogspot.com/">its official blog</a> instead. There's a lot more material than Apple, over 360,000 words just since Jan. 1! Search is the most popular word, but it's clear that YouTube is a bigger part of its public face than I'd expected, with video making a strong appearance, too. Mobile and Android aren't as strong as I'd have expected, with Chrome only making a middling showing as well. I wonder if it says something about the company culture that "data" shows up more often than "information"? <br />
<br><br />
<a href="http://wordlin.gs/view/649925f051862e27"><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/24/microsoftwordle.jpg" style="" />
			</span>
</a><br />
What's clear from Microsoft's cloud is that it's now undeniably an enterprise company, as its <a href="http://www.microsoft.com/en-us/dynamics/default.aspx">Dynamics CRM product</a> is mentioned more often than Windows, and "business" beats out "customers." </p>

<p>I wasn't expecting to see how much of an emphasis it's putting on the healthcare industry, too, with multiple announcements of deals and technologies for both insurers and providers. "Information" beats out "data" for Microsoft, and "technology" and "experience" are prominent, which seems to fit with the company's culture. There's very little coverage of Web technology, with Bing and Azure only receiving a handful of mentions each. Mobile does a bit better, but it's clear management's mind is focused on the money-making opportunities of selling to large organizations, not fighting it out in the consumer trenches.</p>
                    ]]></description>
                <link>http://readwrite.com/2011/02/24/what-data-mining-apple-google-and-microsofts-pr-reveals</link>
                <guid>http://readwrite.com/2011/02/24/what-data-mining-apple-google-and-microsofts-pr-reveals</guid>
                <category>Analysis</category>
                <pubDate>Thu, 24 Feb 2011 06:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[How to Find Your Most Important Fans]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/11/vipsmall.jpg" style="" />
			</span>
Word of mouth is an incredibly powerful marketing tool, but how do you work out which customers are most important in spreading your message? Services like <a href="http://www.peerindex.net/">PeerIndex</a> or <a href="http://klout.com/">Klout</a> help you find experts and influencers in particular communities, but can't measure what people have actually done for your business. The new <a href="http://vipli.st">Vipli.st</a> service from <a href="http://awe.sm/">Awe.sm</a> aims to fill this gap by uncovering the fans who drive the most sharing.</p>

<p>Launched at the <a href="http://strataconf.com/strata2011/public/cfp/148">Strata Startup Showcase</a> last week, the site visualizes how <a href="http://plancast.com">Plancast</a> events are shared across social networks like Twitter and Facebook. It draws a tree showing the first person to create a plan, with links below to everyone who added themselves as attendees after clicking on that link, downwards through the entire history of the conversation around the event. Here's what it looks like for <a href="http://www.vipli.st/?url=http%3A%2F%2Fplancast.com%2Fp%2F3r1e">a SXSW Lean Startup plan</a>:</p>
<p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/11/vipshot.jpg" style="" />
			</span>
<br />
The number next to each name shows how many attendees each person helped to sign up. <a href="http://twitter.com/#!/ericries">Eric Ries</a> is responsible for bringing in 10 attendees, which is no surprise since he's the best-known evangelist for the Lean Startup movement. How about <a href="http://twitter.com/#!/DMelissaG">Melissa Grody</a> of <a href="http://500startups.com/">500Startups</a> though? Despite having a pretty low-key Twitter account with just over a hundred followers, she's indirectly responsible for four signups, thanks to <a href="http://twitter.com/#!/vlaskovits">Patrick Vlaskovits</a> picking up her tweet. Broad influence measuring services would never flag her role, but Vipli.st makes it possible to spot and recognize fans like her who are key to spreading the word.</p>

<p>The service was created by the Awe.sm team, and uses <a href="https://github.com/awesm/awesm-dev-tools/wiki/">the same API that's available to third-party developers</a> to gather the data it needs. To create the family tree of which attendees were driven by which fans, Plancast uses Awe.sm to create a new URL for each attendee that signs up, including a unique parameter that marks which user is sending out the plan. That parameter is also stored in Plancast's database, so when another user clicks on that special URL, it's possible to tell which person sent them to the site.</p>

<p>Awe.sm's co-founder Jonathan Strauss thinks that this sort of performance-based measurement is going to be a crucial tool for anyone marketing using social tools: </p>

<blockquote>If all you want to do is reach people, direct marketing through email is a great channel. What's different about social tools like Twitter and Facebook are the retweet and like buttons, since users are far more likely to click them than they are to forward an email. The real value of social networks is in the sharing.</blockquote>

<p>Plancast's <a href="http://ursusrex.com/">Mark Hendrickson</a> explained why this was so important to their business. </p>

<blockquote>The whole idea behind our site is to help people hear about events through their friends. Vipli.st is the first time we've been able to visualize how that's happening in any kind of detail.</blockquote>

<p>To explain how this could be useful to other businesses, Strauss pointed to one of Awe.sm's customers, the music store creator <a href="http://www.topspinmedia.com/">TopSpin</a> (which markets online for artists like Eminem, Brian Eno and the Beastie Boys). Bands would love to uncover their most important fans, the ones who do the most to spread the word about their albums and concerts. Right now they can spot the big individual spenders, but not the penniless student who can't afford the deluxe $250 box set, but who persuades all her friends to buy the new album. She's the one they should really be inviting to their velvet-rope launch events, since she's doing far more to make them a success.</p>

<p>Strauss thinks broader measures of influence are still useful for brand-building, but that laser-focused performance metrics will become increasingly important to social marketers. "To understand how your social campaign is working, you need to understand how your message is being passed on down the chain". He and his team built the Vipli.st service to prove how easy it was to gather the data with Awe.sm and turn it into an actionable story. It's based completely on its public API, and Strauss is keen to work with any external developers who would like to do something similar with their own site.</p>

<p>I'm fascinated by the stories that this sort of analysis of public conversations will be able to tell us. By uncovering the hidden influencers within communities, hopefully we'll be able to reward some of the unfairly neglected true fans too.</p>
                    ]]></description>
                <link>http://readwrite.com/2011/02/14/discover-your-businesss-vips-w</link>
                <guid>http://readwrite.com/2011/02/14/discover-your-businesss-vips-w</guid>
                <category>APIs</category>
                <pubDate>Mon, 14 Feb 2011 05:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[The Robots are Watching Us]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/10/robotface.jpg" style="" />
			</span>
Have you ever felt like your household appliances are watching your every move and conspiring amongst each other? No? Oh well, I guess that's just me. It's exactly what European researchers are hoping to enable though, by building a data sharing service called <a href="http://www.roboearth.org/">RoboEarth</a> that automated devices can use to share information between themselves.</p>

<p>To understand why this is useful, imagine a robot arriving at a location that it's never visited before. If another machine had explored there earlier, the map it had built up would be available on this "robot Internet." The same system could be used to pass around all sorts of information, from traffic patterns to help robots plan better routes, to the training information about how to best complete tasks.</p>
<p>The practical applications of this are definitely very exciting, but to a mischievous mind like mine, so are some of the unintended possibilities. If cleaning robots share maps of the locations they work in, wouldn't criminals be interested in banks' floor plans? How about the routines of driverless armored cars? Training a machine to perform around our own homes will involve revealing a lot of our private patterns of behavior. Are we always out of the house on Sunday mornings?</p>

<p>Even data that is aggregated together can be very revealing. One researcher told me in confidence of a pattern she had noticed in crime data released by a major city, showing that in one area there were never any arrests for drug crimes on a Thursday. She's kept that under her hat, but that would be very useful intelligence for drug dealers. The more of this sort of information is made semi-publicly available, the more likely it will have unintended consequences like these.</p>

<p>I don't want to be alarmist about this, it sounds like a great project, and to keep things in perspective the volume of information being gathered on us <a href="http://www.readwriteweb.com/archives/meet_the_firehose_seven_thousand_times_bigger_than.php">just from our cell phones</a> dwarfs the planned robotic data-sharing. As the Internet of Things gathers steam though, we are going to have a whole new world of security and privacy challenges to think about. So, keep a careful eye on your Roomba...</p>

<p><i>Photo by <a href="http://www.flickr.com/photos/procsilas/71477071/">Procsilas Moscas</a></i></p>
                    ]]></description>
                <link>http://readwrite.com/2011/02/10/the-robots-are-watching-us</link>
                <guid>http://readwrite.com/2011/02/10/the-robots-are-watching-us</guid>
                <category>Tools</category>
                <pubDate>Thu, 10 Feb 2011 10:30:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Be a Neighborhood Hero (and Earn Some Cash) by Sharing Your Driveway]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/08/parkcirca.jpg" style="" />
			</span>
Have you ever been stuck circling the block waiting for a parking space to open up? The new <a href="http://www.parkcirca.com/">ParkCirca</a> space-sharing service might make that a thing of the past. Co-founder and CEO Chadwick Meyer told me how he was fruitlessly hunting for a space when he noticed how many private driveways had no cars in them. Why not let the driveway owners make some money from them, and save stress (and gas) for the drivers at the same time?</p>

<p>That's exactly what ParkCirca sets out to do. Driveway owners register when their space will be free and how much they want to charge. Drivers can then use an iPhone application to find available spots near their destination, and book them for the time they need. A typical charge might be $2 an hour, in which case an owner with a space available for just eight hours every week day could make up to $320 a month, without losing a place to park in the evenings or weekends.</p>
<p>As a self-funded startup, Meyer and his team have just launched the service in San Francisco by walking around neighborhoods like the Haight, Cole Valley and Inner Sunset, handing out flyers and talking to people. He says the reaction has been very positive. "There's traditionally been a lot of informal sharing between immediate neighbors. This gives people a tool to organize that, and extend the circle of trust a bit further too." He also finds it remarkable how much has changed in the last decade of social technology, since his service relies on "communication between strangers," requiring coordination that would have been almost impossible until recently.</p>

<p>It's still early days, but with several hundred users after just a week, there does seem to be interest. It also seems a natural complement to a car-sharing service like ZipCar, making your choice of parking spot at your destination as flexible as your choice of vehicle.</p>

<p>Meyer pointed out that there are a lot of secondary benefits to the service too. It gives people a chance to help out other locals, to "be a neighborhood hero," reduces the gas wasted circling the block, and removes the hassle of having to move your vehicle every two hours because of street parking regulations. There's an active trade in garage and private space rentals here in San Francisco, but ParkCirca gives you the chance to park in multiple spots, rather than being anchored to a particular location.</p>

<p>There's no guarantees that their model will work, but as some who has recently moved to the city I really hope it does take off. Meyer is currently looking into raising angel funding to support his mission of "making urban life better for everyone," and I wish him luck.</p>
                    ]]></description>
                <link>http://readwrite.com/2011/02/09/be-a-neighborhood-hero-with-pa</link>
                <guid>http://readwrite.com/2011/02/09/be-a-neighborhood-hero-with-pa</guid>
                <category>Services</category>
                <pubDate>Wed, 09 Feb 2011 06:45:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Twitter Sets a Price For Tweets]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/06/hundreddollar.jpg" style="" />
			</span>
 Last week at <a href="http://strataconf.com/strata2011">Strata</a>, <a href="http://gnip.com/">Gnip</a> released a new set of features for its social-stream processing platform. Called <a href="http://blog.gnip.com/twitter-firehose-filtering-with-power-track/">Power Track</a>, the new layer allows customers to set up complex search queries and receive a stream of all the Twitter messages that match the criteria. Unlike existing ways of filtering the firehose, there are no limits on how many keywords or results you can receive. However, the part of the offering with the most long-term significance is the pricing. </p>

<p>On top of the standard $2,000 a month to rent a Gnip collector, founder <a href="http://twitter.com/jvaleski">Jud Valeski</a> told me it will cost 10 cents for every thousand Twitter messages delivered. Though the split of the revenue between the two companies wasn't disclosed, he told me Twitter intends to standardize this price for any similar offerings in the future from other sellers of their data. This sounds like a big step in Twitter's journey to find a sustainable business model.</p>
<p>Valeski told me that there were already 24 customers using the private beta version, some with monthly bills "in six figures," so this is obviously an interesting revenue source. With tens of millions of tweets being delivered every day, there's obviously some happy users too. I talked to Greg Greenstreet of <a href="http://collectiveintellect.com/">CollectiveIntellect</a> about why he was using the service and he told me:</p>

<blockquote>
The reason Power Track is so essential for us is that for clients that want *every* Tweet for a keyword, it supplies a comprehensive solution for us, rather than trying to work around the traditional Twitter search APIs that have restrictions on volume and content. We use Gnip for many other forms of data collection that power our semantic analytics engine, and they have been a solid provider for us for many years
</blockquote>

<p>Though it seems unlikely that marketing-data revenue will be enough on its own to sustain the business, it's significant that Twitter has been able to set a value for every message on the service. At the very least, it gives them an income that increases as usage grows, providing a solid foundation as it tunes broader-based revenue models like advertising or sponsored trends.</p>

<p><i>Photo by <a href="http://www.flickr.com/photos/gi/388322867/">Gisella Giardino</a></i></p>
                    ]]></description>
                <link>http://readwrite.com/2011/02/08/twitter-sets-a-price-for-tweet</link>
                <guid>http://readwrite.com/2011/02/08/twitter-sets-a-price-for-tweet</guid>
                <category>APIs</category>
                <pubDate>Tue, 08 Feb 2011 04:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Using Public Data to Fight a War]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/05/cazoodle2.jpg" style="" />
			</span>
How does a technology built for apartment-hunting end up being evaluated by the U.S. Army for use in Afghanistan? <a href="http://cazoodle.com">Cazoodle</a> is using public data sources like <a href="http://flickr.com/">Flickr</a> and <a href="http://openstreetmap.org/">OpenStreetMap</a> to build detailed guidebooks for American soldiers. Last week at <a href="http://strataconf.com/strata2011">Strata</a> I sat down with company CTO Govind Kabra to find out how they do it.</p>

<p>Its project for the Army is to build a detailed database of information about places in Afghanistan, using only public sources on the Web. The goal is to describe in detail the towns and cities including everything from names, locations and populations, as well as lists and coordinates for schools, mosques, banks and hotels. <br />
</p>
<p>The military already collects this sort of information, but using traditional offline sources through groups like the <a href="https://www1.nga.mil/Pages/Default.aspx">National Geospatial-Intelligence Agency</a>. It's a slow and dangerous process to send personnel door to door for research within war-torn countries, and though the agency's budget is classified, presumably very expensive. The hope is that by using online, crowdsourced data from sites like Wikipedia and Flickr, it will be possible to gather rich information without putting lives at risk, all at a fraction of the cost.</p>

<h2>Origins</h2>
Cazoodle was started four years ago at the University of Illinois - Urbana Champaign. Kubra and his co-founders were graduate students, so naturally the top of their priority list was finding a cheap apartment. 

<p>As they trawled through Craiglist, following links to other sites, consulting maps and looking up details, they realized that what they really needed was an automated way of pulling the information they cared about from all these disparate sources, and putting it into a single spreadsheet they could use to make their decisions easier. They formed the company to build this system, and created <a href="http://www.cazoodle.com/apartment-search.php">an apartment search engine based on the technology</a>.</p>

<p>The founders knew there were lots of other problems that would also benefit from the same underlying technology, so they branched out into <a href="http://www.cazoodle.com/shopping-search.php">shopping</a> and <a href="http://vacation.cazoodle.com/">vacations</a>, and also started building custom search engines for enterprise customers. </p>

<p>That was when they spotted a <a href="https://www.fbo.gov/index?s=opportunity&mode=form&id=42545f1cf87af61b648c1c85a8c56303&tab=core&_cview=0">Small Business Innovation Research grant opportunity</a> from the U.S. Department of Defense. The task was to curate public information on the Web related to Afghanistan into a single database that Army personnel could use to guide their operations. Their technology already took a soup of unstructured Web pages related to locations and converted it into a spreadsheet of data, cleanly split into labeled columns, so it seemed like a natural fit for this problem.</p>

<h2>Technology</h2>
To understand how it works, imagine trying to create a list of mosques in a small town in Afghanistan. There's no handy Yellow Pages you can refer to, and the maps don't have that much detail. However, if you go to Wikipedia you can pull out basic information about a town like <a href="http://en.wikipedia.org/wiki/Pul-i-Alam">Pul-i-Alam</a>, and then look through <a href="http://downloads.cloudmade.com/asia/afghanistan">the OpenStreetMap data for Afghanistan</a> to spot locations that are tagged as religious buildings, eg:

<blockquote><code>
&lt;node id="282153330" lat="34.5154772" lon="69.1804459"&gt;<br/>
&lt;tag k="amenity" v="place_of_worship"/&gt;<br/>
&lt;tag k="name" v="Puli Khishti Mosque"/&gt;<br/>
&lt;tag k="religion" v="muslim"/&gt;<br/>
&lt;/node&gt;<br/>
</code></blockquote>

<p><a href="http://www.flickr.com/photos/tags/mosque/map?&fLat=35.4338&fLon=67.489&zl=10"><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/05/cazoodle1.jpg" style="" />
			</span>
</a>That reveals the explicit information that people have entered, but what's particularly impressive about Cazoodle's work is that it also merges in implicit information from sources like Flickr. For example, running <a href="http://www.flickr.com/photos/tags/mosque/map?&fLat=35.4338&fLon=67.489&zl=10">a search on the photo service</a> shows hundreds of photos taken within Afghanistan mentioning "mosque" in their descriptions. The coordinates can be pulled out of the geotagged photos, and used as an input to the list of mosques for the town they were taken in. </p>

<p>Without realizing it, photographers are  helping to build up a crowd-sourced map of everywhere they shoot. This isn't completely unprecedented; during World War Two the BBC appealed for holiday photos of the beaches of Normandy for an exhibition. In fact, the 9 million snaps received were used to research landing sites for the coming invasion.</p>

<h2>Results</h2>
The end result of the gathering process is a console that Army personnel can use to pull up information on towns across the country, giving a detailed breakdown of all the data that's been gathered on the location:

<p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/02/04/cazoodlesmall.jpg" style="" />
			</span>
</p>

<p>What I find most fascinating about this project is that it's the first practical application for 'linked data'. In contrast to <a href="http://linkeddata.org/">the more academic approach</a>, Cazoodle has attacked the problem in a much messier but more pragmatic way. A good example of this is how the system uses "fuzzy matching" to link data from different sources together. </p>

<p>That means if OpenStreetMap shows a mosque in a particular location, and a Flickr photo with coordinates a hundred yards away mentions a mosque in the description, then it's reasonable to assume they represent the same place, even though there's a small probability that's incorrect.</p>

<p>This means data may not be quite as vetted as a more traditionally sourced gazetteer, but it has much broader coverage and is far more dynamic. In many ways, it's like the tradeoff between Yahoo's original hand-edited directory and Google's chaotic but all-encompassing search index. By lowering the barriers to data entry, and in many cases using public information people don't even realize they're revealing, Cazoodle is able to create an effective guide.</p>

<p>The project is still being evaluated by the Army right now, and hasn't been used in the field, but it's not hard to imagine this approach becoming far more common as public data sources grow and multiply. It also illustrates the conflicts we'll face more and more frequently as this public data is used for completely unintended purposes. How will local photographers and OpenStreetMap editors feel if their work is reused by the U.S. Army? <a href="http://christopheralbon.com/">Christopher Albon</a>, a researcher into <a href="http://conflicthealth.com/">public health in warzones</a> also has some cautionary words on the limits of what can be done remotely:</p>

<blockquote>
While an impressive start, Cazoodle's approach is missing the data that really matters. A map of a physical space only takes you so far. An Afghan village is no more a collection of mosques and houses than Silicon Valley is a collection of coffee shops and office space. What matters is a location's social, political, and economic structures; its human terrain. Who is related to whom? Who owns the fertile fields by the river or the rocky fields on the slopes? Who is healthy and who is sick? Cazoodle can not provide this type of information, leaving American soldiers to gather it the old fashioned way: talking to people door to door, face to face.
</blockquote>

<p>Photo by <a href="http://www.flickr.com/photos/x-ray_delta_one/4769639013/">James Vaughn</a></p>
                    ]]></description>
                <link>http://readwrite.com/2011/02/07/fighting-a-war-with-a-search-e</link>
                <guid>http://readwrite.com/2011/02/07/fighting-a-war-with-a-search-e</guid>
                <category>Big data</category>
                <pubDate>Mon, 07 Feb 2011 08:30:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[A Free Visual Programming Language for Big Data]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/01/30/greenplumicon.jpg" style="" />
			</span>
 Until the last few years, large scale data processing was something only big companies could afford to do. As Hadoop has emerged, it has put the power of Google's MapReduce approach into the hands of mere mortals. The biggest challenge is that it still requires a fair amount of technical knowledge to set up and use. Initiatives like Hive and Pig aim at making Hadoop more accessible to traditional database users, but they're still pretty daunting.</p>

<p>That's what makes <a href="http://community.greenplum.com">today's release</a> of a new free edition of <a href="http://www.emc.com/campaign/global/greenplumdca/index.htm">EMC's Greenplum big data processing system</a> so interesting. It draws on ideas from the MapReduce revolution, but its ancestry is definitely in the traditional enterprise database world. This means it's designed to be used by analysts and statisticians familiar with high-level approaches to data processing, rather than requiring in-depth programming knowledge. So what does that mean in practice?<br />
</p>
<p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/01/30/alpineminer.png" style="" />
			</span>
</p>

<p>Visual programming can be a very effective way of working with data flow pipelines, as <a href="http://developer.apple.com/graphicsimaging/quartz/quartzcomposer.html">Apple's Quartz Composer</a> demonstrates in the imaging world. EMC has an environment called <a href="http://www.alpine-solution.com/index.html">Alpine Miner</a> that lets you build up your processing as a graph of operations connected by data pipes. This offers statisticians a playground to rapidly experiment and prototype new approaches. Thanks to the underlying database technology they can then run the results on massive data sets. This approach will never replace scripting for hardcore programmers, but the discoverability and intuitive layout of the processing pipeline will make it popular amongst a wider audience.</p>

<p>Complementing Alpine Miner is <a href="http://madlib.net/">the MADlib open-source framework</a>. Describing itself as emerging from "discussions between database engine developers, data scientists, IT architects and academics who were interested in new approaches to scalable, sophisticated in-database analytics," it's essentially a library of SQL code to perform common statistical and machine-learning tasks. </p>

<p>The beauty of combining this with Alpine Miner is that it turns techniques like <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Bayes classification</a>, <a href="http://en.wikipedia.org/wiki/K-means_clustering">k-means clustering</a> and multilinear regression into tools you can drag and drop to build your processing pipeline. </p>

<p>Traditionally it's been a development-intensive job to implement those algorithms on large data sets, but now they're within the reach of analysts without requiring engineering resources. Even better, because it's open-source users of other database systems are able to take advantage of <a href="https://github.com/madlib/madlib-contrib">the code</a>, though then they won't benefit from Greenplum's underlying processing engine.</p>

<p>This release from EMC is only free for non-production use, and the majority of the product is not open-source, so it's definitely not an immediate threat to Hadoop adoption. It is a sign that the traditional enterprise world is starting to pay attention to the wider world though, and demonstrates some of the areas where free solutions are lacking, especially in terms of their ease-of-use. </p>

<p>The engine is an extremely powerful tool for large-scale machine learning, as <a href="http://radar.oreilly.com/2011/01/faster-machine-learning.html">this example from O'Reilly's Roger Magoulas demonstrates</a>. Will it open up these sorts of enterprise tools to a whole new set of academic and startup users?</p>
                    ]]></description>
                <link>http://readwrite.com/2011/02/01/a-free-visual-programming-language-for-big-data</link>
                <guid>http://readwrite.com/2011/02/01/a-free-visual-programming-language-for-big-data</guid>
                <category>Big data</category>
                <pubDate>Tue, 01 Feb 2011 01:45:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Qwerly Hopes to Power Rebel Alliance Against Facebook]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/01/28/xwing.jpg" style="" />
			</span>
The <a href="http://qwerly.com/">Qwerly</a> API lets developers easily link together users' various social network accounts. For example, given Tim O'Reilly's Twitter username, it can <a href="http://qwerly.com/twitter/timoreilly">reveal</a> his public profiles at other services like Facebook, Flickr and Plancast. Why is this interesting? Bridging the barriers between different social networks weakens the lock-in effect that makes it tough to opt out of popular services.</p>
<p>If you decide you don't want to participate on Facebook, right now that means losing touch with all of your friends still using it. With Qwerly, a service could let you interact with your entire social network in one place, even if some people are most active on Twitter and others on Facebook. </p>

<p>It's a bit like phone number portability. In the bad old days, if you changed phone company you were given an entirely new number, with all of the hassle of telling your friends and colleagues and changing business cards and stationery. By making a connection between you and your friends' accounts on different networks, Qwerly hopes to make switching to a new service painless.</p>

<p>I spoke to Qwerly's founder <a href="http://twitter.com/maxniederhofer">Max Niederhofer</a> about his plans for the service. He said its mission was to be "at the center of the Rebel Alliance against Facebook - we want to power the federated social web". He continued:</p>

<blockquote> "The motivation to build Qwerly was really the question 'what do we need to build a decentralized social web platform?' and what we came up with was 'first, we need to find out how profiles are connected', i.e. consolidating identities across profiles. We looked at what had happened there in terms of open protocols, like <a href="http://code.google.com/p/webfinger/">webfinger</a>, and figured things weren't moving fast enough."</blockquote>

<p>Originally he was planning on building his own Friendfeed-like service, but one that would instantly show your friends' updates rather than relying on you to laboriously enter all your account details before you'd see any benefits. As he looked at what it would take to build the system, he realized the hardest step would be gathering and linking accounts across social networks, and by sharing the results as an API he'd create a platform for other startups to build their own services on. Services like the <a href="http://hvr.me/">HoverMe</a> social browsing plugin and the <a href="http://duckduckgo.com/">DuckDuckGo</a> search engine are already taking advantage of the interface to enrich the results they offer.</p>

<p><a href="http://duckduckgo.com/?q=marshallk"><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/01/28/ddgscreenshot.png" style="" />
			</span>
</a><br />
Any service that deals with people's personal data raises concerns about privacy, so I asked Niederhofer how they were different from services like Rapleaf that have <a href="http://online.wsj.com/article/SB10001424052702304410504575560243259416072.html">attracted intense criticism</a>. </p>

<p>His response was that "the difference between what Rapleaf was accused of and what we're doing is that Rapleaf was commingling social data and cookie-based data. While social media data isn't yet construed as personally identifiable information, it definitely serves to identify a person. So if you mix cookies and e.g a Facebook ID, you are effectively de-anonymizing web traffic." With the focus on information gathered only from public Web profiles, with <a href="http://qwerly.com/about_us/privacy">no use of cookies</a> or other data sources, he sees the service as just aggregating freely available information in a novel way.</p>

<p>Though it's still early days for the platform, I'm hopeful it can add social context to many different applications, for example transforming the humble phone address book into something much richer. </p>

<p>This is an area that Union Square Ventures' Fred Wilson <a href="http://www.avc.com/a_vc/2011/01/is-the-mobile-phone-our-social-net.html">has been discussing a lot recently</a>. Wilson's colleague Albert Wenger has complained to Niederhofer about his phone: "I open up the address book, it looks like my old Palm - and I mean Pilot, not Pre!" With rich information about all your contacts' social profiles, it's easy to imagine something like <a href="http://gist.com/">Gist</a> tightly integrated with your address book.</p>

<p>What do you think? Is this service going to open up a new world of innovative applications based on federated social information? Are you more concerned about how much personal information we're making publicly available on our profiles?</p>

<p><em>X Wing photo by <a href="http://www.flickr.com/photos/pmiaki/5351186496/">Psiaki</a></em></p>
                    ]]></description>
                <link>http://readwrite.com/2011/01/31/qwerly-hopes-to-power-rebel-alliance-against-facebook</link>
                <guid>http://readwrite.com/2011/01/31/qwerly-hopes-to-power-rebel-alliance-against-facebook</guid>
                <category>APIs</category>
                <pubDate>Mon, 31 Jan 2011 06:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Quora Blocks Startup Search Engines]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/2011/01/27/handtalk.png" style="" />
			</span>
The popular startup question and answer service <a href="http://quora.com/">Quora</a> only allows the largest search engines to index its site. As Gabe Rivera of Techmeme <a href="http://twitter.com/gaberivera/status/30355233607520256">pointed out yesterday</a>, its <a href="http://quora.com/robots.txt">robots.txt</a> file explicitly grants Google, Bing, Blekko and other big players access, but excludes everyone else. If large sites had these restrictions back when Google was starting, it might never have succeeded and we'd still be stuck with Altavista. As more publishers move to this whitelist approach, are they stifling innovation?</p>

<p><a href="http://twitter.com/yegg">Gabriel Weinberg</a> has been struggling to persuade Facebook to add his <a href="http://duckduckgo.com/">DuckDuckGo</a> search engine to their list of approved crawlers, with no luck. Concerned about mining of their public profiles, last year Facebook started requiring search engines to <a href="http://news.ycombinator.com/item?id=1440154">sign a legal agreement</a> covering the usage of their data. Unfortunately it seems like the process has turned into a barrier for fledgling search companies like Gabriel's. </p>
<p>Despite being happy to enter into <a href="http://www.facebook.com/apps/site_scraping_tos.php">that contract</a>, he hasn't heard back after several months. While he's still able to show Facebook pages thanks to API partners like Bing, this leaves him unable to run his own algorithms to optimally rank and display the results. He's frustrated by the trend towards whitelisting, pointing out that malicious or underhand scrapers ignore the policy file and says "Bad bots don't respect it anyway". In his view it's a big drag on innovation too - "really you're just hurting startups that may use your data in cool ways".</p>

<p>Both <a href="http://www.quora.com/Edmond-Lau/Quora-Extension-API">Quora</a> and Facebook offer APIs to access their data, so why do startups need to crawl their sites? After all, web page scraping is often associated with unsavory scammers and copyright infringers. The real loss is that APIs only allow you to ask the questions that the interface designers have anticipated. For example, Gabriel was hoping to build directories listing the Facebook pages for local businesses by location and type, together with snippets of information about them, just as he does for <a href="http://duckduckgo.com/c/Internet_search_engines">other categories</a> of sites on the web. There's no way to gather that information through the Facebook API, so without crawling access he's unable to implement that feature.</p>

<p>As traditional search companies struggle to pull relevant results from an increasing deluge of low-quality content, we need innovative startups to pioneer new approaches. Without the openness that made it possible for Google to grow, the next big thing in search may never happen.</p>

<p><em><small>Photo by <a href="http://www.flickr.com/photos/carbonnyc/4461823997/">David Goehring</a></small></em></p>
                    ]]></description>
                <link>http://readwrite.com/2011/01/27/quora-blocks-startup-search-en</link>
                <guid>http://readwrite.com/2011/01/27/quora-blocks-startup-search-en</guid>
                <category>Analysis</category>
                <pubDate>Thu, 27 Jan 2011 07:30:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Robots Battle Over Wine]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/redwinesplash-1.jpg" style="" />
			</span>
The collision of the wine websites <a href="http://www.cellartracker.com/">CellarTracker</a> and <a href="http://www.snooth.com/">Snooth</a> raises some interesting questions over data ownership. Snooth was <a href="http://www.vintank.com/2011/01/is-snooth-scraping-data-from-cellartracker/">accused of copying information from CellarTracker's user reviews</a>, using an automated robot script crawling the site. While most commenters were outraged, it's not clear that there's any legal case against Snooth, even if it had crawled the data. As it turns out, <a href="http://www.bojago.com/2011/01/25/an-apology-to-cellartracker/">the problem came from an outdated input feed</a>, rather than its crawler, but the case highlights how many problems will arise as data flows and mixes on the Web.
</p>
<p>
The controversy erupted when an independent wine blog at <a href="http://www.vintank.com/blog/">vintank</a> noticed a spooky correlation between the user tags that appeared on the Snooth site and those present in reviews for the same wine on CellarTracker. For example, these screenshots show the same tags appearing on the two sites, even unusual ones like "mocha" or "presence":
</p>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/winerobots0.png" style="" />
			</span>
<br>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/winerobots1.png" style="" />
			</span>

<p>
How could a competing site possibly get access to this information? <a href="http://www.cellartracker.com/robots.txt">CellarTracker's robots.txt file</a> (which lays out its policies for automated access to the site) is pretty open, and no log-in is required to see the reviews. </p>
<p>This means that Google and other search engines can analyze those pages and send traffic to the site. They may need to do some fairly sophisticated processing to understand what the page is all about, with the results placed in a database, but since CellarTracker benefits from the extra traffic they're unlikely to complain about their data being copied. In fact, there's a fair <a href="http://www.robotstxt.org/faq/legal.html">body of case law</a> that suggests by opening up your robots.txt, you're giving permission for copies to be made and shown to users.
</p>
<p>
That system works well as long as search engines are the only ones doing the scraping. What's changed over the last few years is that you can now crawl millions of Web pages for just a few dollars. Many startups have <a href="http://www.80legs.com/">sprung</a> <a href="http://bixolabs.com/">up</a> that make this sort of operation easy even if you don't have the expertise in-house. These small-scale operations often don't drive traffic to the source of the information though, so the implicit bargain of that publishers have with search engines doesn't hold.
</p>
<p>
One thing site owners can do (and that CellarTracker did for Snooth's crawler) is prohibit individual user agents in the robots.txt. The trouble with this blacklisting is that you can only block them after you find out about them, and that may be after they've already grabbed the data. A better alternative from the publisher's point of view is to whitelist only the search engines you actually care about, as <a href="http://news.ycombinator.com/item?id=1440154">Facebook did</a> in response to my own crawling. This is a bad thing for the open Web though, since new startups are denied access to the sites, and the dominance of the existing search engine players is cemented.
</p>
<p>
What's clear from this case is that the rules of the Web have departed from people's intuitive sense of right and wrong. As Snooth's CEO points out, the user tags at the center of the controversy were "<em>not reviews or information that is, or could be, copyright-protected</em>." <p>The copyright law around data requires some highly-paid lawyers to fully explain, but the gist is that plain information without any creative contribution from the author is tough to protect. He also highlights the problem of dealing with information from other 50 million Web pages, plus many other data feeds. Once the data is entered into a system like that, there's rarely any mechanism to keep track of its source, or apply any special restrictions on its use. </p>
<p>
Put all these issues together, and you have a world of proliferating scrapers, pouring unprotected data into systems that mix and match the streams promiscuously, producing an end result that may compete against the sources themselves. It's a place where Snooth could copy those tags, with only user outrage to hold them back. What do you think the rules should be? 
</p>

<p><em><small>Photo by <a href="http://www.flickr.com/photos/cicciofarmaco/4989751112/">ciccioetneo</a></small></em></p>
                    ]]></description>
                <link>http://readwrite.com/2011/01/26/robots-battle-over-wine</link>
                <guid>http://readwrite.com/2011/01/26/robots-battle-over-wine</guid>
                <category>News</category>
                <pubDate>Wed, 26 Jan 2011 09:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Wolfram Alpha's API is Free, But is it Open?]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/wolframdevlogo_150x150.png" style="" />
			</span>

<a href="http://www.wolframalpha.com/">Wolfram Alpha</a> has assembled an impressive collection of information on everything from <a href="http://www.wolframalpha.com/examples/Chemistry.html">chemistry</a> to <a href="http://www.wolframalpha.com/examples/MoneyAndFinance.html">high finance</a>, but until recently external developers could only access it by paying <a href="http://arstechnica.com/web/news/2009/10/developers-get-access-to-wolfram-alpha-if-they-pay.ars">between two and six cents per query</a>. Today the company <a href="http://www.marketwire.com/press-release/WolframAlpha-Releases-API-Version-20-1383529.htm">announced</a> a big change to its pricing plans which gives non-commercial users 2,000 free calls a month, as well as adding new features like the asynchronous delivery of slower results. With few external applications appearing to use the old interface, can these changes open it up to a wider audience of developers?
</p>


<p>
The API itself is very similar to the <a href="http://www.wolframalpha.com/">Wolfram Alpha Web interface</a>. Developers pass in a query string, and then get back XML results that reflect exactly what you'd see in the browser for the same search. This makes it ideal for formatting and displaying to users, since you get back plain text descriptions and images visualizing the information. This is exactly how most of Wolfram's flagship customers have been using it. For example <a href="http://blog.wolframalpha.com/2009/11/11/microsoft%E2%80%99s-bing-introducing-one-of-wolframalpha%E2%80%99s-first-commercial-api-customers/">Bing</a> displays information from Alpha alongside its own search results, and <a href="http://www.touchpress.com/">Touch Press</a> uses it to supplement its interactive books.
</p>
<p>
This is great if you want to show the information immediately to users, but what if you want to understand and process the data as part of your application? You might want run your own analysis on a company's share price, but you'll have a tough time converting their plain text results into numbers you can feed into an algorithm, and though their Mathematica versions are structured, it's not a simple format to read in. This may not be accidental - their <a href="http://products.wolframalpha.com/api/termsofuse.html">terms of service</a> make it clear that you can't "access, cache, store, retain, or in any way compile any copies or portion of any Wolfram|Alpha content." <span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/wolfram0.png" style="" />
			</span>
Wolfram has built up a large and valuable collection of data, and the company doesn't want to make it too accessible for fear that it may be copied. There is a sign of hope though in the mention of an upcoming data API, which sounds like it might offer a more programmer-friendly version of the results. 
</p>
<p>
The easiest way to try it for yourself is through their <a href="http://products.wolframalpha.com/api/explorer.html">API Explorer</a> page. If you enter a query, you'll see the XML results appear, along with the URL you'd call from your application to run the same search. The results are split up into sections that Wolfram describe as "Pods." Each one of these corresponds to a different nugget of information related to the terms you entered, and matches the way results are shown in the normal Web interface. 
</p>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/wolfram1.png" style="" />
			</span>

<p>
There's <a href="http://products.wolframalpha.com/docs/WolframAlpha-API-Reference.pdf">a complete reference guide available as a PDF</a>, detailing the options you can specify to narrow down your query, as well as the meaning of some of the results sections.
</p>
<p>
Stephen Wolfram and his team have created an astonishingly powerful collection of information. As <a href="http://blog.wolframalpha.com/2011/01/20/knowledge-based-computing-and-version-20-of-the-wolframalpha-api/">he puts it on the Wolfram blog</a>, the dream is to make this "computable knowledge" available to immediately enhance any program that's connected to the service. Today's announcement is a big step forward to opening it up to far more developers, but it will need much more computer-readable results before it will really fulfill that promise. Do you agree, or am I misunderstanding the power of the API as it is right now? Are there existing applications beyond the handful that Wolfram highlight? Let us know in the comments.
</p>
                    ]]></description>
                <link>http://readwrite.com/2011/01/21/wolfram-alphas-api-is-free-but-is-it-open</link>
                <guid>http://readwrite.com/2011/01/21/wolfram-alphas-api-is-free-but-is-it-open</guid>
                <category>APIs</category>
                <pubDate>Fri, 21 Jan 2011 08:40:43 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[Secrets of BackType's Data Engineers]]></title>
                <description><![CDATA[
                                        <p>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/backtypelogo.jpg" style="" />
			</span>
How do three guys with only seed funding process a hundred million messages a day? I sat down with the <a href="http://backtype.com/">BackType</a> team to discover how they built a service relied upon by companies like bit.ly, Hunch and The New York Times. 
</p>
<p>
BackType captures online conversations, everything from tweets to blog comments to checkins and Facebook interactions. Its business is aimed at helping marketers and others understand those conversations by measuring them in a lot of ways, which means processing a massive amount of data.  
</p>
<p>
To give you an idea of the scale of its task, it has about 25 terabytes of compressed binary data on its servers, holding over 100 billion individual records. Its API serves 400 requests per second on average, and it has 60 EC2 servers around at all times, scaling up to 150 for peak loads. 
</p>
<div style="width:104; float:right; margin:30"><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/backtype_christopherg.jpg" style="" />
			</span>

<i>Christopher Golda</i>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/backtype_mikem.jpg" style="" />
			</span>

<i>Michael Montano</i>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/backtype_nathanm.jpg" style="" />
			</span>

<i>Nathan Marz</i></div>
<p>It has pulled this off with only seed funding and just three employees: <a href="http://twitter.com/golda">Christopher Golda</a>, <a href="http://twitter.com/michaelmontano">Michael Montano</a> and <a href="http://twitter.com/nathanmarz">Nathan Marz</a>. They're all engineers, so there's not even any sysadmins to take some of the load.
</p>
<p> 
Coping with that volume of data with limited resources has forced them to be extremely creative. They've invented their own language, <a href="http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html">Cascalog</a>, to make analysis easy, and their own database, ElephantDB, to simplify delivering the results of their analysis to users. They've even written a system to update traditional batch processing of massive data sets with new information in near real-time.
</p>
<p> 
The backbone of BackType's pipeline is Amazon Web Services, using S3 for storage and EC2 for servers. It leverages technologies such as Clojure, Python, <a href="http://hadoop.apache.org/">Hadoop</a>, <a href="http://cassandra.apache.org/">Cassandra</a> and <a href="http://thrift.apache.org/">Thrift</a> to process this data in batch and real-time. 
</p>
<p>
The start of the pipeline is a group of machines that ingest data from the Twitter firehose, Facebook API and millions of sites and other social media services. The first interesting feature of the architecture is that it actually has two different pipelines, one the traditional batch layer that takes hours to produce results, and a "speed layer" that reflects new changes immediately.
</p>
<p>
Captured data is fed into the batch layer through processes on each machine called collectors. These append new data to a local file, which is then copied over to S3 periodically. This raw data is then put through a process they call shredding, which organizes it in two different ways. First, data units are stored with others of the same type. For example the content of tweets or blog comments would be stored together and separate from the names of their authors. Second, the same data is sliced by time, so everything within a single day will be stored together.
</p>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/backtypediagram.png" style="" />
			</span>

<p>
Why do they do this? The organization of the data enables them to run more efficient queries only against the relevant data. When they have a job that requires analyzing Twitter retweets for example, they can just pull out the content, sender and time for each message, and ignore all other metadata. This process is made a lot easier thanks to their use of Thrift for the data storage. Everything in their system is described by a graph-like Thrift schema, which controls the folder hierarchy the data is stored into, and automagically creates the Java/Python/etc code for serialization. 
</p>
<p>
Cascalog is one of their secret weapons, a <a href="http://clojure.org/">Clojure</a>-based query language for Hadoop that makes it simple for them to analyze their data in new ways. Inspired by the venerable <a href="http://en.wikipedia.org/wiki/Datalog">Datalog</a>, and built on top of <a href="http://www.cascading.org/">Cascading</a>, it allows you to write queries in Clojure and define even complex operations in simple code. Unlike alternatives like <a href="http://research.yahoo.com/project/90">Pig</a> or <a href="http://wiki.apache.org/hadoop/Hive">Hive</a>, it's written within a general-purpose language, so there's no need for separate user-defined functions, but it's still a highly-structured way of defining queries.
</p>
<p>
 Its power has enabled them to quickly add features like domain-level statistics and per-user influence scores with just a couple of screens of code. It's spread beyond BackType and has an active user community including companies like eHarmony, PBworks and Metamarkets.
</p>
<p>
The final part of the batch processing puzzle is how to get the results of your analysis to the final user. They experimented with writing out the data to a Cassandra cluster, but ran into performance issues. What they ended up creating instead was a system they call ElephantDB. It takes all the data from a batch job, splits it up into shards, each of which is written out to disk as BerkeleyDB-format files. After that they fire up an ElephantDB cluster to serve the shards. Unlike many traditional databases, it's read-only, so to update data served from the batch layer you create a new set of shards.
</p>
<p>
So that's how the heavy processing is done, but what about instant updates? The speed layer exists to compensate for the high latency of the batch layer. It is completely transient and because the batch layer is constantly running it only needs to worry about new data. The speed layer can often make aggressive trade-offs for performance because the batch layer will later extract deep insights and run tougher computations. It takes the data that came in after the last batch processing job and applies fast running algorithms.
</p>
<p>
Because the Hadoop processing is run once or twice a day, the fast layer only has to keep track of a few hours of data to produce its results. The smaller volume makes it easy to use database technologies like MySQL, Tokyo Tyrant and Cassandra in the speed layer. Crawlers put new data on <a href="http://gearman.org/">Gearman</a> queues and workers process and write to a database. When the API is called, a thin layer of code queries both the speed layer database and the batch ElephantDB system, and merges the information from both to produce the final output that's shown to the outside world.
</p>
<p>
BackType isn't the only startup to split its processing using this combination of speed and batch layers; Hunch does something similar for its user recommendations. The trouble is that nobody has found an approach that is as elegant or generally applicable as MapReduce for real-time processing of continuous streams of data. 
</p>
<p>
<div class="pullquote">Instead of the firefighting and housekeeping burden I'd expect from such a complex system, they seem to spend most of their time focused on applications that solve customer problems.</div>Yahoo's <a href="http://labs.yahoo.com/event/99">S4 "Distributed Stream Computing Platform"</a> is an interesting start, but Marz explained that they weren't able to build on top of it because it didn't offer any reliability guarantees, thanks to its use of UDP for communication. The lack of unit tests also made it daunting, since it would be tough to spot if any modifications they needed to make had introduced subtle bugs. 
</p>
<p>
Instead, Marz and Montano have been working on a new framework based on their own experiences. The technology managing the streaming processing and guaranteeing reliability of messages is called Storm, and though it can run a variety of languages, they've designed one especially for it called Thunderlog, based on Cascalog. 
</p>
<p>
Though they are not yet ready for release, Storm and Thunderlog are being actively developed and will soon replace their more hand-coded speed layer. The system will incorporate many of the tips they picked up building their first system. For instance, to avoid concurrency issues without paying a performance penalty, you can group events by key so that possibly conflicting changes happen on the same machine in a serial fashion.
</p>
<p>
At the end of the tour of <a href="http://tech.backtype.com/">their technology</a>, I was left very impressed by how much they have accomplished with so few engineers. Instead of the firefighting and housekeeping burden I'd expect from such a complex system, they seem to spend most of their time focused on applications that solve customer problems. 
</p>
<p>
The secret is their ability to automate the routine tasks with tools like Cascalog, ElephantDB and Thunderlog. Writing those allows them to spend their limited time on writing new applications that offer direct value to their users, without having to wrestle with screenfuls of boilerplate code first. They are on the lookout for new team members, and say they've only stayed so small because they are so committed to only hiring the very best. If you're interested in working on the cutting edge of big data processing, drop them an email at <a href="mailto:jobs@backtype.com">jobs@backtype.com</a>.
</p>
                    ]]></description>
                <link>http://readwrite.com/2011/01/12/secrets-of-backtypes-data-engineers</link>
                <guid>http://readwrite.com/2011/01/12/secrets-of-backtypes-data-engineers</guid>
                <category>Profiles</category>
                <pubDate>Wed, 12 Jan 2011 06:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[The Secret Life of Robots]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/redrobot-1.jpg" style="" />
			</span>
Despite companies like Google making tens of billions of dollars from Web crawling, the rules governing so-called robots indexing the Web are surprisingly vague. As somebody who <a href="http://petewarden.typepad.com/searchbrowser/2010/04/how-i-got-sued-by-facebook.html">ran afoul of Facebook</a> with <a href="http://www.readwriteweb.com/archives/facebook_user_data_analysis.php">my own crawler</a>, I've taken a keen interest in other sites' attitudes to external access. There's some interesting stories buried in the robots.txt files that define their policies, so let me take you on a tour.
</p>
<h2>Wikipedia</h2>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/robotstxt.png" style="" />
			</span>

<p>
This is the only big site which makes all of its data freely available as a bulk download, without requiring a crawler or API to access, so unsurprisingly its <a href="http://en.wikipedia.org/robots.txt">robots.txt</a> is a bit unusual. It's chock-full of comments and is even <a href="http://en.wikipedia.org/w/index.php?title=MediaWiki:Robots.txt&action=edit">editable</a> on the main site. There's lots of user agents that are disallowed, usually with descriptive comments about why they've been banned. Particularly interesting is the commentary on <a href="http://www.webreaper.net/">WebReaper</a>, banned because it "<em>downloads gazillions of pages with no public benefit</em>." This reveals the implicit deal between site owners and crawlers: the publishers put up with automated access as long as they can see the value that's returned. In Wikipedia's case that's about the benefit for the general public, but for most sites it's about their self-interest. Google sends them traffic, and other crawlers are tolerated in the hope they'll do the same. 
</p>
<h2>Facebook</h2>
<p>
After experimenting with <a href="http://www.facebook.com/apps/site_scraping_tos_terms.php">auxiliary terms of service</a> outside of their <a href="http://www.facebook.com/robots.txt">robots.txt</a>, the social network eventually <a href="http://news.ycombinator.com/item?id=1440154">settled on a whitelist policy</a>. This means they disallow access to everyone, except a select group of search engines who've agreed to their terms. Interestingly, there are some smaller players like <a href="http://www.seznam.cz/">Seznam</a> included, showing how keen Facebook are on competing in overseas markets. <div class="pullquote">...the catch-all policy for all other crawlers has a comment that its target is "every bot that might possibly read and respect this file." This highlights the reliance of publishers on crawlers' good manners, and the need to spot and block the IP addresses of any that are being badly behaved.</div>There's also a password-protected <a href="http://www.facebook.com/sitemap.php">site map</a> that presumably makes it easy for those search engines to find and index everyone's public profiles.
</p>
<h2>Google</h2>
<p>
Appropriately for a company founded on Web crawling, <a href="http://www.google.com/robots.txt">Google's robots.txt</a> is very open. The main restrictions are on service entry points like search pages or analytics. This led me to discover some tools I didn't know about before, like its <a href="http://www.google.com/unclesam">Uncle Sam U.S. government search engine</a>, as well as some mysterious entries like the <a href="http://www.google.com/compressiontest/">compressiontest folder</a> whose function <a href="http://www.google.com/support/forum/p/Web%20Search/thread?tid=553fd3a641b1ce49&hl=en">nobody's quite certain of</a>.
</p>
<p>
Probably the biggest difference between Google and Facebook is in their treatment of users' public profiles. Whereas Facebook is very picky about who it gives access, Google not only explicitly calls out the profiles folder as accessible to all robots, it even provides a site map listing all user's ids to make it easier to grab their information. Last year I released <a href="https://github.com/petewarden/buzzprofilecrawl">an open-source project</a> demonstrating how to access these profiles, and it looks like there's now around 9 million available, thanks to Buzz and other efforts to persuade users to open up their information to the world.
</p>
<h2>Ebay</h2>
<p>
I was surprised to see that <a href="http://www.ebay.com/robots.txt">Ebay's robots.txt</a> takes a similar approach to the one abandoned by Facebook, where they have a file that encourages crawling of the whole site, but includes a comment attempting to impose legal restrictions on what can be done with the information that's gathered. This is a problem for the open Web because if it's accepted as valid, it would require a lawyer to read and understand every single site's terms before anyone could write a general-purpose crawler like Google's. I believe the real answer is <a href="http://33bits.org/2010/12/05/web-crawlers-privacy-reboot-robots-txt/">making some of these common restrictions machine-readable in an extended robots.txt standard</a>, so that startups can continue to innovate without the risk of legal action from unhappy publishers.
</p>
<h2>Amazon</h2>
Unusually, <a href="http://www.amazon.com/robots.txt">Amazon's robots.txt</a> imposes more restrictions on Google's crawler than other spiders, the only one I've found that does. It singles out Google to prevent it accessing product reviews, which makes me wonder if there was some dispute over the search engine's re-use of the information on its own sites? It certainly seems like strategic information that adds a lot of value to Amazon, so I can understand why it might not want to share it with a rival. Otherwise Amazon displays a remarkably open policy towards Web crawlers. They obviously believe they get a lot of value from other people indexing the site, because its whole product catalog seems to be available for download, including prices, ratings and related product information. Their site maps even list categories like brands and authors to make it simple to access all their products.
</p>
<h2>Twitter</h2>
<p>
Like Wikipedia, Twitter's data is more easily available through other channels like the API, and its <a href="http://twitter.com/robots.txt">robots.txt</a> reflects this focus. There's also a plaintive note of despair at the bad behavior of Web crawlers that's reminiscent of Wikipedia's complaints. Apparently Google's crawler doesn't respect the crawl-delay directive, and the catch-all policy for all other crawlers has a comment that its target is "<em>every bot that might possibly read and respect this file</em>." This highlights the reliance of publishers on crawlers' good manners, and the need to spot and block the IP addresses of any that are being badly behaved.
</p>
<p>
Are there any other secrets buried in robots.txt? I'm sure there's a lot of surprises out there, so if you've discovered a site with funky policies, let me know in the comments.
</p>
<p><em><small>Photo by <a href="http://www.sxc.hu/photo/447069">Splenetic</a></small></em></p>

                    ]]></description>
                <link>http://readwrite.com/2011/01/11/the-secret-life-of-robots</link>
                <guid>http://readwrite.com/2011/01/11/the-secret-life-of-robots</guid>
                <category>Tips</category>
                <pubDate>Tue, 11 Jan 2011 06:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[How to Hire Coders]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/techcofoundersign.jpg" style="" />
			</span>
One of the most common questions I get asked is "How can I find technical employees?" The market for good programmers is extremely tight, and traditional techniques like job-board or craigslist postings won't produce results. Even the best network of contacts probably won't uncover candidates willing to reach out to you. So, what can you do?</p>
<p><strong>Start early:</strong> Accept that it's going to take longer than you'd like to find the right person, and so plan ahead now for hires you'll be making in six months or a year. If you do find the right person too early, make them an offer and take the hit of paying their salary for a few months before you really need them. The hiring process will be time-consuming, so schedule accordingly; treat it as a first-class task that will take up a significant chunk of every week.
</p>
<p><strong>
Stalk them:</strong> The best candidates aren't looking for a new job, they're being pampered and praised at their current company. The first thing you'll need to do is find out who these people are and connect with them. One of my favorite hacks for this is joining technical meet up groups in your area. Even if you're a business guy, you'll probably be able to nod and smile your way through most presentations. You'll get to see who's enthusiastic and can communicate well, and the social side is a great way to talk to engineers you'd never be in contact with otherwise. If you supply beer and pizza, you'll be very popular. </p>
<p>Go to <a href="http://meetup.com/">Meetup.com</a> and put in a few relevant keywords (e.g. Machine Learning, Big Data) to discover nearby get-togethers. Another avenue is lurking on open-source project's mailing lists, which won't find local coders but otherwise has a lot of the advantages of attending meet ups, since you can see how people communicate and work with others.
</p>
<p><strong>
Understand them:</strong> In order to pry them away from their current job, you need a strong lure, and you need to understand what their motivations are to craft something tempting. The best programmers often aren't driven by money, so figure out if they're after more responsibility, independence, the chance to work with cutting-edge technology or recognition from their peers. Sit down with them over coffee, join them on a hike, spend time with them however you can. Once you know what makes them tick, you can build an offer that's hard for them to refuse.
</p>
<p><strong>
Pimp yourself:</strong> People are a lot more likely to want to join companies they've heard of, startups that are recognized for doing interesting and challenging work, so talk about the great technology you're building every chance you get. Encourage your existing programmers to blog and talk to journalists like me, and try to reach communities like <a href="http://news.ycombinator.com/">Hacker News</a> and <a href="http://reddit.com">Reddit</a> with your stories. Have your engineers give talks at conferences and contribute back to open-source projects.</p>
<p> It may feel painful as that cuts into the time they spend on development, but the added visibility in the development community will be a powerful recruiting tool.
</p>
<p><strong>
Qualify them:</strong> Often you'll find a programmer who loves the idea of joining an early-stage company, but when it comes to making the plunge and leaving the security of a steady job, they get cold feet. Get a feel for how serious they are by paying them for part-time consulting during the courtship. If they are reluctant, or can't fit it into their schedule, that's a sign they might not be willing to follow through.
</p>
<p><strong>
Look at alternatives:</strong> This is probably starting to sound like a lot of work, so think hard about what you're trying to achieve by hiring a new employee. Do you really need somebody with 10 years experience, or would you be better off finding a clever intern with fire in her belly and something to prove? </p>
<p>Bosses almost always underrate their existing employees' potential, because they've seen their failures first hand, and new hires come in with no history. Would you actually be better off spending your time training your current coders to take on more challenging work? Can you find a consulting firm to help them out on particular areas? How can you make sure that none of your current team leaves?
</p>

                    ]]></description>
                <link>http://readwrite.com/2010/12/28/how-to-hire-coders</link>
                <guid>http://readwrite.com/2010/12/28/how-to-hire-coders</guid>
                <category>How-To</category>
                <pubDate>Tue, 28 Dec 2010 03:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[How to Semantically Analyze Web Pages With Delicious]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/images/delicious.png" style="" />
			</span>
There are <a href="http://www.readwriteweb.com/archives/rip_delicious_you_were_so_beautiful_to_me.php">many reasons to love delicious</a> and hope that it survives its current rocky patch, but as a programmer there's one thing I've found it essential for. I often write applications that need to process and organize thousands or millions of Web pages.</p>

<p>To do that, I need to know something about their meaning, what topics they're associated with, if they're blogs, political, technical, commercial, and what other categories they fall into. One way is run an API like <a href="http://developer.zemanta.com/">Zemanta</a> or <a href="http://www.opencalais.com/">OpenCalais</a> on the pages' text, and hope to use significant terms to pick categories. This is an extremely intensive process on large collections, and even the best semantic analysis is nowhere near as good as a human summary. What if you could get millions of people to categorize the pages for you, for free?
</p>

<p>That's exactly what delicious's <a href="http://www.delicious.com/help/json">urlinfo API</a> gives you. It returns the top 10 tags for any URL, together with a count of how many times each tag has been used.
</p>
<p>
<a href="http://web.mailana.com/labs/delicious_tags/test_page.php?url=http%3A%2F%2Fpetewarden.typepad.com%2F"><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/delicious0.png" style="" />
			</span>
</a>
</p>
<p>
You can then use these tags to enrich your own application with a spooky level of intelligence about websites. You could restrict searches to particular categories or industries for example, in a similar way to <a href="http://blekko.com/">Blekko's</a> slashtags, or organize referrer analytics by what kind of site the links are coming from. For most sites the top 10 tags for most sites are both very informative, and highly accurate, so you can get some very effective results.
</p>
<p>
Using the API is simplicity itself. You don't even need a key and it supports JSONP callbacks, allowing you to access it even within completely browser-based applications. To demonstrate how to use it I've put up <a href="http://github.com/petewarden/delicious_tags">some PHP sample code on github</a>, but the short version is you call to <strong>http://feeds.delicious.com/v2/json/urlinfo/data?hash=</strong> with the MD5 hash of the URL appended, and you get back a JSON string containing the tags. If you want to see how accurate it is, <a href="http://web.mailana.com/labs/delicious_tags/test_page.php?url=http%3A%2F%2Fpetewarden.typepad.com%2F">here's a live version of the code you can play with</a>.
                    ]]></description>
                <link>http://readwrite.com/2010/12/20/how-to-semantically-analyze-we</link>
                <guid>http://readwrite.com/2010/12/20/how-to-semantically-analyze-we</guid>
                <category>How-To</category>
                <pubDate>Mon, 20 Dec 2010 02:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[The Secrets Behind Blekko's Search Technology]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/blekkologo.jpg" style="" />
			</span>
<a href="http://www.blekko.com/">Blekko</a> has a refreshingly different interface to search, and a generous data-sharing philosophy, but what I didn't realize was how innovative its underlying technology is. Last week I sat down with CEO <a href="http://www.skrenta.com/">Rich Skrenta</a> and CTO <a href="http://www.pbm.com/~lindahl/">Greg Lindahl</a>, and they took me on a fascinating tour of the system they've built.</p>
<p>Starting with the hardware side, they have around 800 servers in their data center, each with 64 GB RAM and eight SATA drives giving each one about eight terabytes of local storage. The first thing that caught my attention about this setup was when they explained why they avoided RAID. </p>
<div class="super-pullquote"><b>See also:</b><br><a href="http://www.readwriteweb.com/archives/how_to_use_blekko_to_rock_at_your_job.php">How to Use Blekko to Rock at Your Job</a><br><a href="http://www.readwriteweb.com/archives/top_10_rss_and_syndication_technologies_of_2010.php">Top 10 RSS and Syndication Technologies of 2010</a></div><p>In their tests it cut performance in half, from 800 MB a second total across the eight disks with raw access to only 300-350 MB/s with a RAID controller in the pipeline. Even if it didn't impact speed, it's a nag. As soon as one of the eight disks fails, it will page an engineer to drive out to the co-lo and swap it out before there's any data lost. With over 6,000 drives in their cluster, whoever's on-call wouldn't get much sleep!</p>
<h2>An Entirely Decentralized Architecture</h2>
<p>So how do they deal with dying drives? It's designed into their software architecture, along with lots of other nifty tricks they've picked up over the years. Skrenta explained that he'd been involved in a lot of previous systems that had separate modules for crawling, analysis and delivering results to users. This separation of tasks meant that passing data between the modules was a messy, bug-prone process and as time went by engineers would end up spending most of their time firefighting problems as they came up, rather than adding improvements. As Lindahl put it, quoting <a href="http://en.wikipedia.org/wiki/Leslie_Lamport">Leslie Lamport</a>, "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
</p><div class="pullquote">The real shocker was the <em><strong>strftime()</strong></em> C function's bad behavior. They were tracking down an intermittent performance problem and discovered that it would sometimes access up to 50 files from disk, shoving a stick in the spokes of any application that relied on fast response times thanks to the unexpected disk seeks this causes.</div>
<p>
What they needed was a system where the crawling, analysis and delivery of results could use a single data store and set of programming primitives, in a way that was both simple to write and to debug. This led them to the radical step of building an entirely decentralized architecture, with no masters, slaves or indeed any servers with special roles.</p>

<p>Even BigTable has the notion of a root tablet that acts a bit like a DNS server, holding the locations of the servers to talk to about a given key, but Blekko's system relies on a completely distributed 'swarm' approach. Every server advertises the keys of the buckets it's currently holding, and in turn each server listens and remembers the locations of the several thousand buckets holding the actual data. There are three copies of each bucket held on different machines, and if a drive or server fails, other machines will notice the problem through the broadcast information and start the healing process by replicating the affected buckets to other computers. This is a much lower maintenance system compared to swapping out drives when RAID starts complaining!
</p>
<p>
Values for multiple keys are held in a each bucket, with a hash function used to map the key to its destination. The data store supports the familiar set of key/value primitives, along with some more sophisticated variations to allow programmers to do things like only update a value if it's never been set before, or allow sets that can be ignored if there's an error, to prevent non-critical updates like server log messages from taking valuable time for recovery if there's a writing error.</p>
<h2>A "Naked Date"</h2>
<p>Lindahl describes the store's guarantees as "relaxed eventual consistency," and explained that they expect their developers to write their applications with its characteristics in mind. It will let you shoot yourself in the foot with race conditions and requires more thought at the application level, but it's worth it for the power and performance you get from the store. They're crawling over 200 million pages a day, with 3 billion in total. The refresh frequency ranges from minutes for popular news site front pages, up to 14 days for the least-visited sites.
</p>
<p>
So, that sounds like geeky fun, but how does it help the end user of the system? To answer that, Lindahl muttered something about a "naked date," which definitely caught my attention! It turns out this wasn't a proposition, he was talking about the "/date" tag you can enter into Blekko's search box:
</p>
<p>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/blekko0.png" style="" />
			</span>

<small><i><a href="http://blekko.com/ws/+/date">http://blekko.com/ws/+/date</a></i></small>
</p>
<p>
This shows a selection of Blekko's top search results as they're crawled by the system. You can sit there refreshing the page, and as Blekko crawls the Web new pages will appear. The crawler is feeding sites into the data store, they're being ranked on the fly and they're showing up in the interface, all within a couple of seconds. Even Google's <a href="http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html">Caffeine</a> doesn't offer that sort of responsiveness.</p>

<p>In technical terms, they've implemented MapReduce, but instead of a monolithic Reduce stage, they have primitives that allow simple operations like incrementing or merging data structures to be applied incrementally to build up results over time, with the intermediate results available for reading continuously. You can see this in action by refreshing the small <a href="http://blekko.com/ws/http:%2F%2Fwww.retroist.com%2F2010%2F12%2F04%2Fsaturday-supercade-battlestar-galactica%2F+/sitepages">SEO link</a> that's visible for every result, which shows all of the data that goes into their ranking calculations as it updates, from inbound links to content statistics and a complete site map.
</p>
<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/blekko1.png" style="" />
			</span>

<p>
What's the secret of their programming success? Perl! Even though they're the only NoSQL solution written in the language, they're extremely happy with their choice, largely because of the rich and stable set of modules available on CPAN, with over 200 of them on every machine. Each server is running CentOS, and because there's no special roles they can each be configured identically. They're not using any virtualization, so to make machine creation simple Greg has rolled his own configuration system to install everything they need.
</p>
<p>
I asked them if there'd been any real performance surprises they could share, and they came up with a couple of corkers. First, they'd discovered that writing to a disk had a dramatic effect on seek times. Even in the normal case it can take 50ms for a disk head to move to a new position to access data, but when the drive was writing information, that same operation could be delayed by up to 500ms. A few half-second delays like that quickly add up and can ruin a user's experience with a site, so they have write demons sitting on each machine that use a schedule to write in bursts according to a schedule, and other machines know to avoid reading from the server at those times. 
</p>
<p>
The real shocker was the <em><strong>strftime()</strong></em> C function's bad behavior. They were tracking down an intermittent performance problem and discovered that it would sometimes access up to 50 files from disk, shoving a stick in the spokes of any application that relied on fast response times thanks to the unexpected disk seeks this causes. It turns out that the function will load information from locale files to help with its formatting job, and even worse it will periodically recheck the files to see if they've changed. This may not sound like much, but for a programmer it's as unexpected as discovering your grandmother moonlighting as a nightclub bouncer.
</p>
<p>
Blekko has built a very sexy system for processing massive data sets in a very dynamic way, and talking to Skrenta and Lindahl left me excited to see what they'll be able to build on it next. The flexibility of their platform should let them keep producing innovative features nobody else can match. They also wanted me to let you know that if this sort of stuff is your cup of tea, <a href="http://blekko.com/ws/+/blekkojobs">they're hiring</a>!
</p>

                    ]]></description>
                <link>http://readwrite.com/2010/12/10/the-secrets-behind-blekkos-search-technology</link>
                <guid>http://readwrite.com/2010/12/10/the-secrets-behind-blekkos-search-technology</guid>
                <category>Interviews</category>
                <pubDate>Fri, 10 Dec 2010 02:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
                    <item>
                <title><![CDATA[How Hunch Built a Data-Crunching Monster]]></title>
                <description><![CDATA[
                                        <p><span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/hunch0.png" style="" />
			</span>
<a href="http://www.hunch.com/">Hunch</a> has really interesting problems. They collect a lot of data from a lot of users, and once someone creates a profile they need to quickly deliver useful recommendations across a wide range of topics. This means running a sophisticated analysis on a massive data set, all to a strict deadline. Nobody else is doing anything this ambitious with recommendation engines, so I sat down with their co-founder and CTO <a href="http://mattgattis.com/">Matt Gattis</a> to find out how they pulled it off.  
</p>
<p>
The first thing he brought up was hardware costs, casually mentioning that they'd looked into getting <a href="http://www.dell.com/us/en/enterprise/servers/poweredge-r910/pd.aspx?refid=poweredge-r910&s=biz&cs=555">a server with one terabyte of RAM from Dell</a>! That immediately piqued my interest, because the Google-popularized trend has been towards throwing an army of cheap commodity servers at big data problems, rather than scaling vertically with a single monstrously powerful machine. It turns out their whole approach is based around parallelism within a single box, and they had some interesting reasons for making that choice. 
</p>
<p>
They'd evaluated more conventional technologies like <a href="http://hadoop.apache.org/">Hadoop</a>, but the key requirement they couldn't achieve in their tests was low latency. They're running on a graph with over 30 billion edges, with multiple iterations to spread nodes' influence to distant neighbors and achieve a steady state, a bit like PageRank. This has to be extremely responsive to new users inputting their information, so they have to re-run the calculations frequently, and none of the systems they looked at could deliver the results at a speed that was acceptable.
</p>
<p>
They determined that the key bottleneck was network bandwidth, which led them towards housing all of their data processing within a single machine. It's much faster to share information across an internal system bus than to send it across even a fast network, so with their need for frequent communication between the parallel tasks, a monster server made sense. As it happens they decided against the $100,000 one terabyte server, and went for one with a still-impressive 256 GB of RAM, 48 cores and SSD drives.<span class="embedded-Media-image img-caption-c">
				<img src="http://readwrite.com/files/files/files/hack/hunch1.jpg" style="" />
			</span>
</p>
<p>
The other part of the puzzle was the software they needed to actually implement the processing. They looked at a series of open-source graph databases, but ran into problems with all of them when they tried scaling up to 30 billion edge networks. Continuing their contrarian approach, they wrote their own engine from the ground up in C, internally codenamed TasteGraph. The system caches the entire graph in memory, with rolling processes re-running the graph calculations repeatedly, and the end-results cached on multiple external machines. They have even recoded some of their inner loops in assembler, since they spend a lot of their cycles running calculations on large matrices and even the specialized linear algebra libraries they use don't deliver the performance they need.
</p>
<p>
Even with their software and hardware architecture in place, there were still obstacles to overcome. Their monster server uses CentOS Linux, but very few people are running memory-intensive applications on machines with so much RAM, so they ran into performance problems. For example, by default the kernel will start paging out to disk once the memory is about 60% full, which left them with only about 150 GB of RAM available before swapping kicked in and performance cratered. There's not much documentation available around these parameters, so the team ended up scouring the kernel source to understand how it worked before they could produce a set hand-tuned for TasteGraph's needs.
</p>
<p>
When Matt first told me about his design decisions, I have to admit I was surprised that he was apparently swimming against the tide by working within a single uber-machine rather than using an army of dumb boxes, but as he explained their requirements it all started to make sense. With more and more companies facing similar latency issues, I wonder if the pendulum is swinging back towards parallelism across a system bus rather than a network?
</p>
                    ]]></description>
                <link>http://readwrite.com/2010/12/08/how-hunch-built-a-data-crunchi</link>
                <guid>http://readwrite.com/2010/12/08/how-hunch-built-a-data-crunchi</guid>
                <category>Interviews</category>
                <pubDate>Wed, 08 Dec 2010 02:00:00 -0800</pubDate>
                <author>Pete Warden</author>
            </item>
            </channel>
</rss>

