Technical Q&A With FAROO Founder

About 18 months ago, we wrote about an obscure search startup from Germany called FAROO. We believed that its radical alternative, using peer-to-peer (P2P) technology, had a shot at being a real disruptive force. Today, it has made some progress, has raised some money and is getting out into the market. (Disclosure: FAROO is currently a ReadWriteWeb sponsor).

FAROO is wisely underplaying P2P in its marketing, preferring more fashionable terms such as “real-time search” and “social discovery.” But the P2P technology drives it.

So, we decided to invite someone who understands P2P at a technical level to interview Wolf Garbe, FAROO’s founder. Our tech expert, Kiril Pertsev, of Agily Networks, has already written about P2P for us in the past.

Kiril: Why .NET? Did you already have development resources or did you make this choice because you consider it a better option for networked desktop applications? Would you make this choice again? And if you’re not satisfied with .NET, what would your platform of choice be, given all of your experience over the past few years?

Wolf: I come from Delphi (Object Pascal). So, the choice of C#/.NET was a dedicated decision for a new platform, not driven by legacy. When I started to work on the first prototype in 2004, Delphi moved towards .NET. I preferred to go with the original, especially because the development of C# was led by Anders Hejlsberg, the designer of Borland’s Turbo Pascal (which Delphi derived from).

Of course, I also looked into Java, which I found quite similar, both from the language perspective (C# vs. Java) and the JIT Runtime environment (Java Virtual Machine vs. .NET Runtime). The decision for C# was based on the dominating desktop market share of Windows and the assumption that embedding the .NET framework into the OS would ensure fast penetration of .NET. This only partially came true, partly due to the limited success of Vista, which was the first Windows version with .NET pre-installed.

Kiril: Doesn’t this choice hinder your ability to move to Mac and Linux platforms.

Wolf: We were betting on Mono for platform compatibility. Unfortunately, Mac OS X still has no Mono application launcher, other than starting with the terminal, which is not feasible for a mass market. With the increasing importance of the Mac OS X platform, I expect this to change. Silverlight today is already natively available for Mac.

For the ultimate platform independence, we are also continually observing the diverse RIA developments (AJAX, AIR, Silverlight, Mozilla Prism, HTML 5 persistent web storage, Mozilla’s DOM storage, Google Gears and Flash persistent storage), which could one day allow us to remove the download and installation step for P2P. But so far, no solution meets all of the requirements: out-of-browser capability, permanent background operation, auto-start option, tray icon support, cross-domain connection support, persistent storage, accepting an incoming connection and receiving data and NAT traversal.

Kiril: If you become dissatisfied with .NET, what would be your next platform of choice.

Wolf: Although not everything went as expected, I still believe that .NET is a very powerful platform, and C# as a language is evolving at a much faster and broader pace than Java.

Today, we have a good .NET penetration rate in the US and Europe. With Windows 7, I expect that to increase in Asia as well.

Kiril: I see that you’re using a pretty simple P2P communication technology instead of sophisticated Hamachi-like NAT traversal using UDP hole punching.

Wolf: I suppose you are referring to the transport layer, which is HTTP over TCP/IP. The real P2P overlay protocol on top of that is not that simple anymore.

Because our distributed search engine system architecture breaks with almost all legacy paradigms, we thought it would be a good idea that it be at least based on proven and widely used standards wherever possible. There are several reasons for this:

It reduces complexity and development time.
It improves compatibility (there is probably no protocol more widely used than HTTP over Port 80).
It’s unlikely that this connectivity will break anytime soon by changes in protocols, OS, drivers or hardware.
Behaving like a standard browser from the protocol view makes the application less vulnerable to filtering, blocking or traffic shaping and ensures that it even works in most corporate environments.

NAT traversal is the most critical issue for every P2P application. It’s really a shame that although the Internet is built on a distributed foundation, end-to-end connectivity between users in a decentralized way is completely broken. We are using several NAT traversal techniques: Manual Port Forwarding, Automatic Port Forwarding via UPnP and Teredo. Teredo is a IPv6 Tunneling technology, standardized according to RFC4380.

Teredo is part of Windows XP, Vista, and Windows 7; with Miredo, there is also an open-source implementation for Linux and Mac OS X available. Microsoft reports that with Teredo, the chance of a connection between two peers increases from 15% to 84% (PDF link). Our observations are somewhere between 60% and 70%.

Teredo is quite sophisticated technology and is a more universal approach. It provides connectivity at the OS level, in contrast to having several applications in use, where each uses its own proprietary traversal technology.

Kiril: Could you please elaborate on choosing network technology, having achieved a substantial number of users and collecting usage statistics. Do you know how many active and passive peers you have at any given time? What is the ratio?

Wolf: We have solid insight into the state of our P2P network. We know the number of active and passive peers on any given day (using the log from our update server). The active peer ratio is between 60 and 70%.

We are also currently working on an improved distributed intraday statistic. (The distributed statistic currently built into the P2P client is not valid anymore for the increased network size. For scalability, every peer has only a limited view of the whole network, which requires more advanced methods for calculating the actual network size.)

Kiril: Your search index essentially is a distributed storage system with DHT addressing, right?

Wolf: Yes.

Kiril: Have you thought about other uses of such technology, beyond search: back-up, private distributed storage, file-sharing, etc. (like Wuala)?

In a publication from 2001 (in German), in which I also outlined the idea of a peer-to-peer search engine, this was part of an integrated solution with a P2P Web server, P2P file-sharing and a P2P anonymizer. Due to various legal copyright issues, we are currently not looking into file-sharing.

But from a technological standpoint, a distributed storage system is quite universal, from storing a search engine index to attention data, Web pages, instant messages, social network profiles, micro-blogging messages, back-ups and files.

Kiril: Could you please share your vision of the mythical “P2P operating system,” now that we already have P2P networking, P2P processing, P2P storage and P2P applications (like search).

Wolf: P2P and distributed architectures are a universal principle that the whole Internet is built upon. Unfortunately, distributed technologies like Mail, IRC, Usenet and even independent Web servers are being increasingly replaced by centralized solutions (the cloud, Google Wave, etc.). Despite the obvious short-term convenience, this leads to long-term monopolies and dependencies and makes the Internet infrastructure more vulnerable in terms of reliability and political influence.

I believe that a solid, standardized P2P stack integrated in the operating system can fix the broken end-to-end connectivity between users, enabling the use of an endless amount of latent storage, memory, processor cycles and bandwidth.
Distributed storage is certainly a core component, as is distributed processing to make more sense of all of the data.
On top of this, there should be a distributed programming framework, which enables the development of distributed applications and the distribution and aggregations of tasks in a standardized manner (e.g. the distributed version of MapReduce/Hadoop is part of this).
A distributed attention data repository, shared by all applications, but under full user control.
There should be resource management that puts the user in full control of the amount of resources she or he would like to dedicate to a particular distributed project—possibly combined with a ratio system and/or virtual currency to maintain a healthy usage to contribution ratio.
Distributed identity management and authentication, authorization and access control.
This could replace most of the centralized cloud solutions by delivering the same convenience and scalability in a decentralized way.

BOINC (the universal distributed processing platform where seti@home runs today) goes partially in that direction.

This is partially because the peers contribute, by taking tasks from a centralized server and providing results back to this server. But this system is not fully distributed, nor are the results intended to be used by the peers themselves.

Kiril: Do you encounter scalability issues? Do you have any single point of failure resources in your P2P network? How reliable is it—meaning, what percentage of the network could you lose without seriously degrading search quality and performance?

Wolf: We have scaled the P2P network in a controlled way. While we have made some scalability-related adjustments to our P2P protocol, the core algorithms proved that there are no inherent scalability limits. Due to our fully distributed architecture, we have no single point of failure.

We have twenty-fold redundancy of each item, which replicates automatically if peers leave the network. Only if all 20 copies of the item are removed at the same time would this piece of information be lost. This leads to a mean information lifetime of 120 years under realistic churn (i.e. the peers randomly joining and abandoning the network temporarily or permanently). This is more than sufficient for search, where 50% of the information changes during the year (and is therefore refreshed anyway at a much higher rate).

Kiril: Do you think that “mobile P2P” is feasible? What would you say about implementing P2P search (or any other application) on, say, the iPhone? Are mobile terminals ready for P2P? Are cell networks ready? Do they have enough CPU power, etc?

Wolf: Today, we distinguish between mobile connectivity and landlines. But I believe this separation will fade away. Device performance, bandwidth and flat-rate pricing structures will become close. While today, processor cycles, memory and bandwidth in mobile phones are too precious for wide use of P2P applications, this will change. Even “walling off” tendencies and restrictive App Store policies will be liberated by regulation or user demand.

But much more interesting than bringing file-sharing to the iPhone will be P2P applications that use mobility, possibly combined with GPS, distributed camera/augmented reality and RFID. This will bring P2P technology into completely new application fields. Think of distributed traffic control (peers could be users with iPhones in cars or the cars themselves) or applications to lead crowds of people at large public events or in disaster zones, as well as gaming, distributed weather and earthquake prognosis.

Bluetooth could even make this independent of cell networks. Global communication between peers would be asynchronous through moving people. Also, cell network and Bluetooth mashups would be possible.

In the near future, we will provide Web access to our P2P Web search for mobile users. They will just be passive users of the resources contributed by active PC users.

Kiril: What is your vision of the P2P road map? Apparently, the first “killer P2P application” was file-sharing, Kazaa, then BitTorrent. Given that the next one is search, what would be the next after that?

Wolf: As I mentioned, instead of another isolated P2P application, I would like to see P2P built into the OS and Internet stack in a standardized manner. So that an application can benefit from P2P without any specific effort, in the same easy and natural way that applications today use the Internet (HTTP, AJAX and JSON).

Then, P2P technology would become ubiquitous and part of almost every application. Every application that uses cloud services today could benefit from such P2P technology. An example would be a distributed platform for micro-blogging services and social networks, heralding the end of walled gardens.

But my personal vision is to combine P2P with the next thing after search. Twenty years ago, I wrote a small expert system on my C64 (today, a C64 emulator is on the iPhone!), using Predicate Logic and an Eliza-style natural-language interface. So, you could tell the system, “All cats have claws. All tigers are cats.” And then you could ask the system, “Do tigers have claws?” And it would answer, “Yes!” You could retrieve information and relationships that were not explicitly stored (or that anyone was even aware existed).

At that time I had to enter every bit of information manually. Today, almost all information on earth is accessible on the Internet, together with comments and conversation streams.

Predicate logic would be supported by fuzzy logic, statistical machine learning and more. Today, known translations are used to translate untranslated text. But this could be much more universal: using known connections in one field to explain unknown correlations in another. Such a system could autonomously formulate queries, combine facts, fill in the missing link in a theory to prove or falsify it.

So, I think the next step after search will be reasoning; and in combination with P2P technology and distributed processing, this may bring us a kind of global brain. A brain that not only stores and retrieves information but that is capable of cognition and conclusion at a giant scale. It will discover hidden correlations and unknown facts and will answer questions with answers that cannot be found in any document.

You can see, this is much more HAL than the next Google.

Kiril: What technical feature of your technology are you most proud of? 100+ patents must something.

Wolf: Most crucial has been to ensure both a quick response time and complete results for queries with multiple terms and phrases (only 15% of searches are single keywords) in a completely distribute P2P architecture. For queries with multiple keywords, we eliminated the need for the intersection of huge posting lists across different peers.

While we had to invent a lot of things—just because they hadn’t been done before in a way that was required for distributed search—they are not all patented (so, we don’t own 100+ patents). Prior to funding, this would have been impossible financially.

Kiril: How is your real-time search related to the P2P search? Does it also run on a distributed network? If so, then how do peers communicate results to the front page of the FAROO website?

Wolf: Currently, we use a hybrid architecture. While we are building up our P2P network and use it for general Web search, in parallel we use a central index for the real-time data. The focus on the most recent and popular Web pages keeps the costs moderate. Attention data collected by FAROO peers serves also for real-time discovery and ranking (in addition to analyzing the Twitter stream).

But we believe in a holistic approach. Our real-time search will evolve into an integral part of Web search and be fully based on our P2P architecture. There will still be a gateway/proxy server that enables Web access to our P2P network for those users not able or ready to install a P2P client (e.g. for mobile).