The Artwork Behind Building Recommendation Engines with Big Data

We live in a “long tail” world. That means that mass-market products are no longer able to satisfy consumers who ask for tailor-made solutions. This trend has emerged from online retail, specifically Jeff Bezos’ idea of having a million different shops for a million different consumers. In essence, it all boils down to building the perfect recommendation engine. There are various methods to create such a tool, which we will discuss shortly, but all of them have a common denominator: big data.

“Hey, you can have any color you want — as long as it’s black.” –Henry Ford

The days of Henry Ford and “any color you want as long as it is black” are long gone. Other things that are going out of fashion fast are website filters and asking questions to narrow down options. Modern customers hope to get exactly what they dream about presented to them as soon as they open a website. On the home page slider, if possible.

This trend is also evident in the home entertainment sector. Just imagine if you would have to sort through thousands of movies before finding the one you like. Not the most successful business model, isn’t, it? Yet, a movie recommendation engine, like the one powering Netflix and other similar streaming services, can take hints from what you have previously selected and suggest what you might want to watch next.

How are “they” making a selection for you?

There are three ways to build a great recommendation engine, each with a different approach to solve the same problem.

Clustering recommendation engines.

To understand the way clustering recommendation engines work, you should think about product packages or product layouts on the shelves. Clustering takes into consideration their functionality and recommends items that are complementary. For example, a clustering engine would show you a toothpaste if you have already added a toothbrush to your cart.

These engines don’t consider your customers’ particular preferences or what other users have bought, so you could build an engine like this even without the help of big data, just using logic and common sense in addition to simple out-of-the-box tools. The only problem with building a recommendation engine in this manner is that it becomes almost unmanageable if you have hundreds or thousands of products.

In these cases of many thousands of products, clustering them would be too effort-intensive, so an algorithm should come in handy. Big data can help by making the necessary associations automatically.

Content-based recommendation engines.

The next way to get an active recommendation engine is to start with what customers already like. Coming back to the Netflix example, if they have already watched two Lord of The Rings movies, most likely they would be interested in the third as well.

Here, big data is more useful, as the algorithm gathers numerous data points and computes the relevance between them. For example, it looks at the movie genre, the actors, the director, the soundtrack and even filming locations. Next, it scans the database to find items which are similar to the information discovered. This type of recommendation engines takes into consideration a customer’s personal history of interacting with the service and makes truly contextual suggestions.

Collaborative recommendation engines.

What if you have just installed Netflix or are on the Amazon page for the first time? Your recommendation feed is not empty. In fact, you might see some good ideas right from the beginning. These are offered based on the preferences of existing users.

Once you start using the platform and the algorithm learns more about you, the recommendations will get better because you are automatically assigned to a cluster with similar customers.

The advantage of collaborative engines is that it can be used for predictions based on customers’ real-life preferences. The downside is that it works on the assumption that if similar users liked similar things in the past, they would continue to do the same in the future, which is a bit unrealistic.

Steps to build a recommendation engine.

Before you can select any of the previously discussed methods, you need data to feed the algorithm. Since any big data endeavor is a problem of ‘garbage in garbage out,’ you also need to make sure that the data you have is high-quality and genuine.

The first step is to collect the right data. The challenge here is that the most useful information is implicit, coming from a user’s behavior. Although collecting data in online logs is straightforward, filtering just the right information is almost art. The difficulty here is to assign proper importance to each item. For example, in the case of a movie recommendation engine, is the genre or the main actor more critical? Depending on the viewer, answers might be very different.

Next, you need to make sure you store the data in a way that allows you to access it fast and the algorithm to learn from it continuously. A NoSQL database offers the necessary flexibility and scalability for such projects which usually grow exponentially. Such a way of storage is possible by spreading the data over hundreds of distributed servers in the cloud.

The advantage of a NoSQL database is that it offers the opportunity to store any data, including unstructured ones like comments, reviews, and opinions. Most of the times, these are far more valuable than ratings by numbers since you get an insight into more subtle preferences.

To create a great recommendation engine, the most critical step is to analyze data and identify patterns. Some of the best-performing systems work in real time or almost in real time, refreshing every few seconds. The least performant but still usable solution is a batch analysis, which is mostly used in retail for looking at daily or weekly sales.

Future trends.

We can expect that recommendation engines will become so high performing that they will eliminate the need for search, entirely. The danger and downside of this approach is that each customer will live in a comfortable bubble, enjoying the same kind of content with little chance of discovering something beyond that. We are already witnessing this phenomenon happening on our social media feeds — much to our personal and collective danger — and we also see our very own bubbles happening in the automatically generated playlists on YouTube.