AI Does Not End With Datasets

I remember my undergraduate professor in AI saying, “AI is what they call machine intelligence they don’t yet fully understand. Once they understand it, it is no longer considered AI. Once robotics was well understood, it was no longer AI and became its own branch. Once computer vision was understood, it became independent. Speech recognition and natural language processing followed that route too.” Perhaps, the only field that is well understood and yet is so central to AI that separating would render AI meaningless, is machine learning, the science of using complex math-heavy algorithms to interpolate latent functions. It is because of this special status that we end up with slogans such as “AI and Machine Learning” sometimes, which is a bit like saying “Math and Calculus”.

Machine learning (that includes the more famous branch called “deep learning”) is certainly crucial. ML algorithms are common to many AI areas and they are that very esoteric crux which helps the machine to guess desired outcomes off some obscure inputs. Load up a data set into an model and voila – you get predictions. ML makes that happen. Media gets the message. If you read most popular articles these days, you may just believe that AI will magically solve everything and everywhere. The overarching recipe is banal to a fault – collect a data set, find the ML algorithm that can interpolate the problem’s complexity, train a model, and collect cash. Simple.

And yet as any real AI practitioner knows, ML, while vital, is not heart of the matter. A seminal NIPS paper by Google ML researchers explained in depth that machine learning is only a miniscule part of what makes an AI application. The bulk of work is around optimizing pipelines, collecting clean data and extracting features that are palatable to the ML model and are maintainable in a dynamic environment. This is particularly pronounced in natural-language understanding, where in order to extract features palatable to the classifier model, one needs to address misspellings, stemming, stopwords, disambiguate entity references, possibly, look at context, understand that people often use made-up words, be ready for a slowly changing vocabulary and topic distribution, and a myriad of other things.

One might ask, why not skip that altogether and load the task into an powerful deep learning box? Surely, we can trade complexity of modeling the data for spending more time during the training stage? Well, good luck. Have you tried predicting weather off tree rings? They are correlated… Your machine should be able to find the path from one to the other. Problem is, you may be long resting peacefully underground by the time that happens. Some of the most powerful supercomputers used to predict weather off far more impactful signals still make inaccurate predictions. There is a reason – exponential complexity in computation is no joke. Many of the input feature have different degree of impact on the end result and most are not even independent of one another – an attribute that is diametrically opposite the prevalent assumption in the design of ML algorithms.

This is where domain expertise becomes invaluable. Simply put, a human expert can prune a lot of unnecessary computation by offering shortcuts to the machine. This is done by modeling inference paths using the knowledge human experts accumulated in a particular domain over many years. Continuing with NLU, a good example is enriching the data with information from linguistics such as parts of speech, sentence structure (i.e. parse trees), orthography, etc. To understand the benefit, consider how a complex project can be managed effectively. The first thing you do is break it up and establish intermediate milestones. These are smaller in scope, easier to define, and, therefore, easier to reach. Achieving the bigger whole is then reduced to reaching each intermediate milestone, which is easier to define and track. Same with NLU – establishing intermediate steps helps re-define the problem in terms of tying the intermediates together.

But there is more to modeling than taking shortcuts. Proponents of training off data sets overlook the plethora of domains where it is very hard to even define how to compile a data set for training. That means one will have a hard time explaining to an annotator (the one who labels the data set with expected results) the logic of how to come up with an expected result for each data sample. Sometimes, it is the ambiguity of labels that complicates things. Other times, it is the complexity of analyzing the input data – it may be outright impossible to provide the required sensory data to the human. In the physical world, certain measurements may be dangerous to the annotator (e.g. if your inputs are gasses). Each of these circumstances instantly renders the whole process of collecting data unviable from the start. Your choice of ML model will not matter if you cannot produce the input data!

It may be worth re-examining all those fields that “branched off” from AI. A common theme among them is the incredible amount of domain modeling and knowledge. For instance, robotics relies on the physics of motion, mechanics, materials, electrical engineering, optics, and other more rudimentary sciences. While the end result may be feeding images into a CV unit, the bulk of “magic” actually happens before then. In other words, it is not ML at all that makes for a “magical” AI application, but a concoction of axioms, theorems, measurements, tuning, and the like that describe the domain about which the system is making predictions. ML is just icing on the cake. Instead of relying on the machine to correlate inputs with outputs, applications in these fields put domain knowledge first, building out their technology bottom-up – from basic rules to complex systems, possibly bridging some steps with ML. Their overall composition is always driven by the domain logic.

The benefits of doing so are aplenty. First, you no longer have to rely heavily on manual data collection, which, as we discussed, is rife with constraints and errors. This allows for a more comprehensive coverage of your domain. Just think of what you’d rather have, a rule for multiplication of 2 numbers or an infinite table listing products of different pairs of reachable numbers? Second, you are able to explain the inference to the end user.

Instead of highlighting how your back-propagation went on the 7th hidden layer, you can explain that it was a certain domain feature with a real English name that affected results the most. Thirdly, it allows for a cleaner assembly of the product and the ability to replace components with more optimal implementations. Try doing that with an ML pipeline! (this is where it’s worth reading the aforementioned NIPS paper again)

So what, you may ask? By now you may agree that domain modeling is crucial for effective implementation. You decide to hire domain experts and carry on. Is there more to it? Yes! Because domain modeling is so central to AI applications, it can also serve as a compass to finding novel untapped applications of AI! To put it in different words, to find new opportunities, seek a out a field where data is hard to collect while the general domain environment is well understood and merely lacks automation. It is in those spaces where one can close small gaps between two clusters of domain knowledge with a simple ML bridge and suddenly get to a far more impressive result. And, unlike the “we’ll correlate everything with everything crowd”, you will have the advantage of the complete domain coverage, far better descriptive power, and, ultimately, a more robust product.