Building a Predictive Analytics Model From Scratch

There’s a lot of talk right now about the potential value AI can bring to businesses, and the logistics business – because of its complexity and how much e-commerce depends on it – is no exception.

Imagine your e-commerce business needs to ship an order from San Francisco to Seattle and you’ve promised 2-day delivery. It’s 3:34pm and USPS, UPS, FedEx, and Ontrac all have different cutoff times at their sortation facilities. It’s going to take your warehouse between 15 and 45 minutes to pick and pack the order, and there’s a 62% chance of a thunderstorm over San Francisco tonight. Do you ship it by air (express) or by ground?

If you choose to ship it by air you lose all of your profit margin. If you choose ground your margin is great, but it may be late and you risk losing the customer. The only way to make this decision in real-time, thousands of times per day for your growing business is to predict the future. There’s far too many variables and factors for a human to take into account – you need AI. You need a predictive model. And if you don’t have one and your competitors do you will cede ground to them and lose the competitive advantage.

Start With the Data

This is the promise of AI and Machine Learning (ML) – collect a mountain of data, feed it into a predictive model, and profit! Unfortunately, it’s not quite that simple. Even the best neural networks have difficulty extracting accurate predictions for very complex real-world questions.

In 2016 DeepMind used a self-taught neural network to beat the 18-time world champion Go player – a game arguably more complex than chess. Training a neural network to play games (e.g. Chess or Go) isn’t easy, however it is different from the real-world in that you have perfect, accurate data at all times. You know the positions and possibilities for every piece on the board, and you know instantly when they change. This is rarely the case for difficult business questions that you want answered in order to gain a competitive advantage or reduce costs.

Your data is likely coming from multiple sources of varying quality, it’s not guaranteed to be delivered to you in real-time, and there’s far too much of it – more noise than signal. Before you start dumping all of your data into Tensorflow or Google Cloud AutoML Table you need to deeply understand your domain, and hire a data scientist.

Statistical processing has been around for decades, and only a trained data scientist is going to be able to work through the petabytes of data you’ve collected and clean it up so that your predictions will be accurate. A lot of the excitement around AI and ML is that we’ll get better models with much less work – no more tedious feature extraction or selecting variables! But that’s just not the case… yet. Almost none of your raw data is going to be optimally suited for a predictive model – it will all need to be massaged into multiple formats for each specific application.

It’s common for people new to the field to get excited by how easy modern AI and ML tools are to use, however the devil is in the details. Even the simplest models will give you a prediction, but the accuracy of those predictions will be so bad that you won’t be able to extract business value from them. Unfortunately the difference between a naive model and a sophisticated one developed by a data scientist will be borne out in the accuracy and confidence you have in its predictions.

Our Experience

At EasyPost we try to predict when shipments will arrive at their destinations, however even with tens of billions of data points about past shipments this is extremely difficult to do. When we began trying to make these predictions with our tracking data alone the results were abysmal. However, when we began pairing data scientists with shipping experts we were able to make huge strides in speed and accuracy.

An example of where human intelligence can assist the AI is that our human experts understand the importance of cutoff times at sortation facilities in the logistics industry. By adding data from domain experts – in this case the cutoff times at each facility type in the carrier networks – we were able to vastly improve our results. By adding domain specific, relevant data to our scientists’ toolkit we are able to create a more intelligent model than with AI alone.

Conclusion

In our experience a complicated question like the one posed earlier about shipping times contains too many variables for today’s best neural networks to learn and solve on their own. Luckily, they don’t have to, but you’ll need data scientists to work with domain experts in order to properly weight the significance of air humidity levels over the Bay Area!

The future for predictive models is bright, however don’t ignore the past! Statistical processing and data science are the key to framing and simplifying complex business questions so that state of the art AI and ML can grapple with them.