Managing problem structure and noise in machine learning
Warren Powell, co-founder, Optimal Dynamics, Professor, Princeton University
In the 1970s, “AI” meant a rule: “If you are having red meat, drink red wine” or “If the patient has these characteristics, then prescribe this medication.” The results had to be specified by a human, and they quickly became very cumbersome (imagine writing a rule for a patient with more than four or five characteristics).
Today, “AI” generally refers to machine learning, which means fitting a statistical model to data to estimate something. We might be forecasting demand from history (prediction), or recognizing handwriting or voice (classification), or estimating how the market responds to price (inference). A human has to specify the structure of the model, which will include parameters that have to be tuned with a training dataset.
For example, we might write demand as a function of price using D(price) = a-b price, where “a” and “b” are two tunable parameters. If the model is this simple, we might be able to estimate “a” and “b” with 10 or 20 observations of price and observed demand. Below is a graph with a set of training data points (the small circles, which are combinations of price and observed demand from history). We show two lines that fit the data poorly, and one that is the best fit. Notice that it does not fit every point.
A simple model with two parameters will not be able to capture very complex relationships. For this reason, neural networks have attracted a lot of attention. A simple neural network is depicted below, where input data (say, the history of prices, or it might also include information about the customer and product) enters the network to the left, and “flows” through the network until it comes out a single number to the right. Each link has a tunable parameter that is multiplied by the value entering the node to the left, when then gives the value that goes into the right node. A small neural network might have 1,000 parameters; large neural networks (known as “deep neural networks”) can have hundreds of millions.
The problem with models with many parameters is that they can fit almost anything. This sounds good, right? The problem is that we are often fitting relationships (such as demand as a function of price) that are noisy. I can specify the same price day after day, and get wide variations in the demand.
We refer to these large models as “high dimensional” because they have many parameters. What happens when we apply these models to our noisy demand and price data? We might get something like the figure below, where our fitted model seems to go through every point! That is good, right?
No!!!
This is called overfitting. We can see right away that it makes no sense because it creates regions where a higher price produces higher demand (!!), then followed by steep drops. This is simply not how demand is going to change with price.
Notice that we do not have this problem when we propose the structure D(price) = a-b price, since as long as b is positive, a larger price produces a smaller demand. That seems more reasonable, except that if the price is large enough, demand goes negative. We can live with this as long as we restrict price to a certain range (I have seen this used in pricing for hotels), but another approach is to pick a structure where this would not happen. This is easy to do… I won’t go into the details, other than to point out that it requires a human using their knowledge of the problem to pick a model that makes sense.
Choosing appropriate models is sometimes known as “structured machine learning” (or more broadly, “structured AI”). Using a large neural network would be “unstructured learning” since the neural network can fit almost anything, and does not impose any structure on the problem.
Freight transportation and logistics is full of structured problems in estimation and prediction. It also often features data that is changing over time (think about how things have changed with the COVID pandemic), which means if we want a large dataset, we cannot just go farther back in time. The world changes!
The takeaway is that if you are trying to estimate something (or predict, or infer, or classify), ask if you know something about the structure of the problem. If you do, then taking advantage of it will produce a model that will provide more reasonable results, with less data.
Optimal Dynamics is a New York City-based startup that has raised $4.4M to date in order to automate and optimize the logistics industry through the use of High-Dimensional Artificial Intelligence. To find out more please visit: www.optimaldynamics.com