What Is AI? From machine learning to decision automation

Optimal Dynamics
The Startup
Published in
12 min readJul 22, 2020

--

Warren Powell, co-founder, Optimal Dynamics, Professor, Princeton University

There are a lot of articles appearing about “What is AI” (along with “What is machine learning” and “What is reinforcement learning”) that talk about these terms using vague language. Particularly frustrating (at least to me) are the number of authors who equate “AI” with neural networks specifically, or machine learning in general.

We are going to make the distinction between learning and making decisions. We learn from observed data, which is why machine learning is associated with “big data.” Decisions, on the other hand, are designed using models. You cannot learn how to dispatch a fleet of trucks by observing the decisions of dispatchers.

Computer scientists interpret “AI” broadly as capturing any behavior that makes computers behave like humans. In this article, I am going to focus on the tools that are most widely used in business that are described as “AI.” These include

1) Rule-based logic — This was the original form of “artificial intelligence” in the 1960’s and 70’s, but it is still used today. “If furniture has four legs and a seat, it is a chair” or “if the credit score is over 600, grant the loan” are simple examples. In a health context, the attributes of a patient are more complicated (“if patient is male, and if a smoker, and if over 50, and if … and if … then order treatment X”). If the input data (for the “if” statements) has more than three or four dimensions, the rule becomes quickly intractable (this is the well-known “curse of dimensionality”). This is the reason that rule-based AI failed, but we note that simple business rules remain widely used today in virtually all systems.

2) Making estimates using data from the environment — This is the domain of machine learning, also known as statistics, statistical learning, “predictive analytics” and, more broadly, data sciences. It helps to divide machine learning into two broad classes:

a. Supervised learning — This is the domain of “big data” where a large training dataset consisting of input-response (such as, faces with names — the machine learning community refers to the “response” as labels) is used to train a machine learning model.

b. Unsupervised learning — Here we cluster input data (such as attributes of patients or customers) but without access to labels/responses.

3) Making decisions that interact with the environment — Decisions involve a controllable quantity, and a metric that evaluates the performance of the decision. These arise in both static and sequential settings. Algorithmic strategies for making decisions are quite rich, but for this discussion it is useful to identify the following classes of methods:

a. Rule-based logic — We can use rules to make decisions, such as “if a patient has these symptoms, then apply this treatment.”

b. Deterministic optimization — Powerful solvers for high-dimensional problems known as linear programs (where quantities can be continuous) emerged in the 1990s, followed by breakthroughs for integer programs (where quantities have to be discrete) circa 2000. These are widely used in static planning problems such as airline scheduling. Oddly, these tools are not traditionally associated with “AI” (not sure why).

c. Reinforcement learning — These methods emerged in the 1990s in the context of describing how a mouse finds its way out of a maze, and was ultimately applied to the Chinese game of Go. It has been primarily applied to “single entity” problems such as controlling robots, playing games, or guiding a physician, as opposed to more complex resource allocation problems.

d. Stochastic optimization — This is an umbrella term for a wide range of modeling and algorithmic strategies for solving decision problems in the presence of uncertainty. The problem domain is so rich it has been studied by over 15 distinct research communities, spanning problems from optimal stopping to high-dimensional resource allocation problems. Recently, we have pulled these together into a single, unified framework that draws on the tools of deterministic optimization, machine learning and simulation (see the jungle of stochastic optimization).

Tools for making decisions are sometimes classified under “prescriptive analytics.”

There are other dimensions of AI such as knowledge representation and cognition that are important to computer scientists trying to make computers behave like humans. The dimensions I have listed above cover the practical tools being used in business.

This is a good stopping point for people who just want a brief overview of the styles of artificial intelligence. In the remainder of the article, we provide more detail behind these ideas under the headings:

  1. Making estimates using data from the environment — This discussion will describe the three main classes of machine learning methods.
  2. Making decisions that interact with the environment — We provide an overview of the very rich area of making decisions over time, generally in the presence of uncertainty.
  3. What is reinforcement learning — “Reinforcement learning” is a term introduced by computer scientists that started as a particular algorithmic strategy, but has evolved to the general study of adaptive algorithms for sequential decision problems.
  4. The intersection of “learning” and “making decisions” — We can learn decisions by watching a decision-maker (such as a physician) which requires a training dataset, or we can make decisions to optimize some metric (cost, profits), where we need a model of the problem (but without a training dataset).
  5. What is stochastic optimization — This section describes the disciplines that have evolved for making decisions in the presence of uncertainty. Some of these disciplines are familiar to the AI community, but others are not.

1. Making estimates using data from the environment

Estimation come in three flavors:

a) Classification — e.g. identifying the name of a person from an image, words being spoken, or the identity of a song from a recording. Classification involves discrete/categorical inputs, with deterministic (right vs. wrong) responses.

b) Inference — For example, estimating the deterioration of a piece of machinery, or inferring the number of people infected by a disease. Inferences typically imply some error in our estimation of an unknown truth, where we wish to minimize some measure of the errors. Observations of the truth are typically noisy.

c) Prediction — These are estimates about the future, such as the amount of snowfall or the movement of a stock price. These problems always involve working with some level of noise.

In all three cases, we have observable (input) data x (such as an image, the characteristics of a patient, or a history of stock prices) from which we wish to make inferences about something that is not known (or observable) to us right now (the identity of the image, diagnosing a disease, or future stock prices).

While these are different settings, they all require the same machinery of statistical modeling (all forms, not just neural nets). Central to making inferences is that you have some model f(x|w) where “x” is a set of known inputs (for example, prices), and w is a set of weights that we have to tune to make the model fit the data as well as possible.

For example, imagine that x is a price, and f(x|w) is the predicted sales when we use price= x. Now assume that we think the relationship between price and sales is given by

Sales = f(x|w) = w0 — w1 x

Note that this assumed relationship ensures that as the price x is increased, sales will decrease. This behavior makes sense, but is not guaranteed by all models.

This is an example of a parametric function. We determine the parameters w0 and w1 by assuming that we are given a dataset of prices x¹,x² …, x^n and the corresponding sales y¹, y²,…, y^n. We then design algorithms to find the parameters w that minimize the sum of the squared errors between the predicted sales f(x¹|w), …, f(x^n|w) and the observed sales y¹, …, y^n.

We may not feel that sales decreases linearly with price. We could assume other relationships, but today, there has been considerable interest in neural networks, which are so flexible that we do not have to assume any particular form. Instead, we can let the data determine the relationship. Sounds perfect, except that neural networks are so flexible that they can end up fitting the noise in the data. We discussed this issue in this blog entry .

Statistical models can be classified under three headings:

a) Lookup tables — Here “x” is discrete (e.g. the origin-destination of a load, the type of truck) or discretized (e.g. the price of freight, or the penalty applied to late deliveries). A lookup table requires a different estimate (or classification) for each value of “x.’’x” may have multiple dimensions (origin, destination, product-type, age) in which case the number of possible values of “x” can grow dramatically — this is the classic curse of dimensionality.

b) Parametric models — A simple linear parametric model might write demand as a function of price using D(p|w) = w0 — w1 p + w p². More complex functions might be nonlinear in the parameters such as D(p|w) = w0 e^{-w1 p}. Neural networks are a type of nonlinear model, but typically with a lot of parameters, which means they need a lot of data.

c) Nonparametric/locally parametric — This is a class of advanced models that includes deep neural networks, but could also include models that use simple approximations (perhaps constants or linear models) that only apply to specific regions. These models require more data than parametric models (and sometimes a lot of data).

2. Making decisions that interact with the environment

In the context of this discussion, we identify two broad classes of decisions:

o Decisions that change the environment. These decisions might change the physical device (such as moving from one location to another, acquiring/using inventory, taking a drug, or investing in stocks) or setting parameters such as the price of a product. Choosing website designs or restaurant menus fall in this category.

o Decisions to exchange information. It helps to identify two classes of “information decisions”:

1. Decisions to acquire information — We might run a laboratory or field experiment, take a medical image, or simulate an operational problem in the computer to collect information that changes what we know about the environment. This information is then used to make better decisions that directly change the environment.

2. Decisions to communicate/disseminate information. We might send a signal, email a coupon or broadcast an ad to inform other decision makers.

Some decisions may change the physical environment while also producing information, such as storing more inventory to learn how large the demand might be, or charging a lower price to see how the market responds.

It helps to have the notion of the “state” of our system which contains all the information needed to make a decision. The state, which we denote S_t, can be a physical state R_t (the location of a vehicle, how much inventory we have on hand), other information I_t (weather, forecasts, prices), and possibly beliefs B_t about quantities we cannot observe directly, such as the inventory in a different region, or how the market responds to price. In a nutshell, a state variable is everything you need to know at time t (and only what you need to know). Surprisingly, there is an astonishing level of confusion about state variables in the academic literature.

For example, in our demand model D(p|w) = w0 — w1 p, we may not know (w0, w1), but we might have estimates along with information about the accuracy of the estimates (such as a mean and variance). We might also have beliefs about the deterioration of a piece of machinery, or the presence of disease in a population.

We evaluate decisions using a cost or contribution function C(S_t,x) that, given a “state” S_t (which contains all the information we need to make a decision) and a decision x (that you control), C(S_t,x) is a metric of how well we have done. If we have a deterministic problem, we might be looking for an optimal decision x*.

For sequential problems (and especially sequential problems where there is some form of uncertainty), we are looking for a decision function, or policy, for making decisions that we like to write x_t = X^pi(S_t) which takes the information in the state S_t and returns a decision x_t. We assume we have a family of policies, where “pi” is just one example. Policies come in many forms, although below we divide these into four broad classes. A simple example for inventory problems is to order new inventory when the inventory level goes below a level (call it “q”) in which case we order enough to bring the inventory up to a higher level (call it “Q”). The set of policies in this class are all the different values of (q,Q).

Rather than finding optimal decisions, we aspire to find optimal policies, although in practice this is quite rare. The challenge of finding policies is an essential feature of problems where decisions have to be made over time. Just remember — a policy is a method for making a decision… any method (including what you may be using now).

There are many communities that address the challenges of making decisions under uncertainty. We refer to these communities as the “jungle of stochastic optimization” using names such as dynamic programming (or Markov decision processes), optimal control (or stochastic control), simulation-optimization, stochastic programming, decision analysis, and stochastic search, along with communities that use names such as multi-armed bandit problems and active learning. However, in recent years one name has caught the attention of the broader public, and that is “reinforcement learning” which emerged from the work in the 1980’s of Rich Sutton and Andy Barto in computer science.

3. What is reinforcement learning?

Recently, the study of sequential decisions have been grouped under a growing umbrella of research called “reinforcement learning.” In the 1990’s, reinforcement learning emerged as a method for solving (approximately) sequential decision problems using the framework of “Markov decision processes.” The idea proceeded by estimating the value Q(s,a) of being in a state “s” and taking an action “a.” We can write the value of being in state s as V(s) = max_a Q(s,a). Computing Q(s,a) (or V(s)) exactly is rarely possible. There is a wide range of algorithms for approximating Q(s,a) or V(s) that have been studied under the banner of reinforcement learning or approximate dynamic programming.

Over time, researchers found that Q-learning did not always work (in fact, it often did not work). The second edition of Sutton and Barto’s classic book Reinforcement Learning (which appeared in 2018) now includes parameterized policies, upper confidence bounding, Q-learning and a class of lookahead policies known as Monte Carlo tree search. Below, we put these into a more general framework for sequential decision problems.

4. The intersection of “learning” and “making decisions.”

Note that you can use machine learning to make decisions if you are trying to mimic what a human would do. In this case, the observations “y” would be actual decisions made in the past. You have a “cost” function which is to minimize the differences between the statistical model f(x|w) and the observed decision. In this case, our statistical “model” f(x|w) is now a policy.

This approach for creating a policy needs the training dataset of inputs x and observed decisions. When we approach the designing policies as an optimization problem, we do not need the training dataset. Instead, we need the performance metric C(S,x) (costs, profits, utility, …) along with a model of the physical problem. For example, if we have an inventory problem, our model would describe the evolution of inventories given our inventory ordering decisions x.

The distinction between choosing decisions using the machinery of machine learning, versus designing policies to optimize our performance metric, is often overlooked, but it is critical. The two strategies are fundamentally different.

5. What is stochastic optimization?

Stochastic optimization describes any problem where we intermingle making decisions and then making observations that affect the performance of the decision. These problems come in many flavors, but can all be described as a sequence

Decision, new information, decision, information, decision, …

Before each decision is made, we compile everything we know (but only what we need to know) in the state variable S_t.

Decisions come in many forms:

· Binary: Buy/sell, stop/continue

· Discrete: The best of a set of choices (drugs, paths)

· Continuous variables (prices, dosages)

· Vectors, such as how much inventory to store in different warehouses, or how to assign a fleet of drivers to different loads.

We want to make decisions, but we do this by designing rules that we call “policies” so our search is for the best “policy” for making decisions. Since there is uncertainty, we need to find the policy that works best on average over the different uncertainties. Typically we do this by using a sample of different possible outcomes.

Deterministic optimization tools (such as linear programming) focuses on making decisions. Stochastic optimization requires designing policies (rules/methods for making decisions). This is a major shift in thinking, one that is not well understood across all the communities in the jungle of stochastic optimization.

When we are solving machine learning problems, we search over a family of statistical models. When we are searching for the best policy, we have to search over different classes of policies, which are much broader than what we consider for machine learning. We hold the discussion of designing policies for another time.

Optimal Dynamics is a New York City-based startup that has raised $4.4M to date in order to automate and optimize the logistics industry through the use of High-Dimensional Artificial Intelligence. To find out more please visit: www.optimaldynamics.com

--

--