Ask Professor Powell… your AI questions

12 min readAug 11, 2020

I will be using this blog to summarize common questions about AI along with my answers. Feel free to pose your own questions here. Questions of general interest may be replicated here (possibly edited) along with my answer (also possibly edited).

Added Sept 4, 2020:

Q: Value function approximations — Hi, I have a question on value function approximation. I defined and code my problem. I start each episode with a random policy and adjust the weights of state features in the context of gradient monte carlo algorithm and TD(0). For about the first 100 episodes the delta was decreasing, however, it started to increase afterwards. It might be a sign that the predictors are not doing right. What can I do? Shouldn’t the delta be expected to converge instead of diverge across episodes?

These algorithms are heavily dependent on the characteristics of your problem and the specifics of your algorithm. From your reference to “weights of state features” I assume you are using a linear model for the value function V(s) (or for the Q-factors Q(s,a)) (??). There is a lot that can go wrong here. It is very easy for a linear model to be a poor approximation. I am not exactly sure what you mean by delta (is this the increment in TD(0)), I am guessing is that you were getting improvements for your first 100 iterations after which the deviations between the observed qhats and the estimates qbar of a state-action pair started to increase, indicating that your algorithm is diverging. This has happened to me, and when it does, it can easily mean that your approximating function is just not very good.

Now, I have seen this behavior even when using a lookup table for a problem with discrete states and actions. So, a lookup table does not introduce the types of structural errors that a linear VFA might, but you still have errors in the estimates of the value of a s-a pair (just as you get have errors in the estimates of the values of your weights in your linear model).

So, now I have to better understand your training policy. You said that you started each “episode” with a random policy. I am not quite sure what you are meaning by an episode. Is your problem stationary/infinite horizon, or finite horizon? Are you training purely with a random policy? or are you using your VFAs as the basis of your policy? There are problems with both of these (!!!). If you are training based on a random policy, this can be very suboptimal, which means you are effectively learning the value of a bad policy. But if you are using the VFAs as the basis of your policy, this can also be bad! (Having fun yet?).

VFA-based policies can work extremely well, but I have only had good success when I can exploit problem structure. If you go to my online book on sequential decision analytics , search on the chapter on blood management. I show how to estimate VFAs that are concave in the number of units of blood. Concavity is a powerful property, and it dramatically accelerates the learning.

For another example, look at a recent paper of mine using VFAs to optimize fleets of electric vehicles. We used a lookup table, but initially this worked terribly — we experienced exactly the behavior you were describing where results initially were better, and then got worse. We fixed it when we exploited structure within the VFA, and then the results were great (but only then). Now, note that I do not think a linear approximation of the VFA would have ever worked.

It would be very helpful to understand your actual application. VFAs are one of the four classes of policies, which are introduced in different tutorial articles at jungle.princeton.edu (for example, see the video at the top). You might want to start using a different policy that might be easier to use. For example, you could use a VFA-based policy to optimize paths over a stochastic network. Using a deterministic lookahead (as google maps does) would be a good starting point. Training with a “good” (but suboptimal) policy might be much better than using either a random policy, or a VFA-based policy with poor value function approximations.

I encourage you to take a look at the online book at http://tinyurl.com/sequentialdecisionanalytics. I illustrate all four classes of policies in chapters 2–6, then describe the four classes in chapter 8.

Q: What is AI?

So many of the questions I get can be traced back to a misunderstanding of what is meant by AI. Everyone wants to jump on the AI bandwagon. In a nutshell, AI refers to any technology that makes a computer behave “intelligently,” which should mean something more than just adding up numbers. I provide an in-depth discussion of AI in my blog “What is AI” here.

Q: Can AI optimize my company?

This is like saying: Can tools build a house? Obviously, tools can build a house, but you need many tools to perform different functions. Similarly, the performance of a company depends on a variety of steps that depend on information and decisions. Some of the steps simply involve moving information from one place to another. Intelligence arises when we have to act on information to make decisions, where making a decision may involve judgments such as estimating the market response to a price, or how much inventory to order. Each step will require different AI tools, which can generally be organized into three broad classes:

Rules — Rules have to be specified by a human, and are generally limited to relatively simple settings (e.g. if eating red meat, order red wine).
Machine learning — This is the world of using data (sometimes big datasets) to fit mathematical models for classifying (is this email a customer order), inferring (what is the probability a customer will accept the price), and predicting (what will the market price be in three days). Machine learning turns information we know (such as the history of prices) to produce an estimate of something we don’t know (at least not now) such as the weather tomorrow. Machine learning needs a dataset to train the model (and this might be a big dataset). Neural networks is just one example of a machine learning tool. Machine learning is sometimes referred to (misleadingly) as “predictive analytics,” a term that tends to focus on prediction, ignoring classification and inference.
Decision optimization — Decisions represent something we control (inventory, prices, machine schedules, hiring decisions, what experiment to run next). Analytic tools for making decisions do not require a training dataset. Instead, they require performance metrics (cost, profit, service) and a model of the system.

If you want to improve the performance of your company, you have to break all the steps that involve processing data to identify places where you need to estimate something (this generally uses machine learning) or to make a decision (which involves some form of optimization).

Q: What is the difference between machine learning and reinforcement learning”

The machine learning community considers “reinforcement learning” part of machine learning, but it is important to understand three major classes of methods that fall under “learning”

Supervised learning — This is where you have a training dataset that is used to fit a model. You might have observations of demand and price, or histories of prices where past prices are used to predict future prices. Other examples could include a dataset of medical images, each one labeled by whether a radiologist feels that it exhibits cancer or not.
Unsupervised learning — This is used when you have, for example, a dataset of patients with their personal characteristics (age, weight, gender, …) and medical histories, which you use to cluster into groups.
Reinforcement learning — This is a term introduced in the 1990’s to describe a class of algorithms (called Q-learning) that are used to make decisions, such as finding the shortest path through a maze, or solving games such as chess, Go and video games. RL is used for decision problems. It does not use a training dataset, but does require a performance metric and a model of how the system evolves as decisions are made. Q-factors capture the value of being in a state s and then taking an action a, represented by Q(s,a). These values are learned using algorithms that test many decisions over time (hence the association with Q-learning).

Since the 1990’s, reinforcement learning has evolved from the initial association with Q-learning to a number of other algorithm strategies. These all fall within my universal framework that identifies four classes of policies (methods for making decisions), of which Q-learning falls in one of the four classes (click here for more information).

Q: Why do so many AI projects fail?

The first reason why an AI system may fail is that the technology simply is not working very well. It is important to distinguish between AI as machine learning (estimating something) and AI as a decision-making tool.

AI as a machine learning tool — The medical community has long assumed that machine learning (typically using neural networks) would replace radiologists interpreting MRIs, but radiologists continue to outperform computers. We still seem to be a long way from driverless cars. A major problem with the neural networks that are so popular today is that these are systems that often have hundreds of thousands to many millions of parameters that need to be estimated. These systems require truly massive datasets, and yet may still not produce the right behavior (click here for additional discussion of this topic).
AI as a decision tool — An inventory system might place orders based on current inventory without considering forecasts of demands and inbound orders. Or, we may use a forecast, but do not properly account for the potential errors in the forecast. Decisions represent the highest form of AI, since good decisions have to stand on the shoulders of good information, which might also include good forecasts (or other estimates) from machine learning. A decision tool has to use an accurate model of the physical system being controlled

It is very important to understand how the output of an AI system (whether it be an estimate/forecast, or a decision) will be used within a company. Problems that can arise in the implementation of either type of AI system include:

A poor understanding of the data required by the AI system — This tends to arise particularly with decision systems. A truck dispatcher may know that the driver needs to get home, which he learned in a phone call. This is a common instance where a human knows something that the computer does not.
A poor understanding of how information is used by people to make decisions. It can be very hard to understand how a person makes a decision, which complicates the process of providing people with better information.

It is important to understand whether an AI system is being used to help a human, or if it is being allowed to run autonomously.

Q: How can I design a framework (state spaces, action spaces, rewards) for a single agent deep RL applied to any healthcare data?

First — “deep RL” would arise when you are trying to solve the problem. My first recommendation: model first, then solve.

Next, do not talk about state *spaces* and action *spaces*. You want to talk about state *variables* and action (or decision) *variables*. From a modeling perspective, describing the state or action space provides no information about the problem.

There are five dimensions to any sequential decision problem:

State variables — The state S_t at time t is all the information you need to make decisions, compute performance metrics, and to model the transition from time t to t+1. State variables come in three flavors: the physical state (inventories, location of a device), other information (for example, information about the weather — basically anything that does not seem to be a physical state), and belief state (which is information that describes probabilistic knowledge about quantities or parameters that you do not know perfectly).
Decision variables x_t— This is what you control. Decisions are made by a policy (a method or function for making a decision using the information in S_t) that has to obey any constraints at time t. I like to write the policy as X^\pi(S_t) where “\pi” carries information about the structure of the policy, and any parameters that have to be tuned.
Exogenous information W_{t+1} — This is the information that is revealed or observed after you make a decision.
The transition function — This describes how the state S_t evolves to S_{t+1} after you make a decision, and then observe the exogenous information. I write the transition using S_{t+1} = S^M(S_t, x_t, W_{t+1}), where S^M(…) is the state transition model, or transition function. Sometimes we can describe the transition function using a set of equations, and sometimes these are not known, and all we do is observe S_{t+1}.
The objective function — This includes the performance metrics (how well you perform at time t) and how well your policy works over time, on average. Let C(S_t,x_t) be the “contribution function” (we can maximize or minimize) that might be a weighted contribution of different metrics.

I would start by describing the evolution of your system in terms of the state S_t at time t, after which you make a decision x_t (given the information in the state S_t), and you then observe new information that we call W_{t+1}.

Q: How to build optimal sequential decision policies for the healthcare system, these policies might be treatments to patients?

Now the next step is to design the policy. There are two broad classes of policies, each of which can be divided into two subclasses. These are

The policy search class — These are parameterized functions with parameters that have to be tuned to work well over time. These come in two flavors:

Policy function approximations (or PFAs) — These are simple rules or functions such as buy low, sell high, or if the patient has specific attributes, then apply this treatment. PFAs are the only one of the four classes that do not involve an imbedded optimization problem
Cost function approximations (CFAs) — Here we are optimizing a simple function that we have parameterized so it works well over time. In health, we might want to choose the medication that we think will produce the greatest reduction in blood sugar, but we are uncertain, so we are going to add an “uncertainty bonus” given by the standard deviation of our estimate of the reduction (which is uncertain) times a tunable parametre.

2. Lookahead policies — Here we make a decision that combines how we think we will perform right now, plus how well we think we will perform in the future. These policies are generally more complicated than those in the policy search class. Again, these fall into two classes:

Policies based on value function approximations (VFAs) — If we are in a state S_t and make a decision x_t, we land in a downstream state (called the post-decision state) that we call S^x_t. Now assume that we have been able to capture the value of being in this state. We typically have to approximate this value, so we call it a value function approximation. VFA-based policies include Q-learning, which have attracted attention under the umbrella of “reinforcement learning.” Many years ago, RL consisted purely of Q-learning, but the community found that Q-learning often did not work. RL has become an umbrella and includes methods with names such as “policy gradient method” (which applies to PFAs), “upper confidence bounding” (which is a form of CFA), “Q-learning” (a form of VFA), and Monte Carlo tree search (a form of DLA).
Policies based on projecting how well we do in the future over some horizon after making a decision (DLAs) — The simplest example of a DLA is Google maps that plans the path all the way to the destination. Google maps uses a “best estimate” of travel times, and does not model the uncertainties in travel times in the future. This does not always work.

All four of these classes are important. Often, you can identify which is best based on the nature of your application. Decisions in health are quite rich, from which treatment to apply to the management of doctors and nurses (resource allocation problems). Medical decisions often involve uncertainties, so the state variable has belief states, and decisions often involve collecting information (by running a test) to improve the belief.

Beyond this, I need a specific health decision to provide further guidance. You want to look at the nature of a decision to guide designing a policy.

Q: Can I create my own environment for a single agent deep q network related to the healthcare system? If yes then how should I proceed?

Again, I need a specific health decision to help design the policy. You mention “q network” so I think you are trying to do Q-learning. Again — Q-learning may be useful, but this really depends on the application. My experience is that the more sophisticated policies in the lookahead class tend to be more popular with the research community than healthcare professionals, who prefer the simplicity of policies in the PFA/CFA classes. These are simpler and more transparent, which can be very important in health.

Please see my online book Sequential Decision Analytics and Modeling at http://tinyurl.com/sequentialdecisionanalytics. If you like, pick a health problem and try adding a chapter to the online book http://tinyurl.com/sequentialdecisionanalyticspub.

Q: Your first works on the DVA, starting with SLAP, uses mainly approximations as solving exactly realistic instances would have been impossible 30 years ago. With the evolution of mathematical programming solvers and algorithms (decompositions, interior point methods, among others) in the last 20 years, does it make sense for you to invest in this line of research for treating the DVA?

Wow … this is going way back. For the uninitiated, DVA = “dynamic vehicle allocation” and was an early name that we used for a particular form of fleet management under uncertainty. This problem was fairly simple because we ignored drivers.

The methods you are referring to are all for deterministic problems, not stochastic ones. In addition, our models today explicitly model drivers, who are described with 15-dimensional attribute vectors (which means each driver is unique). We need to make a decision of whether to accept a load five days in the future, or which driver to assign to the load even though we do not know what opportunities he might face after he finishes the load (which could be tomorrow or the next day).

Ask Professor Powell… your AI questions

Q: How to build optimal sequential decision policies for the healthcare system, these policies might be treatments to patients?

Written by Optimal Dynamics