CMU-CS-20-122
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-20-122

Counterfactual MDPs: Planning Beyond Direct Control

Rui Silva

Ph.D. Thesis

August 2020

CMU-CS-20-122.pdf


Counterfactual Markov Decision Processes, Planning Under Uncertainty

Planning under uncertainty using Markov decision processes (MDPs) requiresa model of the environment specifying the probabilistic effects of the actions that the agent is able to execute. The optimal policy of the agent is then computed from such model. As such, when planning, the agent assumes the environment may only change as the direct result of its actions. In this thesis we consider lifting that assumption, and allow the agent to reason over the counterfactual "What if the world were different? Effectively, we allow the agent to reason over other possible configurations of the world, where more rewarding optimal policies may exist, and over the cost required in shifting the original environment to these modified worlds. Our goal is to endow the agent with the ability to plan over the possibility of actually operating in such configurations of the world, if beneficial. We introduce Counterfactual MDPs, a new class of MDPs that allows the agent to reason and plan over the aforementioned counterfactual. Solving a Counterfactual MDP consists in the maximization of the expected value/cost trade-off over possible changes to the world.

In the context of MDPs, the dynamics of the world are described in terms of transition probabilities. Our approach is thus to formulate the problem as a joint-optimization of the transition probabilities and optimal policy of the MDP. We analyze the complexity of the resulting problem, and formally prove it is NP-Hard. We then derive two gradient-based approaches for solving the problem. These approaches culminate in the contribution of an iterative gradient based algorithm, P-ITERATION, for solving Counterfactual MDPs. Additionally, we discuss methods for scaling up this algorithm to larger problems.

We demonstrate the applicability of Counterfactual MDPs and the performance of the algorithms proposed in multiple scenarios. We show, in particular, significant performance improvements that arise from allowing the agent to reason and plan over other possible worlds, and corresponding optimal policies.

In the process we realize, however, that Counterfactual MDPs implicitly assume that the specific world configuration the agent envisioned will be necessarily materialized. However, in many real-life scenarios there exists an underlying uncertainty in the outcome of applying changes to the world. We extend the Counterfactual MDP model to allow the agent to reason over this uncertainty, and dub the resulting model Stochastic Outcomes Counterfactual MDPs. This new model assumes the uncertainty associated with changes to the world follows a probabilistic distribution with parameters the agent can reason over and control, resulting in a new optimization problem. We show the gradient of this new optimization problem can be computed by solving an expectation, and thus propose a sampling based method for computing it. This allows us to extend P-ITERATION to this new class of problems with stochastic outcomes.

In the end we demonstrate the applicability of the new model in multiple scenarios with uncertainty in the outcome of changes to the world. We show that by reasoning over this uncertainty the agent is able to find more valuable world configurations.

128 pages

Thesis Committee:
Manuela Veloso (Co-Chair, CMU)
Fancisco S. Melo (Co-Chair, Instituto Superior Técnico)
Ariel Procaccia
Reid Simmons
Daniel Borrago (Universidad Carlos III de Madrid)
Pedro Limo (Instituto Superior Técnico)

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by [email protected]