CMU-CS-20-136
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-20-136

Striving for Safe and Efficient Deep Reinforcement Learning

Harshit Sushil Sikchi

M.S. Thesis

December 2020

CMU-CS-20-136.pdf


Keywords: Machine Learning, Robotics, Reinforcement Learning, Safe Reinforcement Learning, Optimization, Model based Reinforcement Learning, Model free Reinforcement Learning, Planning, Trajectory Optimization, Inverse Reinforcement Learning, Imitation Learning

Reinforcement Learning has seen tremendous progress in the past few years solving games like Dota and Starcraft, but little attention has been given to the safety ofdeployed agents. In this thesis, keeping safety in mind, we make progress in different dimensions of Reinforcement learning–Planning, Inverse RL, and Safe Model-FreeRL.

Towards the goal of safe and efficient Reinforcement Learning, we propose:

  1) A hybrid model-based model-free RL method, "Learning Off-Policy with On-line Planning (LOOP)," which effectively combines Online Planning using learned dynamics models with a terminal value function for long-horizon reasoning. This method is favorable for ensuring safe exploration within the planning horizon, andwe demonstrate that it achieves state-of-the-art performance competitive with model-based methods.

  2) An Inverse Reinforcement Learning method ƒ-IRL that allows specifying preferences using state-marginals or observations only. We derive an analytical gradient to match general ƒ-divergence between agents and experts state marginal. ƒ-IRL achieves more stable convergence than the Adversarial Imitation approaches that rely on min-max optimization. We show that ƒ-IRL outperforms state-of-the-art IRL base-lines in sample efficiency. Moreover, we show that the recovered reward function can be used in downstream tasks, and empirically demonstrate its utility on hard-to-explore tasks and for behavior transfer across changes in dynamics.

  3) A model-free Safe Reinforcement Learning method, Lyapunov Barrier Policy Optimization (LBPO), that uses a Lyapunov-based barrier function to restrict the policy update to a safe set for each training iteration. Our method also allows the user to control the agent's conservativeness with respect to the constraints in the environment. LBPO significantly outperforms state-of-the-art baselines in terms of the number of constraint violations during training while being competitive in terms of performance.

101 pages

Thesis Committee:
David Held (Chair)
Jeff Schneider

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by [email protected]