    Reinforcement Learning for Real-World Robotics

    Ideas from the literature on RL for real-world robot control


    Sim2Real: If we wish to train our policies in a simulated environment and then deploy them

    Reward specification: When playing chess or StarCraft2 the goal is very clear and easily quantified; win or lose. However, how do you quantify the level of success in something like washing dishes? Or folding laundry?

    Safety: Our robot policies must be safe (during both training on a robot and deployment to the real world), in order to ensure the integrity of the robot itself and the safety of people and property in the environment.

    I will mostly address the first two categories, briefly touch on the third one and avoid the last.

    Sample Efficiency

    Traditionally, the go-to approach when desiring extremely low sample complexity in RL is model-based RL. In model-based RL, instead of learning a policy only based on rewards obtained by interaction with the environment, the agent tries to learn a model of the environment and use it to plan and improve its policy, thus dramatically reducing the number of interactions it needs with the environment. However, this approach often results in lower asymptotic performance compared to model-free algorithms, and sometimes suffers from catastrophic failures due to errors in the learned model.

    Model-based RL showed great success in problems where the dynamics can be sufficiently represented by a simple model, such as a linear one, but have not had great successes in more complex environments where non-linear models (such as deep neural networks) are required. In 2018 researchers from Berkeley published a paper on the subject, in which they identified a suspected cause to the instability issues and suggested a solution.

    In their paper, they claim that during learning and planning, the policy tends to exploit those regions of the state-space on which the learned model performs poorly, which veers the agent “off-course” from areas in which it can reliably plan ahead using the learned model to those in which it is completely blind, rendering it ineffective.

    Their solution is relatively simple; learn several different models of the environment, and during planning uniformly sample from these different models, effectively regularizing the learning process. This enabled applying model-based RL to more complex tasks than previously possible, achieving asymptotic performance comparable to model-free algorithms in orders of magnitude less attempts.

    Interestingly, a recent model-free algorithm has demonstrated excellent sample complexity, so much so that it was possible to train a policy on a real robot in a relatively short amount of time. In 2019 researchers from Berkeley and Google Brain published a paper in which they describe an off-policy actor-critic algorithm called Soft Actor Critic (SAC). They demonstrated that this algorithm performs very well with low sample complexity on several traditional RL control benchmarks, and then proceeded to train a robot to walk in only 4 hours of environment interaction.

    The robot was trained to walk on a flat surface, and then successfully tested on surfaces with different obstacles, showing the robustness of the learned policy.

    Another nice paper took a different path to sample efficiency; by doing away with the notion that the policy must be learned from scratch, and should utilize existing imperfect control algorithms to provide a “skeleton” on which to grow the policy. By exploiting the fact that existing control algorithms are reasonably well-behaved for the problem, the policy can learn to fine-tune the control system and not start from random behavior as in most RL algorithms. They use this approach to train a robot for a real block assembly task involving complex dynamics.

    In the past few years, many research papers have demonstrated the capabilities of their learned RL policies to operate in a physical robot platform, but these are often limited to narrow tasks and usually require extensive manual adjustments to work properly. Especially for robots that use visual sensing, it is extremely difficult to realistically simulate the images a robot is likely to encounter, and this creates the famous Sim2Real gap.

    In problems where the dynamics can be sufficiently captured by a simulation, RL can actually work pretty well, as can be seen in this paper, in which a policy was trained to learn recovery maneuvers for a quadrupedal robot using a simulation, and it transferred to the real robot very well, with an impressive 97% success rate. However, in this problem the state was relatively low dimensional, unlike visual representations.

    To tackle the problem of coping with visual inputs, a nice paper from 2017 by OpenAI and Berkeley offers a very neat solution; randomize the visual inputs provided by the environment, and train the policy to be very robust to these changes, in the hopes that the real world would like just another variation of the simulation.


    This worked very well, and they were able to use an object detector trained solely in simulation, in a real robot grasping system.

    A very cool continuation of that work was published in a 2019 paper. In this paper the researchers used a similar randomized simulation to help train the policy to be robust, but simultaneously trained a conditional GAN (cGAN) to transform the randomized images back to a canonical form of the original simulation. During test time, the cGAN transformed the real images to the canonical image form that the policy was familiar with, thus helping to reduce the Sim2Real gap. Using this method, they trained an agent in simulation and used it on a robot with 70% success rate. Using some finetuning on the real robot, they were able to achieve 91% and even 99% success rates.


    Reward Specification

    Say you want your robot to learn to place books in a bookshelf, and that you have an algorithm with very low sample complexity. How does one go about designing a reward function for this? In a very cool 2019 paper, researchers from Berkeley did just that. Instead of specifying a reward function, they provide the algorithm with several images of a goal state (arranged bookshelves), and allow it to query the user (very few times) if the current state is a goal state. Using the r=previously mentioned Soft Actor Critic algorithm together with several other algorithms, they trained their policy on a real robot in a few hours. They trained different policies for different tasks, such as putting books in the shelf and draping a cloth over a box.



    The challenge of applying RL to real-world robotics problems is still far from being declared solved, but much progress is being made and hopefully we will continue to see further breakthroughs in this exciting field.

    Disclaimer: the views expressed in this article are those of the author and do not reflect those of IBM.

