Next: 6.7 R-Learning for Undiscounted Up: 6 Temporal Difference Learning Previous: 6.5 Q-learning: Off-Policy TD

6.6 Actor-Critic Methods (*)

Actor-critic methods are TD methods that have a separate memory structure to explicitly represent the policy independent of the value function. The policy structure is known as the actor, because it is used to select actions, and the estimated value function is known as the critic, because it criticizes the actions made by the actor. Learning is always on-policy: the critic must learn about and critique whatever policy is currently being followed by the actor. The critique takes the form of a TD error. This scalar signal is the sole output of the critic and drives all learning in both actor and critic, as suggested by Figure 6.15.

Figure 6.15: The Actor-Critic Architecture.

Actor-critic methods are the natural extension of the idea of reinforcement-comparison methods (Section ) to TD learning and to the full reinforcement-learning problem. Typically, the critic is a state-value function. After each action selection, the critic evaluates the new state to determine whether things have gone better or worse than expected. That evaluation is the TD error:

where V is the current value function implemented by the critic. This TD error can be used to evaluate the action just selected, the action taken in state . If the TD error is positive, it suggests that the tendency to select should be strengthened for the future, whereas if the TD error is negative it suggests the tendency should be weakened. Suppose actions are generated by the Gibbs softmax method:

where the are the values at time t of the modifiable policy parameters of the actor, indicating the tendency to select ( preference for) each action a when in each state s. Then the strengthening or weakening described above can be implemented simply by incrementing or decrementing , e.g., by

where is another positive step-size parameter.

This is just one example of an actor-critic method. Other variations select the actions in different ways, or use eligibility traces like those described in the next chapter. Another common dimension of variation, just as in reinforcement-comparison methods, is to include additional factors varying the amount of credit assigned to the action taken, . For example, one of the most common such factors is inversely related to the probability of selecting , resulting in the update rule:

These issues were explored early on, primarily for the immediate reward case (Williams, 1992; Sutton, 1984) and have not been brought fully up to date.

Many of the earliest reinforcement learning systems that used TD methods were actor-critic methods (Witten, 1977; Barto, Sutton and Anderson, 1983). Since then, more attention has been devoted to methods than learn action-value functions and determine a policy exclusively from the estimated values (such as Sarsa and Q-learning). This divergence may be just historical accident. For example, one could imagine intermediate architectures in which one learns an action-value function and yet still maintains an independent policy. In any event, actor-critic methods are likely to remain of current interest because of two significant apparent advantages:

They require minimal computation in order to select actions. Consider a case where there are an infinite number of possible actions---for example, a continuous-valued action. Any method learning just action values must search through this infinite set in order to pick an action. If the policy is explicitly stored, then this extensive computation may not be needed for each action selection.
They can learn an explicitly stochastic policy; that is, they can learn the optimal probabilities of selecting various actions. This ability turns out to be useful in competitive and nonMarkov cases (e.g., see Singh et al., 1994).

In addition, the separate actor in actor-critic methods make them more appealing to many people as psychological and biological models. In some cases it may also make it easier to impose domain-specific constraints on the set of allowed policies.

Next: 6.7 R-Learning for Undiscounted Up: 6 Temporal Difference Learning Previous: 6.5 Q-learning: Off-Policy TD

Richard Sutton
Fri May 30 13:53:05 EDT 1997