1 Introduction
Hierarchical reinforcement learning (HRL) is a general framework for addressing largescale reinforcement learning problems. It exploits the task (or action) structure of a problem by considering policies over temporally extended actions that typically involve a reduced subset of the state components. For example, the MAXQ approach [Dietterich2000] decomposes a Markov decision process (MDP) and its value function into a hierarchy of smaller MDPs such that the value function of the target MDP corresponds to an additive combination of the value functions of the smaller MDPs. Another example is the options approach for which different tasks can be learned simultaneously in an online fashion [Sutton and Precup1998]. HRL methods have also been used to explain human and animal behavior [Botvinick, Niv, and Barto2009].
Independently, a class of stochastic optimal control problems was introduced for which the actions and cost function are restricted in ways that make the Bellman equation linear and thus more efficiently solvable [Todorov2006, Kappen2005]. This class of problems is known in the discrete setting as linearlysolvable MDPs (LMDPs), in the continuous setting as pathintegral control or more generally, as KullbackLeibler (KL) control [Kappen, Gómez, and Opper2012]. Optimal control computation for this class of problems is equivalent to a KLdivergence minimization.
The original LMDP formulation considers a single action that changes the stochastic laws of the environment. An alternative interpretation (that we adopt in this work) is to consider a stochastic policy over deterministic actions. LMDPs have many interesting properties. For example, optimal control laws for LMDPs can be linearly combined to derive composite optimal control laws efficiently [Todorov2009a]. Also, the power iteration method, used to solve LMDPs, is equivalent to the popular belief propagation algorithm used for probabilistic inference in dynamic graphical models [Kappen, Gómez, and Opper2012]. The optimal value function for LMDPs can be learned efficiently using an offpolicy learning algorithm, Zlearning, that operates directly in the state space instead of in the product space of states and actions [Todorov2009b].
In the continuous setting, the KLdivergence reduces to the familiar quadratic energy cost, widely used in robotic applications. Examples of such applications include robot navigation [Kinjo, Uchibe, and Doya2013] and motor skill reinforcement learning [Theodorou, Buchli, and Schaal2010]. This class of problems is also relevant in other disciplines, such as cognitive science and decision making theory [Friston et al.2013, Ortega and Braun2013]. However, in general, application of LMDPs to realworld problems is challenging, mainly due to the curse dimensionality [AbbasiYadkori et al.2015, Matsubara, Gómez, and Kappen2014, Todorov2009c].
In this paper, we propose to combine both HRL and LMDP frameworks and formulate a reinforcement learning problem as a hierarchy of LMDPs. Surprisingly, despite LMDPs were introduced already ten years ago, no unifying framework that combines both methodologies has been proposed yet. The benefits of this combination are twofold. On one hand, HRL problems expressed in this way can benefit from the same properties that LMDPs enjoy. For example, one can use Zlearning as an efficient alternative to the stateoftheart HRL methods. Another example is task compositionality, by which a composite task can be learned at no cost given the optimal solution for the different composing tasks. This is useful in tasks that have several terminal states, as we will show later. On the other hand, LMDPs can also benefit from the HRL framework, for example, by addressing the curse of dimensionality in an alternative way to the previously mentioned approaches or by simultaneous intratask learning from HRL.
2 Preliminaries
In this section we introduce preliminaries and notation. We first define MDPs and SemiMDPs, then explain the idea behind MAXQ decomposition, and finally describe linearlysolvable MDPs.
2.1 MDPs and SemiMDPs
An MDP consists of a set of states , a set of actions
, a transition probability distribution
satisfying for each stateaction pair , and an expected reward function . The aim is to learn an optimal policy , i.e. a mapping from states to actions that maximizes expected future reward.MDPs are usually solved by defining a value function
that estimates the expected future reward in each state. In the undiscounted case, the optimal value is obtained by solving the Bellman optimality equation:
To bound the optimal value function, we only consider first exit problems that define a set of terminal states . A function defines the final reward of each terminal state. As an alternative to the value function , one can define an actionvalue function that estimates the expected future reward for each stateaction pair .
The Bellman optimality equation can be solved globally using algorithms such as value iteration and policy iteration. However, for large state spaces this is not feasible. Alternative algorithms make local updates to the value function online. Arguably, the most popular online algorithm for MDPs is Qlearning [Watkins1989]. Given a transition from state to state when taking action and receiving reward , Qlearning makes the following update to the estimate of the optimal actionvalue function:
where is a learning rate.
A SemiMDP generalizes an MDP by including actions that take more than one timestep to complete. In this case, the Bellman optimality equation becomes
where is the duration of action , is the expected reward when applied in lasts for steps, and is the probability of transitioning to in steps. By defining and , we get
i.e. a SemiMDP can be treated and solved as an MDP.
2.2 MAXQ Decomposition
MAXQ decomposition [Dietterich2000] decomposes an MDP into a finite set of tasks with root task , i.e. solving is equivalent to solving . Each task , , consists of a termination set , an action set and a pseudoreward .
A task is primitive if it has no subtasks, i.e. if . A primitive task corresponds to an action of the original MDP , defined such that is always applicable, terminates after one time step and has pseudoreward everywhere. A nonprimitive task can only be applied in nonterminal states (i.e. states not in ). Terminating in state produces pseudoreward . corresponds to a SemiMDP with action set , i.e. actions are other tasks.
MAXQ defines a task graph with tasks in as nodes. There is an edge between nodes and if and only if , i.e. if is an action of task . To avoid infinite recursion, the task graph has to be acyclic. Figure 1 shows a simplified task graph of the Taxi domain, commonly used to illustrate MAXQ decomposition.
The aim of MAXQ decomposition is to learn a hierarchical policy , i.e. a separate policy for each individual task , . Each task defines its own value function that, for each state , estimates the expected cumulative reward until terminates. The reward associated with applying action in state of task equals the value of in , i.e. . Hence the Bellman optimality equation for decomposes as
where is the probability of transitioning from to when applying the (possibly composite) action . If is a primitive task corresponding to an action of the original MDP , its value is the expected immediate reward, i.e. . The pseudoreward is only used for learning the policy of and does not contribute to the value function.
dietterich2000hierarchical (dietterich2000hierarchical) proposed an online algorithm for MAXQ decomposition called MAXQQ learning. The algorithm maintains two value functions for each task : an estimate of the value function defined above, and an estimate of the expected cumulative reward that includes the pseudoreward . The estimate defines the policy for , while the estimate is passed as reward to parent tasks of . MAXQQ learning achieves recursive optimality, i.e. each policy is locally optimal with respect to . conf/nips/Dietterich99 (conf/nips/Dietterich99) also showed how to use state abstraction to simplify learning in MAXQ decomposition.
2.3 LinearlySolvable MDPs
Linearlysolvable MDPs (LMDPs) were first introduced by TodorovNIPS2007 (TodorovNIPS2007,TodorovPNAS2009). The original formulation has no explicit actions, and control consists in changing a predefined uncontrolled probability distribution over next states. An alternative interpretation is to view the resulting probability distribution as a stochastic policy over deterministic actions. Todorov’s idea was to transform the discrete optimization problem over actions to a continuous optimization problem over transition probabilities, which is convex and analytically tractable.
Formally, an LMDP consists of a set of states , an uncontrolled transition probability distribution satisfying for each state , and an expected reward function . Given a state and any next state distribution , we define the set of next states under as . For firstexit problems, LMDPs also have a subset of terminal states .
The control in LMDPs is a probability distribution over next states; for a given next state , can be nonzero only if is nonzero. The reward for applying control in state is
where is the (nonpositive) reward associated with state .
is the KullbackLeibler divergence between
and , penalizing controls that are significantly different from . Typically, is a random walk and acts as a temperature parameter. Large values of (high temperature) lead to solutions which are more stochastic, since deviating from the random dynamics is penalized more. Conversely, very small values of (low temperature) result in deterministic policies, since the statedependent term dominates the immediate cost. LMDPs, in a sense, replace deterministic policies defined over stochastic actions with stochastic policies defined over deterministic actions. In what follows, unless otherwise stated, the next state is always drawn from the distribution .We can now define the Bellman optimality equation:
For a given state , the set consists of control inputs that satisfy and for each . To bound the values in the absense of a discount factor, terminal states are absorbing, i.e. for each .
Introducing we obtain
To obtain a KL divergence on the righthand side, introduce a normalization term and insert it into the Bellman equation:
The KL term achieves a minimum of when the distributions are equal, i.e. the optimal policy is
Exponentiating the Bellman equation gives
We can write this equation in matrix form as
(1) 
where is a diagonal matrix with the terms on the diagonal and is the transition probability matrix derived from the distribution . Unlike the Bellman optimality equation, this is a system of linear equations.
Since Equation (1
) is linear, we can solve its eigenvector problem using, for example, the power iteration method. As an alternative, TodorovNIPS2007 (TodorovNIPS2007,TodorovPNAS2009) proposed an online learning algorithm for LMDPs called Zlearning. Similar to Qlearning for MDPs, the idea of Zlearning is to follow a trajectory, record transitions and perform incremental updates to the value function.
Since LMDPs have no explicit actions, each transition consists of a state , a next state and a reward recorded during the transition. Zlearning maintains an estimate of the optimal value, and this estimate is updated after each transition as
(2) 
where is a learning rate.
Naive Zlearning samples transitions from the passive dynamics , which essentially amounts to a random walk and leads to slow learning. A better alternative is to use importance sampling to guide exploration by sampling transitions from a more informed distribution. A natural choice is the estimated optimal policy derived from , resulting in the following corrected update rule [Todorov2006]:
(3) 
Note that the importance weight requires access to the passive dynamics .
2.4 LMDPs With TransitionDependent Rewards
In the original formulation of LMDPs, reward is statedependent. To develop a hierarchical framework based on LMDPs, we have to account for the fact that each task may accumulate different amounts of reward. Hence reward is transitiondependent, depending not only on the current state but also on the next state. In this section we extend LMDPs to transitiondependent reward, i.e. the expected reward function is defined over pairs of states. Then the reward of applying control in state is
The Bellman equation becomes
Letting and yields
Normalizing as yields and results in the policy
Exponentiating the Bellman equation gives which can be written in matrix form as
(4) 
where each entry of equals .
To solve Equation (4) we can either apply the power iteration method or Zlearning. It is trivial to extend Zlearning to LMDPs with transitiondependent reward. Each transition is still a triplet , and the only difference is that the reward now depends on the next state as well as the current state . If we compare to the target value , we see that the update rule in Equation (2) causes to tend towards the optimal value when using the uncontrolled distribution to sample transitions. The update for importance sampling in Equation (2.3) can also be directly applied to LMDPs with transitiondependent reward.
3 Hierarchical LMDPs
In this section we formalize a framework for hierarchical LMDPs based on MAXQ decomposition. We assume that there exists an underlying LMDP and a set of tasks , with being the root task. Each task , , consists of a termination set , a set of subtasks and a pseudoreward . Each state is absorbing and produces reward . For clarity of presentation, we first assume that each task is deterministic, i.e. for each state , terminates in a unique state . We later show how to extend hierarchical LMDPs to nondeterministic tasks.
In MAXQ decomposition, since the actions of the original MDP are included as primitive tasks, the action set of task contains a subset of actions in . The analogy for hierarchical LMDPs is that each task contains a subset of the allowed transitions of the original LMDP , i.e. transitions with nonzero probability according to . Intuitively, the optimal control of each task can be viewed as a stochastic policy that selects between deterministic tasks and deterministic next states.
We associate each task with an LMDP with transitiondependent rewards. Task is primitive if and only if . For each state , let be the subset of next states of the original LMDP that are also present in . Further let be the set of subtasks of that are applicable in , i.e. for which is not a terminal state. Clearly, if is primitive.
For a given state , the passive dynamics and immediate reward of the task LMDP are defined in terms of transitions due to
and transitions due to
(5)  
(6) 
Just as in MAXQ decomposition, the reward associated with applying a subtask of in state equals the value function of in , i.e. . The transition associated with subtask has uniform probability, while transitions in have probabilities proportional to those of and produce the same reward as in the original LMDP .
For each task , the value function estimates the expected cumulative reward until terminates and defines the immediate reward for higherlevel tasks. We can write the Bellman optimality equation for as
(7) 
The task graph for hierarchical LMDPs is defined as for MAXQ decomposition, and has to be acyclic. Figure 2 shows the LMDP task graph of the Taxi domain. Compared to Figure 1, primitive actions no longer appear as tasks, and new sink nodes, e.g. Navigate(), correspond to primitive tasks of the hierarchical LMDP.
The above definitions implicitly consider the following assumptions that differ from the MAXQ formulation and are required for hierarchical LMDPs:

First, we assume that terminal states of subtasks are mutually exclusive and do not overlap with next states in . The reason is that an LMDP is not allowed to have more than one transition between two states with different rewards: if this happens, the optimal policy is not identifiable, since one has to collapse both transitions into one and determine what the resulting immediate reward should be, an illdefined problem.

Another difference is that each LMDP task needs the equivalent of a noop action, i.e.
, so that the corresponding Markov chain is aperiodic, needed for convergence of the poweriteration method.

Finally, unlike MAXQ decomposition, the value function in Equation (7) includes KL terms due to differences between the control and uncontrolled dynamics . The reward also includes subtask dependent KL terms. Consequently, the value function of the root task includes KL terms from all other tasks. Although this introduces an approximation, we can control the relative importance of KL terms by adjusting the value of .
3.1 Task Compositionality for NonDeterministic Tasks
In this subsection, we extend the definition of hierarchical LMDPs to nondeterministic tasks. As before, we associate with each task an LMDP with the important difference that (and it subtasks) can have more than one terminal state. Primitive subtasks (or transitions ) are addressed as before, and we omit them for clarity. We thus have to define passive dynamics and immediate rewards for nonprimitive subtasks that can have more than one terminal state.
Denote , , as the th terminal state of subtask . We define the counterparts of Equations (5) and (6) for multiple terminal states as
(8)  
(9) 
where the transition probability of subtask from state to terminal state and the value function can be expressed using compositionality of optimal control laws for LMDPs [Todorov2009a], as described below. Note that is different from the immediate transition probabilities for subtask , and that the total transition probability of subtask is still , but it is now distributed among the possible terminal states according to .
For each terminal state of task , we define a separate task . The new tasks are identical to and have terminal states that differ only in their pseudorewards . For task , the pseudoreward for (goal) is zero and the remaining terminal states have the same (negative) pseudoreward , e.g. and , .
Consider the optimal policy and the optimal value of each of the individual tasks. Using compositionality, the original task with multiple terminal states can be expressed as a weighted sum of the individual tasks . In particular, the composite optimal and policy are [Todorov2009a]
where the mixing weights for composing tasks are uniform and equals , since each task assigns the same pseudoreward to nongoal terminal states.
The value function in Equation (9) is then given by
and the transition probability for Equation (8) is defined recursively for all states as
This defines the hierarchical LMDP framework for nondeterministic tasks. Note that each individual task is still deterministic; its purpose is to avoid terminating in a state different from .
3.2 Hierarchical Learning Algorithms
The aim of a hierarchical LMDP is to learn an estimated hierarchical control policy , i.e. an individual control policy for each task , . Similar to MAXQ decomposition, there are two main alternatives for learning a hierarchical policy:

Learn each individual policy separately in a bottomup fashion.

Learn all policies simultaneously using a hierarchical execution in a topdown fashion.
Implementing an algorithm of the first type is straightforward: since each individual task is an LMDP, we can using the power iteration method or Zlearning. Since all subtasks of are solved before itself, the rewards of are known and fixed when solving .
To implement an algorithm of the second type, similar to MAXQQ learning, we start at the root task and sample a subtask to execute using the current estimate of the policy for . We then execute until termination, possibly applying subtasks of along the way. When terminates, we return the control to the root task and another subtask is sampled using . This continues until we reach an absorbing state of . During this process, the value function estimates of each task are updated using Zlearning.
As in MAXQQ learning, if a task has pseudorewards different from , we have to learn two estimates of the value function for : one estimate of the optimal value function that excludes the pseudoreward , and another estimate that includes the pseudoreward . The estimate defines the policy of , while is passed as reward to parent tasks of .
3.3 IntraTask ZLearning
In hierarchical MDPs, the aim is to learn a separate policy for each individual task. Since Qlearning is an offpolicy algorithm, it is possible to use transitions recorded during one task to learn the policy of another task. Such intratask learning is known to converge faster [Sutton and Precup1998]. In this section we describe an algorithm for intratask Zlearning.
As described in Subsection 2.3, we can use importance sampling to improve exploration. Let be a transition sampled using the estimated policy of task , and consider an update to the estimated value of another task . Even though the sample distribution is different from the estimated policy of , we consider the update in Equation (2.3)
(10) 
To see why the update rule is correct, simply substitute the expressions for and :
In other words, instead of moving in the direction of , the update rule moves in the direction of , which is precisely the desired value of . In particular, the importance weight can be shared by the different tasks in intratask learning.
For LMDPs with transition costs, the same update rule can be used, but substituting the expressions for and leads to a slightly different result:
Recall that . The expectation results from the fact that the observed reward and expected reward may be different, but are equal in expectation. For LMDPs with transition costs, is the desired value of .
3.4 State Abstraction
In hierarchical LMDPs, we can apply the same forms of state abstraction as for MAXQ decomposition [Dietterich1999]. The most common form of state abstraction is projection or max node irrelevance. This form of state abstraction assumes that the state is factored, i.e. where are the domains of state variables. Max node irrelevance identifies state variables that are irrelevant for a given task, implying that the values of these state variables remain the same until completion of the task. Irrelevant state variables can be ignored while learning the value function.
conf/nips/Dietterich99 (conf/nips/Dietterich99) identified other conditions under which it is safe to perform state abstraction. One condition, leaf irrelevance, does not apply to hierarchical LMDPs since actions are no longer included as leaves of the task graph. Another condition, result distribution irrelevance, does apply to hierarchical LMDPs: when two or more states have the same transition probabilities with respect to a given task , we only need to estimate a single value (or ) for these states.
4 Experiments
We evaluate the proposed framework in two tasks commonly used in hierarchical MDPs: the taxi domain [Dietterich2000] and an autonomous vehicle guided task (AGV) [Ghavamzadeh and Mahadevan2007]. We compare the following methods:
 Z:

Zlearning using naive sampling (i.e. random walk) without correction term, as in Equation (2).
 ZIS:

Zlearning with importance sampling but without intratask learning, as in Equation (2.3),
 ZISIL:

Zlearning with importance sampling and intratask learning, as in Equation (10).
 QG:

greedy Qlearning without intratask learning.
 QGIL:

greedy Qlearning with intratask learning.
The Zlearning variants are evaluated using task LMDPs as described in Section 3 To compare with the Qlearning variants, for each task LMDP we construct a traditional MDP following the methodology of TodorovPNAS2009 TodorovPNAS2009. The resulting traditional MDP is guaranteed to have the same optimal value function as the original LMDP. Following TodorovNIPS2007 TodorovNIPS2007, we use dynamic learning rates, which decay as , where is optimized separately for each algorithm and is the current trial. The parameter for Qlearning is also optimized for best performance.
To compare performance, we calculate, for each iteration, the norm of the differences between the learned and the optimal value function, which can be computed exactly in the tasks we consider here. More details about the experiments are described in the supplementary material.
4.1 The Taxi Domain
The taxi domain is defined on a grid, with four distinguished locations. There is a passenger at one of the four locations, and that passenger wishes to be transported to one of the other three locations. There is also a taxi that must navigate to the passenger’s location, pick up the passenger, navigate to the destination location, and put down the passenger there. We use a variant of the taxi domain [Dietterich2000] with much larger state space, a grid.
We decompose the taxi domain as shown in Figure 2. Just like dietterich2000hierarchical (dietterich2000hierarchical), we apply state abstraction in the form of projection to the navigate tasks, ignoring the passenger’s location and destination. This results in state spaces of sizes and for the navigation and full task, respectively.
The primitive tasks (NAVIGATE) contain all state transitions associated with navigation actions: NORTH, SOUTH, EAST, WEST and IDLE (i.e. the noop action). There are four of these primitive tasks, one for each location (corner) in the grid. The corresponding LMDPs are very similar to the grid example of TodorovNIPS2007 (TodorovNIPS2007): the passive dynamics is a random walk and the statecost term is zero for the terminal states (the corresponding corner) and elsewhere.
Figure 3 shows the performance of different learning methods in the primitive task of this domain. The best results are obtained with Zlearning with importance sampling and intratask learning (ZISIL). The Zlearning variants outperform the Qlearning variants mainly because Zlearning, unlike Qlearning, does not need a maximization operator or stateaction values [Todorov2006]. Remarkably, ZIS (without intratask learning) still performs better than Qlearning with intratask learning. Naive Zlearning (Z) performs better than greedy Qlearning (QG) because in this particular task, random exploration is still useful to learn a location at one corner of the grid.
The full task is composed of the four possible navigation subtasks plus the transitions resulting from applying the original PICKUP and PUTDOWN actions and the IDLE transition. Figure 4 shows the results in the full task. Since there is only one such task, intratask learning does not apply. In this case, random exploration converges very slowly, as the curve of Z indicates. Also, the advantage of Zlearning with importance sampling over greedy Qlearning is less pronounced than in the primitive tasks.
From these results, we can conclude that the proposed extensions of Zlearning outperform the stateoftheart learning methods for this domain.
4.2 The Autonomous Guided Vehicle (AGV) Domain
The second domain we consider is a variant of the AGV domain [Ghavamzadeh and Mahadevan2007]. In this problem, an AGV has to transport parts between machines in a warehouse. Different parts arrive to the warehouse at uncertain times, and these parts can be loaded from the warehouse and delivered to the specific machines that can process and assemble them. Once a machine terminates, the AVG can pick up the assembly and bring it to the unload location of the warehouse.
The state space of the full problem has nine components: three components for the position of the AGV ( and angle), one for the type of the part that the AGV is carrying and five to represent the different parts that are available to pick up from the warehouse of from the assembly locations. To convert the overall problem into a firstexit task, we do not allow new parts to arrive at the warehouse, and the task is to assemble all parts and deliver them to the unload station. For more details, see the supplementary material.
An important feature of this problem is that the AVG can only navigate using transitions corresponding to primitive actions FORWARD, TURNLEFT, TURNRIGHT and STAY. Unlike the taxi domain, this significantly constrains the trajectories required to navigate from one location to another in the warehouse. Similar to the taxi domain, we define six primitive NAVIGATE tasks for navigating to the six dropoff and pickup locations in the warehouse. As before, we apply state abstraction in the form of projection to these tasks, ignoring the location of all parts and assemblies.
Figure 5 shows the result of different learning methods on the NAVIGATE tasks. They are similar to those in the taxi domain, although Qlearning with intratask learning performs comparatively better, while naive Zlearning performs comparatively worse. The latter result can be explained by the need of guided exploration for navigating in this domain.
Since the total number of states is large, we also apply state abstraction in the form of result distribution irrelevance to the overall task. Since NAVIGATE tasks always terminate in a predictable state, it is not necessary to maintain a value function for other than dropoff and pickup locations.
We have also implemented an online algorithm similar to MAXQQ learning. Instead of using the value function of subtasks to estimate transition rewards, we execute each subtask until termination, recording the reward along the way. The reward accumulated during the subtask is then used as the observed immediate reward for the abstract task. For performance we measure the throughput, i.e. the number of assemblies delivered to the unload station per time step.
Figure 5 shows the relative performance of Zlearning with importance sampling and Qlearning for this MAXQQ variant. We omit naive Zlearning, since the throughput of a random walk is constant over time. The number of time steps includes all primitive transitions, including those of the NAVIGATE subtasks. As the figure shows, Zlearning converges more quickly to a suboptimal policy compared to Qlearning, illustrating the benefits of hierarchical LMDPs.
5 Conclusions
We have presented a framework for hierarchical reinforcement learning that combines the MAXQ decomposition and formulates each task as a linearlysolvable MDP. The framework has been illustrated in two domains, in which the hierarchical, intratask Zlearning algorithm outperforms the stateoftheart methods for hierarchical MDPs.
Hierarchical LinearlySolvable Markov Decision Problems
(Supplementary Material)
Anders Jonsson Vicenç Gómez
Department of Information and Communication Technologies
Universitat Pompeu Fabra
Roc Boronat 138, 08018 Barcelona, Spain
{anders.jonsson,vicen.gomez}@upf.edu
Appendix A Experimental Setup
To compare with the Qlearning variants, for each task LMDP we construct a traditional MDP following the methodology of TodorovPNAS2009 TodorovPNAS2009. For each state , we define a symbolic action with transition probability distribution matching the optimal , which is computed using power iteration method in the original LMDP. We also define as many other symbolic actions as number of possible states following . Their transition probabilities are obtained from by circular shifting. The reward of the symbolic actions is . The resulting traditional MDP is guaranteed to have the same optimal value function that the original LMDP.
Following TodorovNIPS2007 TodorovNIPS2007, we use dynamic learning rates, which decay as , where the constant is optimized separately for each algorithm and is the current trial. The parameter for Qlearning is also optimized for best performance.
To compare performance, we calculate, for each iteration, the norm of the differences between the learned and the optimal value function,
a.1 The Taxi Domain
The taxi domain is defined on a grid, with four distinguished locations. There is a passenger at one of the four locations, and that passenger wishes to be transported to one of the other three locations. There is also a taxi that must navigate to the passenger’s location, pick up the passenger, navigate to the destination location, and put down the passenger there. We use a variant of the taxi domain [Dietterich2000] with much larger state space, a grid, as shown in Figure 7.
The state space is composed of , where and are the horizontal and vertical coordinates of the taxi location and is the location of the passenger for the different pickup locations and when the passenger is in the taxi. Just like dietterich2000hierarchical (dietterich2000hierarchical), we apply state abstraction in the form of projection to the navigate tasks, ignoring the passenger’s location and destination. This results in state spaces of sizes and for the navigation and full task, respectively.
The primitive tasks (NAVIGATE) are composed of all state transitions corresponding to navigation actions, namely, NORTH, SOUTH, EAST, WEST and IDLE (i.e. the noop action). There are four of these primitive tasks, one for each location (corner) in the grid. The corresponding LMDPs are very similar to the grid example of (TodorovNIPS2007): the passive dynamics is a random walk and the statecost term is zero for the terminal states (the corresponding corner) and elsewhere.
In terms of scalability, the bottleneck of the algorithm is the numerical precision required for computing the exact optimal value function. This precision strongly depends on the absolute difference between the maximum and the minimum values in the matrices (or ). In order to obtain good estimates, one needs to set sufficiently small, as mentioned in TodorovNIPS2007 (TodorovNIPS2007), which increases the required numerical precision. Although we obtained correct policies for larger grids, For grids we started to have numerical problems, since the threshold required to check convergence of the power iteration method is .
a.2 The Autonomous Guided Vehicle (AGV) Domain
The second domain we consider is a variant of the AGV domain [Ghavamzadeh and Mahadevan2007]. In this problem, an AGV has to transport parts between machines in a warehouse. Different parts arrive to the warehouse at uncertain times, and these parts can be loaded from the warehouse and delivered to the specific machines that can process and assemble them. Once a machine terminates, the AVG can pick up the assembly and bring it to the unload location of the warehouse.
We simplified the original problem by reducing the number of machines from to and setting the processing time of machines to to make the task fully observable (see Figure 8).
The state space of the full problem has the following components:

coordinate of the AGV position

coordinate of the AGV position

orientation of the AGV (up,right,down,left)

Num. parts at the input buffer of Machine 1

Num. parts at the output buffer of Machine 1

Num. parts at the input buffer of Machine 2

Num. parts at the output buffer of Machine 2

Part of type 1 available at the warehouse

Part of type 2 available at the warehouse
To convert the overall problem into a firstexit task, we do not allow new parts to arrive at the warehouse, and the task is to assemble all parts and deliver them to the unload station. The resulting task has approximately states.
The bottleneck of the algorithm is again a numerical precision error for calculating the optimal value function using the power iteration method.
References

[AbbasiYadkori et al.2015]
AbbasiYadkori, Y.; Bartlett, P. L.; Chen, X.; and Malek, A.
2015.
Largescale Markov decision problems with KL control cost and its
application to crowdsourcing.
In
32nd International Conference on Machine Learning (ICML) 2015
, 1053–1062.  [Botvinick, Niv, and Barto2009] Botvinick, M. M.; Niv, Y.; and Barto, A. C. 2009. Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition 113(3):262–280.
 [Dietterich1999] Dietterich, T. G. 1999. State abstraction in MAXQ hierarchical reinforcement learning. In Advances in Neural Information Processing Systems 12 (NIPS), 994–1000.

[Dietterich2000]
Dietterich, T. G.
2000.
Hierarchical reinforcement learning with the MAXQ value function
decomposition.
Journal of Artificial Intelligence Research
13:227–303.  [Friston et al.2013] Friston, K.; Schwartenbeck, P.; FitzGerald, T.; Moutoussis, M.; Behrens, T.; and Dolan, R. J. 2013. The anatomy of choice: active inference and agency. Frontiers in Human Neuroscience 7:598.
 [Ghavamzadeh and Mahadevan2007] Ghavamzadeh, M., and Mahadevan, S. 2007. Hierarchical average reward reinforcement learning. Journal of Machine Learning Research 8:2629–2669.
 [Kappen, Gómez, and Opper2012] Kappen, H. J.; Gómez, V.; and Opper, M. 2012. Optimal control as a graphical model inference problem. Machine Learning 87(2):159–182.
 [Kappen2005] Kappen, H. J. 2005. Linear theory for control of nonlinear stochastic systems. Physical Review Letters 95:200201.
 [Kinjo, Uchibe, and Doya2013] Kinjo, K.; Uchibe, E.; and Doya, K. 2013. Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task. Frontiers in Neurorobotics 7:1–13.
 [Matsubara, Gómez, and Kappen2014] Matsubara, T.; Gómez, V.; and Kappen, H. J. 2014. Latent Kullback Leibler control for continuousstate systems using probabilistic graphical models. In 30th Conference on Uncertainty in Artificial Intelligence (UAI), 583–592. AUAI Press.
 [Ortega and Braun2013] Ortega, P. A., and Braun, D. A. 2013. Thermodynamics as a theory of decisionmaking with informationprocessing costs. In Proceedings of the Royal Society of London A, volume 469, 20120683. The Royal Society.
 [Sutton and Precup1998] Sutton, R. S., and Precup, D. 1998. Intraoption learning about temporally abstract actions. In Proceedings of the 15th International Conference on Machine Learning (ICML), 556–564. Morgan Kaufman.
 [Theodorou, Buchli, and Schaal2010] Theodorou, E.; Buchli, J.; and Schaal, S. 2010. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research 11:3137–3181.
 [Todorov2006] Todorov, E. 2006. Linearlysolvable Markov decision problems. Advances in Neural Information Processing Systems 19 (NIPS) 1369–1376.
 [Todorov2009a] Todorov, E. 2009a. Compositionality of optimal control laws. Advances in Neural Information Processing Systems 22 (NIPS) 1856–1864.
 [Todorov2009b] Todorov, E. 2009b. Efficient computation of optimal actions. Proceedings of the National Academy of Sciences of the United States of America 106(28):11478–11483.
 [Todorov2009c] Todorov, E. 2009c. Eigenfunction approximation methods for linearlysolvable optimal control problems. In Proceedings of the 2nd IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 161–168.
 [Watkins1989] Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. Ph.D. Dissertation, King’s College, Cambridge, UK.
Comments
There are no comments yet.