\documentclass{article} \usepackage[a4paper, margin=1in]{geometry} \usepackage{amsmath} \usepackage{fancyhdr} \pagestyle{fancy} \usepackage{lastpage} \usepackage{graphicx} % \graphicspath{{./figures}} \cfoot{Page \thepage\ of \pageref{LastPage}} \rhead{Pavel Lutskov, 03654990} \lhead{Programming Assignment} \title{\huge Approximate Dynamic Programming and Reinforcement Learning \\ \Large Programming Assignment} % \subtitle{Assignment 1} \author{Pavel Lutskov, 03654990} \begin{document} \maketitle \section{Environment modeling} In my code the behavior of the maze is represented using the system equation model. First, I assign a numeric index to each valid (non-wall) state of the maze in a row-major order. The possible actions are also assigned a numeric index. The space of possible disturbances in my implementation is the same as the space of actions (meaning $\{up, down, left, right, idle\}$). Using the numerical indexing described, the system equation and system stochasticity can be represented as two 3-D matrices $F_{xuw}$ and $P_{xuw}$. The $(x,u,w)$-th element of the matrix $F$ gives the index of the state, resulting from state $x$, when taken action $u$, under disturbance $w$. If action $u$ is impossible in state $x$, or if $w$ is impossible for the given $(x,u)$, then the $(x,u,w)$-th entry should be treated as invalid. This is achieved by using a supporting matrix $U_{xu}$, the $(x,u)$-th element of which contains a Boolean value, indicating whether action $u$ is allowed in the state $x$. Furthermore, the element $(x,u,w)$ of matrix $P$ gives the probability of the disturbance $w$, when action $u$ is taken in state $x$, being equal to zero if such configuration of $(x,u,w)$ is impossible. These matrices are initialized before the execution of the dynamic programming algorithm begins. The advantage of such formulation is the possibility to accelerate the computations by leveraging the \textit{NumPy} library for matrix operations. The alternative formulation are the Markovian state transition probabilities. This approach, however, has a number of drawbacks. If the transition probabilities $p_{ij}(u)$ were stored as a 3-D matrix $P_{iju}$, the size of the matrix would grow quadratically with the size of the state space, while the size of matrices used for implementing the system equation grows only linearly. Furthermore, this matrix would be very sparse, meaning only a few entries would be non-zero. Therefore, one would need a more space efficient representation of the transition probabilities, and therefore wouldn't be able to use a matrix library such as \textit{NumPy} for acceleration of computations. The one-step costs in my implementation only depend on the target state, meaning $g(x, u, w) = g(f(x, u, w))$, therefore the one-step cost functions are represented as vectors $G_x^1$ and $G_x^2$, where the goal state has a lower cost than the rest of the states, and the trap state incurs a high penalty. This formulation differs slightly from the formulation in the task, where for $g_2$ only the \textit{self-loop} in the final state is for free. However, this difference doesn't affect the resulting policy, and only has significant influence on the cost function of the states directly adjacent to the goal state. If the one-step cost did depend on the action taken to transit to the goal state (i.e.\ self-loop vs transition from the adjacent state), the one-step cost couldn't have been stored as a vector, and instead a 2-D matrix would have been needed, which would have introduced unnecessary complexity to the code. A policy is implemented as a vector $\Pi_x$, where the $x$-th element of the vector contains the index of the action, that will be taken in state $x$. The convergence criteria differ for Value Iteration and Policy Iteration. The most sensible convergence criterion for Policy Iteration, is that the policy stopped changing between the iterations of the algorithm, i.e.\ $\pi_{k+1} = \pi_k$. For value iteration I use a common criterion of $\|J_{k+1} - J_k\|_{\infty} < \epsilon$. The value of $\epsilon$ depends on the discount factor $\alpha$, and the relation $\epsilon = \alpha^{|S|}$, where $|S|$ is the number of possible states, has been empirically found to provide good results. \section{Algorithm inspection} For visualization I used a non-linear scale for the cost function. Each different value in the cost vector was assigned a different color in order to ensure, that for small values for $\alpha$ the distinct values could be clearly visible. The unnormalized representation is also provided as reference. If the termination criterion for Value Iteration is chosen correctly, i.e.\ the algorithm only terminates when it converged to an optimal policy, then both PI and VI will result in the same policy. The one-step cost $g_2$ is constantly shifted by $1$ relative to $g_1$, except for the trap state. For this reason $g_1$ and $g_2$ produce the same result for most $\alpha$, however the values of $\alpha$ exist, for which the two one-step costs produce different policies in the proximity of the trap. Generally, the behavior with both one-step costs may differ, depending on the $\alpha$. For large $\alpha$ the algorithms may favor risking getting into the trap over going around it. For smaller $\alpha$ the resulting policy, on the contrary, is playing on the safe side. Furthermore, for very small $\alpha$, e.g.\ $0.01$, machine precision starts playing a role. The double precision floating point variable can store numbers of large range of magnitude, however the precision is limited by the 52-bit fractional part. The precision is not an issue for $g_1$, because the negative cost of the goal state is propagated through the maze as a number of ever decreasing magnitude, since the one-step costs in the maze are $0$. For $g_2$, however, the dominating term for the cost function is the one-step cost of $1$ for the non-goal states, therefore the cost-free final state is propagated as an ever-decreasing additive term, and the distance of the propagation is restricted by the precision of the floating point variable used to store the cost function. Hence, the algorithms may not converge to the optimal policy, when $g_2$ is used in conjunction with small values of $\alpha$. For comparison of Value Iteration and Policy Iteration I used a wide range of $\alpha$, the values that I chose are $0.99$, $0.7$ and $0.1$. Using these values demonstrates the impact, that $\alpha$ has on the optimization. With large $\alpha$ it can be seen, that both algorithms stagnate for several iterations, after which they converge rapidly to the optimal policy and cost function. With decreasing $\alpha$ this effect becomes less pronounced, and the algorithms converge more steadily. From these graphs it is apparent, that Policy Iteration converges in two to three times less iterations than Value Iteration. Surprisingly, the number of iterations doesn't seem to depend on the discount factor, which could mean that the given maze problem is small and simple enough, so we don't have to care about choosing the $\alpha$ carefully. Furthermore, the one-step cost $g_2$ allows both algorithms to converge faster. It is natural, that PI converges in less iterations than VI, since policy is guaranteed to improve on each iteration. However, finding the exact cost function $J_{\pi_k}$ on each iteration can get expensive, when the state space grows. However, the given maze is small, so it is affordable to use the PI. \begin{figure} \includegraphics[width=\linewidth]{figures/a09.png} \includegraphics[width=\linewidth]{figures/a09_norm.png} \end{figure} \begin{figure} \includegraphics[width=\linewidth]{figures/a05.png} \includegraphics[width=\linewidth]{figures/a05_norm.png} \end{figure} \begin{figure} \includegraphics[width=\linewidth]{figures/a001.png} \includegraphics[width=\linewidth]{figures/a001_norm.png} \end{figure} \begin{figure} \includegraphics[width=\linewidth]{figures/vi1.png} \end{figure} \begin{figure} \includegraphics[width=\linewidth]{figures/pi1.png} \end{figure} \end{document}