diff --git a/report.latex b/report.latex index c0c4b44..61c66a9 100644 --- a/report.latex +++ b/report.latex @@ -1,14 +1,20 @@ \documentclass{article} + \usepackage[a4paper, margin=1in]{geometry} \usepackage{amsmath} + \usepackage{fancyhdr} \pagestyle{fancy} + \usepackage{lastpage} \usepackage{graphicx} -% \graphicspath{{./figures}} + \cfoot{Page \thepage\ of \pageref{LastPage}} \rhead{Pavel Lutskov, 03654990} \lhead{Programming Assignment} + +\setlength{\parskip}{1em} + \title{\huge Approximate Dynamic Programming and Reinforcement Learning \\ \Large Programming Assignment} % \subtitle{Assignment 1} @@ -34,7 +40,7 @@ value, indicating whether action $u$ is allowed in the state $x$. Furthermore, the element $(x,u,w)$ of matrix $P$ gives the probability of the disturbance $w$, when action $u$ is taken in state $x$, being equal to zero if such configuration of $(x,u,w)$ is impossible. These matrices are initialized before -the execution of the dynamic programming algorithm begins. +the execution of the dynamic programming algorithm begins. \par The advantage of such formulation is the possibility to accelerate the computations by leveraging the \textit{NumPy} library for matrix operations. @@ -49,7 +55,7 @@ the transition probabilities, and therefore wouldn't be able to use a matrix library such as \textit{NumPy} for acceleration of computations. The one-step costs in my implementation only depend on the target state, -meaning $g(x, u, w) = g(f(x, u, w))$, therefore the one-step cost functions are +meaning $g(x,u,w) = g(f(x,u,w))$, therefore the one-step cost functions are represented as vectors $G_x^1$ and $G_x^2$, where the goal state has a lower cost than the rest of the states, and the trap state incurs a high penalty. This formulation differs slightly from the formulation in the task, where for @@ -76,6 +82,9 @@ good results. \section{Algorithm inspection} +For Value Iteration algorithm I initialize the cost function to zero vector; in +Policy Iteration the $idle$ policy is used as a starting point. + For visualization I used a non-linear scale for the cost function. Each different value in the cost vector was assigned a different color in order to ensure, that for small values for $\alpha$ the distinct values could be clearly @@ -103,7 +112,8 @@ for the non-goal states, therefore the cost-free final state is propagated as an ever-decreasing additive term, and the distance of the propagation is restricted by the precision of the floating point variable used to store the cost function. Hence, the algorithms may not converge to the optimal policy, -when $g_2$ is used in conjunction with small values of $\alpha$. +when $g_2$ is used in conjunction with small values of $\alpha$. The cost plots +for $\alpha=0.01$ demonstrate this problem. For comparison of Value Iteration and Policy Iteration I used a wide range of $\alpha$, the values that I chose are $0.99$, $0.7$ and $0.1$. Using these @@ -116,12 +126,14 @@ Policy Iteration converges in two to three times less iterations than Value Iteration. Surprisingly, the number of iterations doesn't seem to depend on the discount factor, which could mean that the given maze problem is small and simple enough, so we don't have to care about choosing the $\alpha$ carefully. -Furthermore, the one-step cost $g_2$ allows both algorithms to converge faster. +Furthermore, the one-step cost $g_2$ allows both algorithms to converge +slightly faster. It is natural, that PI converges in less iterations than VI, since policy is guaranteed to improve on each iteration. However, finding the exact cost function $J_{\pi_k}$ on each iteration can get expensive, when the state space -grows. However, the given maze is small, so it is affordable to use the PI. +grows. However, the given maze is small, so it is preferable to use the PI to +solve this problem. \begin{figure} \includegraphics[width=\linewidth]{figures/a09.png}