final touches to the report, I think?

This commit is contained in:
2020-08-02 12:19:02 +02:00
parent 6fe398195c
commit c1f3e98c76

View File

@@ -1,14 +1,20 @@
\documentclass{article} \documentclass{article}
\usepackage[a4paper, margin=1in]{geometry} \usepackage[a4paper, margin=1in]{geometry}
\usepackage{amsmath} \usepackage{amsmath}
\usepackage{fancyhdr} \usepackage{fancyhdr}
\pagestyle{fancy} \pagestyle{fancy}
\usepackage{lastpage} \usepackage{lastpage}
\usepackage{graphicx} \usepackage{graphicx}
% \graphicspath{{./figures}}
\cfoot{Page \thepage\ of \pageref{LastPage}} \cfoot{Page \thepage\ of \pageref{LastPage}}
\rhead{Pavel Lutskov, 03654990} \rhead{Pavel Lutskov, 03654990}
\lhead{Programming Assignment} \lhead{Programming Assignment}
\setlength{\parskip}{1em}
\title{\huge Approximate Dynamic Programming and Reinforcement Learning \\ \title{\huge Approximate Dynamic Programming and Reinforcement Learning \\
\Large Programming Assignment} \Large Programming Assignment}
% \subtitle{Assignment 1} % \subtitle{Assignment 1}
@@ -34,7 +40,7 @@ value, indicating whether action $u$ is allowed in the state $x$. Furthermore,
the element $(x,u,w)$ of matrix $P$ gives the probability of the disturbance the element $(x,u,w)$ of matrix $P$ gives the probability of the disturbance
$w$, when action $u$ is taken in state $x$, being equal to zero if such $w$, when action $u$ is taken in state $x$, being equal to zero if such
configuration of $(x,u,w)$ is impossible. These matrices are initialized before configuration of $(x,u,w)$ is impossible. These matrices are initialized before
the execution of the dynamic programming algorithm begins. the execution of the dynamic programming algorithm begins. \par
The advantage of such formulation is the possibility to accelerate the The advantage of such formulation is the possibility to accelerate the
computations by leveraging the \textit{NumPy} library for matrix operations. computations by leveraging the \textit{NumPy} library for matrix operations.
@@ -49,7 +55,7 @@ the transition probabilities, and therefore wouldn't be able to use a matrix
library such as \textit{NumPy} for acceleration of computations. library such as \textit{NumPy} for acceleration of computations.
The one-step costs in my implementation only depend on the target state, The one-step costs in my implementation only depend on the target state,
meaning $g(x, u, w) = g(f(x, u, w))$, therefore the one-step cost functions are meaning $g(x,u,w) = g(f(x,u,w))$, therefore the one-step cost functions are
represented as vectors $G_x^1$ and $G_x^2$, where the goal state has a lower represented as vectors $G_x^1$ and $G_x^2$, where the goal state has a lower
cost than the rest of the states, and the trap state incurs a high penalty. cost than the rest of the states, and the trap state incurs a high penalty.
This formulation differs slightly from the formulation in the task, where for This formulation differs slightly from the formulation in the task, where for
@@ -76,6 +82,9 @@ good results.
\section{Algorithm inspection} \section{Algorithm inspection}
For Value Iteration algorithm I initialize the cost function to zero vector; in
Policy Iteration the $idle$ policy is used as a starting point.
For visualization I used a non-linear scale for the cost function. Each For visualization I used a non-linear scale for the cost function. Each
different value in the cost vector was assigned a different color in order to different value in the cost vector was assigned a different color in order to
ensure, that for small values for $\alpha$ the distinct values could be clearly ensure, that for small values for $\alpha$ the distinct values could be clearly
@@ -103,7 +112,8 @@ for the non-goal states, therefore the cost-free final state is propagated as
an ever-decreasing additive term, and the distance of the propagation is an ever-decreasing additive term, and the distance of the propagation is
restricted by the precision of the floating point variable used to store the restricted by the precision of the floating point variable used to store the
cost function. Hence, the algorithms may not converge to the optimal policy, cost function. Hence, the algorithms may not converge to the optimal policy,
when $g_2$ is used in conjunction with small values of $\alpha$. when $g_2$ is used in conjunction with small values of $\alpha$. The cost plots
for $\alpha=0.01$ demonstrate this problem.
For comparison of Value Iteration and Policy Iteration I used a wide range of For comparison of Value Iteration and Policy Iteration I used a wide range of
$\alpha$, the values that I chose are $0.99$, $0.7$ and $0.1$. Using these $\alpha$, the values that I chose are $0.99$, $0.7$ and $0.1$. Using these
@@ -116,12 +126,14 @@ Policy Iteration converges in two to three times less iterations than Value
Iteration. Surprisingly, the number of iterations doesn't seem to depend on the Iteration. Surprisingly, the number of iterations doesn't seem to depend on the
discount factor, which could mean that the given maze problem is small and discount factor, which could mean that the given maze problem is small and
simple enough, so we don't have to care about choosing the $\alpha$ carefully. simple enough, so we don't have to care about choosing the $\alpha$ carefully.
Furthermore, the one-step cost $g_2$ allows both algorithms to converge faster. Furthermore, the one-step cost $g_2$ allows both algorithms to converge
slightly faster.
It is natural, that PI converges in less iterations than VI, since policy is It is natural, that PI converges in less iterations than VI, since policy is
guaranteed to improve on each iteration. However, finding the exact cost guaranteed to improve on each iteration. However, finding the exact cost
function $J_{\pi_k}$ on each iteration can get expensive, when the state space function $J_{\pi_k}$ on each iteration can get expensive, when the state space
grows. However, the given maze is small, so it is affordable to use the PI. grows. However, the given maze is small, so it is preferable to use the PI to
solve this problem.
\begin{figure} \begin{figure}
\includegraphics[width=\linewidth]{figures/a09.png} \includegraphics[width=\linewidth]{figures/a09.png}