final touches to the report, I think?

This commit is contained in:
2020-08-02 12:19:02 +02:00
parent 6fe398195c
commit c1f3e98c76

View File

@@ -1,14 +1,20 @@
\documentclass{article}
\usepackage[a4paper, margin=1in]{geometry}
\usepackage{amsmath}
\usepackage{fancyhdr}
\pagestyle{fancy}
\usepackage{lastpage}
\usepackage{graphicx}
% \graphicspath{{./figures}}
\cfoot{Page \thepage\ of \pageref{LastPage}}
\rhead{Pavel Lutskov, 03654990}
\lhead{Programming Assignment}
\setlength{\parskip}{1em}
\title{\huge Approximate Dynamic Programming and Reinforcement Learning \\
\Large Programming Assignment}
% \subtitle{Assignment 1}
@@ -34,7 +40,7 @@ value, indicating whether action $u$ is allowed in the state $x$. Furthermore,
the element $(x,u,w)$ of matrix $P$ gives the probability of the disturbance
$w$, when action $u$ is taken in state $x$, being equal to zero if such
configuration of $(x,u,w)$ is impossible. These matrices are initialized before
the execution of the dynamic programming algorithm begins.
the execution of the dynamic programming algorithm begins. \par
The advantage of such formulation is the possibility to accelerate the
computations by leveraging the \textit{NumPy} library for matrix operations.
@@ -76,6 +82,9 @@ good results.
\section{Algorithm inspection}
For Value Iteration algorithm I initialize the cost function to zero vector; in
Policy Iteration the $idle$ policy is used as a starting point.
For visualization I used a non-linear scale for the cost function. Each
different value in the cost vector was assigned a different color in order to
ensure, that for small values for $\alpha$ the distinct values could be clearly
@@ -103,7 +112,8 @@ for the non-goal states, therefore the cost-free final state is propagated as
an ever-decreasing additive term, and the distance of the propagation is
restricted by the precision of the floating point variable used to store the
cost function. Hence, the algorithms may not converge to the optimal policy,
when $g_2$ is used in conjunction with small values of $\alpha$.
when $g_2$ is used in conjunction with small values of $\alpha$. The cost plots
for $\alpha=0.01$ demonstrate this problem.
For comparison of Value Iteration and Policy Iteration I used a wide range of
$\alpha$, the values that I chose are $0.99$, $0.7$ and $0.1$. Using these
@@ -116,12 +126,14 @@ Policy Iteration converges in two to three times less iterations than Value
Iteration. Surprisingly, the number of iterations doesn't seem to depend on the
discount factor, which could mean that the given maze problem is small and
simple enough, so we don't have to care about choosing the $\alpha$ carefully.
Furthermore, the one-step cost $g_2$ allows both algorithms to converge faster.
Furthermore, the one-step cost $g_2$ allows both algorithms to converge
slightly faster.
It is natural, that PI converges in less iterations than VI, since policy is
guaranteed to improve on each iteration. However, finding the exact cost
function $J_{\pi_k}$ on each iteration can get expensive, when the state space
grows. However, the given maze is small, so it is affordable to use the PI.
grows. However, the given maze is small, so it is preferable to use the PI to
solve this problem.
\begin{figure}
\includegraphics[width=\linewidth]{figures/a09.png}