final touches to the report, I think?
This commit is contained in:
22
report.latex
22
report.latex
@@ -1,14 +1,20 @@
|
||||
\documentclass{article}
|
||||
|
||||
\usepackage[a4paper, margin=1in]{geometry}
|
||||
\usepackage{amsmath}
|
||||
|
||||
\usepackage{fancyhdr}
|
||||
\pagestyle{fancy}
|
||||
|
||||
\usepackage{lastpage}
|
||||
\usepackage{graphicx}
|
||||
% \graphicspath{{./figures}}
|
||||
|
||||
\cfoot{Page \thepage\ of \pageref{LastPage}}
|
||||
\rhead{Pavel Lutskov, 03654990}
|
||||
\lhead{Programming Assignment}
|
||||
|
||||
\setlength{\parskip}{1em}
|
||||
|
||||
\title{\huge Approximate Dynamic Programming and Reinforcement Learning \\
|
||||
\Large Programming Assignment}
|
||||
% \subtitle{Assignment 1}
|
||||
@@ -34,7 +40,7 @@ value, indicating whether action $u$ is allowed in the state $x$. Furthermore,
|
||||
the element $(x,u,w)$ of matrix $P$ gives the probability of the disturbance
|
||||
$w$, when action $u$ is taken in state $x$, being equal to zero if such
|
||||
configuration of $(x,u,w)$ is impossible. These matrices are initialized before
|
||||
the execution of the dynamic programming algorithm begins.
|
||||
the execution of the dynamic programming algorithm begins. \par
|
||||
|
||||
The advantage of such formulation is the possibility to accelerate the
|
||||
computations by leveraging the \textit{NumPy} library for matrix operations.
|
||||
@@ -76,6 +82,9 @@ good results.
|
||||
|
||||
\section{Algorithm inspection}
|
||||
|
||||
For Value Iteration algorithm I initialize the cost function to zero vector; in
|
||||
Policy Iteration the $idle$ policy is used as a starting point.
|
||||
|
||||
For visualization I used a non-linear scale for the cost function. Each
|
||||
different value in the cost vector was assigned a different color in order to
|
||||
ensure, that for small values for $\alpha$ the distinct values could be clearly
|
||||
@@ -103,7 +112,8 @@ for the non-goal states, therefore the cost-free final state is propagated as
|
||||
an ever-decreasing additive term, and the distance of the propagation is
|
||||
restricted by the precision of the floating point variable used to store the
|
||||
cost function. Hence, the algorithms may not converge to the optimal policy,
|
||||
when $g_2$ is used in conjunction with small values of $\alpha$.
|
||||
when $g_2$ is used in conjunction with small values of $\alpha$. The cost plots
|
||||
for $\alpha=0.01$ demonstrate this problem.
|
||||
|
||||
For comparison of Value Iteration and Policy Iteration I used a wide range of
|
||||
$\alpha$, the values that I chose are $0.99$, $0.7$ and $0.1$. Using these
|
||||
@@ -116,12 +126,14 @@ Policy Iteration converges in two to three times less iterations than Value
|
||||
Iteration. Surprisingly, the number of iterations doesn't seem to depend on the
|
||||
discount factor, which could mean that the given maze problem is small and
|
||||
simple enough, so we don't have to care about choosing the $\alpha$ carefully.
|
||||
Furthermore, the one-step cost $g_2$ allows both algorithms to converge faster.
|
||||
Furthermore, the one-step cost $g_2$ allows both algorithms to converge
|
||||
slightly faster.
|
||||
|
||||
It is natural, that PI converges in less iterations than VI, since policy is
|
||||
guaranteed to improve on each iteration. However, finding the exact cost
|
||||
function $J_{\pi_k}$ on each iteration can get expensive, when the state space
|
||||
grows. However, the given maze is small, so it is affordable to use the PI.
|
||||
grows. However, the given maze is small, so it is preferable to use the PI to
|
||||
solve this problem.
|
||||
|
||||
\begin{figure}
|
||||
\includegraphics[width=\linewidth]{figures/a09.png}
|
||||
|
||||
Reference in New Issue
Block a user