final touches to the report, I think?

2020-08-02 12:19:02 +02:00
parent 6fe398195c
commit c1f3e98c76
1 changed files with 18 additions and 6 deletions
--- a/report.latex
+++ b/report.latex
@@ -1,14 +1,20 @@
 \documentclass{article}
+
 \usepackage[a4paper, margin=1in]{geometry}
 \usepackage{amsmath}
+
 \usepackage{fancyhdr}
 \pagestyle{fancy}
+
 \usepackage{lastpage}
 \usepackage{graphicx}
-% \graphicspath{{./figures}}
+
 \cfoot{Page \thepage\ of \pageref{LastPage}}
 \rhead{Pavel Lutskov, 03654990}
 \lhead{Programming Assignment}
+
+\setlength{\parskip}{1em}
+
 \title{\huge Approximate Dynamic Programming and Reinforcement Learning \\
  \Large Programming Assignment}
 % \subtitle{Assignment 1}
@@ -34,7 +40,7 @@ value, indicating whether action $u$ is allowed in the state $x$. Furthermore,
 the element $(x,u,w)$ of matrix $P$ gives the probability of the disturbance
 $w$, when action $u$ is taken in state $x$, being equal to zero if such
 configuration of $(x,u,w)$ is impossible. These matrices are initialized before
-the execution of the dynamic programming algorithm begins.
+the execution of the dynamic programming algorithm begins. \par

 The advantage of such formulation is the possibility to accelerate the
 computations by leveraging the \textit{NumPy} library for matrix operations.
@@ -76,6 +82,9 @@ good results.

 \section{Algorithm inspection}

+For Value Iteration algorithm I initialize the cost function to zero vector; in
+Policy Iteration the $idle$ policy is used as a starting point.
+
 For visualization I used a non-linear scale for the cost function. Each
 different value in the cost vector was assigned a different color in order to
 ensure, that for small values for $\alpha$ the distinct values could be clearly
@@ -103,7 +112,8 @@ for the non-goal states, therefore the cost-free final state is propagated as
 an ever-decreasing additive term, and the distance of the propagation is
 restricted by the precision of the floating point variable used to store the
 cost function. Hence, the algorithms may not converge to the optimal policy,
-when $g_2$ is used in conjunction with small values of $\alpha$.
+when $g_2$ is used in conjunction with small values of $\alpha$. The cost plots
+for $\alpha=0.01$ demonstrate this problem.

 For comparison of Value Iteration and Policy Iteration I used a wide range of
 $\alpha$, the values that I chose are $0.99$, $0.7$ and $0.1$. Using these
@@ -116,12 +126,14 @@ Policy Iteration converges in two to three times less iterations than Value
 Iteration. Surprisingly, the number of iterations doesn't seem to depend on the
 discount factor, which could mean that the given maze problem is small and
 simple enough, so we don't have to care about choosing the $\alpha$ carefully.
-Furthermore, the one-step cost $g_2$ allows both algorithms to converge faster.
+Furthermore, the one-step cost $g_2$ allows both algorithms to converge
+slightly faster.

 It is natural, that PI converges in less iterations than VI, since policy is
 guaranteed to improve on each iteration. However, finding the exact cost
 function $J_{\pi_k}$ on each iteration can get expensive, when the state space
-grows. However, the given maze is small, so it is affordable to use the PI.
+grows. However, the given maze is small, so it is preferable to use the PI to
+solve this problem.

 \begin{figure}
  \includegraphics[width=\linewidth]{figures/a09.png}