final touches to the report, I think?
This commit is contained in:
24
report.latex
24
report.latex
@@ -1,14 +1,20 @@
|
|||||||
\documentclass{article}
|
\documentclass{article}
|
||||||
|
|
||||||
\usepackage[a4paper, margin=1in]{geometry}
|
\usepackage[a4paper, margin=1in]{geometry}
|
||||||
\usepackage{amsmath}
|
\usepackage{amsmath}
|
||||||
|
|
||||||
\usepackage{fancyhdr}
|
\usepackage{fancyhdr}
|
||||||
\pagestyle{fancy}
|
\pagestyle{fancy}
|
||||||
|
|
||||||
\usepackage{lastpage}
|
\usepackage{lastpage}
|
||||||
\usepackage{graphicx}
|
\usepackage{graphicx}
|
||||||
% \graphicspath{{./figures}}
|
|
||||||
\cfoot{Page \thepage\ of \pageref{LastPage}}
|
\cfoot{Page \thepage\ of \pageref{LastPage}}
|
||||||
\rhead{Pavel Lutskov, 03654990}
|
\rhead{Pavel Lutskov, 03654990}
|
||||||
\lhead{Programming Assignment}
|
\lhead{Programming Assignment}
|
||||||
|
|
||||||
|
\setlength{\parskip}{1em}
|
||||||
|
|
||||||
\title{\huge Approximate Dynamic Programming and Reinforcement Learning \\
|
\title{\huge Approximate Dynamic Programming and Reinforcement Learning \\
|
||||||
\Large Programming Assignment}
|
\Large Programming Assignment}
|
||||||
% \subtitle{Assignment 1}
|
% \subtitle{Assignment 1}
|
||||||
@@ -34,7 +40,7 @@ value, indicating whether action $u$ is allowed in the state $x$. Furthermore,
|
|||||||
the element $(x,u,w)$ of matrix $P$ gives the probability of the disturbance
|
the element $(x,u,w)$ of matrix $P$ gives the probability of the disturbance
|
||||||
$w$, when action $u$ is taken in state $x$, being equal to zero if such
|
$w$, when action $u$ is taken in state $x$, being equal to zero if such
|
||||||
configuration of $(x,u,w)$ is impossible. These matrices are initialized before
|
configuration of $(x,u,w)$ is impossible. These matrices are initialized before
|
||||||
the execution of the dynamic programming algorithm begins.
|
the execution of the dynamic programming algorithm begins. \par
|
||||||
|
|
||||||
The advantage of such formulation is the possibility to accelerate the
|
The advantage of such formulation is the possibility to accelerate the
|
||||||
computations by leveraging the \textit{NumPy} library for matrix operations.
|
computations by leveraging the \textit{NumPy} library for matrix operations.
|
||||||
@@ -49,7 +55,7 @@ the transition probabilities, and therefore wouldn't be able to use a matrix
|
|||||||
library such as \textit{NumPy} for acceleration of computations.
|
library such as \textit{NumPy} for acceleration of computations.
|
||||||
|
|
||||||
The one-step costs in my implementation only depend on the target state,
|
The one-step costs in my implementation only depend on the target state,
|
||||||
meaning $g(x, u, w) = g(f(x, u, w))$, therefore the one-step cost functions are
|
meaning $g(x,u,w) = g(f(x,u,w))$, therefore the one-step cost functions are
|
||||||
represented as vectors $G_x^1$ and $G_x^2$, where the goal state has a lower
|
represented as vectors $G_x^1$ and $G_x^2$, where the goal state has a lower
|
||||||
cost than the rest of the states, and the trap state incurs a high penalty.
|
cost than the rest of the states, and the trap state incurs a high penalty.
|
||||||
This formulation differs slightly from the formulation in the task, where for
|
This formulation differs slightly from the formulation in the task, where for
|
||||||
@@ -76,6 +82,9 @@ good results.
|
|||||||
|
|
||||||
\section{Algorithm inspection}
|
\section{Algorithm inspection}
|
||||||
|
|
||||||
|
For Value Iteration algorithm I initialize the cost function to zero vector; in
|
||||||
|
Policy Iteration the $idle$ policy is used as a starting point.
|
||||||
|
|
||||||
For visualization I used a non-linear scale for the cost function. Each
|
For visualization I used a non-linear scale for the cost function. Each
|
||||||
different value in the cost vector was assigned a different color in order to
|
different value in the cost vector was assigned a different color in order to
|
||||||
ensure, that for small values for $\alpha$ the distinct values could be clearly
|
ensure, that for small values for $\alpha$ the distinct values could be clearly
|
||||||
@@ -103,7 +112,8 @@ for the non-goal states, therefore the cost-free final state is propagated as
|
|||||||
an ever-decreasing additive term, and the distance of the propagation is
|
an ever-decreasing additive term, and the distance of the propagation is
|
||||||
restricted by the precision of the floating point variable used to store the
|
restricted by the precision of the floating point variable used to store the
|
||||||
cost function. Hence, the algorithms may not converge to the optimal policy,
|
cost function. Hence, the algorithms may not converge to the optimal policy,
|
||||||
when $g_2$ is used in conjunction with small values of $\alpha$.
|
when $g_2$ is used in conjunction with small values of $\alpha$. The cost plots
|
||||||
|
for $\alpha=0.01$ demonstrate this problem.
|
||||||
|
|
||||||
For comparison of Value Iteration and Policy Iteration I used a wide range of
|
For comparison of Value Iteration and Policy Iteration I used a wide range of
|
||||||
$\alpha$, the values that I chose are $0.99$, $0.7$ and $0.1$. Using these
|
$\alpha$, the values that I chose are $0.99$, $0.7$ and $0.1$. Using these
|
||||||
@@ -116,12 +126,14 @@ Policy Iteration converges in two to three times less iterations than Value
|
|||||||
Iteration. Surprisingly, the number of iterations doesn't seem to depend on the
|
Iteration. Surprisingly, the number of iterations doesn't seem to depend on the
|
||||||
discount factor, which could mean that the given maze problem is small and
|
discount factor, which could mean that the given maze problem is small and
|
||||||
simple enough, so we don't have to care about choosing the $\alpha$ carefully.
|
simple enough, so we don't have to care about choosing the $\alpha$ carefully.
|
||||||
Furthermore, the one-step cost $g_2$ allows both algorithms to converge faster.
|
Furthermore, the one-step cost $g_2$ allows both algorithms to converge
|
||||||
|
slightly faster.
|
||||||
|
|
||||||
It is natural, that PI converges in less iterations than VI, since policy is
|
It is natural, that PI converges in less iterations than VI, since policy is
|
||||||
guaranteed to improve on each iteration. However, finding the exact cost
|
guaranteed to improve on each iteration. However, finding the exact cost
|
||||||
function $J_{\pi_k}$ on each iteration can get expensive, when the state space
|
function $J_{\pi_k}$ on each iteration can get expensive, when the state space
|
||||||
grows. However, the given maze is small, so it is affordable to use the PI.
|
grows. However, the given maze is small, so it is preferable to use the PI to
|
||||||
|
solve this problem.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\includegraphics[width=\linewidth]{figures/a09.png}
|
\includegraphics[width=\linewidth]{figures/a09.png}
|
||||||
|
|||||||
Reference in New Issue
Block a user