teleoperation/docs/report.latex

\documentclass[conference]{IEEEtran}

% \IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote.
% If that is unneeded, please comment it out.

\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
% \usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{subcaption}
\usepackage[colorlinks]{hyperref}

\usepackage{fancyhdr}
\pagestyle{fancy}
\lhead{Humanoid Robotic Systems}
\rhead{Teleoperating NAO}

\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}

\begin{document}

\title{TUM ICS Humanoid Robotic Systems \\ ``Teleoperating NAO''}

\author{Pavel Lutskov, Luming Li, Lukas Otter and Atef Kort}

\maketitle

\section{Project Description}

In this semester the task of our group was to program a routine for
teleoperation of a NAO robot. Using the ArUco markers, placed on the operator's
chest and hands, the position and the posture of the operator should have been
determined by detecting the markers' locations with a webcam, and then the
appropriate commands should have been sent to the robot to imitate the motions
of the operator. The overview of the
process can be seen in \autoref{fig:overview}. The main takeaway from
fulfilling this objective was practicing the skills that we acquired during the
Humanoid Robotic Systems course and to get familiar with the NAO robot as a
research and development platform.

\begin{figure}[h]
  \centering
  \includegraphics[width=\linewidth]{figures/teleoperation_overview.png}
  \caption{Overview of the defined states and their transistions.}
  \label{fig:overview}
\end{figure}

In closer detail, once the markers are detected, their coordinates relative to
the webcam are extracted. The position and the orientation of the user's
chest marker is used to control the movement of the NAO around the environment.
We call this approach a ``Human Joystick'' and we describe it in more detail in
\autoref{ssec:navigation}.

The relative locations of the chest and hand markers can be used to determine
the coordinates of the user's end effectors (i.e.\ hands) in the user's chest
frame. In order for the NAO to imitate the arm motions, these coordinates need
to be appropriately remapped into the NAO torso frame. With the knowledge of the
desired coordinates of the hands, the commands for the NAO joints can be
calculated by using the Cartesian control approach. We present a thorough
discussion of the issues we had to solve and the methods we used for arm motion
imitation in \autoref{ssec:imitation}.

Furthermore, in order to enable the most intuitive teleoperation, a user
interface was needed to be developed. In our system, we present the operator
with a current estimation of the operator's pose, a sensor feedback based robot
pose, as well as with the camera feed from both NAO's cameras and with the
webcam view of the operator. In order for the user to be able to give explicit
commands to the robot, such as a request to open or close the hands or to
temporarily suspend the operation, we implemented a simple voice command
system. Finally, to be able to accommodate different users and to perform
control in different conditions, a small calibration routine was developed,
which would quickly take a user through the process of setting up the
teleoperation. We elaborate on the tools and approaches that we used for
implementation of the user-facing features in \autoref{ssec:interface}.

An example task, that can be done using our teleoperation package might be the
following. The operator can safely and precisely navigate the robot through an
uncharted environment with a high number of obstacles to some lightweight
object, such as an empty bottle, then make the robot pick up that object and
bring the object back to the operator. Thanks to the high precision of the arm
motions and the constant operator input, the robot is able to pick up an object
of different shapes and sizes, applying different strategies when needed. We
demonstrate the functioning of our system in the supporting video.

We used ROS \cite{ros} as a framework for our implementation. ROS is a
well-established software for developing robot targeted applications with rich
support infrastructure and modular approach to logic organization. For
interacting with the robot we mainly relied on the NAOqi \cite{naoqi} Python
API. The advantage of using Python compared to C++ is a much higher speed of
development and a more concise and readable resulting code. We, therefore, used
C++ only for the most computationally intensive parts of our program, such as
the ArUco marker detection, because of the efficiency of the C++.

\section{System Overview}

\subsection{Vision}\label{ssec:vision}

The foundational building block of our project is a computer vision system for
detection of the position and the orientation of ArUco markers \cite{aruco}. In
our implementation we follow closely the HRS Tutorial 4 and leverage the
functionality of the ROS ArUco library. One major difference from the lecture,
however, lies in finding the calibration matrix of the camera. In the tutorial
we could retrieve the camera intrinsics of the NAO's camera through a call to
the NAO API. In our case, however, a third-party webcam was used, the
intrinsics of which we didn't know. In order to find the camera matrix, we used
a common approach based on the calculation of a homography matrix through a
search for correspondent points in a series of planar scenes \cite{homography}.
In particular, we used three checkerboard patterns and the Camera Calibration
Toolbox for Matlab \cite{cam-toolbox}. Our experiments confirmed that the
positions and the orientations of the ArUco markers are calculated correctly,
and therefore the calibration was correct.

On the higher level, we extract the coordinates of the ArUco markers in the
webcam frame, then apply a rotational transformation, so that the Z-coordinate
of the markers correctly corresponds to the height \footnote{In the camera
  frame the Z-coordinate is parallel to the camera axis.}. Finally, we publish
the transforms of the markers with respect to the \verb|odom| frame
\footnote{The choice of the parent frame is arbitrary as long as it is
  consistent throughout the project.} using the ROS \verb|tf|.

\begin{figure}
  \centerline{\includegraphics[width=0.8\linewidth]{figures/aruco.png}}
  \caption{ArUco marker detection on the operator.}
  \label{fig:aruco-detection}
\end{figure}

\subsection{Interface}\label{ssec:interface}

\paragraph{Speech Commands}

Based on NAOqi API and NAO's built-in voice recognition, we built a Python
speech recognition server, providing a ROS action as a means of accessing it.
It was possible to reuse the results of the HRS Tutorial 7, where a speech
recognition node was already implemented. Those results, however, were not
flexible enough for our purposes, and making the necessary adjustments was more
time-consuming than implementing a node in Python from scratch. It was our
design constraint, that the robot only accepts commands which lead to state
changes that are reachable from the current state. We will provide further
detail on how the state dependency is implemented and how the speech
recognition is integrated with our system in \autoref{sec:integration}.

\begin{table}[h]
\centering
\begin{tabular}{|c|c|c|}
\hline
\textbf{Command}&\textbf{Action}&\textbf{Available in state} \\
\hline
``Go'' & Wake Up & Sleep \\
\hline
``Kill'' & Go to sleep & Idle, Imitation \\
\hline
``Arms'' & Start imitation & Idle \\
\hline
``Stop'' & Stop imitation & Imitation \\
\hline
``Open'' & Open hands & Idle, Imitation \\
\hline
``Close'' & Close hands & Idle, Imitation \\
\hline
\end{tabular}
\caption{Commands of the speech recognition module}
\label{tab:speech-states}
\end{table}

The \autoref{tab:speech-states} depicts the list of available commands,
depending on the state of the system. We tried to make those as short and
distinguishable as possible in order to minimize the number of misunderstood
commands. As a confirmation, the NAO repeats the recognized command, or says
``nope'' if it detected some speech but couldn't recognize a valid command.
Such brevity greatly speeds up the speech-based interaction, compared to the
case if NAO would talk in full sentences.

\paragraph{Calibration}

In order to make our system more robust, we have included a routine to
calibrate it for different users. It can run as an optional step before the
execution of the main application. Within this routine different threshold
values, which are required for the ``Human Joystick'' approach that is used to
control the NAO's walker module, as well as various key points, which are
needed to properly map the operator's arm motions to the NAO, are determined.

When the module is started, the NAO is guiding the operator through a number of
recording steps via spoken prompts. After a successful completion of the
calibration process, the determined values are written to the
\textit{YAML-file} \verb|config/default.yaml| \cite{yaml}. This file can then
be accessed by the other nodes in the system.

\paragraph{Teleoperation Interface}

In order to make it possible to operate the NAO without visual contact, we have
developed a  teleoperation interface. It allows the operator to receive visual
feedback on the NAO as well as an estimation of the operators current pose and
of the buffer and movement zones which are needed to navigate the robot.

The NAO-part contains feeds of the top and bottom cameras on the robot's head.
These were created by subscribing to their respective topics using the
\verb|rqt_gui| package. Moreover, it additionally consists of a
visualization of the NAO in rviz. For this, the robot's joint positions are
displayed by subscribing to the \verb|tf| topic where the coordinates and the
different coordinate frames are published. We further used the
\verb|nao_meshes| package to render a predefined urdf-3D-model of the NAO. It
is shown in \autoref{fig:rviz-nao-model}.

Furthermore, the interface also presents an estimation of the current pose of
the operator as well as the control zones for our "Human Joystick" approach in
an additional \textit{rviz} window. For this, we created a separate node that
repeatedly publishes a model of the operator and the zones consisting of
concentric circles to \textit{rviz}. Initially, the \textit{YAML-file} that
contains the parameters which were determined within the system calibration is
read out. According to those, the size of circles that represent the control
zones are set. Further, the height of the human model is set to 2.2 times the
determined arm-length of the operator. The size of the other body parts is then
scaled dependent on that height parameter and predefined weights. We tried to
match the proportions of the human body as good as possible with that approach.
The position of the resulting body model is bound to the determined location of
the Aruco marker on the operators chest, which was again received by
subscription to the \verb|tf| topic. Thus, since the model is recreated and
re-published in each iteration of the node it is dynamically moving with the
operator.

Moreover, for a useful interface it was crucial to have a dynamic
representation of the operator's arms in the model. After several tries using
the different marker types (e.g. cylinders and arrows) turned out to be too
elaborate to implement, we decided to use markers of the type
\textit{line-strip} starting from points at shoulders and ending on points on
the hands for the model's arms. By using the shoulder points that were defined
in the body model and locking the points on the hands to the positions that
were determined for the markers in the operators hands, we finally created a
model that represents the operators arm position's and thereby provides support
for various tasks such as grabbing an object. The final model is shown in
figure \autoref{fig:rviz-human-model}. Just for reference, we also included a
marker of type \textit{sphere} that depicts the position of the recording
webcam.

In addition, we added camera feed showing the operator. Within the feed ArUco
markers are highlighted once they are detected. This was done by including the
output of the ArUco detection module in the interface. A sample output is shown
in figure \autoref{fig:aruco-detection}.

\begin{figure}
  \centering
  %\hfill
  \begin{subfigure}[b]{0.4\linewidth}
    \includegraphics[width=\linewidth]{figures/interface_nao.png}
    \caption{}
    %{{\small $i = 1 \mu m$}}
    \label{fig:rviz-nao-model}
  \end{subfigure}
  \begin{subfigure}[b]{0.4\linewidth}
    \includegraphics[width=\linewidth]{figures/rviz_human.png}
    \caption{}
    %{{\small $i = -1 \mu A$}}
    \label{fig:rviz-human-model}
  \end{subfigure}
  \caption{NAO and operator in rviz.}
  \label{fig:interface}
\end{figure}

\subsection{Navigation}\label{ssec:navigation}

Next, our system needed a way for the operator to command the robot to a
desired location. Furthermore, the operator has to be able to adjust the speed
of the robot's movement. To achieve this we use the approach that we call the
``Human Joystick''. We implement this approach in a module called
\verb|walker|.

Through the calibration procedure we determine the initial position of the
operator. Furthermore, we track the position of the operator by locating the
ArUco marker on the operator's chest. Then, we can map the current position of
the user to the desired direction and speed of the robot. For example, if the
operator steps to the right from the initial position, then the robot will be
moving to the right until the operator returns back into the initial position.
The further the operator is from the origin, the faster will the robot move. In
order to control the rotation of the robot, the operator can slightly turn the
body clockwise or counterclockwise while being in the initial position so that
the marker can still be detected by the webcam. The speed of the rotation can
also be controlled by the magnitude of the operator's rotation. The process is
schematically illustrated in \autoref{fig:joystick}.

\begin{figure}
  \centering
  \includegraphics[width=0.8\linewidth]{figures/usr_pt.png}
  \caption{``Human Joystick''.}
  \label{fig:joystick}
\end{figure}

There is a small region around the original position, called the Buffer Zone,
in which the operator can stay without causing the robot to move. As soon as
the operator exceeds the movement threshold into some direction, the robot will
slowly start moving in that direction. We use the following relationship for
calculating the robot's speed:

$$v = v_{min} + \frac{d - d_{thr}}{d_{max} - d_{thr}}(v_{max} - v_{min})$$

Here, $d$ denotes the operator's distance from the origin in that direction,
$d_{thr}$ is the minimum distance required for starting the movement and the
$d_{max}$ is the boundary of the control zone; $d_{thr}$ and $d_{max}$ are
determined through the calibration process. Currently, there can only be
movement in one direction at a time, so in case the operator exceeds the
threshold in more than one direction, the robot will move in the direction with
the higher precedence. The forwards-backwards motion has the highest
precedence, then goes the sideways motion and, finally, the rotation.

Our test have shown, that having the control over the speed is crucial for the
success of the teleoperation. The alignment to an object is impossible if the
robot is walking at its maximum speed, on the other hand walking around the
room at a fraction of the maximum speed is too slow.

\subsection{Imitation}\label{ssec:imitation}

One of the main objectives of our project was the imitation of the operator
arm motions by the NAO. In order to perform this, first the appropriate mapping
between the relative locations of the detected ArUco markers and the desired
hand positions of the robot needs to be calculated. Then, based on the
target coordinates, the robot joint rotations need to be calculated.

\paragraph{Posture Retargeting}

First, let us define the notation of the coordinates that we will use to
describe the posture retargeting procedure. Let $r$ denote the 3D $(x, y, z)$
coordinates, then the subscript defines the object which has these coordinates,
and the superscript defines the coordinate frame in which these coordinates are
taken. So, for example, $r_{hand,NAO}^{torso,NAO}$ gives the coordinate of the
hand of the NAO robot in the frame of the robot's torso.

\begin{figure}
  \centering
  %\hfill
  \begin{subfigure}[b]{0.45\linewidth}
    \includegraphics[width=\linewidth]{figures/operator_frames.png}
    \caption{Operator's chest and shoulder frames.}
    %{{\small $i = 1 \mu m$}}
    \label{fig:operator-frames}
  \end{subfigure}
  \begin{subfigure}[b]{0.45\linewidth}
    \includegraphics[width=\linewidth]{figures/robot_torso.png}
    \caption{NAO's torso frame.}
    %{{\small $i = -1 \mu A$}}
    \label{fig:nao-frames}
  \end{subfigure}
  \caption{Coordinate frames.}
  \label{fig:coord-frames}
\end{figure}

After the ArUco markers are detected and published on ROS \verb|tf|, as was
described in \autoref{ssec:vision}, we have the three vectors
$r_{aruco,chest}^{webcam}$, $r_{aruco,lefthand}^{webcam}$ and
$r_{aruco,righthand}^{webcam}$. We describe the retargeting for one hand, since
it is symmetrical for the other hand. We also assume that the user's coordinate
systems have the same orientation, with the z-axis pointing upwards, the x-axis
pointing straight into webcam and the y-axis to the left of the webcam
\footnote{This assumption holds, because for the imitation mode the user always
  faces the camera directly and stands straight up. We need this assumption for
  robustness against the orientation of the chest marker, since it can
  accidentally get tilted. If we would bind the coordinate system to the chest
  marker completely, we would need to place the marker on the chest firmly and
  carefully, which is time consuming.}. Therefore, we can directly calculate
the hand position in the user chest frame by the means of the following
equation:

$$r_{hand,user}^{chest,user} = r_{aruco,hand}^{webcam} -
r_{aruco,chest}^{webcam}$$

Next, we remap the hand coordinates in the chest frame into the user shoulder
frame, using the following relation:

$$r_{hand,user}^{shoulder,user} =
r_{hand,user}^{chest,user} - r_{shoulder,user}^{chest,user}$$

We know the coordinates of the user's shoulder in the user's chest frame from
the calibration procedure, described in \autoref{ssec:interface}.

Now, we perform the retargeting of the user's hand coordinates to the desired
NAO's hand coordinates in the NAO's shoulder frame with the following formula:

$$r_{hand,NAO}^{shoulder,NAO} =
\frac{L_{arm,NAO}}{L_{arm,user}} r_{hand,user}^{shoulder,user}$$

As before, we know the length of the user's arm through calibration and the
length of the NAO's arm through the specification provided by the manufacturer.

A final step of the posture retargeting is to obtain the coordinates of the end
effector in the torso frame. This can be done through the following relation:

$$r_{hand,NAO}^{torso,NAO} =
r_{hand,NAO}^{shoulder,NAO} + r_{shoulder,NAO}^{torso,NAO}$$

The coordinates of the NAO's shoulder in the NAO's torso frame can be obtained
through a call to the NAOqi API.

Now that the desired positions of the NAO's hands are known, the appropriate
joint motions need to be calculated by the means of Cartesian control.

\paragraph{Cartesian Control}

At first, we tried to employ the Cartesian controller that is shipped with the
NAOqi SDK. We soon realized, however, that this controller was unsuitable for
our task, because of the two significant limitations. The first problem with
the NAO's controller is that it freezes if the target is being updated too
often: the arms of the robot start to stutter, and then make a final erratic
motion once the program is terminated. However, arm teleoperation requires
smoothness and therefore frequent updates of the target position, and the NAO
controller didn't fit these requirements. A possible reason for such behavior
could be a bug in the implementation, and it might be possible that this
problem was fixed in the later versions of the NAOqi SDK.

Secondly, the controller of the NAO is not robust against
\textit{singularities}. Singularities occur, when the kinematic chain loses one
or more degrees of freedom, and so in order to reach a desired position, the
joint motors must apply infinite torques. Practically, for the imitation task
this would mean that once the robot has its arms fully stretched, the arms
would execute violent erratic motions which would hurt the robot or cause it to
lose balance. Therefore, we needed to implement our own Cartesian controller,
which would allow us to operate the robot smoothly and don't worry about the
singularities.

In our case, the output of the Cartesian controller are the 4 angles of the
rotational joints for the shoulder and the elbow part of each arm of the NAO
robot. The angle speeds for the joints can be calculated using the following
formula:

$$\dot{\theta} = J^{-1}\dot{r}$$

In here $\dot{r}$ is the desired speed of the end effector and $\dot{theta}$ is
the vector of the necessary joint angular speeds. $J$ is the Jacobian matrix
\cite{jacobian}. The Jacobian matrix gives the relationship between the joint
angle speed and the resulting speed of the effector on the end of the kinematic
chain which the Jacobian matrix describes.

We now apply a common simplification and state that

$$\Delta \theta \approx J^{-1}\Delta r$$

Here $\Delta$ is a small change in angle or the position. We use

$$\Delta r = \frac{r_{desired} - r_{current}}{K},\ K = 10$$

In this formula $r_{desired}$ denotes the 3D position of the target, which is
the result of the posture retargeting, namely $r_{hand,NAO}^{torso,NAO}$
\footnote{ In here we mean not the real position of the NAO's hand, but the
  desired position calculated from the user's hand position. The real position
  of the NAO's hand is the $r_{current}$. The proper distinguishing would
  require even further abuse of notation. }. We want the $r$ to make a small
movement in the direction of the desired position.

Now we need to calculate a Jacobian matrix. There are 2 main ways to determine
the Jacobian matrix. The first way is the numerical method, where this
approximation is done by checking how the end effector moves with small joint
angle changes. For this we can approximate each column of the Jacobian Matrix
as follows:

$$
J_j = \frac{\partial r}{\partial\theta_j} \approx
\frac{\Delta r}{\Delta\theta_j} =
\left(
    \begin{array}{ccc}
      \frac{\Delta r_x}{\Delta\theta_j} &
      \frac{\Delta r_y}{\Delta\theta_j} &
      \frac{\Delta r_z}{\Delta\theta_j}
    \end{array}
  \right)^{T}
  $$

We tested this approach, the results, however, were rather unstable, and due to
the lack of time we didn't investigate the possible ways to make this approach
perform better. A possible reason for bad performance of this method could be
the imprecise readings from the NAO's joint sensors and the imprecise
calculation of the position of the end effector, also performed by the NAO
internally.

The other method that we employed was to calculate the Jacobian matrix
analytically. Since only rotational joints were available, the approximation
for the Jacobian matrix, which is the tangent in rotational joints, can be
calculated using the cross product between the rotational axis of a joint,
denoted by $e_j$, and the rotational vector $r_{end}-r_{j}$, where $r_{end}$
is the position of the end effector (i.e.\ hand) and $r_{j}$ is the position of
the joint. The following relation gives us one column of the Jacobian matrix.

$$
J_j = \frac{\partial r_{end}}{\partial\theta_j} =
(e_j \times (r_{end}-r_j))
$$

We can get the rotational axis of a joint and the position of the joint in the
torso frame through NAOqi API. This can be repeated for each rotational joint
until the whole matrix is filled.

The next step for the Cartesian controller is to determine the inverse Jacobian
matrix for the inverse kinematic. For this singular value decomposition is
used, which is given by

$$J = U\Sigma V^T$$

Then, the inverse can be calculated by

$$J^{-1} = V \Sigma^{-1} U^T$$

One advantage of this approach is that it can be employed to find a
pseudoinverse of a non-square matrix. Furthermore, the diagonal matrix $\Sigma$
has the $J$'s singular values in its main diagonal. If any of the singular
values are close to zero, this means that the $J$ has lost rank and therefore
the singularity occurs. We can calculate

$$\Sigma^{-1} = (\frac{1}{\Sigma})^T$$

Then we can avoid the singularity behavior by setting to $0$ the entries in
$\Sigma^{-1}$ that are above a threshold value $\tau = 50$, which we determined
through experimentation.

The final control objective for the current loop iteration can be stated as:

$$\theta_{targ} = \theta_{cur} + \Delta\theta$$

Our test have shown, that our controller doesn't have the freezing behavior,
which is present in the NAO's own controller, and therefore the target of the
control can be updated with arbitrary frequency. Furthermore, our controller
shows no signs of producing violent arm motions, which means that our strategy
for handling singularities was effective. The implementation for the whole
imitation routine resides in the \verb|imitator| module of our system.

\subsection{Fall Recovery}

A final component of our system is a small node that is responsible for
recovering from the falls. Using the NAOqi API, this node subscribes to a
NAOqi event called \verb|robot_has_fallen|, and as soon as such event occurs,
immediately suspends any other activities in the system, and then indefinitely
tries to bring the robot back into the standing state, again by using a routine
provided by NAOqi SDK. Thanks to this node, the teleoperation can in many cases
be resumed normally after the fall.

\section{System Implementation and Integration}\label{sec:integration}

Now that the individual modules were designed and implemented, the whole system
needed to be assembled together. The state machine that we designed can be
seen in the \autoref{fig:overview}.

The software package was organized as a collection of ROS nodes, controlled by
a single master node. The master node keeps track of the current system state,
and the slave nodes consult with the master node to check if they are allowed
to perform an action. To achieve this, the master node creates a server for a
ROS service, named \verb|inform_masterloop|, with this service call taking as
arguments a name of the caller and the desired action and responding with a
Boolean value indicating, whether a permission to perform the action was
granted. The master node can then update the system state based on the received
action requests and the current state. Some slave nodes, such as the walking or
imitation nodes run in a high-frequency loop, and therefore consult with the
master in each iteration of the loop. Other nodes, such as the fall detector,
only inform the master about the occurrence of certain events, such as the fall
or fall recovery, so that the master could deny requests for any activities,
until the fall recovery is complete.

\begin{figure}[h]
  \centering
  \includegraphics[width=0.9\linewidth]{figures/sys_arch.png}
  \caption{Overview of the interactions in the system.}
  \label{fig:impl_overview}
\end{figure}

We will now illustrate our architecture by using interaction between the walker
node and the master node as an example. This interaction is depicted in the
\autoref{fig:master-walker}. The walker node subscribes to the \verb|tf|
transform of the chest ArUco marker, and requests a position update every 0.1
seconds. If in the current cycle the marker happens to be outside of the buffer
zone (see \autoref{fig:joystick}), or the rotation of the marker exceeds the
motion threshold, the walker node will ask the master node for a permission to
start moving. The master node will receive the request, and if the current
state of the system is either \textit{walking} or \textit{idle} (see
\autoref{fig:overview}), then the permission will be granted and the system
will transit into the \textit{walking} state. If the robot is currently
imitating the arm motions or has not yet recovered from a fall, then the
permission will not be granted and the system will remain in its current state
\footnote{We did research a possibility of automatic switching between walking
  and imitating, so that the robot always imitates when the operator is within
  the buffer zone, and stops imitating as soon as the operator leaves the
  buffer zone, but this approach requires more skill and concentration from the
  operator, so the default setting is to explicitly ask the robot to go into
  imitating state and back into idle.}.
The walker node will then receive the
master's response, and in case it was negative, any current movement will be
stopped and the next cycle of the loop will begin. In case the permission was
granted, the walker will calculate the direction and the speed of the movement,
based on the marker position, and will send a command to the robot over the
NAOqi API to start moving. We use a non-blocking movement function, so that the
movement objective can be updated with every loop iteration. Finally, if the
marker is within the buffer zone, the robot will be commanded to stop by the
walker node, and the master will be informed, that the robot has stopped
moving. Since in this case the walker node gives up the control, the permission
from the master doesn't matter.

\begin{figure}[h]
  \centering
  \includegraphics[width=0.9\linewidth]{figures/master_walker.png}
  \caption{Interaction between master and walker modules.}
  \label{fig:master-walker}
\end{figure}

A final piece of our system is the speech-based command interface. Since in our
system the acceptable commands vary between states, the speech recognition
controller must be aware of the current state of the system, therefore the
master node is responsible for this functionality. The master node runs an
auxiliary loop, in which a recognition target is sent to the speech server
node, described in \autoref{ssec:interface}. If a relevant word is detected,
master receives the result and updates the state accordingly and then sends a
new recognition target. If a state change occurred before any speech was
detected, then the master sends a cancellation request to the speech server for
the currently running objective and, again, sends a new target.

\section{Conclusion and Possible Drawbacks}

Upon completion of this project, our team successfully applied the knowledge
that we acquired during the HRS lectures and tutorials to a complex practical
task. We implemented an easy to use prototype of a teleoperation system, which
is fairly robust to the environmental conditions. Furthermore, we researched
several approaches to the implementation of the Cartesian control, and were
able to create a Cartesian controller, which is superior to the NAO's built-in
one. Finally, we extensively used ROS and now can confidently employ ROS in the
future projects.

Our resulting system has a few drawbacks, however, and there is room for
future improvements. Some of these drawbacks are due to the time constraints,
the other ones have to do with the limitations of NAO itself. The first major
drawback is the reliance on the NAO's built-in speech recognition for
controlling the robot. Because of this, the operator has to be in the same room
with the robot, which severely constraints the applicability of the
teleoperation system. Furthermore, since the acting robot is the one detecting
the speech, it can be susceptible to the sounds it makes during the operation
(joint noises, warning notifications). Also, as the live demonstration
revealed, using voice-based control in a crowded environment can lead to a high
number of false positive detections and therefore instability of the system. A
simple solution is to use two NAO robots, one of which is in the room with the
operator acting solely as a speech detection tool, and the other one is in
another room performing the actions. A saner approach is to apply third-party
speech recognition software to a webcam microphone feed, since there are
speech-recognition packages for ROS available \cite{ros-speech}. However,
because the speech recognition wasn't the main objective of our project, we
will reserve this for possible future work.

Another important issue, which can be a problem for remote operation are the
cables. A NAO is connected to the controlling computer over the Ethernet cable,
and also, due to the low capacity of the NAO's battery, the power cord needs to
be plugged in most of the time. The problem with this is that without the
direct oversight of the operator, it is impossible to know where the cables are
relative to the robot, so it is impossible to prevent the robot from tripping
over the cables and falling. When it comes to battery power, the NAO has some
autonomy; the Ethernet cable, however, cannot be removed because the onboard
Wi-Fi of the NAO is too slow to allow streaming of the video feed and joint
telemetry.

A related issue is a relatively narrow field of view of the NAO's cameras. In a
cordless case, the camera feed might be sufficient for the operator to navigate
the NAO through the environment. However, picking up the objects when only
seeing them through the robot's cameras is extremely difficult, because of the
narrow field of view and lack of the depth information. A possible solution to
this issue and the previous one, which would enable to operate a NAO and not be
in the same room with it, is to equip the robot's room with video cameras,
so that some oversight is possible.

Finally, there is a problem with the NAO's stability when it walks carrying an
object. Apparently, the NAOqi walking controller relies on the movement of the
arms to stabilize the walking. It seems that if the arms are occupied by some
other task during the walk, the built-in controller doesn't try to
intelligently compensate, which has led to a significant number of falls during
our experiments. Due to the time constraints, we weren't able to investigate
any approaches to make the walking more stable. This, however, can be an
interesting topic for future semester projects.

\bibliography{references}{}
\bibliographystyle{IEEEtran}

\end{document}