\documentclass[conference]{IEEEtran} % \IEEEoverridecommandlockouts % The preceding line is only needed to identify funding in the first footnote. % If that is unneeded, please comment it out. \usepackage{cite} \usepackage{amsmath,amssymb,amsfonts} % \usepackage{algorithmic} \usepackage{graphicx} \usepackage{textcomp} \usepackage{xcolor} \usepackage{subcaption} \usepackage{todonotes} \usepackage{hyperref} \def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}} \begin{document} \title{Humanoid Robotic Systems - ``Teleoperating NAO''} \author{Pavel Lutskov, Luming Li, Lukas Otter and Atef Kort} \maketitle \section{Project Description} In this semester the task of our group was to program a routine for teleoperation of a NAO robot. Using the ArUco markers, placed on the operator's chest and hands, the position and the posture of the operator should have been determined by detecting the markers' locations with a webcam, and then the appropriate commands should have been sent to the robot to imitate the motions of the operator. The overview of the process can be seen in \autoref{fig:overview}. The main takeaway from fulfilling this objective was practicing the skills that we acquired during the Humanoid Robotic Systems course and to get familiar with the NAO robot as a research and development platform. \begin{figure}[h] \centering \includegraphics[width=\linewidth]{figures/teleoperation_overview.png} \caption{Overview of the defined states and their transistions.} \label{fig:overview} \end{figure} In closer detail, once the markers are detected, their coordinates relative to the webcam are extracted. The position and the orientation of the user's chest marker is used to control the movement of the NAO around the environment. We call this approach a ``Human Joystick'' and we describe it in more detail in \autoref{ssec:navigation}. The relative locations of the chest and hand markers can be used to determine the coordinates of the user's end effectors (i.e.\ hands) in the user's chest frame. In order for the NAO to imitate the arm motions, these coordinates need to be appropriately remapped into the NAO torso frame. With the knowledge of the desired coordinates of the hands, the commands for the NAO joints can be calculated by using the Cartesian control approach. We present a thorough discussion of the issues we had to solve and the methods we used for arm motion imitation in \autoref{ssec:imitation}. Furthermore, in order to enable the most intuitive teleoperation, a user interface was needed to be developed. In our system, we present the operator with a current estimation of the operator's pose, a sensor feedback based robot pose, as well as with the camera feed from both NAO's cameras and with the webcam view of the operator. In order for the user to be able to give explicit commands to the robot, such as a request to open or close the hands or to temporarily suspend the operation, we implemented a simple voice command system. Finally, to be able to accommodate different users and to perform control in different conditions, a small calibration routine was developed, which would quickly take a user through the process of setting up the teleoperation. We elaborate on the tools and approaches that we used for implementation of the user-facing features in \autoref{ssec:interface}. An example task, that can be done using our teleoperation package might be the following. The operator can safely and precisely navigate the robot through an uncharted environment with a high number of obstacles to some lightweight object, such as an empty bottle, then make the robot pick up that object and bring the object back to the operator. Thanks to the high precision of the arm motions and the constant operator input, the robot is able to pick up an object of different shapes and sizes, applying different strategies when needed. We demonstrate the functioning of our system in the supporting video. We used ROS as a framework for our implementation. ROS is a well-established software for developing robot targeted applications with rich support infrastructure and modular approach to logic organization For interacting with the robot we mainly relied on the NAOqi Python API. The advantage of using Python compared to C++ is a much higher speed of development and a more concise and readable resulting code. \section{System Overview} \subsection{Vision}\label{ssec:vision} - Camera calibration - Aruco marker extraction - TF world coordinate publishing \begin{figure} \centerline{\includegraphics[width=0.8\linewidth]{figures/aruco.png}} \caption{ArUco marker detection on the operator.} \label{fig:aruco_detection} \end{figure} \subsection{Interface}\label{ssec:interface} \paragraph{Speech State Machine} Based on NAOqi API and NAO built-in voice recognition \begin{table} \caption{Commands of the speech recognition module} \begin{center} \begin{tabular}{|c|c|c|} \hline \textbf{Command}&\textbf{Action}&\textbf{Available in state} \\ \hline ``Go'' & Wake Up & Sleep \\ \hline ``Kill'' & Go to sleep & Idle, Imitation \\ \hline ``Arms'' & Start imitation & Idle \\ \hline ``Stop'' & Stop imitation & Imitation \\ \hline ``Open'' & Open hands & Idle, Imitation \\ \hline ``Close'' & Close hands & Idle, Imitation \\ \hline \end{tabular} \label{tab_speech_states} \end{center} \end{table} \paragraph{Teleoperation Interface} In order to make it possible to operate the NAO without visual contact, a teleoperation interface was developed. This interface allows the operator to receive visual feedback on the NAO as well as additional information regarding his own position. The NAO-part contains video streams of the top and bottom cameras on the robots head. These were created by subscribing to their respective topics (FIND NAME) using the \textit{rqt\_gui} package. Moreover, it also consists of a rviz window which gives a visual representation of the NAO. For this, the robot's joint positions are displayed by subscribing to the topic tf where the coordinates and the different coordinate frames are published. We further used the \textit{NAO-meshes} package to create the 3D model of the NAO. \begin{figure} \centering %\hfill \begin{subfigure}[b]{0.4\linewidth} \includegraphics[width=\linewidth]{figures/rviz_human.png} \caption{} %{{\small $i = 1 \mu m$}} \label{fig_human_model} \end{subfigure} \begin{subfigure}[b]{0.4\linewidth} \includegraphics[width=\linewidth]{figures/interface_nao.png} \caption{} %{{\small $i = -1 \mu A$}} \label{fig_nao_model} \end{subfigure} \caption{Operator and NAO in rviz.} \label{fig_interface} \end{figure} \subsection{Navigation}\label{ssec:navigation} - Human Joystick (3dof) One of the two main feature in our robot is an intuitive navigation tool, which allows the robot to navigate the environment by tracking the user movements. By fixing an ArUco marker on the user's chest, we can continuously track its position and orientation in a three dimensional space and so capture its motion. In order to simplify the task we define a buffer zone where the robot can only track the orientation of the user then depending on which direction the user will exit the zone the robot will either go forward, backward, left or right. Also the covered distance will influence the speed of the robot, the further the user is from the center of the buffer zone the faster the movement of the robot will be. The extent of the movement and buffer zone are determined automatically through calibration. \begin{figure} \centering \includegraphics[width=0.8\linewidth]{figures/usr_pt.png} \caption{User position tracking model} \label{fig_user_tracking} \end{figure} \subsection{Imitation}\label{ssec:imitation} One of the main objectives of our project was the imitation of the operator arm motions by the NAO. In order to perform this, first the appropriate mapping between the relative locations of the detected ArUco markers and the desired hand positions of the robot needs to be calculated. Then, based on the target coordinates, the robot joint rotations need to be calculated. \paragraph{Posture retargeting} First, let us define the notation of the coordinates that we will use to describe the posture retargeting procedure. Let $r$ denote the 3D $(x, y, z)$ coordinates, then the subscript defines the object which has these coordinates, and the superscript defines the coordinate frame in which these coordinates are taken. So, for example, $r_{NAO hand}^{NAO torso}$ gives the coordinate of the hand of the NAO robot in the frame of the robot's torso. \begin{figure} \centering %\hfill \begin{subfigure}[b]{0.45\linewidth} \includegraphics[width=\linewidth]{figures/operator_frames.png} \caption{Operator's chest and shoulder frames} %{{\small $i = 1 \mu m$}} \label{fig:operator-frames} \end{subfigure} \begin{subfigure}[b]{0.45\linewidth} \includegraphics[width=\linewidth]{figures/robot_torso.png} \caption{NAO's torso frame} %{{\small $i = -1 \mu A$}} \label{fig:nao-frames} \end{subfigure} \caption{Coordinate frames} \label{fig:coord-frames} \end{figure} After the ArUco markers are detected and published on ROS TF, as was described in \autoref{ssec:vision}, we have the three vectors $r_{aruco,chest}^{webcam}$, $r_{aruco,lefthand}^{webcam}$ and $r_{aruco,righthand}^{webcam}$. We describe the retargeting for one hand, since it is symmetrical for the other hand. We also assume that the user's coordinate systems have the same orientation, with the z-axis pointing upwards, the x-axis pointing straight into webcam and the y-axis to the left of the webcam \footnote{This assumption holds, because for the imitation mode the user always faces the camera directly and stands straight up. We need this assumption for robustness against the orientation of the chest marker, since it can accidentally get tilted. If we would bind the coordinate system to the chest marker completely, we would need to place the marker on the chest firmly and carefully, which is time consuming.}. Therefore, we can directly calculate the hand position in the user chest frame by the means of the following equation: $$r_{hand,user}^{chest,user} = r_{aruco,hand}^{webcam} - r_{aruco,chest}^{webcam}$$. Next, we remap the hand coordinates in the chest frame into the user shoulder frame, using the following relation: $$r_{hand,user}^{shoulder,user} = r_{hand,user}^{chest,user} - r_{shoulder,user}^{chest,user}$$ We know the coordinates of the user's shoulder in the user's chest frame from the calibration procedure, described in \autoref{ssec:interface}. Now, we perform the retargeting of the user's hand coordinates to the desired NAO's hand coordinates in the NAO's shoulder frame with the following formula: $$r_{hand,NAO}^{shoulder,NAO} = \frac{L_{arm,NAO}}{L_{arm,user}} r_{hand,user}^{shoulder,user}$$ As before, we know the length of the user's arm through calibration and the length of the NAO's arm through the specification provided by the manufacturer. A final step of the posture retargeting is to obtain the coordinates of the end effector in the torso frame. This can be done through the following relation: $$r_{hand,NAO}^{torso,NAO} = r_{hand,NAO}^{shoulder,NAO} + r_{shoulder,NAO}^{torso,NAO}$$ The coordinates of the NAO's shoulder in the NAO's torso frame can be obtained through a call to the NAOqi API. Now that the desired position of the NAO's hands are known, the appropriate joint motions need to be calculated by the means of Cartesian control. \paragraph{Cartesian control} For this a singular robust cartesian controller was build. The output of our cartesian controller are the 4 angles of the rotational joints for the shoulder and the elbow part of each arm of the NAO robot, which is described by the inverse kinematic formula $$\Delta\theta = J^{-1,robust}\Delta r$$ To build the cartesian controller first the Jacobian matrix is needed. The content of the Jacobian matrix describes an approximation for the movement of each joint of the robot. There are 2 main ways to determine the Jacobian matrix. The first way is the numerical method, where this approximation is done by checking how the end effector moves with small angles for rotational joints. For this we can approximate each column of the Jacobian Matrix as followed: $$\frac{\partial r}{\partial\theta} \sim \frac{\Delta r}{\Delta\theta} = \left( \begin{array}{ccc} \frac{\Delta r_x}{\Delta\theta} & \frac{\Delta r}{\Delta\theta} & \frac{\Delta r}{\Delta\theta} \end{array} \right)^{T}$$ The other method is the analytical method, which was used in this project. Since only rotational joints were available, the approximation for the Jacobian matrix, which is the tangent in rotational joints, can be calculated using the cross product between the rotational axis $e$ and the rotational vector \\ $r_{end}-r_{joint}$. $$ \frac{\partial r_{end}}{\partial\theta _{joint}} = (e \times (r_{end}-r_{joint})) $$ which gives us one column of the Jacobian matrix. This can be repeated for each rotational joint until the whole matrix is filled. The next step for the Cartesian controller is to determine the inverse Jacobian matrix for the inverse kinematic. For this singular value decomposition is used. \section{System Implementation and Integration} Now that the individual modules were designed and implemented, the whole system needed to be assembled together. It is crucial that the states of the robot and the transitions between the states are well defined and correctly executed. The state machine, that we designed, can be seen in the \autoref{fig:overview}. The software package was organized as a collection of ROS nodes, controlled by a single master node. The master node keeps track of the current system state, and the slave nodes consult with the master node to check if they are allowed to perform an action. To achieve this, the master node creates a server for a ROS service, named \verb|inform_masterloop|, with this service call taking as arguments a name of the caller and the desired action and responding with a Boolean value indicating, whether a permission to perform the action was granted. The master node can then update the system state based on the received action requests and the current state. Some slave nodes, such as the walking or imitation nodes run in a high-frequency loop, and therefore consult with the master in each iteration of the loop. Other nodes, such as the fall detector, only inform the master about the occurrence of certain events, such as the fall or fall recovery, so that the master could deny requests for any activities, until the fall recovery is complete. We will now illustrate our architecture by using interaction between the walker node and the master node as an example. This interaction is depicted in the \autoref{fig:integration-example}. The walker node subscribes to the TF transform of the chest ArUco marker, and requests a position update every 0.1 seconds. If in the current cycle the marker happens to be outside of the buffer zone (see \autoref{fig:joystick}), or the rotation of the marker exceeds the motion threshold, the walker node will ask the master node for a permission to start moving. The master node will receive the request, and if the current state of the system is either \textit{walking} or \textit{idle} (see \autoref{fig:overview}), then the permission will be granted and the system will transit into the \textit{walking} state. If the robot is currently imitating the arm motions or has not yet recovered from a fall, then the permission will not be granted and the system will remain in its current state \footnote{We did research a possibility of automatic switching between walking and imitating, so that the robot always imitates when the operator is within the buffer zone, and stops imitating as soon as the operator leaves the buffer zone, but this approach requires more skill and concentration from the operator, so the default setting is to explicitly ask the robot to go into imitating state and back into idle.}. The walker node will then receive the master's response, and in case it was negative, any current movement will be stopped and the next cycle of the loop will begin. In case the permission was granted, the walker will calculate the direction and the speed of the movement, based on the marker position, and will send a command to the robot over the NAOqi API to start moving. We use a non-blocking movement function, so that the movement objective can be updated with every loop iteration. Finally, if the marker is within the buffer zone, the robot will be commanded to stop by the walker node, and the master will be informed, that the robot has stopped moving. Since in this case the walker node gives up the control, the permission from the master doesn't matter. A final piece of our system is the speech-based command interface. Since in our system the acceptable commands vary between states, the speech recognition controller must be aware of the current state of the system, therefore the master node is responsible for this functionality. The master node runs an auxiliary loop, in which a recognition target is sent to the speech server node. If a relevant word is detected, master receives the result and updates the state accordingly and then sends a new recognition target. If a state change occurred before any speech was detected, then the master sends a cancellation request to the speech server for the currently running objective and, again, sends a new target. This interaction is schematically displayed in \autoref{fig:master-speech}. \section{Conclusion and possible drawbacks} Upon completion of this project, our team successfully applied the knowledge that we acquired during the HRS lectures and tutorials to a complex practical task. We implemented an easy to use prototype of a teleoperation system, which is fairly robust to the environmental conditions. Furthermore, we researched several approaches to the implementation of the Cartesian control, and were able to create a Cartesian controller, which is superior to the NAO's built-in one. Finally, we extensively used ROS and so can confidently employ ROS in the future projects. Our resulting system has a few drawbacks, however, and there is a room for future improvements. Some of these drawbacks are due to the time constraints, the other ones have to do with the limitations of NAO itself. The first major drawback is the reliance on the NAO's built-in speech recognition for controlling the robot. Because of this, the operator has to be in the same room with the robot, which severely constraints the applicability of the teleoperation system. Furthermore, since the acting robot is the one detecting the speech, it can be susceptible to the sounds it makes during the operation (joint noises, warning notifications). Also, as the live demonstration revealed, using voice-based control in a crowded environment can lead to a high number of false positive detections and therefore instability of the system. A simple solution is to use two NAO robots, one of which is in the room with the operator acting solely as a speech detection tool, and the other one is in another room performing the actions. A saner approach is to apply third-party speech recognition software to a webcam microphone feed, since there are speech-recognition packages for ROS available \cite{ros-speech}. However, because the speech recognition wasn't the main objective of our project, we will reserve this for possible future work. Another important issue, which can be a problem for remote operation are the cables. A NAO is connected to the controlling computer over the Ethernet cable, and also, due to the low capacity of the NAO's battery, the power cord needs to be plugged in most of the time. The problem with this is that without the direct oversight of the operator, it is impossible to know where the cables are relative to the robot, so it is impossible to prevent the robot from tripping over the cables and falling. When it comes to battery power, the NAO has some autonomy; the Ethernet cable, however, cannot be removed because the onboard Wi-Fi of the NAO is too slow to allow streaming of the video feed and joint telemetry. A related issue is a relatively narrow field of view of the NAO's cameras. In a cordless case, the camera feed might be sufficient for the operator to navigate the NAO through the environment. However, picking up the objects when only seeing them through the robot's cameras is extremely difficult, because of the narrow field of view and lack of the depth information. A possible solution to this issue and the previous one, which would enable to operate a NAO and not be in the same room with it, is to equip the robot's room with video cameras, so that some oversight is possible. Finally, there is a problem with the NAO's stability when it walks carrying an object. Apparently, the NAOqi walking controller relies on the movement of the arms to stabilize the walking. It seems that if the arms are occupied by some other task during the walk, the built-in controller doesn't try to intelligently compensate, which has led to a significant number of falls during our experiments. Due to the time constraints, we weren't able to investigate any approaches to make the walking more stable. This, however, can be an interesting topic for future semester projects. % \begin{table}[htbp] % \caption{Table Type Styles} % \begin{center} % \begin{tabular}{|c|c|c|c|} % \hline % \textbf{Table}&\multicolumn{3}{|c|}{\textbf{Table Column Head}} \\ % \cline{2-4} % \textbf{Head} & \textbf{\textit{Table column subhead}}& \textbf{\textit{Subhead}}& \textbf{\textit{Subhead}} \\ % \hline % copy& More table copy$^{\mathrm{a}}$& & \\ % \hline % \multicolumn{4}{l}{$^{\mathrm{a}}$Sample of a Table footnote.} % \end{tabular} % \label{tab_sample} % \end{center} % \end{table} % \begin{thebibliography}{00} % \bibitem{b1} % G. Eason, B. Noble, and I. N. Sneddon, % ``On certain integrals of Lipschitz-Hankel type involving % products of Bessel functions,'' % Phil. Trans. Roy. Soc. London, vol. A247, pp. 529--551, April 1955. % \end{thebibliography} \end{document}