added some figures to the docs

This commit is contained in:
2019-12-15 09:29:01 -08:00
parent 086b021f26
commit 06d0b2d565
5 changed files with 41 additions and 22 deletions

View File

@@ -1,6 +1,7 @@
\documentclass{article}
\usepackage[letterpaper, margin=1in]{geometry}
\usepackage[colorlinks]{hyperref}
\usepackage{graphicx}
\usepackage{listings}
\lstset{basicstyle=\ttfamily}
@@ -62,14 +63,21 @@ perform some proxy task. The resulting embedding matrix is the weight matrix of
the first layer of the trained neural network.
The text data, before being supplied to the neural network, has to pass several
preprocessing stages. These stages, as implemented during in this project,
form an \textit{input pipeline}, which is depicted in \autoref{fig:pipeline}.
First, the pipeline node called \textit{Tokenizer} reads a character stream
from a text file. This node is responsible for replacing all non-ASCII
alphabetic characters in the stream with whitespace, normalizing the stream by
setting all remaining alphabetic characters to lowercase, and finally splitting
the stream into tokens (words) and passing the words one-by-one to the next
pipeline stage.
preprocessing stages. These stages, as implemented in this project, form an
\textit{input pipeline}, which is depicted in \autoref{fig:pipeline}. First,
the pipeline node called \textit{Tokenizer} reads a character stream from a
text file. This node is responsible for replacing all non-ASCII alphabetic
characters in the stream with whitespace, normalizing the stream by setting all
remaining alphabetic characters to lowercase, and finally splitting the stream
into tokens (words) and passing the words one-by-one to the next pipeline
stage.
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{fig/input_pipeline.pdf}
\caption{An Input Pipeline in the System}
\label{fig:pipeline}
\end{figure}
The next pipeline stage is filtering, for which the \textit{Filter} node is
responsible. When computing word embeddings using the CBOW model, only those
@@ -106,10 +114,18 @@ In the presented system, there is one central node, called the
\textit{Dispatcher}, that is responsible for storing the model weights,
distributing the weights to the \textit{Learner} nodes, which perform the
actual training, and collecting the weights at the end of a training round and
computing their average. The system allows for each \textit{Learner} to have
its own input pipeline, or for one single input pipeline to be shared among all
Learners, or for some intermediate configuration. However, it is not currently
possible for one Learner to access more than one input pipeline.
computing their average. \autoref{fig:modes} demonstrates that the system
allows for each \textit{Learner} to have its own input pipeline, or for one
single input pipeline to be shared among all Learners, or for some intermediate
configuration. However, it is not currently possible for one Learner to access
more than one input pipeline.
\begin{figure}[h]
\centering
\includegraphics[width=\linewidth]{fig/modes.pdf}
\caption{Two Configurable Modes of System Operation}
\label{fig:modes}
\end{figure}
\section{Implementation Details}
@@ -154,8 +170,8 @@ To run this system, you will need the following software:
\item \textit{MPICH} 3;
\item \textit{Python} 3.6 with headers and libraries (e.g.\@ on Ubuntu you
need to install \verb|python3-dev|);
\item \textit{Python} $\geq3.6$ with headers and libraries (e.g.\@ on Ubuntu
you need to install \verb|python3-dev|);
\item \textit{Meson}, \textit{Cython} and \textit{ninja} for building;
@@ -383,14 +399,14 @@ Learner nodes would essentially result in sequential simulation of the parallel
processing, thus yielding no improvements in processing time.
The evaluations were performed on two datasets. The first one being the book
``Moby Dick'' by Herman Melville (~200k words), obtained from the Project
Gutenberg~\cite{gutenberg}, using the API provided by the NLTK toolkit. The
vocabulary used for this dataset are all words from the book excluding English
stop words, as defined by NLTK. The test part for this dataset were a 1000
randomly selected context windows from the book.
``Moby Dick'' by Herman Melville (approx.\@ 200k words), obtained from the
Project Gutenberg~\cite{gutenberg}, using the API provided by the NLTK toolkit.
The vocabulary used for this dataset are all words from the book excluding
English stop words, as defined by NLTK. The test part for this dataset were a
1000 randomly selected context windows from the book.
Another dataset was a part of a recent English Wikipedia dump~\cite{wikidump}
(~90M words), which was transformed into plain text using the
(approx.\@ 90M words), which was transformed into plain text using the
WikiExtractor~\cite{wikiextractor} tool. For this dataset the vocabulary is the
list of 10000 most frequently used English words, obtained
from~\cite{10k-words}, again, excluding the stop words. As a test data, 5000
@@ -421,7 +437,7 @@ explained and probably has to do something with the particularities of the
training algorithm and the training data. This result also validates the use of
the number of context windows consumed by each Learner as a proxy for system
performance, since scaling within the number of available cores results in an
almost perfect correlation between the amount of consumed data and the wall
almost perfect correlation between the amount of data per Learner and the wall
time. Going from 2 to 4 Learners decreases the amount of data per Learner by
another 1.7x, with the wall time remaining the same, demonstrating the core
depletion on the laptop. Further increasing the number of learner nodes results