added some figures to the docs
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@@ -7,3 +7,4 @@ trained/
|
||||
__pycache__/
|
||||
config
|
||||
.dockerignore
|
||||
*.pdf
|
||||
|
||||
@@ -38,7 +38,7 @@ though.
|
||||
Compilation is supposed to be as simple as: (run in project root)
|
||||
|
||||
```sh
|
||||
meson build && cd build && ninja
|
||||
meson build && (cd build && ninja)
|
||||
```
|
||||
|
||||
If this fails then either fix it yourself or let me know I guess.
|
||||
|
||||
1
docs/fig/input_pipeline.drawio
Normal file
1
docs/fig/input_pipeline.drawio
Normal file
@@ -0,0 +1 @@
|
||||
<mxfile modified="2019-12-15T17:00:20.194Z" host="" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/11.1.1 Chrome/76.0.3809.88 Electron/6.0.0 Safari/537.36" etag="yK1R_JUEmaYINE8Mw2ns" version="11.1.1" type="device"><diagram id="7AXTfKl1m1ZNKUOlZLWa" name="Page-1">3ZrvU5swGMf/mr6cVwg/2pe26rY7t3mnN3XvIqSQmRIuBNvur18CAZoGrVNaGfXOIw9JIJ/n+zwhgRGYL9efGUzjbzREZGSPw/UInI1s259MxH9p2JQGb+yVhojhsDRZjeEa/0HKOFbWHIco0ypySgnHqW4MaJKggGs2yBhd6dUWlOhXTWGEDMN1AIlpvcUhj0vrxPYb+xeEo7i6suVNyzNLWFVWI8liGNLVlgmcj8CcUcrLo+V6johkV3Ep2108c7a+MYYS/poG3yc2vp//IhiCmX/q3VwtQPrJAWU3T5DkasTqbvmmQsBonoRI9jIegdkqxhxdpzCQZ1fC58IW8yURJUschjCLi7qysMCEzCmhTJQTmogWM3U5xDhaPzsQq8YjZIXoEnG2EVWqBmNFVEnKmqryqnGQU+kn1pyjjFCJIqr7briJA4XuHzB6+ymiUOhKFSnjMY1oAsl5Y53pnJs6l5SmCuhvxPlGBQnMOdXZl9eUF3qZrLgvmrMAvTCgKtQgixB/oZ7b7imGCOT4Sb+PzqlbBvUb+ogSgYe9T8RdiNTWRWq7pkgnLRp1DiXRydAk6r5Sov5HStQ1qF9gwnugTzER90uftd+Pqke0xvxu6/hednXiqtLZWvVcFDaqkAnJ8VP5bNFMa4VNuJZUDZKwqhEQmGU4KI2qSseh8F6Jq6ZXFIsr1xLxduZZ29/xfRl6qtWO++vbeLsifCN2ZpAHcQ+Cx+lbcp8aqH7SAD7kBKrOt2mJB9FUHgYbggU2th/ZQ8n38qE2wOAxKqj/yLnoBbU88I1ssHDlXxEeTMzMW2e84idb0IRv2ctfVynO0b1kmV7yj+kl50Om3LekOEGYbe62C1utZLFpVpT6mxorn+99THhGS0d6kh0bATyPIcuOGbsdRBzYSYuOfeIaIee2hJzrHgqsub69pSz8z7jWmasCC1pWvC1YwaEymWWuJvBR5xkxa1zAJSYSyJwmGSVyjJ1o2N3P2j4qa3NyHwxrv2esbTMPD4b1tG+sh6tr1/aria8vtMFwle06/aNtbk0OhrbXP9rmLttgaJuZpGUVe1za5suOwdA2M8mH0zZ3wQZD28wkH07bXNeIMYshym3RW8GWrgz4xVkNrr7bpbZBWt6FQoKjRLpNACucJnHiAJJTdWKJw7DYBGrzpr4x1IE/gKF+a2ou3z23xR+HWr4D8xmx2AMerhNcsOuClqV+iwvsQ4WE4xgu+JqkuQyIK5yiIlX00xv6pwgijNUWqeV146rXfIdgtX2HcLD05ZhT8xB2u1pmhY52u0Sx+QCnfG/VfMUEzv8C</diagram></mxfile>
|
||||
1
docs/fig/modes.drawio
Normal file
1
docs/fig/modes.drawio
Normal file
@@ -0,0 +1 @@
|
||||
<mxfile modified="2019-12-15T17:12:53.280Z" host="" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/11.1.1 Chrome/76.0.3809.88 Electron/6.0.0 Safari/537.36" etag="wUOw0lOGHGc77jKsLJXB" version="11.1.1" type="device"><diagram id="HbwVHqc1XiIdVxvrsfh9" name="Page-1">7Vxbc5s8EP01fuw3NiCwH/sl6WUmnbaT6fTy0lFBMbQYMUKu7f76CiMZg4ghdqyLm5cMWi8Yds/Rrg6KR+7VYv2awDx+hyOUjpxxtB651yPHmXk++1saNpUhcILKMCdJVJkmteEu+YO4ccytyyRCRcORYpzSJG8aQ5xlKKQNGyQEr5pu9zhtfmsO50gy3IUwla2fk4jGlXUKxrX9DUrmsfjmyZh/soDCmRuKGEZ4tWdyb0buFcGYVkeL9RVKy9iJuFTnvXrg092NEZTRISf8oquP928/pd+SYv0eBS++f3u3eSEuU9CNeGIUsQDwISY0xnOcwfSmtv5P8DKLUHnZMRvVPrcY58w4YcafiNINzyZcUsxMMV2k/FO0TuiXveOv5aX+cwAfXq/5pbeDjRhklGy+7A+q04AY1qdtR+K86gHLp2oErsBLEqJDoQEcbpDMET3kGMipmOwSzIiB8AKxO2IuBKWQJr+bdwI5ROc7P37qS0LgZs8hx0lGi70rfygNzIGTbUcazrVJ0EJEj/+s4c4OqhsQo70nqU1bkD0GcI45gLMUb24H9fXgzWnhbdqDt7Y/UAE49xlwp05wswsBnJjwzgs4zxzABZYiznVNQVy7pM4eV1KdQAXieFh/w3TJ4/A2y5flbX1IcpQmGZIg2QTcKk4ousvhNnsr1sg3wcUvjwhF6we7zwfyIxYCrTj6fLyqm+pd6xzvNdTe+OGM7kX1iKD55tD0OJY6fTQtGM/oy3IRxAxhCosiCYX5VZI2s3sEm4OhbPYNYbPrHmp4e90nSsgcSGS+RZBkiGjn8I6MYi0vc9hXSuGp7RTWzeCZdQw+3NH1+rc6xjNReGYNhUU/o43DYop95vCRHBZqgL0cbskAZnBYhNUGDru6OWy9phfoJrFrHYndgwvd/l5ahdgswmoBiR2gm8QGyVZmqVbuYJ0UPD05T8upLAxJSS5imJeH4SZNWC5JP/p/VEm//bEzwPDXfAuF90u6VZuejib9spGnlCW+FNHrpMghDWMDJhUX9IcrUBouWcz4vP3ewjIcSpF1NEcWgEufrhstV4bLbDb6rQNTupTXAbO85wyc5T1P55QubtNkrR+0er0OnVCt1u9Z/w5Yt9bveQPZAUx5cxcETRD2aP1tdyVav4iqgesTf2qW1u/9W/XuDAz2rWOw31oi92j9bX8lOqEnLwhMpbB2rd8Lnjl8GoentnO4T+vXw+GpPRzWrfV7M9s5rFvrF0qCRSRu9cZ9Wr/US6vQ+kVYLSCxdq0fGPRfH2Zp/WCwCuQ8PTlPy6ksDElJNltj9fzeJadSrR/Ibw8N0vqDATtq1SrSsphhp9YvRVa31u//W9rH+bV+MFQLqVYG2qZ0WVswXuvvWqCoFfvBxasMRzc3g+WDM+wyOi2n8grdsqLSbm66eKK0u/Gt//8X04rK0M30lWSsjUry5nTzi0qHcq22qAj2Xi5bji0q/tD95wJ4xhQVcecXVFQ6eKK2qFx876W4qPhD9ahKUtRGJVl8Mr6odKmwiouK9duSzlZUhm43EsAzp6jIKpjlRaWLJ2qLyoDF36MmkwgW8da3HNyzefsKp5jU8/kTBLH1MxBd29adjhieL4jBgGbHtCB6QF0U2bD+KbbqnWT9e3buzV8=</diagram></mxfile>
|
||||
@@ -1,6 +1,7 @@
|
||||
\documentclass{article}
|
||||
\usepackage[letterpaper, margin=1in]{geometry}
|
||||
\usepackage[colorlinks]{hyperref}
|
||||
\usepackage{graphicx}
|
||||
\usepackage{listings}
|
||||
\lstset{basicstyle=\ttfamily}
|
||||
|
||||
@@ -62,14 +63,21 @@ perform some proxy task. The resulting embedding matrix is the weight matrix of
|
||||
the first layer of the trained neural network.
|
||||
|
||||
The text data, before being supplied to the neural network, has to pass several
|
||||
preprocessing stages. These stages, as implemented during in this project,
|
||||
form an \textit{input pipeline}, which is depicted in \autoref{fig:pipeline}.
|
||||
First, the pipeline node called \textit{Tokenizer} reads a character stream
|
||||
from a text file. This node is responsible for replacing all non-ASCII
|
||||
alphabetic characters in the stream with whitespace, normalizing the stream by
|
||||
setting all remaining alphabetic characters to lowercase, and finally splitting
|
||||
the stream into tokens (words) and passing the words one-by-one to the next
|
||||
pipeline stage.
|
||||
preprocessing stages. These stages, as implemented in this project, form an
|
||||
\textit{input pipeline}, which is depicted in \autoref{fig:pipeline}. First,
|
||||
the pipeline node called \textit{Tokenizer} reads a character stream from a
|
||||
text file. This node is responsible for replacing all non-ASCII alphabetic
|
||||
characters in the stream with whitespace, normalizing the stream by setting all
|
||||
remaining alphabetic characters to lowercase, and finally splitting the stream
|
||||
into tokens (words) and passing the words one-by-one to the next pipeline
|
||||
stage.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.7\linewidth]{fig/input_pipeline.pdf}
|
||||
\caption{An Input Pipeline in the System}
|
||||
\label{fig:pipeline}
|
||||
\end{figure}
|
||||
|
||||
The next pipeline stage is filtering, for which the \textit{Filter} node is
|
||||
responsible. When computing word embeddings using the CBOW model, only those
|
||||
@@ -106,10 +114,18 @@ In the presented system, there is one central node, called the
|
||||
\textit{Dispatcher}, that is responsible for storing the model weights,
|
||||
distributing the weights to the \textit{Learner} nodes, which perform the
|
||||
actual training, and collecting the weights at the end of a training round and
|
||||
computing their average. The system allows for each \textit{Learner} to have
|
||||
its own input pipeline, or for one single input pipeline to be shared among all
|
||||
Learners, or for some intermediate configuration. However, it is not currently
|
||||
possible for one Learner to access more than one input pipeline.
|
||||
computing their average. \autoref{fig:modes} demonstrates that the system
|
||||
allows for each \textit{Learner} to have its own input pipeline, or for one
|
||||
single input pipeline to be shared among all Learners, or for some intermediate
|
||||
configuration. However, it is not currently possible for one Learner to access
|
||||
more than one input pipeline.
|
||||
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{fig/modes.pdf}
|
||||
\caption{Two Configurable Modes of System Operation}
|
||||
\label{fig:modes}
|
||||
\end{figure}
|
||||
|
||||
\section{Implementation Details}
|
||||
|
||||
@@ -154,8 +170,8 @@ To run this system, you will need the following software:
|
||||
|
||||
\item \textit{MPICH} 3;
|
||||
|
||||
\item \textit{Python} 3.6 with headers and libraries (e.g.\@ on Ubuntu you
|
||||
need to install \verb|python3-dev|);
|
||||
\item \textit{Python} $\geq3.6$ with headers and libraries (e.g.\@ on Ubuntu
|
||||
you need to install \verb|python3-dev|);
|
||||
|
||||
\item \textit{Meson}, \textit{Cython} and \textit{ninja} for building;
|
||||
|
||||
@@ -383,14 +399,14 @@ Learner nodes would essentially result in sequential simulation of the parallel
|
||||
processing, thus yielding no improvements in processing time.
|
||||
|
||||
The evaluations were performed on two datasets. The first one being the book
|
||||
``Moby Dick'' by Herman Melville (~200k words), obtained from the Project
|
||||
Gutenberg~\cite{gutenberg}, using the API provided by the NLTK toolkit. The
|
||||
vocabulary used for this dataset are all words from the book excluding English
|
||||
stop words, as defined by NLTK. The test part for this dataset were a 1000
|
||||
randomly selected context windows from the book.
|
||||
``Moby Dick'' by Herman Melville (approx.\@ 200k words), obtained from the
|
||||
Project Gutenberg~\cite{gutenberg}, using the API provided by the NLTK toolkit.
|
||||
The vocabulary used for this dataset are all words from the book excluding
|
||||
English stop words, as defined by NLTK. The test part for this dataset were a
|
||||
1000 randomly selected context windows from the book.
|
||||
|
||||
Another dataset was a part of a recent English Wikipedia dump~\cite{wikidump}
|
||||
(~90M words), which was transformed into plain text using the
|
||||
(approx.\@ 90M words), which was transformed into plain text using the
|
||||
WikiExtractor~\cite{wikiextractor} tool. For this dataset the vocabulary is the
|
||||
list of 10000 most frequently used English words, obtained
|
||||
from~\cite{10k-words}, again, excluding the stop words. As a test data, 5000
|
||||
@@ -421,7 +437,7 @@ explained and probably has to do something with the particularities of the
|
||||
training algorithm and the training data. This result also validates the use of
|
||||
the number of context windows consumed by each Learner as a proxy for system
|
||||
performance, since scaling within the number of available cores results in an
|
||||
almost perfect correlation between the amount of consumed data and the wall
|
||||
almost perfect correlation between the amount of data per Learner and the wall
|
||||
time. Going from 2 to 4 Learners decreases the amount of data per Learner by
|
||||
another 1.7x, with the wall time remaining the same, demonstrating the core
|
||||
depletion on the laptop. Further increasing the number of learner nodes results
|
||||
|
||||
Reference in New Issue
Block a user