added some figures to the docs
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@@ -7,3 +7,4 @@ trained/
|
|||||||
__pycache__/
|
__pycache__/
|
||||||
config
|
config
|
||||||
.dockerignore
|
.dockerignore
|
||||||
|
*.pdf
|
||||||
|
|||||||
@@ -38,7 +38,7 @@ though.
|
|||||||
Compilation is supposed to be as simple as: (run in project root)
|
Compilation is supposed to be as simple as: (run in project root)
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
meson build && cd build && ninja
|
meson build && (cd build && ninja)
|
||||||
```
|
```
|
||||||
|
|
||||||
If this fails then either fix it yourself or let me know I guess.
|
If this fails then either fix it yourself or let me know I guess.
|
||||||
|
|||||||
1
docs/fig/input_pipeline.drawio
Normal file
1
docs/fig/input_pipeline.drawio
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<mxfile modified="2019-12-15T17:00:20.194Z" host="" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/11.1.1 Chrome/76.0.3809.88 Electron/6.0.0 Safari/537.36" etag="yK1R_JUEmaYINE8Mw2ns" version="11.1.1" type="device"><diagram id="7AXTfKl1m1ZNKUOlZLWa" name="Page-1">3ZrvU5swGMf/mr6cVwg/2pe26rY7t3mnN3XvIqSQmRIuBNvur18CAZoGrVNaGfXOIw9JIJ/n+zwhgRGYL9efGUzjbzREZGSPw/UInI1s259MxH9p2JQGb+yVhojhsDRZjeEa/0HKOFbWHIco0ypySgnHqW4MaJKggGs2yBhd6dUWlOhXTWGEDMN1AIlpvcUhj0vrxPYb+xeEo7i6suVNyzNLWFVWI8liGNLVlgmcj8CcUcrLo+V6johkV3Ep2108c7a+MYYS/poG3yc2vp//IhiCmX/q3VwtQPrJAWU3T5DkasTqbvmmQsBonoRI9jIegdkqxhxdpzCQZ1fC58IW8yURJUschjCLi7qysMCEzCmhTJQTmogWM3U5xDhaPzsQq8YjZIXoEnG2EVWqBmNFVEnKmqryqnGQU+kn1pyjjFCJIqr7briJA4XuHzB6+ymiUOhKFSnjMY1oAsl5Y53pnJs6l5SmCuhvxPlGBQnMOdXZl9eUF3qZrLgvmrMAvTCgKtQgixB/oZ7b7imGCOT4Sb+PzqlbBvUb+ogSgYe9T8RdiNTWRWq7pkgnLRp1DiXRydAk6r5Sov5HStQ1qF9gwnugTzER90uftd+Pqke0xvxu6/hednXiqtLZWvVcFDaqkAnJ8VP5bNFMa4VNuJZUDZKwqhEQmGU4KI2qSseh8F6Jq6ZXFIsr1xLxduZZ29/xfRl6qtWO++vbeLsifCN2ZpAHcQ+Cx+lbcp8aqH7SAD7kBKrOt2mJB9FUHgYbggU2th/ZQ8n38qE2wOAxKqj/yLnoBbU88I1ssHDlXxEeTMzMW2e84idb0IRv2ctfVynO0b1kmV7yj+kl50Om3LekOEGYbe62C1utZLFpVpT6mxorn+99THhGS0d6kh0bATyPIcuOGbsdRBzYSYuOfeIaIee2hJzrHgqsub69pSz8z7jWmasCC1pWvC1YwaEymWWuJvBR5xkxa1zAJSYSyJwmGSVyjJ1o2N3P2j4qa3NyHwxrv2esbTMPD4b1tG+sh6tr1/aria8vtMFwle06/aNtbk0OhrbXP9rmLttgaJuZpGUVe1za5suOwdA2M8mH0zZ3wQZD28wkH07bXNeIMYshym3RW8GWrgz4xVkNrr7bpbZBWt6FQoKjRLpNACucJnHiAJJTdWKJw7DYBGrzpr4x1IE/gKF+a2ou3z23xR+HWr4D8xmx2AMerhNcsOuClqV+iwvsQ4WE4xgu+JqkuQyIK5yiIlX00xv6pwgijNUWqeV146rXfIdgtX2HcLD05ZhT8xB2u1pmhY52u0Sx+QCnfG/VfMUEzv8C</diagram></mxfile>
|
||||||
1
docs/fig/modes.drawio
Normal file
1
docs/fig/modes.drawio
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<mxfile modified="2019-12-15T17:12:53.280Z" host="" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/11.1.1 Chrome/76.0.3809.88 Electron/6.0.0 Safari/537.36" etag="wUOw0lOGHGc77jKsLJXB" version="11.1.1" type="device"><diagram id="HbwVHqc1XiIdVxvrsfh9" name="Page-1">7Vxbc5s8EP01fuw3NiCwH/sl6WUmnbaT6fTy0lFBMbQYMUKu7f76CiMZg4ghdqyLm5cMWi8Yds/Rrg6KR+7VYv2awDx+hyOUjpxxtB651yPHmXk++1saNpUhcILKMCdJVJkmteEu+YO4ccytyyRCRcORYpzSJG8aQ5xlKKQNGyQEr5pu9zhtfmsO50gy3IUwla2fk4jGlXUKxrX9DUrmsfjmyZh/soDCmRuKGEZ4tWdyb0buFcGYVkeL9RVKy9iJuFTnvXrg092NEZTRISf8oquP928/pd+SYv0eBS++f3u3eSEuU9CNeGIUsQDwISY0xnOcwfSmtv5P8DKLUHnZMRvVPrcY58w4YcafiNINzyZcUsxMMV2k/FO0TuiXveOv5aX+cwAfXq/5pbeDjRhklGy+7A+q04AY1qdtR+K86gHLp2oErsBLEqJDoQEcbpDMET3kGMipmOwSzIiB8AKxO2IuBKWQJr+bdwI5ROc7P37qS0LgZs8hx0lGi70rfygNzIGTbUcazrVJ0EJEj/+s4c4OqhsQo70nqU1bkD0GcI45gLMUb24H9fXgzWnhbdqDt7Y/UAE49xlwp05wswsBnJjwzgs4zxzABZYiznVNQVy7pM4eV1KdQAXieFh/w3TJ4/A2y5flbX1IcpQmGZIg2QTcKk4ousvhNnsr1sg3wcUvjwhF6we7zwfyIxYCrTj6fLyqm+pd6xzvNdTe+OGM7kX1iKD55tD0OJY6fTQtGM/oy3IRxAxhCosiCYX5VZI2s3sEm4OhbPYNYbPrHmp4e90nSsgcSGS+RZBkiGjn8I6MYi0vc9hXSuGp7RTWzeCZdQw+3NH1+rc6xjNReGYNhUU/o43DYop95vCRHBZqgL0cbskAZnBYhNUGDru6OWy9phfoJrFrHYndgwvd/l5ahdgswmoBiR2gm8QGyVZmqVbuYJ0UPD05T8upLAxJSS5imJeH4SZNWC5JP/p/VEm//bEzwPDXfAuF90u6VZuejib9spGnlCW+FNHrpMghDWMDJhUX9IcrUBouWcz4vP3ewjIcSpF1NEcWgEufrhstV4bLbDb6rQNTupTXAbO85wyc5T1P55QubtNkrR+0er0OnVCt1u9Z/w5Yt9bveQPZAUx5cxcETRD2aP1tdyVav4iqgesTf2qW1u/9W/XuDAz2rWOw31oi92j9bX8lOqEnLwhMpbB2rd8Lnjl8GoentnO4T+vXw+GpPRzWrfV7M9s5rFvrF0qCRSRu9cZ9Wr/US6vQ+kVYLSCxdq0fGPRfH2Zp/WCwCuQ8PTlPy6ksDElJNltj9fzeJadSrR/Ibw8N0vqDATtq1SrSsphhp9YvRVa31u//W9rH+bV+MFQLqVYG2qZ0WVswXuvvWqCoFfvBxasMRzc3g+WDM+wyOi2n8grdsqLSbm66eKK0u/Gt//8X04rK0M30lWSsjUry5nTzi0qHcq22qAj2Xi5bji0q/tD95wJ4xhQVcecXVFQ6eKK2qFx876W4qPhD9ahKUtRGJVl8Mr6odKmwiouK9duSzlZUhm43EsAzp6jIKpjlRaWLJ2qLyoDF36MmkwgW8da3HNyzefsKp5jU8/kTBLH1MxBd29adjhieL4jBgGbHtCB6QF0U2bD+KbbqnWT9e3buzV8=</diagram></mxfile>
|
||||||
@@ -1,6 +1,7 @@
|
|||||||
\documentclass{article}
|
\documentclass{article}
|
||||||
\usepackage[letterpaper, margin=1in]{geometry}
|
\usepackage[letterpaper, margin=1in]{geometry}
|
||||||
\usepackage[colorlinks]{hyperref}
|
\usepackage[colorlinks]{hyperref}
|
||||||
|
\usepackage{graphicx}
|
||||||
\usepackage{listings}
|
\usepackage{listings}
|
||||||
\lstset{basicstyle=\ttfamily}
|
\lstset{basicstyle=\ttfamily}
|
||||||
|
|
||||||
@@ -62,14 +63,21 @@ perform some proxy task. The resulting embedding matrix is the weight matrix of
|
|||||||
the first layer of the trained neural network.
|
the first layer of the trained neural network.
|
||||||
|
|
||||||
The text data, before being supplied to the neural network, has to pass several
|
The text data, before being supplied to the neural network, has to pass several
|
||||||
preprocessing stages. These stages, as implemented during in this project,
|
preprocessing stages. These stages, as implemented in this project, form an
|
||||||
form an \textit{input pipeline}, which is depicted in \autoref{fig:pipeline}.
|
\textit{input pipeline}, which is depicted in \autoref{fig:pipeline}. First,
|
||||||
First, the pipeline node called \textit{Tokenizer} reads a character stream
|
the pipeline node called \textit{Tokenizer} reads a character stream from a
|
||||||
from a text file. This node is responsible for replacing all non-ASCII
|
text file. This node is responsible for replacing all non-ASCII alphabetic
|
||||||
alphabetic characters in the stream with whitespace, normalizing the stream by
|
characters in the stream with whitespace, normalizing the stream by setting all
|
||||||
setting all remaining alphabetic characters to lowercase, and finally splitting
|
remaining alphabetic characters to lowercase, and finally splitting the stream
|
||||||
the stream into tokens (words) and passing the words one-by-one to the next
|
into tokens (words) and passing the words one-by-one to the next pipeline
|
||||||
pipeline stage.
|
stage.
|
||||||
|
|
||||||
|
\begin{figure}
|
||||||
|
\centering
|
||||||
|
\includegraphics[width=0.7\linewidth]{fig/input_pipeline.pdf}
|
||||||
|
\caption{An Input Pipeline in the System}
|
||||||
|
\label{fig:pipeline}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
The next pipeline stage is filtering, for which the \textit{Filter} node is
|
The next pipeline stage is filtering, for which the \textit{Filter} node is
|
||||||
responsible. When computing word embeddings using the CBOW model, only those
|
responsible. When computing word embeddings using the CBOW model, only those
|
||||||
@@ -106,10 +114,18 @@ In the presented system, there is one central node, called the
|
|||||||
\textit{Dispatcher}, that is responsible for storing the model weights,
|
\textit{Dispatcher}, that is responsible for storing the model weights,
|
||||||
distributing the weights to the \textit{Learner} nodes, which perform the
|
distributing the weights to the \textit{Learner} nodes, which perform the
|
||||||
actual training, and collecting the weights at the end of a training round and
|
actual training, and collecting the weights at the end of a training round and
|
||||||
computing their average. The system allows for each \textit{Learner} to have
|
computing their average. \autoref{fig:modes} demonstrates that the system
|
||||||
its own input pipeline, or for one single input pipeline to be shared among all
|
allows for each \textit{Learner} to have its own input pipeline, or for one
|
||||||
Learners, or for some intermediate configuration. However, it is not currently
|
single input pipeline to be shared among all Learners, or for some intermediate
|
||||||
possible for one Learner to access more than one input pipeline.
|
configuration. However, it is not currently possible for one Learner to access
|
||||||
|
more than one input pipeline.
|
||||||
|
|
||||||
|
\begin{figure}[h]
|
||||||
|
\centering
|
||||||
|
\includegraphics[width=\linewidth]{fig/modes.pdf}
|
||||||
|
\caption{Two Configurable Modes of System Operation}
|
||||||
|
\label{fig:modes}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
\section{Implementation Details}
|
\section{Implementation Details}
|
||||||
|
|
||||||
@@ -154,8 +170,8 @@ To run this system, you will need the following software:
|
|||||||
|
|
||||||
\item \textit{MPICH} 3;
|
\item \textit{MPICH} 3;
|
||||||
|
|
||||||
\item \textit{Python} 3.6 with headers and libraries (e.g.\@ on Ubuntu you
|
\item \textit{Python} $\geq3.6$ with headers and libraries (e.g.\@ on Ubuntu
|
||||||
need to install \verb|python3-dev|);
|
you need to install \verb|python3-dev|);
|
||||||
|
|
||||||
\item \textit{Meson}, \textit{Cython} and \textit{ninja} for building;
|
\item \textit{Meson}, \textit{Cython} and \textit{ninja} for building;
|
||||||
|
|
||||||
@@ -383,14 +399,14 @@ Learner nodes would essentially result in sequential simulation of the parallel
|
|||||||
processing, thus yielding no improvements in processing time.
|
processing, thus yielding no improvements in processing time.
|
||||||
|
|
||||||
The evaluations were performed on two datasets. The first one being the book
|
The evaluations were performed on two datasets. The first one being the book
|
||||||
``Moby Dick'' by Herman Melville (~200k words), obtained from the Project
|
``Moby Dick'' by Herman Melville (approx.\@ 200k words), obtained from the
|
||||||
Gutenberg~\cite{gutenberg}, using the API provided by the NLTK toolkit. The
|
Project Gutenberg~\cite{gutenberg}, using the API provided by the NLTK toolkit.
|
||||||
vocabulary used for this dataset are all words from the book excluding English
|
The vocabulary used for this dataset are all words from the book excluding
|
||||||
stop words, as defined by NLTK. The test part for this dataset were a 1000
|
English stop words, as defined by NLTK. The test part for this dataset were a
|
||||||
randomly selected context windows from the book.
|
1000 randomly selected context windows from the book.
|
||||||
|
|
||||||
Another dataset was a part of a recent English Wikipedia dump~\cite{wikidump}
|
Another dataset was a part of a recent English Wikipedia dump~\cite{wikidump}
|
||||||
(~90M words), which was transformed into plain text using the
|
(approx.\@ 90M words), which was transformed into plain text using the
|
||||||
WikiExtractor~\cite{wikiextractor} tool. For this dataset the vocabulary is the
|
WikiExtractor~\cite{wikiextractor} tool. For this dataset the vocabulary is the
|
||||||
list of 10000 most frequently used English words, obtained
|
list of 10000 most frequently used English words, obtained
|
||||||
from~\cite{10k-words}, again, excluding the stop words. As a test data, 5000
|
from~\cite{10k-words}, again, excluding the stop words. As a test data, 5000
|
||||||
@@ -421,7 +437,7 @@ explained and probably has to do something with the particularities of the
|
|||||||
training algorithm and the training data. This result also validates the use of
|
training algorithm and the training data. This result also validates the use of
|
||||||
the number of context windows consumed by each Learner as a proxy for system
|
the number of context windows consumed by each Learner as a proxy for system
|
||||||
performance, since scaling within the number of available cores results in an
|
performance, since scaling within the number of available cores results in an
|
||||||
almost perfect correlation between the amount of consumed data and the wall
|
almost perfect correlation between the amount of data per Learner and the wall
|
||||||
time. Going from 2 to 4 Learners decreases the amount of data per Learner by
|
time. Going from 2 to 4 Learners decreases the amount of data per Learner by
|
||||||
another 1.7x, with the wall time remaining the same, demonstrating the core
|
another 1.7x, with the wall time remaining the same, demonstrating the core
|
||||||
depletion on the laptop. Further increasing the number of learner nodes results
|
depletion on the laptop. Further increasing the number of learner nodes results
|
||||||
|
|||||||
Reference in New Issue
Block a user