added some figures to the docs

2019-12-15 09:29:01 -08:00
parent 086b021f26
commit 06d0b2d565
5 changed files with 41 additions and 22 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -7,3 +7,4 @@ trained/
 __pycache__/
 config
 .dockerignore
 *.pdf
--- a/README.md
+++ b/README.md
@@ -38,7 +38,7 @@ though.
 Compilation is supposed to be as simple as: (run in project root)
 ```sh
-meson build && cd build && ninja
+meson build && (cd build && ninja)
 ```
 If this fails then either fix it yourself or let me know I guess.
--- a/docs/fig/input_pipeline.drawio
+++ b/docs/fig/input_pipeline.drawio
@@ -0,0 +1 @@
 <mxfile modified="2019-12-15T17:00:20.194Z" host="" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/11.1.1 Chrome/76.0.3809.88 Electron/6.0.0 Safari/537.36" etag="yK1R_JUEmaYINE8Mw2ns" version="11.1.1" type="device"><diagram id="7AXTfKl1m1ZNKUOlZLWa" name="Page-1">3ZrvU5swGMf/mr6cVwg/2pe26rY7t3mnN3XvIqSQmRIuBNvur18CAZoGrVNaGfXOIw9JIJ/n+zwhgRGYL9efGUzjbzREZGSPw/UInI1s259MxH9p2JQGb+yVhojhsDRZjeEa/0HKOFbWHIco0ypySgnHqW4MaJKggGs2yBhd6dUWlOhXTWGEDMN1AIlpvcUhj0vrxPYb+xeEo7i6suVNyzNLWFVWI8liGNLVlgmcj8CcUcrLo+V6johkV3Ep2108c7a+MYYS/poG3yc2vp//IhiCmX/q3VwtQPrJAWU3T5DkasTqbvmmQsBonoRI9jIegdkqxhxdpzCQZ1fC58IW8yURJUschjCLi7qysMCEzCmhTJQTmogWM3U5xDhaPzsQq8YjZIXoEnG2EVWqBmNFVEnKmqryqnGQU+kn1pyjjFCJIqr7briJA4XuHzB6+ymiUOhKFSnjMY1oAsl5Y53pnJs6l5SmCuhvxPlGBQnMOdXZl9eUF3qZrLgvmrMAvTCgKtQgixB/oZ7b7imGCOT4Sb+PzqlbBvUb+ogSgYe9T8RdiNTWRWq7pkgnLRp1DiXRydAk6r5Sov5HStQ1qF9gwnugTzER90uftd+Pqke0xvxu6/hednXiqtLZWvVcFDaqkAnJ8VP5bNFMa4VNuJZUDZKwqhEQmGU4KI2qSseh8F6Jq6ZXFIsr1xLxduZZ29/xfRl6qtWO++vbeLsifCN2ZpAHcQ+Cx+lbcp8aqH7SAD7kBKrOt2mJB9FUHgYbggU2th/ZQ8n38qE2wOAxKqj/yLnoBbU88I1ssHDlXxEeTMzMW2e84idb0IRv2ctfVynO0b1kmV7yj+kl50Om3LekOEGYbe62C1utZLFpVpT6mxorn+99THhGS0d6kh0bATyPIcuOGbsdRBzYSYuOfeIaIee2hJzrHgqsub69pSz8z7jWmasCC1pWvC1YwaEymWWuJvBR5xkxa1zAJSYSyJwmGSVyjJ1o2N3P2j4qa3NyHwxrv2esbTMPD4b1tG+sh6tr1/aria8vtMFwle06/aNtbk0OhrbXP9rmLttgaJuZpGUVe1za5suOwdA2M8mH0zZ3wQZD28wkH07bXNeIMYshym3RW8GWrgz4xVkNrr7bpbZBWt6FQoKjRLpNACucJnHiAJJTdWKJw7DYBGrzpr4x1IE/gKF+a2ou3z23xR+HWr4D8xmx2AMerhNcsOuClqV+iwvsQ4WE4xgu+JqkuQyIK5yiIlX00xv6pwgijNUWqeV146rXfIdgtX2HcLD05ZhT8xB2u1pmhY52u0Sx+QCnfG/VfMUEzv8C</diagram></mxfile>
--- a/docs/fig/modes.drawio
+++ b/docs/fig/modes.drawio
@@ -0,0 +1 @@
 <mxfile modified="2019-12-15T17:12:53.280Z" host="" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/11.1.1 Chrome/76.0.3809.88 Electron/6.0.0 Safari/537.36" etag="wUOw0lOGHGc77jKsLJXB" version="11.1.1" type="device"><diagram id="HbwVHqc1XiIdVxvrsfh9" name="Page-1">7Vxbc5s8EP01fuw3NiCwH/sl6WUmnbaT6fTy0lFBMbQYMUKu7f76CiMZg4ghdqyLm5cMWi8Yds/Rrg6KR+7VYv2awDx+hyOUjpxxtB651yPHmXk++1saNpUhcILKMCdJVJkmteEu+YO4ccytyyRCRcORYpzSJG8aQ5xlKKQNGyQEr5pu9zhtfmsO50gy3IUwla2fk4jGlXUKxrX9DUrmsfjmyZh/soDCmRuKGEZ4tWdyb0buFcGYVkeL9RVKy9iJuFTnvXrg092NEZTRISf8oquP928/pd+SYv0eBS++f3u3eSEuU9CNeGIUsQDwISY0xnOcwfSmtv5P8DKLUHnZMRvVPrcY58w4YcafiNINzyZcUsxMMV2k/FO0TuiXveOv5aX+cwAfXq/5pbeDjRhklGy+7A+q04AY1qdtR+K86gHLp2oErsBLEqJDoQEcbpDMET3kGMipmOwSzIiB8AKxO2IuBKWQJr+bdwI5ROc7P37qS0LgZs8hx0lGi70rfygNzIGTbUcazrVJ0EJEj/+s4c4OqhsQo70nqU1bkD0GcI45gLMUb24H9fXgzWnhbdqDt7Y/UAE49xlwp05wswsBnJjwzgs4zxzABZYiznVNQVy7pM4eV1KdQAXieFh/w3TJ4/A2y5flbX1IcpQmGZIg2QTcKk4ousvhNnsr1sg3wcUvjwhF6we7zwfyIxYCrTj6fLyqm+pd6xzvNdTe+OGM7kX1iKD55tD0OJY6fTQtGM/oy3IRxAxhCosiCYX5VZI2s3sEm4OhbPYNYbPrHmp4e90nSsgcSGS+RZBkiGjn8I6MYi0vc9hXSuGp7RTWzeCZdQw+3NH1+rc6xjNReGYNhUU/o43DYop95vCRHBZqgL0cbskAZnBYhNUGDru6OWy9phfoJrFrHYndgwvd/l5ahdgswmoBiR2gm8QGyVZmqVbuYJ0UPD05T8upLAxJSS5imJeH4SZNWC5JP/p/VEm//bEzwPDXfAuF90u6VZuejib9spGnlCW+FNHrpMghDWMDJhUX9IcrUBouWcz4vP3ewjIcSpF1NEcWgEufrhstV4bLbDb6rQNTupTXAbO85wyc5T1P55QubtNkrR+0er0OnVCt1u9Z/w5Yt9bveQPZAUx5cxcETRD2aP1tdyVav4iqgesTf2qW1u/9W/XuDAz2rWOw31oi92j9bX8lOqEnLwhMpbB2rd8Lnjl8GoentnO4T+vXw+GpPRzWrfV7M9s5rFvrF0qCRSRu9cZ9Wr/US6vQ+kVYLSCxdq0fGPRfH2Zp/WCwCuQ8PTlPy6ksDElJNltj9fzeJadSrR/Ibw8N0vqDATtq1SrSsphhp9YvRVa31u//W9rH+bV+MFQLqVYG2qZ0WVswXuvvWqCoFfvBxasMRzc3g+WDM+wyOi2n8grdsqLSbm66eKK0u/Gt//8X04rK0M30lWSsjUry5nTzi0qHcq22qAj2Xi5bji0q/tD95wJ4xhQVcecXVFQ6eKK2qFx876W4qPhD9ahKUtRGJVl8Mr6odKmwiouK9duSzlZUhm43EsAzp6jIKpjlRaWLJ2qLyoDF36MmkwgW8da3HNyzefsKp5jU8/kTBLH1MxBd29adjhieL4jBgGbHtCB6QF0U2bD+KbbqnWT9e3buzV8=</diagram></mxfile>
--- a/docs/report.latex
+++ b/docs/report.latex
@@ -1,6 +1,7 @@
 \documentclass{article}
 \usepackage[letterpaper, margin=1in]{geometry}
 \usepackage[colorlinks]{hyperref}
 \usepackage{graphicx}
 \usepackage{listings}
 \lstset{basicstyle=\ttfamily}
@@ -62,14 +63,21 @@ perform some proxy task. The resulting embedding matrix is the weight matrix of
 the first layer of the trained neural network.
 The text data, before being supplied to the neural network, has to pass several
-preprocessing stages. These stages, as implemented during in this project,
+preprocessing stages. These stages, as implemented in this project, form an
-form an \textit{input pipeline}, which is depicted in \autoref{fig:pipeline}.
+\textit{input pipeline}, which is depicted in \autoref{fig:pipeline}. First,
-First, the pipeline node called \textit{Tokenizer} reads a character stream
+the pipeline node called \textit{Tokenizer} reads a character stream from a
-from a text file. This node is responsible for replacing all non-ASCII
+text file. This node is responsible for replacing all non-ASCII alphabetic
-alphabetic characters in the stream with whitespace, normalizing the stream by
+characters in the stream with whitespace, normalizing the stream by setting all
-setting all remaining alphabetic characters to lowercase, and finally splitting
+remaining alphabetic characters to lowercase, and finally splitting the stream
-the stream into tokens (words) and passing the words one-by-one to the next
+into tokens (words) and passing the words one-by-one to the next pipeline
-pipeline stage.
+stage.
 \begin{figure}
  \centering
  \includegraphics[width=0.7\linewidth]{fig/input_pipeline.pdf}
  \caption{An Input Pipeline in the System}
  \label{fig:pipeline}
 \end{figure}
 The next pipeline stage is filtering, for which the \textit{Filter} node is
 responsible. When computing word embeddings using the CBOW model, only those
@@ -106,10 +114,18 @@ In the presented system, there is one central node, called the
 \textit{Dispatcher}, that is responsible for storing the model weights,
 distributing the weights to the \textit{Learner} nodes, which perform the
 actual training, and collecting the weights at the end of a training round and
-computing their average. The system allows for each \textit{Learner} to have
+computing their average. \autoref{fig:modes} demonstrates that the system
-its own input pipeline, or for one single input pipeline to be shared among all
+allows for each \textit{Learner} to have its own input pipeline, or for one
-Learners, or for some intermediate configuration. However, it is not currently
+single input pipeline to be shared among all Learners, or for some intermediate
-possible for one Learner to access more than one input pipeline.
+configuration. However, it is not currently possible for one Learner to access
 more than one input pipeline.
 \begin{figure}[h]
  \centering
  \includegraphics[width=\linewidth]{fig/modes.pdf}
  \caption{Two Configurable Modes of System Operation}
  \label{fig:modes}
 \end{figure}
 \section{Implementation Details}
@@ -154,8 +170,8 @@ To run this system, you will need the following software:
  \item \textit{MPICH} 3;
-  \item \textit{Python} 3.6 with headers and libraries (e.g.\@ on Ubuntu you
+  \item \textit{Python} $\geq3.6$ with headers and libraries (e.g.\@ on Ubuntu
-    need to install \verb|python3-dev|);
+    you need to install \verb|python3-dev|);
  \item \textit{Meson}, \textit{Cython} and \textit{ninja} for building;
@@ -383,14 +399,14 @@ Learner nodes would essentially result in sequential simulation of the parallel
 processing, thus yielding no improvements in processing time.
 The evaluations were performed on two datasets. The first one being the book
-``Moby Dick'' by Herman Melville (~200k words), obtained from the Project
+``Moby Dick'' by Herman Melville (approx.\@ 200k words), obtained from the
-Gutenberg~\cite{gutenberg}, using the API provided by the NLTK toolkit. The
+Project Gutenberg~\cite{gutenberg}, using the API provided by the NLTK toolkit.
-vocabulary used for this dataset are all words from the book excluding English
+The vocabulary used for this dataset are all words from the book excluding
-stop words, as defined by NLTK. The test part for this dataset were a 1000
+English stop words, as defined by NLTK. The test part for this dataset were a
-randomly selected context windows from the book.
+1000 randomly selected context windows from the book.
 Another dataset was a part of a recent English Wikipedia dump~\cite{wikidump}
-(~90M words), which was transformed into plain text using the
+(approx.\@ 90M words), which was transformed into plain text using the
 WikiExtractor~\cite{wikiextractor} tool. For this dataset the vocabulary is the
 list of 10000 most frequently used English words, obtained
 from~\cite{10k-words}, again, excluding the stop words. As a test data, 5000
@@ -421,7 +437,7 @@ explained and probably has to do something with the particularities of the
 training algorithm and the training data. This result also validates the use of
 the number of context windows consumed by each Learner as a proxy for system
 performance, since scaling within the number of available cores results in an
-almost perfect correlation between the amount of consumed data and the wall
+almost perfect correlation between the amount of data per Learner and the wall
 time. Going from 2 to 4 Learners decreases the amount of data per Learner by
 another 1.7x, with the wall time remaining the same, demonstrating the core
 depletion on the laptop. Further increasing the number of learner nodes results
		`@@ -0,0 +1 @@`
							<mxfile modified="2019-12-15T17:00:20.194Z" host="" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/11.1.1 Chrome/76.0.3809.88 Electron/6.0.0 Safari/537.36" etag="yK1R_JUEmaYINE8Mw2ns" version="11.1.1" type="device"><diagram id="7AXTfKl1m1ZNKUOlZLWa" name="Page-1">3ZrvU5swGMf/mr6cVwg/2pe26rY7t3mnN3XvIqSQmRIuBNvur18CAZoGrVNaGfXOIw9JIJ/n+zwhgRGYL9efGUzjbzREZGSPw/UInI1s259MxH9p2JQGb+yVhojhsDRZjeEa/0HKOFbWHIco0ypySgnHqW4MaJKggGs2yBhd6dUWlOhXTWGEDMN1AIlpvcUhj0vrxPYb+xeEo7i6suVNyzNLWFVWI8liGNLVlgmcj8CcUcrLo+V6johkV3Ep2108c7a+MYYS/poG3yc2vp//IhiCmX/q3VwtQPrJAWU3T5DkasTqbvmmQsBonoRI9jIegdkqxhxdpzCQZ1fC58IW8yURJUschjCLi7qysMCEzCmhTJQTmogWM3U5xDhaPzsQq8YjZIXoEnG2EVWqBmNFVEnKmqryqnGQU+kn1pyjjFCJIqr7briJA4XuHzB6+ymiUOhKFSnjMY1oAsl5Y53pnJs6l5SmCuhvxPlGBQnMOdXZl9eUF3qZrLgvmrMAvTCgKtQgixB/oZ7b7imGCOT4Sb+PzqlbBvUb+ogSgYe9T8RdiNTWRWq7pkgnLRp1DiXRydAk6r5Sov5HStQ1qF9gwnugTzER90uftd+Pqke0xvxu6/hednXiqtLZWvVcFDaqkAnJ8VP5bNFMa4VNuJZUDZKwqhEQmGU4KI2qSseh8F6Jq6ZXFIsr1xLxduZZ29/xfRl6qtWO++vbeLsifCN2ZpAHcQ+Cx+lbcp8aqH7SAD7kBKrOt2mJB9FUHgYbggU2th/ZQ8n38qE2wOAxKqj/yLnoBbU88I1ssHDlXxEeTMzMW2e84idb0IRv2ctfVynO0b1kmV7yj+kl50Om3LekOEGYbe62C1utZLFpVpT6mxorn+99THhGS0d6kh0bATyPIcuOGbsdRBzYSYuOfeIaIee2hJzrHgqsub69pSz8z7jWmasCC1pWvC1YwaEymWWuJvBR5xkxa1zAJSYSyJwmGSVyjJ1o2N3P2j4qa3NyHwxrv2esbTMPD4b1tG+sh6tr1/aria8vtMFwle06/aNtbk0OhrbXP9rmLttgaJuZpGUVe1za5suOwdA2M8mH0zZ3wQZD28wkH07bXNeIMYshym3RW8GWrgz4xVkNrr7bpbZBWt6FQoKjRLpNACucJnHiAJJTdWKJw7DYBGrzpr4x1IE/gKF+a2ou3z23xR+HWr4D8xmx2AMerhNcsOuClqV+iwvsQ4WE4xgu+JqkuQyIK5yiIlX00xv6pwgijNUWqeV146rXfIdgtX2HcLD05ZhT8xB2u1pmhY52u0Sx+QCnfG/VfMUEzv8C</diagram></mxfile>
		`@@ -0,0 +1 @@`
							<mxfile modified="2019-12-15T17:12:53.280Z" host="" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/11.1.1 Chrome/76.0.3809.88 Electron/6.0.0 Safari/537.36" etag="wUOw0lOGHGc77jKsLJXB" version="11.1.1" type="device"><diagram id="HbwVHqc1XiIdVxvrsfh9" name="Page-1">7Vxbc5s8EP01fuw3NiCwH/sl6WUmnbaT6fTy0lFBMbQYMUKu7f76CiMZg4ghdqyLm5cMWi8Yds/Rrg6KR+7VYv2awDx+hyOUjpxxtB651yPHmXk++1saNpUhcILKMCdJVJkmteEu+YO4ccytyyRCRcORYpzSJG8aQ5xlKKQNGyQEr5pu9zhtfmsO50gy3IUwla2fk4jGlXUKxrX9DUrmsfjmyZh/soDCmRuKGEZ4tWdyb0buFcGYVkeL9RVKy9iJuFTnvXrg092NEZTRISf8oquP928/pd+SYv0eBS++f3u3eSEuU9CNeGIUsQDwISY0xnOcwfSmtv5P8DKLUHnZMRvVPrcY58w4YcafiNINzyZcUsxMMV2k/FO0TuiXveOv5aX+cwAfXq/5pbeDjRhklGy+7A+q04AY1qdtR+K86gHLp2oErsBLEqJDoQEcbpDMET3kGMipmOwSzIiB8AKxO2IuBKWQJr+bdwI5ROc7P37qS0LgZs8hx0lGi70rfygNzIGTbUcazrVJ0EJEj/+s4c4OqhsQo70nqU1bkD0GcI45gLMUb24H9fXgzWnhbdqDt7Y/UAE49xlwp05wswsBnJjwzgs4zxzABZYiznVNQVy7pM4eV1KdQAXieFh/w3TJ4/A2y5flbX1IcpQmGZIg2QTcKk4ousvhNnsr1sg3wcUvjwhF6we7zwfyIxYCrTj6fLyqm+pd6xzvNdTe+OGM7kX1iKD55tD0OJY6fTQtGM/oy3IRxAxhCosiCYX5VZI2s3sEm4OhbPYNYbPrHmp4e90nSsgcSGS+RZBkiGjn8I6MYi0vc9hXSuGp7RTWzeCZdQw+3NH1+rc6xjNReGYNhUU/o43DYop95vCRHBZqgL0cbskAZnBYhNUGDru6OWy9phfoJrFrHYndgwvd/l5ahdgswmoBiR2gm8QGyVZmqVbuYJ0UPD05T8upLAxJSS5imJeH4SZNWC5JP/p/VEm//bEzwPDXfAuF90u6VZuejib9spGnlCW+FNHrpMghDWMDJhUX9IcrUBouWcz4vP3ewjIcSpF1NEcWgEufrhstV4bLbDb6rQNTupTXAbO85wyc5T1P55QubtNkrR+0er0OnVCt1u9Z/w5Yt9bveQPZAUx5cxcETRD2aP1tdyVav4iqgesTf2qW1u/9W/XuDAz2rWOw31oi92j9bX8lOqEnLwhMpbB2rd8Lnjl8GoentnO4T+vXw+GpPRzWrfV7M9s5rFvrF0qCRSRu9cZ9Wr/US6vQ+kVYLSCxdq0fGPRfH2Zp/WCwCuQ8PTlPy6ksDElJNltj9fzeJadSrR/Ibw8N0vqDATtq1SrSsphhp9YvRVa31u//W9rH+bV+MFQLqVYG2qZ0WVswXuvvWqCoFfvBxasMRzc3g+WDM+wyOi2n8grdsqLSbm66eKK0u/Gt//8X04rK0M30lWSsjUry5nTzi0qHcq22qAj2Xi5bji0q/tD95wJ4xhQVcecXVFQ6eKK2qFx876W4qPhD9ahKUtRGJVl8Mr6odKmwiouK9duSzlZUhm43EsAzp6jIKpjlRaWLJ2qLyoDF36MmkwgW8da3HNyzefsKp5jU8/kTBLH1MxBd29adjhieL4jBgGbHtCB6QF0U2bD+KbbqnWT9e3buzV8=</diagram></mxfile>