\documentclass[]{article} \usepackage{graphicx} \usepackage{float} \usepackage{amsmath} \usepackage{amssymb} \usepackage{amsthm} \usepackage{xspace} \usepackage[inline]{enumitem} \usepackage{times} \usepackage{tikz} \usepackage{mathrsfs,mathabx} \usepackage{algorithm} \usepackage{url} \usepackage{rotating} \usepackage[noend]{algpseudocode} \usepackage{fancyref} \usepackage{minted} \usetikzlibrary{arrows,automata} %\newtheorem{theorem}{Theorem}[section] %\newtheorem{algorithm}[theorem]{Algorithm} %\newtheorem{lemma}[theorem]{Lemma} %\newtheorem{note}[theorem]{Note} %\newtheorem{proposition}[theorem]{Proposition} %\newtheorem{corollary}[theorem]{Corollary} \newenvironment{definition}[1][Definition]{\begin{trivlist} \item[\hskip \labelsep {\bfseries #1}]}{\end{trivlist}} \newenvironment{example}[1][Example]{\begin{trivlist} \item[\hskip \labelsep {\bfseries #1}]}{\end{trivlist}} \newenvironment{remark}[1][Remark]{\begin{trivlist} \item[\hskip \labelsep {\bfseries #1}]}{\end{trivlist}} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhead[L]{\small Projekt} \fancyhead[R]{\small Thomas Bracht Laumann Jespersen} %\usepackage[thmmarks]{ntheorem} %\theoremstyle{nonumberplain} %\theoremheaderfont{\normalfont\bfseries} %\theorembodyfont{\normalfont} %theoremsymbol{\ensuremath{\square}} %theoremseparator{.} %newtheorem{proof}{Proof} \setdescription{ font=\normalfont\bfseries, style=sameline, leftmargin=\parindent, itemsep=0pt, listparindent=\parindent } \frenchspacing \title{Implementing Base Change without Space Change} \author{Thomas Bracht Laumann Jespersen\\\footnotesize\url{ntl316@alumni.ku.dk}} \date{} \begin{document} \maketitle %\begin{center} % {\bf Group}\\[3mm] % \begin{tabular}{ll} % Thomas Bracht Laumann Jespersen & \texttt{ntl316} % \end{tabular} %\end{center} \newcommand{\eof}{\textsc{eof}\xspace} % Dit navn, cpr.nr og Alumni-mail % Navn på intern vejleder % Om projektet skal godkendes som begrænset valgfrit eller valgfrit % Emnet for projektet og dets indhold % ECTS-point, omfang og eventuelle metodekrav, hvis projektet har eksperimentelt indhold % Eksamensform % Tidsplan for projektet % Evt. omfang af vejledning % Afleveringsfrist \begin{abstract} We present a C library implementation of the SOLE encoding presented in \cite{patrascu10trits}. First we demonstrate how these encoding schemes function and provide a proof-of-concept implementation in Haskell for 32-bit blocks. Next we examine the requirements for an online encoding and decoding implementation using 128 bit blocks. We discuss implementation details, and conclude with a small benchmark comparison to a ``naive'' encoding scheme. \end{abstract} \section{Introduction} The implementation presented in this report require arithmetic operations on numbers represented with up to 256 bits. If performance was of no concern one could simply include arbitrary precision libraries such as GNU's GMP, but we decided against this for the reasons \begin{enumerate*}[label=(\alph*)] \item 256-bit numbers are relatively small when considering that libraries such as GNU GMP are written to handle and operate on numbers of arbitrary size, and the added overhead might detract too much from the running times; and \item by implementing our own operations we can in some cases take advantage of the structure of the operations, and compute our results much faster. \end{enumerate*} \Fref{sec:basechange} goes through the SOLE algorithm in its details and discusses some practical implementation considerations. \Fref[plain]{sec:impl} describes implementation details for working with 128-bit blocks (on 64-bit architectures). \section{Changing Base}\label{sec:basechange} At the heart of the algorithm is the method of changing base. Say we have $a = a_1\cdot B + a_2$, where $B$ is some base, we can re-express $a$ in a different base $B'$ such that $a = a_1'\cdot B'+a_2'$. \subsection{Encoding an input stream} We assume we're given an input stream of $n$ blocks of $b$ bits. Each block $x_i$ can be viewed as a ``character'' in an alphabet consisting of $2^b$ characters, i.e. $x_i\in[2^b]$. Let $B = 2^b$. \begin{center} \begin{tabular}{l*{5}{c}l} Input stream: & $x_1$ & $x_2$ & $x_3$ & $x_4$ & $x_5$ & $\cdots$\\ \end{tabular} \end{center} To indicate the end of the stream (which we might not even know when occurs), we introduce a new character \textsc{eof}, which gives us an alphabet of size $B+1$. In order to encode the \eof we would require $2^{b+1}$ bits, resulting in a huge loss of entropy (we wasting half the possible encoding space). Instead the following method was invented in~\cite{patrascu10trits}, which we outline here we a little more detail. \subsection{Overview of the algorithm} Conceptually the encoding consists of two phases each making a pass over the input and we present them here separately. However, in the implementation these phases are combined into one. In the following $B$ is our input alphabet size, such that $B = 2^b$. The first pass groups the input together $(2i, 2i+1)$ for $i = 0,1,\dots,\lceil\frac{n}{2}\rceil$ and computes a number $x_i'$: \begin{align*} x_i' = x_{2i}(B+1) + x_{2i+1} \in [(B+1)^2], \end{align*} which is decomposed into two new numbers $y_i$ and $z_i$ by dividing by $(B-3i)$. We set $z_i$ as the remainder and $y_i$ as the quotient of this operation. By virtue of division we know $z_i \in [B-3i]$. $y_i$ is constrained by the ratio between $(B+1)^2$ and $(B-3i)$, which is shown in~\cite{patrascu10trits} to be at most $B+3i+3$, so $y_i\in [B+3i+3]$. After this first phase the regrouped numbers $z_i$ and $y_i$ are grouped as follows, with alphabet sizes indicated: \begin{center} \begin{tabular}{l|*{5}{c|}l} Alphabet sizes & $B$ & $B+3$ & $B-3$ & $B+6$ & $B-9$ & $\cdots$\\\hline Post-phase 1 blocks & \;$z_0$\; & $y_0$ & $z_1$ & $y_1$ & $z_2$ & $\cdots$ \end{tabular} \end{center} Phase 2 regroups the blocks produced in phase 1, considering blocks $(2i, 2i+1)$ for $i\geq 1$. Notice that the very first block obtained $z_0$ is already a number in $B$, so we do not process it further, but simply output it. The remaining blocks are considered in pairs $(y_0, z_1), (y_1, z_2),\dots$ etc. The important observation from~\cite{patrascu10trits} here is the fact that for a given pair $(y_i, z_i)$, $y_i\in [B+3i]$ and $z\in[B-3i]$. This allows us to rewrite $(y_i, z_i)$ as a single number, and this number can be represented in $[B^2]$ as $(B-3i)(B+3i) < B^2$. A straight-forward transformation (which encodes an enumeration) is: \begin{align*} w_i = z_i(B+3i) + y_i. \end{align*} The resulting $w_i\in [B^2]$ can now easily be decomposed into two output blocks of $b$ bits by taking as the first block the upper $b$ bits from $w_i$, and for the second block, the lower $b$ bits. \subsection{Encoding \eof} In the above we never mentioned how \eof was to be handled. Assume $n$ is odd, ie we receive an odd number of input blocks. Appending \eof gives an even number of blocks in phase 1, the last block being in $[B+3j]$ for some $j$. During phase 2 of the encoding, when we reach the last block we artificially insert a zero block at the position immediately preceding it. This zero block is in the range $[B-3j]$ and can be combined with the last block and output. For even $n$ we simply append another \eof, and repeat the above termination scheme. \subsection{Decoding} The decoding is basically reversing the operations performed in passes 1 and 2. This section discusses some details in reversing the computation. The entire algorithm essentially consists of three injective mappings. \paragraph{The composition $x_j(B+1) + x_{j+1}$} This composition is very fast, since it breaks down to a shift by $b$ and two additions. The inverse is dividing a number $w = x_j(B+1) + x_{j+1}$. Naively dividing $w$ by $(B+1)$ is obviously not desirable, since we know how $w$ was composed, and we wish to exploit this knowledge. Consider $x_j, x_{j+1}\in [B]$. The sum of these numbers is in $[2B -1]$. In other words, the sum will at most overflow by one bit, a carry that is added to the $(b+1)$'th position. Since $x_j$ was shifted $b$, the ``upper'' part of $w$ is either the value $x_j+1$ or $x_j$ (depending on whether $x_j + x_{j+1}$ did or did not overflow). Compute $x' = w \gg b$ and $y' = w - x'B$. Due to the nature of the composition, $x'$ either equals $x_j$ or $x_j+1$. We can determine which by comparing $x'$ and $y'$. There are three cases to consider: \begin{itemize}[noitemsep] \item $x' > y'$. This indicates that $x_j + x_{j+1}$ overflowed and we find $x_{2i}$ by subtracting one from $x'$. \item $x' = y'$. This is only possible when $x_{j+1} = 0$. In that case we have already found $x_j$ and now know $x_{j+1} = 0$. \item $x' < y'$. The addition could not have overflowed the block size, and thus $x_j = x'$. \end{itemize} In the first and last case above, once we have determined the value for $x_j$, we can easily find $x_{j+1}$. \paragraph{The decomposition of $w = x_j(B+1) + x_{j+1}$ by $B-3i$} This step could potentially be very slow if one were to attempt traditional long division. \paragraph{The composition $w = r(B+3i) + q$} From one decomposition we have a quotient $q\in[B+3i+3]$, also called ``the spill'' and we are going to combine it with a remainder from the next decomposition, $r\in[B-3(i+1)]$. The encoding step is fairly straight-forward, in the same vein as the composition step from pass 1. The decoding is in theory a division by $(B+3i)$, but we will again exploit the structure of the composition of $w$ to obtain a faster division method. \section{Implementation}\label{sec:impl} In order to implement an \emph{online}, constant-space encoding and decoding, the two phases must be combined in one, since keeping them separate would require remembering the entire input (not necessarily feasible). \subsection{Haskell implementation} We implemented a proof-of-concept implementation in Haskell, using \texttt{Word32} as input blocks. This was then extended to support arbitrary block sizes (using the builtin \texttt{Integer} type). The Haskell implementation gives a good overview of the encoding, specifically allows to reproduce the example from an early version of the paper, in which blocks of $B\pm 4i$ were used (instead of $\pm 3i$). \subsection{C implementation} The C implementation works with 64-bit numbers and unsurprisingly represent 128-bit blocks as two 64-bit numbers. For intermediate computations, we need to be able to represent and work with 257-bit numbers, a representation (a little misleadingly called) \texttt{blk256\_t}, is implemented as well. Given the C89/90 standards do not include definitions for 64-bit sizes, this implementation adheres to the C99 standard. \begin{minted}{c} typedef struct __blk256_t { byte mx; uint64_t x[4]; } blk256_t; \end{minted} \subsection{Bit tricks} %\begin{enumerate} % \item Division by a number which is almost a power of two. % \item Multiplication by $B\pm k$ for small $k$ %\end{enumerate} For the first transformation a multiplication by $B+1$ is required. Given two blocks $x_1$ and $x_2$ we must compute the following: \begin{align*} x_{12} = x_2(B+1) + x_1, \end{align*} which we rewrite to $x_2\cdot B +x_2 + x_1$. Since $B$ is a power of two, this transformation can be implemented as \begin{align*} x_{12} = x_2 \ll b + x_2 + x_1, \end{align*} where ``$\ll$'' is bit-wise left-shift. \begin{example} Let $b=5$, $B = 2^b = 32$, $x_1=5$ and $x_2 = 7$. We compute $7\cdot (B+1) + 5 = 236_{10} = 00111|01100_2$. \end{example} \begin{example}[Largest possible non-\eof] Let $b=5$, $B = 2^b = 32$, $x_1=31$ and $x_2 = 31$. We compute $31\cdot (B+1) + 31 = 1054_{10} = 1|00000|11110_2$. \end{example} \begin{example}[Double-\eof] Let $b=5$, $B = 2^b = 32$. Then $\eof = 32$. In the case when $n$ is even, the last two blocks encoded are $(x_1, x_2) = (\eof, \eof)$. We compute $32\cdot (B+1) + 32 = 1088_{10} = 1|00010|00000_2$. \end{example} The next transformation requires dividing a large number $y$ by $B-3i$. We need to find $q$ and $r$ such that $y = q(B-3i) + r$. Consider dividing $y$ by $B$. Call this number $q'$ which is easily implemented as $y \gg b$, and we can find a corresponding $r'$ such that $y = q'(B-3i) + r'$. Since $B\geq B-3i$, the value of $q'$ can only be smaller than or equal to the value of $q$ we are seeking, i.e. $q'\leq q$. This implies our value for $r'$ is potentially too large, and we can iteratively find the value for $r$ by subtracting $B-3i$ enough times: \begin{itemize}[noitemsep] \item $q' \leftarrow y \gg b$ \item $r' \leftarrow y - q'(B-3i)$ \item \textbf{while} $r' \geq (B-3i)$ \begin{itemize}[noitemsep,nolistsep] \item $r' \leftarrow r' - (B-3i)$ \item $q' \leftarrow q' + 1$ \end{itemize} \item $q, r \leftarrow q', r'$ \end{itemize} This effectively reduces a division to bit-shifts are a number of additions and subtractions. Furthermore, since $B-3i$ is a large number, the expected number of iterations is small.\footnote{In particular, when $i=0$ the loop is not executed at all.}(NOTE Must be a way to analyse this.) The above takes care of the operations required for Pass 1. For pass 2 we have a quotient from one double-block and a remainder from the next double-block, respectively, $q\in [B+3i]$ and $r\in[B-3i]$, and we need to compute: \begin{align*} w = r(B+3i) + q \in [B^2]. \end{align*} Again we can rewrite the multiplication as $rB + r\cdot 3i$, and reuse the fact that $rB$ can be implemented as $r\ll b$. This allows us to implement the computation of $w$ as $r \ll b + r\cdot3i + q$. Finally, we have $w\in [B^2]$ which can be output directly. \subsection{Reversing first pass} For decoding we have to reverse the operations performed in \paragraph{Reverse Pass 1} Once the input blocks have been reversed from pass 2, we have a stream of alternating quotients and remainders. We take, for a given $i$, quotient $q\in [B+3i+3]$ and remainder $r\in [B-3i]$. These were obtained originally from blocks $x_{2i}$ and $x_{2i+1}$, by dividing the composition $x_{2i}(B+1) +x_{2i+1}$ by $B-3i$. To reverse this, we first compute $w = q(B-3i) + r$ and then divide $w$ by $(B+1)$. The calculation of $w$ is straight-forward to implement as: \begin{align*} w = q \ll b - q\cdot 3i + r. \end{align*} Division by $B+1$ would not be very efficient if implemented using the long division technique. Instead we can utilise our knowledge of the composition to quickly find the composite pair. \subsection{Large block sizes} On pretty standard hardware, a blocksize $b = 32$ is easily implemented in single-register arithmetic operations (using 64-bit registers). Larger block sizes allows greater input lengths, so naturally we would be interested in using a relatively large block size, say 128 or 256 bits. \subsection{API Design} The design of the API draws heavily from that of the zlib library. zlib is a widely used compression library implementing a variant of the Lempel-Ziv called DEFLATE. In particular the design of the API requires setting pointers to the input and output, and indicate their sizes, which allows for control over the size of these buffers. \section{Comparing to other methods} This section details the implementation of an example ``naive'' encoding scheme and attempts to compare this to the SOLE encoding scheme.. First we describe a simple, but fast encoding scheme, which sacrifices space for encoding speed. Our hypothesis is that even though the computation step of an encoding may be extremely fast, it is comparatively insignificant to the IO-bound operations, in which wasting space may be a determining factor in overall speed. This is discussed further in \Fref{sec:point-comp}. \subsection{A naive encoding scheme} A straight-forward encoding scheme simply takes every bit in an input stream and copies it to its output destination followed by a 1-bit, if that was the end of the stream; otherwise a zero is output. This scheme is extremely simple and can be implemented somewhat efficiently. Under the assumption of 64-bit registers and arithmetic operations, the scheme is implemented in the following manner: \begin{itemize}[noitemsep] \item Grab the next 32 bits of data, call it $x$ \item Obtain a 64-bit number $z = \texttt{zmort}(x)$ \item If $x$ were the last bits in the input stream, \end{itemize} The \texttt{zmort()} function intersperses the bits of its operand with zeroes, effectively computing a Morton number, with one operand being zero. In our implementation this means all odd bits are zero, except for at the end of the input. This transformation can be computed in 15 bit-wise operations, so it is very fast. \subsection{Points of comparison}\label{sec:point-comp} We should be careful when we say ``speed'' and be specific on the grounds we wish to compare the naive encoding and the SOLE encoding. \bibliography{refs}{} \bibliographystyle{plain} \end{document}