\documentclass[]{article}

\usepackage{graphicx}
\usepackage{float}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{xspace}
\usepackage[inline]{enumitem}
\usepackage{times}
\usepackage{tikz}
\usepackage{mathrsfs,mathabx}
\usepackage{algorithm}
\usepackage{url}
\usepackage{rotating}
\usepackage[noend]{algpseudocode}
\usepackage{fancyref}
\usepackage{minted}
\usetikzlibrary{arrows,automata}

%\newtheorem{theorem}{Theorem}[section]
%\newtheorem{algorithm}[theorem]{Algorithm}
%\newtheorem{lemma}[theorem]{Lemma}
%\newtheorem{note}[theorem]{Note}
%\newtheorem{proposition}[theorem]{Proposition}
%\newtheorem{corollary}[theorem]{Corollary}

\newenvironment{definition}[1][Definition]{\begin{trivlist}
\item[\hskip \labelsep {\bfseries #1}]}{\end{trivlist}}
\newenvironment{example}[1][Example]{\begin{trivlist}
\item[\hskip \labelsep {\bfseries #1}]}{\end{trivlist}}
\newenvironment{remark}[1][Remark]{\begin{trivlist}
\item[\hskip \labelsep {\bfseries #1}]}{\end{trivlist}}


\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhead[L]{\small Projekt}
\fancyhead[R]{\small Thomas Bracht Laumann Jespersen}

%\usepackage[thmmarks]{ntheorem}
%\theoremstyle{nonumberplain}
%\theoremheaderfont{\normalfont\bfseries}
%\theorembodyfont{\normalfont}
%theoremsymbol{\ensuremath{\square}}
%theoremseparator{.}
%newtheorem{proof}{Proof}

\setdescription{
  font=\normalfont\bfseries,
  style=sameline,
  leftmargin=\parindent,
  itemsep=0pt,
  listparindent=\parindent
}

\frenchspacing

\title{Implementing Base Change without Space Change}
\author{Thomas Bracht Laumann Jespersen\\\footnotesize\url{ntl316@alumni.ku.dk}}
\date{}

\begin{document}
\maketitle

%\begin{center}
%  {\bf Group}\\[3mm]
%    \begin{tabular}{ll}
%    Thomas Bracht Laumann Jespersen & \texttt{ntl316}
%    \end{tabular}
%\end{center}


\newcommand{\eof}{\textsc{eof}\xspace}

%  Dit navn, cpr.nr og Alumni-mail
%  Navn på intern vejleder
%  Om projektet skal godkendes som begrænset valgfrit eller valgfrit
%  Emnet for projektet og dets indhold
%  ECTS-point, omfang og eventuelle metodekrav, hvis projektet har eksperimentelt indhold
%  Eksamensform
%  Tidsplan for projektet
%  Evt. omfang af vejledning
%  Afleveringsfrist

\begin{abstract}
We present a C library implementation of the SOLE encoding presented in
\cite{patrascu10trits}.

First we demonstrate how these encoding schemes function and provide a
proof-of-concept implementation in Haskell for 32-bit blocks.

Next we examine the requirements for an online encoding and decoding
implementation using 128 bit blocks. We discuss implementation details, and
conclude with a small benchmark comparison to a ``naive'' encoding scheme.
\end{abstract}

\section{Introduction}

The implementation presented in this report require arithmetic operations on
numbers represented with up to 256 bits. If performance was of no concern one
could simply include arbitrary precision libraries such as GNU's GMP, but we
decided against this for the reasons
\begin{enumerate*}[label=(\alph*)]
	\item 256-bit numbers are relatively small when considering that
		libraries such as GNU GMP are written to handle and operate on
		numbers of arbitrary size, and the added overhead might detract
		too much from the running times; and
	\item by implementing our own operations we can in some cases take
		advantage of the structure of the operations, and compute our
		results much faster.
\end{enumerate*}

\Fref{sec:basechange} goes through the SOLE algorithm in its details and discusses
some practical implementation considerations. \Fref[plain]{sec:impl} describes
implementation details for working with 128-bit blocks (on 64-bit architectures).

\section{Changing Base}\label{sec:basechange}

At the heart of the algorithm is the method of changing base.
Say we have $a = a_1\cdot B + a_2$, where $B$ is some base, we can re-express
$a$ in a different base $B'$ such that $a = a_1'\cdot B'+a_2'$.

\subsection{Encoding an input stream}

We assume we're given an input stream of $n$ blocks of $b$ bits. Each block
$x_i$ can be viewed as a ``character'' in an alphabet consisting of $2^b$
characters, i.e. $x_i\in[2^b]$. Let $B = 2^b$.

\begin{center}
	\begin{tabular}{l*{5}{c}l}
	Input stream: & $x_1$ & $x_2$ & $x_3$ & $x_4$ & $x_5$ & $\cdots$\\
	\end{tabular}
\end{center}

To indicate the end of the stream (which we might not even know when occurs), we
introduce a new character \textsc{eof}, which gives us an alphabet of size
$B+1$. In order to encode the \eof we would require $2^{b+1}$ bits, resulting in a huge loss of
entropy (we wasting half the possible encoding space).

Instead the following method was invented in~\cite{patrascu10trits}, which we
outline here we a little more detail.

\subsection{Overview of the algorithm}

Conceptually the encoding consists of two phases each making a pass over the
input and we present them here separately. However, in the implementation these
phases are combined into one.

In the following $B$ is our input alphabet size, such that $B = 2^b$.

The first pass groups the input together $(2i, 2i+1)$ for $i =
0,1,\dots,\lceil\frac{n}{2}\rceil$ and computes a number $x_i'$:
\begin{align*}
	x_i' = x_{2i}(B+1) + x_{2i+1} \in [(B+1)^2],
\end{align*}
which is decomposed into two new numbers $y_i$ and $z_i$ by dividing by
$(B-3i)$. We set $z_i$ as the remainder and $y_i$ as the quotient of this
operation. By virtue of division we know $z_i \in [B-3i]$. $y_i$ is constrained
by the ratio between $(B+1)^2$ and $(B-3i)$, which is shown
in~\cite{patrascu10trits} to be at most $B+3i+3$, so $y_i\in [B+3i+3]$.

After this first phase the regrouped numbers $z_i$ and $y_i$ are grouped as
follows, with alphabet sizes indicated:

\begin{center}
	\begin{tabular}{l|*{5}{c|}l}
	Alphabet sizes & $B$ & $B+3$ & $B-3$ & $B+6$ & $B-9$ & $\cdots$\\\hline
	Post-phase 1 blocks  & \;$z_0$\; & $y_0$ & $z_1$ & $y_1$ & $z_2$ & $\cdots$
	\end{tabular}
\end{center}

Phase 2 regroups the blocks produced in phase 1, considering blocks $(2i, 2i+1)$
for $i\geq 1$. Notice that the very first block obtained $z_0$ is already a
number in $B$, so we do not process it further, but simply output it. The
remaining blocks are considered in pairs $(y_0, z_1), (y_1, z_2),\dots$ etc.

The important observation from~\cite{patrascu10trits} here is the fact that for
a given pair $(y_i, z_i)$, $y_i\in [B+3i]$ and $z\in[B-3i]$. This allows us to
rewrite $(y_i, z_i)$ as a single number, and this number can be represented in
$[B^2]$ as $(B-3i)(B+3i) < B^2$. A straight-forward transformation (which
encodes an enumeration) is:
\begin{align*}
	w_i = z_i(B+3i) + y_i.
\end{align*}
The resulting $w_i\in [B^2]$ can now easily be decomposed into two output blocks
of $b$ bits by taking as the first block the upper $b$ bits from $w_i$, and for
the second block, the lower $b$ bits.

\subsection{Encoding \eof}

In the above we never mentioned how \eof was to be handled. Assume $n$ is odd,
ie we receive an odd number of input blocks. Appending \eof gives an even number
of blocks in phase 1, the last block being in $[B+3j]$ for some $j$.

During phase 2 of the encoding, when we reach the last block we artificially
insert a zero block at the position immediately preceding it. This zero block is
in the range $[B-3j]$ and can be combined with the last block and output.

For even $n$ we simply append another \eof, and repeat the above termination
scheme.

\subsection{Decoding}

The decoding is basically reversing the operations performed in passes 1 and 2.
This section discusses some details in reversing the computation. The entire
algorithm essentially consists of three injective mappings.

\paragraph{The composition $x_j(B+1) + x_{j+1}$}

This composition is very fast, since it breaks down to a shift by $b$ and two
additions.

The inverse is dividing a number $w = x_j(B+1) + x_{j+1}$. Naively dividing $w$
by $(B+1)$ is obviously not desirable, since we know how $w$ was composed, and
we wish to exploit this knowledge. Consider $x_j, x_{j+1}\in [B]$. The sum of
these numbers is in $[2B -1]$. In other words, the sum will at most overflow by
one bit, a carry that is added to the $(b+1)$'th position. Since $x_j$ was shifted
$b$, the ``upper'' part of $w$ is either the value $x_j+1$ or $x_j$ (depending
on whether $x_j + x_{j+1}$ did or did not overflow).

Compute $x' = w \gg b$ and $y' = w - x'B$. Due to the nature of the
composition, $x'$ either equals $x_j$ or $x_j+1$. We can determine which
by comparing $x'$ and $y'$. There are three cases to consider:
\begin{itemize}[noitemsep]
	\item $x' > y'$. This indicates that $x_j + x_{j+1}$ overflowed and
		we find $x_{2i}$ by subtracting one from $x'$.
	\item $x' = y'$. This is only possible when $x_{j+1} = 0$. In that
		case we have already found $x_j$ and now know $x_{j+1} =
		0$.
	\item $x' < y'$. The addition could not have overflowed the block size,
		and thus $x_j = x'$.
\end{itemize}
In the first and last case above, once we have determined the value for
$x_j$, we can easily find $x_{j+1}$.

\paragraph{The decomposition of $w = x_j(B+1) + x_{j+1}$ by $B-3i$}

This step could potentially be very slow if one were to attempt traditional long
division.

\paragraph{The composition $w = r(B+3i) + q$}
From one decomposition we have a quotient $q\in[B+3i+3]$, also called ``the
spill'' and we are going to combine it with a remainder from the next
decomposition, $r\in[B-3(i+1)]$.

The encoding step is fairly straight-forward, in the same vein as the composition
step from pass 1.

The decoding is in theory a division by $(B+3i)$, but we will again exploit the
structure of the composition of $w$ to obtain a faster division method.

\section{Implementation}\label{sec:impl}

In order to implement an \emph{online}, constant-space encoding and decoding,
the two phases must be combined in one, since keeping them separate would
require remembering the entire input (not necessarily feasible).

\subsection{Haskell implementation}
We implemented a proof-of-concept implementation in Haskell, using
\texttt{Word32} as input blocks. This was then extended to support arbitrary
block sizes (using the builtin \texttt{Integer} type).

The Haskell implementation gives a good overview of the encoding, specifically
allows to reproduce the example from an early version of the paper, in which
blocks of $B\pm 4i$ were used (instead of $\pm 3i$).

\subsection{C implementation}
The C implementation works with 64-bit numbers and unsurprisingly
represent 128-bit blocks as two 64-bit numbers.

For intermediate computations, we need to be able to represent and work with
257-bit numbers, a representation (a little misleadingly called)
\texttt{blk256\_t}, is implemented as well. Given the C89/90 standards do not
include definitions for 64-bit sizes, this implementation adheres to the C99
standard.

\begin{minted}{c}
typedef struct __blk256_t {
    byte     mx;
    uint64_t x[4];
} blk256_t;
\end{minted}

\subsection{Bit tricks}

%\begin{enumerate}
%	\item Division by a number which is almost a power of two.
%	\item Multiplication by $B\pm k$ for small $k$
%\end{enumerate}


For the first transformation a multiplication by $B+1$ is required. Given two
blocks $x_1$ and $x_2$ we must compute the following:
\begin{align*}
	x_{12} = x_2(B+1) + x_1,
\end{align*}
which we rewrite to $x_2\cdot B +x_2 + x_1$. Since $B$ is a power of two, this
transformation can be implemented as
\begin{align*}
	x_{12} = x_2 \ll b + x_2 + x_1,
\end{align*}
where ``$\ll$'' is bit-wise left-shift.

\begin{example}
Let $b=5$, $B = 2^b = 32$, $x_1=5$ and $x_2 = 7$. We compute $7\cdot (B+1) + 5 =
236_{10} = 00111|01100_2$.
\end{example}

\begin{example}[Largest possible non-\eof]
Let $b=5$, $B = 2^b = 32$, $x_1=31$ and $x_2 = 31$. We compute $31\cdot (B+1) +
31 = 1054_{10} = 1|00000|11110_2$.
\end{example}

\begin{example}[Double-\eof]
Let $b=5$, $B = 2^b = 32$. Then $\eof = 32$. In the case when $n$ is even, the
last two blocks encoded are $(x_1, x_2) = (\eof, \eof)$.  We compute $32\cdot (B+1) +
32 = 1088_{10} = 1|00010|00000_2$.
\end{example}

The next transformation requires dividing a large number $y$ by $B-3i$. We need
to find $q$ and $r$ such that $y = q(B-3i) + r$.

Consider dividing $y$ by $B$. Call this number $q'$ which is easily implemented
as $y \gg b$, and we can find a corresponding $r'$ such that $y = q'(B-3i) +
r'$. Since $B\geq B-3i$, the value of $q'$ can only be smaller than or equal to
the value of $q$ we are seeking, i.e. $q'\leq q$. This implies our value for
$r'$ is potentially too large, and we can iteratively find the value for $r$ by
subtracting $B-3i$ enough times:
\begin{itemize}[noitemsep]
	\item $q' \leftarrow y \gg b$
	\item $r' \leftarrow y - q'(B-3i)$
	\item \textbf{while} $r' \geq (B-3i)$
		\begin{itemize}[noitemsep,nolistsep]
			\item $r' \leftarrow r' - (B-3i)$
			\item $q' \leftarrow q' + 1$
		\end{itemize}
	\item $q, r \leftarrow q', r'$
\end{itemize}
This effectively reduces a division to bit-shifts are a number of additions and
subtractions. Furthermore, since $B-3i$ is a large number, the expected number
of iterations is small.\footnote{In particular, when $i=0$ the loop is not
executed at all.}(NOTE Must be a way to analyse this.)

The above takes care of the operations required for Pass 1. For pass 2 we have a
quotient from one double-block and a remainder from the next double-block,
respectively, $q\in [B+3i]$ and $r\in[B-3i]$, and we need to compute:
\begin{align*}
	w = r(B+3i) + q \in [B^2].
\end{align*}
Again we can rewrite the multiplication as $rB + r\cdot 3i$, and reuse the fact
that $rB$ can be implemented as $r\ll b$. This allows us to implement the
computation of $w$ as $r \ll b + r\cdot3i + q$.

Finally, we have $w\in [B^2]$ which can be output directly.

\subsection{Reversing first pass}
For decoding we have to reverse the operations performed in

\paragraph{Reverse Pass 1}
Once the input blocks have been reversed from pass 2, we have a stream of
alternating quotients and remainders. We take, for a given $i$, quotient
$q\in [B+3i+3]$ and remainder $r\in [B-3i]$. These were obtained originally
from blocks $x_{2i}$ and $x_{2i+1}$, by dividing the composition
$x_{2i}(B+1) +x_{2i+1}$ by $B-3i$.

To reverse this, we first compute $w = q(B-3i) + r$ and then divide $w$ by
$(B+1)$. The calculation of $w$ is straight-forward to implement as:
\begin{align*}
w = q \ll b - q\cdot 3i + r.
\end{align*}

Division by $B+1$ would not be very efficient if implemented using the long
division technique. Instead we can utilise our knowledge of the composition to
quickly find the composite pair.

\subsection{Large block sizes}

On pretty standard hardware, a blocksize $b = 32$ is easily implemented in
single-register arithmetic operations (using 64-bit registers).

Larger block sizes allows greater input lengths, so naturally we would be
interested in using a relatively large block size, say 128 or 256 bits.


\subsection{API Design}
The design of the API draws heavily from that of the zlib library. zlib is a
widely used compression library implementing a variant of the Lempel-Ziv called
DEFLATE.

In particular the design of the API requires setting pointers to the input and
output, and indicate their sizes, which allows for control over the size of
these buffers.

\section{Comparing to other methods}
This section details the implementation of an example ``naive'' encoding scheme
and attempts to compare this to the SOLE encoding scheme.. First we describe a
simple, but fast encoding scheme, which sacrifices space for encoding speed.

Our hypothesis is that even though the computation step of an encoding may be
extremely fast, it is comparatively insignificant to the IO-bound operations, in
which wasting space may be a determining factor in overall speed. This is
discussed further in \Fref{sec:point-comp}.

\subsection{A naive encoding scheme}
A straight-forward encoding scheme simply takes every bit in an input stream and
copies it to its output destination followed by a 1-bit, if that was the end of
the stream; otherwise a zero is output. This scheme is extremely simple and can
be implemented somewhat efficiently.

Under the assumption of 64-bit registers and arithmetic operations, the scheme
is implemented in the following manner:
\begin{itemize}[noitemsep]
\item Grab the next 32 bits of data, call it $x$
\item Obtain a 64-bit number $z = \texttt{zmort}(x)$
\item If $x$ were the last bits in the input stream,
\end{itemize}
The \texttt{zmort()} function intersperses the bits of its operand with zeroes,
effectively computing a Morton number, with one operand being zero. In our
implementation this means all odd bits are zero, except for at the end of the
input. This transformation can be computed in 15 bit-wise operations, so it is
very fast.

\subsection{Points of comparison}\label{sec:point-comp}
We should be careful when we say ``speed'' and be specific on the grounds we
wish to compare the naive encoding and the SOLE encoding.


\bibliography{refs}{}
\bibliographystyle{plain}

\end{document}