This section discusses some of the implementation details for a C99 compliant
implementation. The choice of C99 over ANSI C boils down to C99 adding
definitions for 64-bit numbers (\texttt{uint64\_t} and \texttt{ULL}), not present
in ANSI C.

\subsection{Proof of concept in Haskell}
We implemented a proof-of-concept implementation in Haskell, using
\texttt{Word32} as input blocks. Intermediate results were represented using the
builtin \texttt{Integer} type, which implements arbitrary precision number
representation.

This allowed the representation to stay very close in spirit to the conceptual
description and give an overview of the required operations.

A Haskell version supporting 128-bit blocks (or an arbitrarily chosen
block-size) was also easily obtained, by replacing uses of \texttt{Word32}
by \texttt{Integer} and requiring these always be positive.

A command-line interface is also available, in \texttt{cmd-sole.hs}, providing
both encoding and decoding capabilites and a point of comparison for our C
implementation.

\subsection{Implementation in C}
It is mentioned in the paper that passes 1 and 2 can easily be combined into one
pass, and order to implement an \emph{online}, constant-space encoding and
decoding, they must be combined.

In our implementation we decided on implementing support for 128-bit blocks,
since this is a common block size in cryptographic applications. A basic block
of $b=128$ bits is then represented as two 64-bit integers.

For intermediate computations, we need to be able to represent and work with
257-bit numbers. This is both necessary and sufficient, since we can (and will)
overflow 256 bits for some of the transformations, but none of our
transformations can overflow the space of 257 bits.

In addition, in order to implement division we require a representation of
negative numbers, so in total we arrived at the following definition for an
intermediate double-block:

\begin{ccode}
typedef struct __blk256_t {
    byte     mx;
    uint64_t x[4];
} blk256_t;
\end{ccode}
The \texttt{mx} field encodes both the sign of the number and the 257'th bit.

\subsection{Arithmetic operations}
Standard arithmetic operations addition, subtraction and multiplication are
implemented, both for 128- and 256-bit blocks.

The multiplication procedures were implemented according to the algorithms in
\cite{cryptobook}.

\subsubsection{Division}

By far the most challenging operation to implement of the arithmetic operations,
was division. It relies on moving from our integral representation into a
floating point representation and repeatedly reduce the division problem.

Thus we define a function \texttt{blk256\_as\_double()}, which takes a 256-bit
block and returns a \texttt{double} representation of its argument. We also
define an inverse function, \texttt{double\_as\_blk256()} transforming
a \texttt{double} to our integral representation. Analogous functions exist
for \texttt{blk128\_t}.

The division algorithm then proceeds as follows (argument $x$)

% \begin{itemize}[noitemsep]
% \item $q \leftarrow 0$
% \item $r \leftarrow x$
% \item $d_d \leftarrow \texttt{blk128\_as\_double}(d)$
% \item \textbf{while} $r > d \vee r < 0$
%       \begin{itemize}[noitemsep,nolistsep]
%       \item $r_d\leftarrow \texttt{blk256\_as\_double}(r)$
%       \item $z_d \leftarrow r_d / d_d$
%       \item $z \leftarrow \texttt{double\_as\_blk256}(z_d)$
%       \item $q \leftarrow q + z$
%       \item $r \leftarrow r - z*d$
%       \end{itemize}
% \item Output $q$ and $r$
% \end{itemize}

\begin{algorithm}
%\caption{Euclid's algorithm}\label{euclid}
\begin{algorithmic}
\Procedure{BLK256\_div\_BLK128}{$x,d$}
   \State $q \gets 0$
   \State $r \gets x$
   \State $d_d \leftarrow \texttt{blk128\_as\_double}(d)$
   \While{$r > d \vee r < 0$}
       \State $r_d\leftarrow \texttt{blk256\_as\_double}(r)$
       \State $z_d \leftarrow r_d / d_d$
       \State $z \leftarrow \texttt{double\_as\_blk256}(z_d)$
       \State $q \leftarrow q + z$
       \State $r \leftarrow r - z\cdot d$
   \EndWhile
   \State \textbf{return} $q$, $r$
\EndProcedure
\end{algorithmic}
\end{algorithm}


\subsection{API Design}
The design of the API draws heavily from that of the zlib
library. zlib is a widely used compression library implementing the
DEFLATE variant of the Lempel-Ziv algorithm.

In particular the design of the API requires setting pointers to the input and
output, and indicate their sizes, which allows for control over the size of
these buffers.

The exported \texttt{sole\_stream} struct has the following definition:

\begin{ccode}
typedef struct sole_stream_s {
    byte flags;
    uint offset;

    blk128_t *i3;           /* Current value of 3*i */
    blk128_t *b3i;          /* Current (B - 3i) */
    blk128_t *r;            /* Current remainder */
    blk256_t *q;            /* Current spill */
    blk128_t *b1;           /* First block  */
    blk128_t *b2;           /* Second block */

    size_t avail_in;        /* Number of bytes available */
    byte   *inp;            /* Next input bytes */
    size_t avail_out;       /* Remaining free space at outp */
    byte   *outp;           /* Pointer to next output slot */

    size_t total_in;        /* Total number of bytes read */
    size_t total_out;       /* Total number of bytes written */
} sole_stream;
\end{ccode}


\subsection{BigNum libraries}
Instead of implementing basic arithmetic operations on a, we could also have
opted for using a library such as GNU's GMP, which implements multi-precision
arithmetic, but we decided against this for the reasons that
\begin{enumerate*}[label=(\alph*)]
\item 256-bit numbers are relatively small when considering that
  libraries such as GNU GMP are written to handle and operate on numbers of
  arbitrary size, and the added overhead might detract too much from the running
  times
\item by implementing our own operations we can in some cases take
  advantage of the structure of the operations, and compute our results much
  faster; and
\item adding a library introduces a dependency that users are also required to
  have---this reduces portability in some cases.
\end{enumerate*}


% For the first transformation a multiplication by $B+1$ is required. Given two
% blocks $x_1$ and $x_2$ we must compute the following:
% \begin{align*}
% 	x_{12} = x_2(B+1) + x_1,
% \end{align*}
% which we rewrite to $x_2\cdot B +x_2 + x_1$. Since $B$ is a power of two, this
% transformation can be implemented as
% \begin{align*}
% 	x_{12} = x_2 \ll b + x_2 + x_1,
% \end{align*}
% where ``$\ll$'' is bit-wise left-shift.

% \begin{example}
% Let $b=5$, $B = 2^b = 32$, $x_1=5$ and $x_2 = 7$. We compute $7\cdot (B+1) + 5 =
% 236_{10} = 00111|01100_2$.
% \end{example}

% \begin{example}[Largest possible non-\eof]
% Let $b=5$, $B = 2^b = 32$, $x_1=31$ and $x_2 = 31$. We compute $31\cdot (B+1) +
% 31 = 1054_{10} = 1|00000|11110_2$.
% \end{example}

% \begin{example}[Double-\eof]
% Let $b=5$, $B = 2^b = 32$. Then $\eof = 32$. In the case when $n$ is even, the
% last two blocks encoded are $(x_1, x_2) = (\eof, \eof)$.  We compute $32\cdot (B+1) +
% 32 = 1088_{10} = 1|00010|00000_2$.
% \end{example}

% The next transformation requires dividing a large number $y$ by $B-3i$. We need
% to find $q$ and $r$ such that $y = q(B-3i) + r$.

% Consider dividing $y$ by $B$. Call this number $q'$ which is easily implemented
% as $y \gg b$, and we can find a corresponding $r'$ such that $y = q'(B-3i) +
% r'$. Since $B\geq B-3i$, the value of $q'$ can only be smaller than or equal to
% the value of $q$ we are seeking, i.e. $q'\leq q$. This implies our value for
% $r'$ is potentially too large, and we can iteratively find the value for $r$ by
% subtracting $B-3i$ enough times:
% \begin{itemize}[noitemsep]
% 	\item $q' \leftarrow y \gg b$
% 	\item $r' \leftarrow y - q'(B-3i)$
% 	\item \textbf{while} $r' \geq (B-3i)$
% 		\begin{itemize}[noitemsep,nolistsep]
% 			\item $r' \leftarrow r' - (B-3i)$
% 			\item $q' \leftarrow q' + 1$
% 		\end{itemize}
% 	\item $q, r \leftarrow q', r'$
% \end{itemize}
% This effectively reduces a division to bit-shifts are a number of additions and
% subtractions. Furthermore, since $B-3i$ is a large number, the expected number
% of iterations is small.\footnote{In particular, when $i=0$ the loop is not
% executed at all.}(NOTE Must be a way to analyse this.)

% The above takes care of the operations required for Pass 1. For pass 2 we have a
% quotient from one double-block and a remainder from the next double-block,
% respectively, $q\in [B+3i]$ and $r\in[B-3i]$, and we need to compute:
% \begin{align*}
% 	w = r(B+3i) + q \in [B^2].
% \end{align*}
% Again we can rewrite the multiplication as $rB + r\cdot 3i$, and reuse the fact
% that $rB$ can be implemented as $r\ll b$. This allows us to implement the
% computation of $w$ as $r \ll b + r\cdot3i + q$.

% Finally, we have $w\in [B^2]$ which can be output directly.

% \subsection{Reversing first pass}
% For decoding we have to reverse the operations performed in

% \paragraph{Reverse Pass 1}
% Once the input blocks have been reversed from pass 2, we have a stream of
% alternating quotients and remainders. We take, for a given $i$, quotient
% $q\in [B+3i+3]$ and remainder $r\in [B-3i]$. These were obtained originally
% from blocks $x_{2i}$ and $x_{2i+1}$, by dividing the composition
% $x_{2i}(B+1) +x_{2i+1}$ by $B-3i$.

% To reverse this, we first compute $w = q(B-3i) + r$ and then divide $w$ by
% $(B+1)$. The calculation of $w$ is straight-forward to implement as:
% \begin{align*}
% w = q \ll b - q\cdot 3i + r.
% \end{align*}

% Division by $B+1$ would not be very efficient if implemented using the long
% division technique. Instead we can utilise our knowledge of the composition to
% quickly find the composite pair.