# Understanding Pijul file graphs and change files

Posted on <b><time id="post-date">2023-10-19</time></b>.

<p id="post-excerpt">
Pijul represents files internally as graphs, but how does it actually work?
</p>

This document is meant to give a broad, general understanding of
Pijul's representation of files in a repository, and how changes are
modeled. The goal is to develop the reader's intuition about the way
that Pijul internally represents files and keep track of their states.

NOTE: This is a work in progress, and will updated from time to time.

## Files and directories in a repo as graphs.

A Pijul repository tracks files and folders, and changes to those. The
way it does that is by modeling those objects as a graph.

<!--
[this bit should explain how the graph model works...]
-->

## Change files

Change files consist of three parts:

 1. bits that are hashed
 2. bits that aren't hashed
 3. raw byte contents (also hashed, its hash is included as a field in part 1)

Key insight: a change file is *not* a patch or diff in the sense that
git or other diff tools might produce. A change file contains contains
precise changes to an existing collection-of-files-as-a-graph (call it
"repo-graph" for short).

For example: a change file never contains deleted text (like deleted
lines that might appear in a unified diff). A change file may refer to
content introduced by other changes and indicate parts that should be
marked deleted in the repo-graph. In other words, a change file never
contains deleted text, only references to text to delete.

Conversely, a given change file in isolation may only contain
text/content that is being added to the existing repo-graph.

## Initial graph

The initial graph is empty so there's not really much to show, the
repository graph can essentially be thought of as containing just a
single root node:

<div style="text-align: center">

![Root node](/ani/root.png)

</div>

This node is not introduced by any changes, it exist only in the
internal Pijul graph structure to anchor new graph nodes. Running
`pijul debug` on an empty repository produces no graph output.

## Adding files

The only thing we can really do with an empty repository is to begin
adding files to it. To do that, we create a file, add some contents to
it, run `pijul add <file>`, then `pijul record -m <message>`.

Our first file graph looks like this:

<div style="text-align: center">

![File addition: add main.c](/ani/add-file.png)

</div>

The root change (also known as AAAAAAAAAAAAA) is a synthetic node not
introduced by any changes, all the other nodes are introduced by
changes. The first little surprise here is that we've only recorded
change, but two were created. The Pijul log command confirms this:

    $ pijul log
    Change YS4YZQKSU3ZUXKGRK65W52EKET2UIF4V6SOZMVOEFTMXFOXJEBHQC
    Author: ************ <****@*****>
    Date: Wed, 18 Oct 2023 04:17:49 +0000
    
        add main.c
    
    Change WLNTADNCUR2X6BU4AFUIVAPTX77CORJ3T4OEXW63XPZ4ACCJ7H2AC
    Author:
    Date: Wed, 18 Oct 2023 04:17:49 +0000

The author-less change looks like this:

    $ pijul change WLNTA
    message = ''
    timestamp = '2023-10-18T04:17:49.400906844Z'
    authors = []
    
    # Hunks
    
    1. Root add
      up 1.0, new 0:0

The C1 change is introduced by Pijul to "anchor" other changes. In
Pijul it's possible to join histories from different repositories and,
as I understand it, the separate, explicit root addition is what makes
that possible. The C2 change is the actual file addition:

    1. File addition: "main.c" in "" "UTF-8"
      up 2.1, new 0:30
    + #include <stdio.h>
    + 
    + int
    + main(int argc, char *argv[])
    + {
    + 	printf("hello world\n");
    + }

<!--
0: #E1BE6A (sand)
1: #40B0A6 (light cyan)
2: #DEC2BB (salmon)
3: #126526 (dark green)
4: #E95C9C (hot pink)
5: #3D7491 (slate blue/grey)
6: #1D7404 (forest green)
7: #ACB10B (light green)
8: #B5CD7A (washed green)
9: #E9C012 (yellow)
10: #4B49E5 (blue/purple)
11: #E128C7 (another pink)
-->


## Edits & Replacements

By far the most common operations after adding files is editing
them. Pijul has two flavours of file edits: edits and
replacements. The difference between the two is essentially that an
edit only introduces new lines where replacements additionally remove
lines.

### Simple edits

The edit is rendered like this with `pijul change`:

    1. Edit in "main.c":2 2.31 "UTF-8"
      up 2.51, new 0:41, down 2.51
    + 
    + void saymsg(char *msg) { printf(msg); }

So all we've done is add two lines. The internal graph representation
in Pijul ends up looking like this:

<div style="text-align: center">

![Simple edit adding new lines](/ani/edit-file.png)

</div>

From the C2 change, Pijul has split the node `C2 [32;115[` into two
nodes: `C2 [32;51[` and `C2 [51;115[`, and C3 introduces a new node
whose contents go between them. The range notation on a node label
refers to a slice of bytes from the contents section of a given
change.

The green arrows indicate file content nodes and the order between
them, and the red back arrows indicate the immediate predecessor. To
render a file from this file graph, Pijul would attempt to do a total
topological ordering of the content nodes, traverse them, grab the
bytes from respective change files and concatenate them together to
produce the final file.

In our example, the content nodes are now `C2 [32;51]`, `C2 [51;115[`,
and `C3 [0;41[`, and Pijul orders them by the outgoing green arrows:

<div style="text-align: center">

![Topologically sorted nodes](/ani/sorted-nodes.png)

</div>

If you look closely you'll note that the there's one edge labeled `C3`
instead of `[C3]`. This means the edge exists only to provide a total
ordering on the nodes. The current contents of our file are literally
the concatenation of the byte sequences indicated by the three nodes.

### Replacements

Next we'll try replacing some of the existing file content with new
content. The change itself is quite simple:

    1. Replacement in "main.c":8 2.31 "UTF-8"
    B:BD 2.87 -> 2.87:113/2
      up 2.87, new 0:26, down 2.113
    - 	printf("hello world\n");
    + 	saymsg("hello world\n");

Now our file graph looks like this:

<div style="text-align: center">

![Topologically sorted nodes](/ani/replacement-file.png)

</div>

Like C3, a new change is added with new contents, but there are some
new things to observe:

 - As with C3, the nodes of C2 have again been split. The `C2
 [51;115[` node has been split into three nodes with the ranges
 [51;87[, [87;113[, and [113;115[.

 - The middle node of range [87;113[ is the text that we have removed,
   and it is indicated in the graph by dashed edges introduced by C4.

 - The node with deleted content retains edges to other content
   nodes. Pijul considers a node "dead" if all its incoming (ie
   green) edges are marked deleted. It is still allowed for a dead
   node to have live edges to other nodes.
   
 - There are new dotted edges between nodes `C2 [51;87[` and `C2
   [113;115[`. These are pseudo-edges added by Pijul and are not
   explicitly introduced by any change. These edges exist to make it
   faster for Pijul to compute the live nodes without having to
   traverse all the dead nodes.

In general, Pijul "deletes" text by splitting a node into three: a
node to hold the deleted contents, and two nodes for the remaining
(alive) contents before and after. Pseudo-edges are added to make file
rendering faster by allowing it to skip dead nodes.

The total number of nodes that make up the contents of our file is now
five and are ordered as follows:

<div style="text-align: center">

![Topologically sorted nodes again](/ani/sorted-nodes-2.png)

</div>

## Adding another file

Files exist as independent subgraphs that only share a common root in
the repository structure, so if we add another file to the repository
it will not depend on any of the changes that introduced or modified
our first file, it will only depend on the first change that
introduced the root.

This it what it looks like:

<div style="text-align: center">

![Another file](/ani/add-another-file.png)

</div>

In the above graph, I have removed the subgraphs that grouped nodes
from a given change and instead shown the subgraphs that represents
each of the two files now tracked by Pijul. This display is much
closer to the result from `pijul debug` piped through dot(1).

Do note that a single change can introduce more than one file at a
time, and also modify many files at the same time. The presentation I
have chosen here is in a sense the "cleanest" in that we have
considered small incremental changes at a time.

One of the powers of the graph representation begins to come to light:
If we decided that we wanted to undo the change called C4 (the last
change on main.c) we could undo it without needing to consider the C5
change. Our last change (adding readme.txt) does not in any way depend
on C4, so clearly we should be able to modify the part of the file
graph that represents main.c without touching the part that represents
readme.txt. If, in our last change, we had also edited main.c, then we
could have ended up depending on C4 and undoing it would be more
cumbersome, but in the current file graph that is not the case.

## Conflicts

The file graph can quickly become complicated and challenging to read,
so for this section we will reset the file graph to something simpler:

_file graph here_

How, then, are conflicts modeled? _TBD_

Idea: reset to a simpler file graph, fork and make conflicting changes.

## References

 1. <https://pijul.org/manual/theory.html>
 2. <https://jneem.github.io/pijul/>
 3. <https://jneem.github.io/pseudo/>
 4. [colour palette](https://davidmathlogic.com/colorblind/#%23E1BE6A-%2340B0A6-%23dec2bb-%23126526-%23e95c9c-%233d7491-%231d7404-%23acb10b-%23b5cd7a-%23e9c012-%234b49e5-%23e128c7)