When you create a new Pijul repository, a .pijul folder
is created with two subfolders: “change” and “pristine”. Here we will
look into the structure of the change directory and change files.
* "change" appears to be organized much like Git's "objects" folder in that the
changes are referenced by a hash. A hash like
U6TQX5Z2NF6GX3SRLUBQGCZ7WAXNYMWWZ2YMADUSG4EWVKNV2BIAC is stored in the file:
`.pijul/change/U6/TQX5Z2NF6GX3SRLUBQGCZ7WAXNYMWWZ2YMADUSG4EWVKNV2BIAC.change`.
* "pristine" contains only... a "db" file?
We'll forget about the "pristine" folder for now and get back to that later. In
this write-up we'll focus on the structure of change files.
## Structure of a .change file
The overall format of a `.change` file is something like this:
┌────────────────┐
│ offsets │ fixed 56 bytes
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
│ hashed parts │ = zstd_compress(bincode(hashed))
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
│ unhashed parts │ = zstd_compress(bincode(unhashed | [])) - may be zero
├┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┤
│ contents │ = zstd_compress(contents)
└────────────────┘
The first part contains information about offsets into the rest of the file. The
remaining three segments: hashed, unhashed, and contents are all compressed with
a seekable variant of.We will refer to this as _zstdseek_
The offsets section also provides the uncompressed sizes of the three compressed
segments. This allows allocating sufficient buffer space for zstd to decompress
a segment into.
### Offsets
The offsets "header" section of a change file can be represented as the
following struct:
#include
struct offsets {
uint64_t version;
uint64_t hashed_len; /* length of the hashed contents */
uint64_t unhashed_off;
uint64_t unhashed_len; /* length of unhashed contents */
uint64_t contents_off;
uint64_t contents_len;
uint64_t total;
}
With seven components each eight bytes in size we get: 56 bytes. But how are
they actually written to the file? The answer is: **bincode**.
### Bincode
An interesting feature of Pijul is that everything that's serialized to disk is
serialized with bincode. What's
interesting about bincode is that it appears to be Rust-only.
The specification
is a [Markdown document in the source code
repo](https://github.com/bincode-org/bincode/blob/trunk/docs/spec.md).
Reading the specification it appears to focus on cutting as much fluff as
possible and just produce and consume byte streams. This means a struct with
seven u64 members has no other information than the values of those seven
members, one after the other.
The primary thing to keep in mind is that values must be encoded and decode in
**little-endian byte order**.
All the segments of a change file are bincode-encoded _except_ the contents
section which is already just an array of bytes.
### Hashed
The hashed contents is the most structured part of a change file. In a similar
vein as Git, it gathers all the bits that should be hashed. This includes:
* A "version" field - currently always the value `6` (u64)
* A change header struct (more on that later)
* A list of dependencies, just the hashes
* Another list of hashes - extra known "context" changes (assists with recovery
from deleted contexts)
* The hash of the contents
Notably the contents themselves are not in the hashed structure, only the hash
of the contents.
In C, the hashed structure looks something like this:
struct hashed {
uint64_t version;
struct change_header header;
struct hash *dependencies; /* Vec */
struct hash *extra_known; /* extra known "context" changes (recovery from deleted contexts) */
uint8_t *metadata; /* space for application-specific data */
?? changes; /* Vec, "Hunk" being a generic argument */
hash contents_hash; /* hash of the contents */
}
Computing the hash happens on the bincode-encoded form of the hashed struct,
something like this: hash = blake3(bincode(hashed))
### Unhashed
I haven't looked into the unhashed segment yet, but supposedly it's for putting
stuff associated with a change that you don't want hashed.
### Contents
Just an array of bytes by the looks of it, I have yet to figure out what it
actually holds.