Writing Git in Rust

Writing Git in Rust

Git is an indispensable tool in every programmer's toolkit. Yet, most of us only know barely enough to make Git work. As part of the Codecrafters challenge, I took to writing an asynchronous implementation of the Git internals in Rust, partly because I wanted to do something with Rust and partly to understand how this godsend of a tool works.

In this post, I will be talking about a few nice parts of the projects focusing mostly on the object storage part of Git. Do keep in mind that this is a very barebones implementation of Git that leaves out a number of key features. It only implements a small part of Git at best.

Note that most of the implementation's details will be snipped. If you want to see the full implementation and try it out on your machine, check out the repository.

Structuring the Project

The challenge proceeded in steps and each step was about implementing what they call "plumbing commands" of Git. So, to make the job of adding a new command in every step easier, I split the project into three parts:

  • The internal implementations
  • Command Line Interface
  • Wrapper functions to be called by the CLI

Now doing this lets us make the fn main() what is called a single-line main function.

mod cli;
mod clone;
mod commands;
mod objects;
mod packfile;
mod utils;

use anyhow::Result;

#[tokio::main]
async fn main() -> Result<()> {
    cli::CLI::run().await
}
src/main.rs

The blob, tree, and commit modules are for Git objects⁠—we'll come back to these later. The cli.rs defines the CLI and the commands.rs defines the functions that sort of wrap together everything into convenience functions.

Also note, the #[tokio::main] is because we are using Tokio's asynchronous runtime. Tokio is the most popular asynchronous runtime for Rust.

The CLI

Writing CLIs in Rust is an absolute pleasure. Using a crate like structopt, you can define the structure of your CLI and it handles cleaning the input and converting it to an appropriate type.

#[derive(Debug, StructOpt)]
#[structopt(name = "TGit", about = "HedonHermDev's implementation of Git")]
pub enum CLI {

    #[structopt(
    name = "init",
    about = "Initialize an empty git repository"
    )]
    Init { git_dir: Option<PathBuf> },
    
    #[structopt(name = "cat-file", about = "Cat the contents of a git object")]
    CatFile {
        #[structopt(
            name = "pretty_print",
            short = "p",
            about = "Pretty print the contents"
        )]
        pretty_print: bool,
        #[structopt(name = "OBJECT SHA")]
        object_sha: String,
    },
    
   // -- snip --
}

impl CLI {
    pub async fn run() -> Result<()> {
        let args: Self = Self::from_args();

        match args {
            CLI::Init { git_dir } => commands::init(git_dir).await,
            
            CLI::CatFile {
                pretty_print,
                object_sha,
            } => commands::cat_file(pretty_print, object_sha).await,
        }
    }
}

The above code should be fairly legible even if you don't understand Rust. It defines the CLI as a set of subcommands with each command getting its arguments as a struct. This leverages the power of Rust's enums. Now, we can write functions that execute the above subcommands in the commands.rs file. For the init and the cat-file command, the implementation looks something like this:

pub async fn init(git_dir: Option<PathBuf>) -> Result<()> {

// -- snip --

}

pub async fn cat_file(pretty_print: bool, object_sha: String)
-> Result<()> {

// -- snip --

}
src/commands.rs

Similar functions are defined for the other commands.

The Internals of Git Object Storage

To understand the implementation further, you will need a rough idea of how Git stores its data. Here's a brief introduction to the same. Everything about Git is stored inside a directory called the gitdir aka .git.

Example contents of the .git directory

Think of Git like a database. Git stores its data in the form of Objects. There are three types of Git objects: Blob, Tree, and Commit.

  • Blob: the object type used to store the contents of each file in a repository.
  • Tree: the object type used to store the hierarchy between files in a repository.
  • Commit: the human-readable object type used to store the snapshot of a tree.

Each object is stored as a file. Most of the data stored by Git is compressed to save space. For compression, Git uses zlib. The format of a blob object is:

blob <blob_length>\0<blob_content>

Here, <blob_length> denotes the length of the file's contents and <blob_content> is the zlib compressed contents of the file. The files are stored according to the SHA-1 hash of the compressed contents. Note that to further save space, Git packs the object files periodically into "packfiles" which have their own format.

The Implementation

Since all three types of objects have some common behaviour that only differs in its implementation, we can easily define a trait Object that defines each of these methods.

#[async_trait]
pub trait Object {
    async fn from_object_sha(object_sha: String) -> Result<Self>
    where
        Self: Sized;

    fn sha1_hash(&self) -> [u8; 20];

    fn write_data(&self) -> &Vec<u8>;

    async fn write(&self) -> Result<PathBuf> {
        let mut path = PathBuf::from(".git/objects");

        let blob_hex = hex::encode(self.sha1_hash());
        let (dirname, filename) = blob_hex.split_at(2);

        path.push(dirname);

        fs::create_dir_all(&path).await?;
        path.push(filename);

        let encoded_content = utils::zlib_compress(&self.write_data())?;

        fs::write(&path, encoded_content).await?;

        Ok(path)
    }

    fn encoded_hash(&self) -> String {
        hex::encode(&self.sha1_hash())
    }
}
src/objects/object.rs

The from_object_sha, sha1_hash, and the write_data functions have a type specific implementation while the write and the encoded_hash functions have a generic implementation that is defined here.

The #[async_trait] macro is needed because Rust currently doesn't natively support asynchronous traits. To add asynchronous functions in the Object trait, we have used the async-trait crate.

Now that we have a generic Object type, we can implement the Object trait for each of Blob, Tree, and Commit. Let's see the impl blocks for the Blob type.

impl Blob {
    pub async fn new(file: PathBuf) -> Result<Self> {
    // -- snip --
    }
}

#[async_trait]
impl Object for Blob {
    async fn from_object_sha(object_sha: String) -> Result<Self> {
    // -- snip --
    }

    fn sha1_hash(&self) -> [u8; 20] {
        let mut hash: [u8; 20] = [0; 20];
        hash.copy_from_slice(&self.sha1_hash);

        hash
    }

    fn write_data(&self) -> &Vec<u8> {
        &self.write_data
    }
}



impl Display for Blob {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        let out = String::from_utf8_lossy(&self.contents);

        f.write_fmt(format_args!("{}", out))
    }
}
src/blob.rs

Now that we have a Blob type, we can use it in commands so that the git cat-file and the git hash-object commands can be implemented like this:

pub async fn cat_file(pretty_print: bool, object_sha: String) -> Result<()> {
    let blob = Blob::from_object_sha(object_sha).await?;

    if pretty_print {
        print!("{}", blob);
    }

    Ok(())
}

pub async fn hash_object(file: PathBuf, write: bool) -> Result<()> {
    let blob = Blob::new(file).await?;

    if write {
        blob.write().await?;
    }
    print!("{}", blob.encoded_hash());

    Ok(())
}
src/commands.rs

The implementation of the Tree and the Commit types is on similar lines.

Bonus: Better Error Handling

If you have written any Rust, you may have noted how the above code uses Result<T> instead of Result<T, E>. This is because we have used a crate called anyhow.rs. Consider this function:

pub fn write_blob(self) {
    let blob_path = self.get_path();
    create_dir_all(blob_path.parent().unwrap()).unwrap();
    let mut file = File::create(blob_path).unwrap();
    file.write_all(&self.content).unwrap();
    file.flush().unwrap();        
}

There are a number of ways this can fail. However, we don't care a lot about the exact type of each error. We only want to know where the error occurred and why. The anyhow crate works best in such a scenario. Here's the implementation of the same function with anyhow::Result.  

pub fn write_blob(self) -> anyhow::Result<()> {
    let blob_path = self.get_path();
    create_dir_all(blob_path.parent()?)?;
    let mut file = File::create(blob_path)?;
    file.write_all(&self.content)?;
    file.flush()?;
}

As you can see, we are able to use the handy little ? operator even though each function call returns a different type of error.

Endnote

If you have reached here, you probably care enough about either Rust or Git. To build something similar, head over to Codecrafters and apply for early access. The challenges are framed such that a beginner without any experience in programming complex systems can easily follow along while also leaving all the implementation details to the user.

If you are a BITSian who likes Rust or likes to talk about systems in general, join us at #systems (read: #rust-circlejerk) on Silica :)