Are 15% of all commits fixes?

· 3min · Giovanni Carvalho

tl;dr no.

I recently started a new job, and one of the first things I like to do soon after getting access to the repo is to check various statistics about it.

Things such as how old is the repo, how many total commits are there, who are the top committers, which files have been touched the most, whether merge-commits or a linear history is used, etc.

In particular, the proportion of "fix" commits is something I find very interesting. That is, out of all commits, how many are fixes. Which I'll refer to as "fix-ratio" from now on, for simplicity.

Across multiple repositories and in different companies, most productionized software I've come across seems to hover at around 15% of fix-ratio.

And as I ran the command in the main repo:

λ numbat -e $(git log --oneline | grep -iw fix | wc -l)/$(git rev-list --count --all)
0.148675

There it is. The good old 15%.

This left me wondering: Are repos in the wild also around that ballpark?

I went looking for a list of reasonably popular repos on GitHub, well aware that:

  • a) open-source and closed-source software (where I have observed the 15%) are different beasts;
  • b) not all popular repositories are applications. Many are simply mostly-text "awesome" lists;
  • c) any number of repositories that I can fetch in a reasonable amount of time will still be an unrepresentative sample.

I sourced the list from the top 200 repositories here.

Shrugging these caveats away, and 114GB later1, I was ready to calculate the answer.

Setup

First, I dumped the number of "fix" and total commits from each repo into output.csv.

Note that the script below depends on GNU Parallel.

# file: count.sh
#!/usr/bin/env bash
set -euo pipefail

function count_repo {
    repo=$1
    num_fixes=$(git -C $repo log --oneline | grep -iw fix | wc -l)
    num_commits=$(git -C $repo rev-list --count --all)
    echo $repo,$num_fixes,$num_commits
}
# export function to use it with 'parallel' below
export -f count_repo

function main {
    # write csv header
    echo 'repo,num_fixes,num_commits'
    # for each repo, count fixes and total commits
    find -maxdepth 1 -type d -execdir test -d {}/.git \; -print -prune |
        parallel --jobs 8 count_repo
}

main | tee output.csv
λ ./count.sh
repo,num_fixes,num_commits
[...]
./opencv,4338,36425
./next.js,5405,38047
./rust,28281,306211
./rails,11212,113647
./vscode,23092,145918
./tensorflow,18848,191525
./linux,198048,1369404

And then I computed the fix-ratio and analyzed the results with DuckDB. The SUMMARIZE command is very useful for these quick analyses.

# file: analyze.sql
.mode line

summarize
select num_fixes/num_commits as fix_ratio from 'output.csv';
λ duckdb < analyze.sql
    column_name = fix_ratio
    column_type = DOUBLE
            min = 0.0
            max = 0.35591287490021667
  approx_unique = 172
            avg = 0.10526396442532608
            std = 0.07332227658614995
            q25 = 0.05261715331694921
            q50 = 0.09326059309410577
            q75 = 0.14624197878941886
          count = 200
null_percentage = 0.00

Results

I didn't bother doing any kind of cleanup, so it's no wonder there is such a variation. But from this small sample of 200 open-source repos, it seems like the proportion of fix-commits is closer to 10% than it is to 15%.

What does this mean?

I don't feel as validated as I did in the other times I saw a ~15% ratio, but I think it's still a good guesstimate. If you're at a 10-20% fix-ratio, I imagine you probably sleep well and don't often wake up at 3AM to fix production. If the project you work on is a dumpster fire, I'm curious, what's your fix-ratio?

In general, should you care? Probably not.

Caveats

  • You need to use semantic commits somewhat, or at least include fix in the commit subject.
  • Merge-heavy repos probably don't follow this ratio very cleanly.

Takeaways

  1. Use GNU Parallel: it is a super handy tool.
  2. Use DuckDB, especially the SUMMARIZE command, for quick stats.
  3. Accept that ~10% of changes may require fixes.

  1. to be fair, 32.2GB is from nerd-fonts alone.