Predicting code bug risk with git metadata

One of the perks of working at Civis is the quarterly ‘Hack Time’. For one week each quarter, you get to explore an offbeat idea of your choice and then present the results to your colleagues. This past quarter I spent my time exploring some off-label uses for the version control tool git. Git is widely used in the tech industry, but often in a fairly superficial manner.

This is enough to do basic version control tasks, but if you dig a bit deeper git also has lots of bells and whistles that can expose valuable information about your coding practices.

In this post, I’ll describe the tool I built which uses these bells and whistles (or more specifically, the metadata git collects along with several git utilities which access this metadata) to predict the likelihood that a change to some code will introduce a bug. The tool (called gitrisky) is open sourced and available here, check it out! I also presented this at PyData NYC 2017 and you can see a recording of the talk here.

What metadata does git collect?

Git tracks changes to a codebase as a series of ‘commits’. Each commit contains a set of changes along with a unique identifier. When a new commit is recorded, git also records who made the commit, when it was recorded, and an explanatory note written by the user who made the commit.

This process produces two types of metadata: commit-level data and history-level data. Commit-level data includes all the data associated with a single commit, e.g. the explanatory note, author, timestamp, or the code changes included in that commit. History-level data is information that can be extracted from the sequence of commits but not from any single commit, for example, the changes in a particular file between two different points in the commit history. We need to use both types of metadata for our bug risk prediction tool, fortunately, git makes it easy to get at this information.

How do you access git metadata?

Git comes with several extra command line utilities that expose different bits of metadata. For this project I used three of them: git loggit diff, and git blame.

git log

This command exposes all the metadata associated with a particular commit. For example, to see the metadata associated with the most recent commit we can use:

$ git log -1 --stat
commit d87039fad9dcfd846cd3ac65883cd5fde8d759b3
Author: someuser <[email protected]>
Date:   Thu Nov 9 09:54:19 2017 -0600
    BUG Fix mistyped parameter foo in bar.py
  bar.py | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

The output includes the unique identifier for the commit, the author, a timestamp, a custom message the author included explaining the changes, and a summary of the files and lines changed.

git diff

This command lists all the changes that were introduced between two different commits. For example, to see what changed between commits 8b2d3726a and d270c2b745 we use:

$ git diff 8b2d3726a d270c2b745
--- a/gitrisky/cli.py
+++ b/gitrisky/cli.py
@@ -61 +61,2 @@ def predict(commit):
-    score = model.predict_proba(features)
+    # pull out just the postive class probability
+    [(_, score)] = model.predict_proba(features)

The output tells us that between those two commits the file gitrisky/cli.pywas modified at line 61. One line was deleted (the one marked with a minus sign) and two lines were added (the lines marked with a plus sign).

git blame

This command lists the last commit which modified each line in a file. You can also get this information as of some point in the past by specifying a previous commit. For example, to see which commits last modified each line in the file cli.py as of commit e355e3e97 we use:

$ git blame -s e355e3e97 -- cli.py
d5f970f0 cli.py  1) """This module contains cli commands to train and score gitrisky models"""
d5f970f0 cli.py  2)
d5f970f0 cli.py  3) import click
d5f970f0 cli.py  4)
dc95b215 cli.py  5) from .model import create_model, save_model, load_model
209879e0 cli.py  6) from .gitcmds import get_latest_commit
209879e0 cli.py  7) from .parsing import get_features, get_labels

This output tells us that as of commit e355e3e97 lines 1-4 of cli.py were last modified by commit d5f970f0, line 5 was last modified by commit dc95b215, and lines 6-7 were last modified by commit 209879e0.

How does the bug risk prediction work?

To predict code bug risk we want to build a model which predicts whether a given commit will introduce a bug. To build any kind of model we need labeled data, so our first challenge is to somehow label each commit in the history of our repository by whether it introduced a bug or not. This is where the git tools mentioned above come in handy. Using these tools we can identify which commits introduced bugs with a four-step process:

1. Identify all the commits which fix bugs by using git log to find all the commits which have messages that start with ‘Bug’ or ‘Fix’.

2. Figure out what each bugfix commit modified by using git diff to compare each bugfix commit to its immediate predecessor.

3. Find the last commit to touch the corrected lines by using git blameto inspect the corrected file as of right before the bugfix commit.

4. Label that commit as having introduced a bug.

This process comes with a caveat: in order to be effective the git repository being analyzed must have good ‘commit hygiene’. This means that each commit must have a descriptive commit message and must only include changes related to the task described in its commit message. If a commit has the message ‘Fix a bug in foo.py’ then that commit should only contain changes related to fixing that bug in foo.py, and not (for example) any unrelated typo corrections in the documentation.

The second thing we need to build a model is a set of predictive features for each commit. Here we can get a little bit creative — anything we can derive from the metadata associated with a single commit is fair game. A few easy options include the number of files changed, the number of lines added and deleted, the length of the commit message, whether the commit message contains specific words (e.g. ‘refactor’), and (if you really want to make friends) which of your coworkers authored the commit. To build the features we can parse the git log output for each commit and extract whatever features we came up with.

Once we have labels and features for each commit we can train a model on the data and use that model to score new commits. This is a binary classification task (our labels are either 1 — introduced a bug or 0 — did not introduce a bug) so for the model, I used a default RandomForestClassifier from Scikit-learn. Finally, I wrapped everything up into a nice command line tool with the click library. In action it looks like this:

$ cd repo/
$ gitrisky train
Model trained on 69 training examples with 14 positive cases
<add a new commit>
$ gitrisky predict
Commit 910cdb3c has a bug score of 0.2 / 1.0
Exit mobile version