• Latest
  • Trending
Predicting code bug risk with git metadata

Predicting code bug risk with git metadata

December 21, 2021
ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023
Co-Creation Hub’s edtech accelerator puts $15M towards African startups

Co-Creation Hub’s edtech accelerator puts $15M towards African startups

February 20, 2023
Data Leak Hits Thousands of NHS Workers

Data Leak Hits Thousands of NHS Workers

February 20, 2023
EU Cybersecurity Agency Warns Against Chinese APTs

EU Cybersecurity Agency Warns Against Chinese APTs

February 20, 2023
How Your Storage System Will Still Be Viable in 5 Years’ Time?

How Your Storage System Will Still Be Viable in 5 Years’ Time?

February 20, 2023
The Broken Promises From Cybersecurity Vendors

Cloud Infrastructure Used By WIP26 For Espionage Attacks on Telcos

February 20, 2023
Instagram and Facebook to get paid-for verification

Instagram and Facebook to get paid-for verification

February 20, 2023
YouTube CEO Susan Wojcicki steps down after nine years

YouTube CEO Susan Wojcicki steps down after nine years

February 20, 2023
Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022
8 Most Common Causes of a Data Breach

5.7bn data entries found exposed on Chinese VPN

August 18, 2022
  • Consumer Watch
  • Kids Page
  • Directory
  • Events
  • Reviews
Monday, 5 June, 2023
  • Login
itechnewsonline.com
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion
Subscription
Advertise
No Result
View All Result
itechnewsonline.com
No Result
View All Result

Predicting code bug risk with git metadata

by ITECHNEWS
December 21, 2021
in Data Science, Leading Stories
0 0
0
Predicting code bug risk with git metadata

One of the perks of working at Civis is the quarterly ‘Hack Time’. For one week each quarter, you get to explore an offbeat idea of your choice and then present the results to your colleagues. This past quarter I spent my time exploring some off-label uses for the version control tool git. Git is widely used in the tech industry, but often in a fairly superficial manner.

This is enough to do basic version control tasks, but if you dig a bit deeper git also has lots of bells and whistles that can expose valuable information about your coding practices.

YOU MAY ALSO LIKE

ATC Ghana supports Girls-In-ICT Program

Vice President Dr. Bawumia inaugurates ICT Hub

In this post, I’ll describe the tool I built which uses these bells and whistles (or more specifically, the metadata git collects along with several git utilities which access this metadata) to predict the likelihood that a change to some code will introduce a bug. The tool (called gitrisky) is open sourced and available here, check it out! I also presented this at PyData NYC 2017 and you can see a recording of the talk here.

What metadata does git collect?

Git tracks changes to a codebase as a series of ‘commits’. Each commit contains a set of changes along with a unique identifier. When a new commit is recorded, git also records who made the commit, when it was recorded, and an explanatory note written by the user who made the commit.

This process produces two types of metadata: commit-level data and history-level data. Commit-level data includes all the data associated with a single commit, e.g. the explanatory note, author, timestamp, or the code changes included in that commit. History-level data is information that can be extracted from the sequence of commits but not from any single commit, for example, the changes in a particular file between two different points in the commit history. We need to use both types of metadata for our bug risk prediction tool, fortunately, git makes it easy to get at this information.

How do you access git metadata?

Git comes with several extra command line utilities that expose different bits of metadata. For this project I used three of them: git log, git diff, and git blame.

git log

This command exposes all the metadata associated with a particular commit. For example, to see the metadata associated with the most recent commit we can use:

$ git log -1 --stat
commit d87039fad9dcfd846cd3ac65883cd5fde8d759b3
Author: someuser <[email protected]>
Date:   Thu Nov 9 09:54:19 2017 -0600
    BUG Fix mistyped parameter foo in bar.py
  bar.py | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

The output includes the unique identifier for the commit, the author, a timestamp, a custom message the author included explaining the changes, and a summary of the files and lines changed.

git diff

This command lists all the changes that were introduced between two different commits. For example, to see what changed between commits 8b2d3726a and d270c2b745 we use:

$ git diff 8b2d3726a d270c2b745
--- a/gitrisky/cli.py
+++ b/gitrisky/cli.py
@@ -61 +61,2 @@ def predict(commit):
-    score = model.predict_proba(features)
+    # pull out just the postive class probability
+    [(_, score)] = model.predict_proba(features)

The output tells us that between those two commits the file gitrisky/cli.pywas modified at line 61. One line was deleted (the one marked with a minus sign) and two lines were added (the lines marked with a plus sign).

git blame

This command lists the last commit which modified each line in a file. You can also get this information as of some point in the past by specifying a previous commit. For example, to see which commits last modified each line in the file cli.py as of commit e355e3e97 we use:

$ git blame -s e355e3e97 -- cli.py
d5f970f0 cli.py  1) """This module contains cli commands to train and score gitrisky models"""
d5f970f0 cli.py  2)
d5f970f0 cli.py  3) import click
d5f970f0 cli.py  4)
dc95b215 cli.py  5) from .model import create_model, save_model, load_model
209879e0 cli.py  6) from .gitcmds import get_latest_commit
209879e0 cli.py  7) from .parsing import get_features, get_labels

This output tells us that as of commit e355e3e97 lines 1-4 of cli.py were last modified by commit d5f970f0, line 5 was last modified by commit dc95b215, and lines 6-7 were last modified by commit 209879e0.

How does the bug risk prediction work?

To predict code bug risk we want to build a model which predicts whether a given commit will introduce a bug. To build any kind of model we need labeled data, so our first challenge is to somehow label each commit in the history of our repository by whether it introduced a bug or not. This is where the git tools mentioned above come in handy. Using these tools we can identify which commits introduced bugs with a four-step process:

1. Identify all the commits which fix bugs by using git log to find all the commits which have messages that start with ‘Bug’ or ‘Fix’.

2. Figure out what each bugfix commit modified by using git diff to compare each bugfix commit to its immediate predecessor.

3. Find the last commit to touch the corrected lines by using git blameto inspect the corrected file as of right before the bugfix commit.

4. Label that commit as having introduced a bug.

This process comes with a caveat: in order to be effective the git repository being analyzed must have good ‘commit hygiene’. This means that each commit must have a descriptive commit message and must only include changes related to the task described in its commit message. If a commit has the message ‘Fix a bug in foo.py’ then that commit should only contain changes related to fixing that bug in foo.py, and not (for example) any unrelated typo corrections in the documentation.

The second thing we need to build a model is a set of predictive features for each commit. Here we can get a little bit creative — anything we can derive from the metadata associated with a single commit is fair game. A few easy options include the number of files changed, the number of lines added and deleted, the length of the commit message, whether the commit message contains specific words (e.g. ‘refactor’), and (if you really want to make friends) which of your coworkers authored the commit. To build the features we can parse the git log output for each commit and extract whatever features we came up with.

Once we have labels and features for each commit we can train a model on the data and use that model to score new commits. This is a binary classification task (our labels are either 1 — introduced a bug or 0 — did not introduce a bug) so for the model, I used a default RandomForestClassifier from Scikit-learn. Finally, I wrapped everything up into a nice command line tool with the click library. In action it looks like this:

$ cd repo/
$ gitrisky train
Model trained on 69 training examples with 14 positive cases
<add a new commit>
$ gitrisky predict
Commit 910cdb3c has a bug score of 0.2 / 1.0
Source: Civis Analytics Team
Tags: gitmetadata
ShareTweetShare
Plugin Install : Subscribe Push Notification need OneSignal plugin to be installed.

Search

No Result
View All Result

Recent News

ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023
Co-Creation Hub’s edtech accelerator puts $15M towards African startups

Co-Creation Hub’s edtech accelerator puts $15M towards African startups

February 20, 2023

About What We Do

itechnewsonline.com

We bring you the best Premium Tech News.

Recent News With Image

ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023

Recent News

  • ATC Ghana supports Girls-In-ICT Program April 25, 2023
  • Vice President Dr. Bawumia inaugurates ICT Hub April 2, 2023
  • Co-Creation Hub’s edtech accelerator puts $15M towards African startups February 20, 2023
  • Data Leak Hits Thousands of NHS Workers February 20, 2023
  • Home
  • InfoSec
  • Opinion
  • Africa Tech
  • Data Storage

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

No Result
View All Result
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Go to mobile version