rm(list = ls()) seed <- 1 set.seed(seed)
require(kaiaulu) require(data.table) require(stringi) require(knitr) require(yaml) require(magrittr) require(visNetwork)
This notebook leverages the JIRA issues data collected from the R Notebook download_jira_data.Rmd
to calculate bug counts.
As usual, the first step is to load the project configuration file.
tool <- yaml::read_yaml("../tools.yml") conf <- yaml::read_yaml("../conf/geronimo.yml") perceval_path <- tool[["perceval"]] git_repo_path <- conf[["version_control"]][["log"]] # Issue ID Regex on Commit Messages issue_id_regex <- conf[["commit_message_id_regex"]][["issue_id"]] # Path to Jira Issues (obtained using `download_jira_data Notebook`) jira_issues_path <- conf[["issue_tracker"]][["jira"]][["issues"]] # Filters file_extensions <- conf[["filter"]][["keep_filepaths_ending_with"]] substring_filepath <- conf[["filter"]][["remove_filepaths_containing"]]
To establish bug count, we must map each file to its associated issue. In general (but not always), an open source project will adopt a commit message convention to label issue ids. For example, Kaiaulu uses i #<issue number>
. JIRA dictates issue ids in the format PROJECT-
Since commits are in turn associated to files, we can establish a map from file to issue. Additionally, projects which use JIRA commonly annotate on their issue type whether the issue is a feature or bug.
By connecting both git log and JIRA information, therefore, we can count the number of bugs per file.
After specifying the necessary configuration file, we first parse the project git log:
project_git <- parse_gitlog(perceval_path,git_repo_path)
We then removed test files:
project_git <- project_git %>% filter_by_file_extension(file_extensions,"file_pathname") %>% filter_by_filepath_substring(substring_filepath,"file_pathname")
Timezones are standardized so we can explore a slice of the data for visualization:
project_git$author_datetimetz <- as.POSIXct(project_git$author_datetimetz, format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC") project_git$committer_datetimetz <- as.POSIXct(project_git$committer_datetimetz, format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC") project_git_slice <- project_git[author_datetimetz >= as.POSIXct("2012-01-01", format = "%Y-%m-%d",tz = "UTC") & author_datetimetz < as.POSIXct("2012-12-31", format = "%Y-%m-%d",tz = "UTC")]
We can use a built-in Kaiaulu function to search for a regular expression (regex) of the issue id. First we use the regex to calculate how many commits contain issue ids. Ideally, you should consider projects with a high enough coverage, or the results may not be representative.
The total number of commits with issue ids in the chosen git slice is:
commit_message_id_coverage(project_git_slice,issue_id_regex)
Proportion of commit messages containing issue ids relative to all commits in the slice:
normalized_coverage <- commit_message_id_coverage(project_git_slice,issue_id_regex)/length(unique(project_git_slice$commit_hash)) normalized_coverage
To get a better idea of the mapping, we can also visualize the issue-file network we just established using the regex as follows:
project_commit_message_id_edgelist <- transform_commit_message_id_to_network(project_git_slice,commit_message_id_regex = issue_id_regex) project_commit_message_id_network <- igraph::graph_from_data_frame(d=project_commit_message_id_edgelist[["edgelist"]], directed = TRUE, vertices = project_commit_message_id_edgelist[["nodes"]]) visIgraph(project_commit_message_id_network,randomSeed = 1)
As the network shows, visual inspection can identify interesting outliers in the data mapping. Remember you can zoom in to see both issue id and file paths details to look for details.
Now that we explored the file to issue mapping, let's focus on calculating the total bug count per file for the entire git log.
First, we can extract only the detected issue id from the commit message, and add it in a separate column:
project_git <- parse_commit_message_id(project_git, issue_id_regex)
Now we know each commit's ID, we can add the issue_type information obtained directly from JIRA, so we can know which files were involved in a bug, hence obtaining file bug count. We will also consider only issues that are already closed
, in case the bug is later found to be invalid.
The jira table contains many fields (see download_jira_data.Rmd
for all columns), here we show a sample of the relevant fields for bug count:
jira_issues <- parse_jira(jira_issues_path)[["issues"]] jira_issues <- jira_issues[issue_status == "Closed" & issue_type == "Bug"] kable(head(jira_issues[,.(issue_key,issue_type,issue_status,issue_summary)]))
We can then perform a left join using the extracted key of the git log against the issue key from the issue data:
project_git <- merge(project_git,jira_issues,all.x=TRUE,by.x="commit_message_id", by.y="issue_key")
For each file, we now know the commits that were associated with fixing bugs. We then group the number of bug-related issues by file path. Our result should show the number of bugs per file path. The following outputs the top 20 files with the most bugs.
file_bug_count <- project_git[,.(bug_count=.N),by = "file_pathname"] kable(head(file_bug_count[order(-bug_count)],20))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.