In sailuh/kaiaulu: Kaiaulu

Introduction

In this Vignette, we show how to download GitHub comments from both Issue and Pull Requests. This data may be useful if development communication occurs through issues or pull requests in GitHub in addition or instead of mailing lists. For details on how to merge various communication sources, see the reply_communication_showcase.Rmd notebook. The parsed communication table from GitHub comments by parse_github_comments has the same format as the mailing list and JIRA comments tables. In turn, the function expects as input the table generated at the end of this notebook.

This notebook assumes you are familiar with obtaining a GitHub token. For details, see the github_api_showcase.Rmd notebook. The functions in Kaiaulu will assume you have a token available, which can be passed as parameter.

rm(list = ls())
require(kaiaulu)
require(data.table)
require(jsonlite)
require(knitr)
require(magrittr)
require(gt)

GitHub Project's Comments

In this notebook, we are interested in obtaining comment data from the GitHub API. Development communication may occur in either issues or pull requests. The GitHub API Pulls documentation states that 'Comments on pull requests can be managed via the Issue Comments API. Every pull request is an issue, but not every issue is a pull request. For this reason, "shared" actions for both features, like manipulating assignees, labels and milestones, are provided within the Issues API.'

Further details are noted on the issue endpoint: 'Note: GitHub's REST API v3 considers every pull request an issue, but not every issue is a pull request. For this reason, "Issues" endpoints may return both issues and pull requests in the response. You can identify pull requests by the pull_request key. Be aware that the id of a pull request returned from "Issues" endpoints will be an issue id. To find out the pull request id, use the "List pull requests" endpoint.'

While the above is true for comments, the first message of every issue and every pull request (which also include a title) is not regarded by GitHub as a comment. For example, suppose one issue was opened by Author A, and contain only one reply by Author B. In this case, A and B established communication, and we would like this interaction to be reflected in the final table. However, if we were to only use the comments endpoint, we would only obtain Author's B comment, and not Author's A. The same is true in Pull Requests.

Therefore, in this Notebook we have to rely on three endpoints from the GitHub API: The Issue endpoint to obtain the "first comment" of every issue, the Pull Request endpoint to obtain the "first comment" of every pull request, then finally the Issue and Pull Request Comment endpoint, which provides comments for both issue and pull requests together.

Project Configuration File

To use the pipeline, you must specify the organization and project of interest, and your token. Obtain a github token following the instructions here.

conf <- parse_config("../conf/kaiaulu.yml")

save_path_issue_refresh <- get_github_issue_search_path(conf, "project_key_1")
save_path_issue <- get_github_issue_path(conf, "project_key_1")
save_path_pull_request <- get_github_pull_request_path(conf, "project_key_1")
save_path_issue_or_pr_comments <- get_github_issue_or_pr_comment_path(conf, "project_key_1")
save_path_commit <- get_github_commit_path(conf, "project_key_1")
# Path you wish to save all raw data. A folder with the repo name and sub-folders will be created.

owner <- get_github_owner(conf, "project_key_1") # Has to match github organization (e.g. github.com/sailuh)
repo <- get_github_repo(conf, "project_key_1") # Has to match github repository (e.g. github.com/sailuh/perceive)
# your file github_token (a text file) contains the GitHub token API
token <- scan("~/.ssh/github_token",what="character",quiet=TRUE)

Issues

Issues by Date Range

Kaiaulu offers the ability to download issue data by date range. Using the github_api_project_issue_by_date(), you can download issues based on their 'created' dates. The acceptable formats are:

YYYY-MM-DD Example: 2023-04-15 (15th April 2023)
YYYY-MM-DDTHH:MM Example: 2023-04-15T13:45 (15th April 2023 at 13:45 UTC)
YYYY-MM-DDTHH:MM:SS Example: 2023-04-15T13:45:30 (15th April 2023 at 13:45:30 UTC)
YYYY-MM-DDTHH:MM:SSZ or YYYY-MM-DDTHH:MM:SS+00:00 Example: 2023-04-15T13:45:30Z or 2023-04-15T13:45:30+00:00 (15th April 2023 at 13:45:30 UTC)

If both date range parameters are set to NULL, then all issues will be retrieved. Alternatively, if for example date_lower_bound_issue is set to "2000-01-01" and date_upper_bound is set to "2005-01-01", then only issues created between these two dates will be retrieved.

If you would like to retrieve issues only after a certain date, set date_upper_bound_issue=NULL and date_lower_bound to the date.

If you would like to retrieve only issues before a certain date, set date_lower_bound_issue=NULL and date_upper_bound to the date.

created_lower_bound_issue <- "1990-01-01"
created_upper_bound_issue <- "2021-01-01"
# make initial API CALL
# Acceptable formats for issue_or_pr
# "is:issue"
# "is:pull-request
# issue_or_pr <- "is:issue"
issue_or_pr <- "is:pull-request"
if (issue_or_pr == "is:issue"){
  issue_date_save_path <- save_path_issue_refresh
} else {
  issue_date_save_path <- save_path_pull_request
}
dir.create(issue_date_save_path)
gh_response <- github_api_project_issue_by_date(owner,
                                                repo,
                                                token,
                                                created_lower_bound_issue,
                                                created_upper_bound_issue,
                                                issue_or_pr,
                                                verbose=TRUE)
#dir.create(save_path_issue)

# Make subsequent API calls and write to JSON file along save path
dir.create(save_path_issue_refresh)

github_api_iterate_pages(token,gh_response,
                         issue_date_save_path,
                         prefix="issue",
                         verbose=TRUE)

# Read all JSON files from the directory
all_issue_files <- lapply(list.files(save_path_issue_refresh, full.names = TRUE), read_json)

# Parse each JSON file using the refresh parser
all_issue <- lapply(all_issue_files, github_parse_search_issues_refresh)


# Combine all the data tables
all_issues_combined <- rbindlist(all_issue, fill = TRUE)

# Display the head of the combined issues table
head(all_issues_combined) %>%
  gt(auto_align = FALSE)

Issues by Customized Queries

You may alternatively have full flexibility on the choice of parameters using github_api_project_issue().

query <- "<parameters_of_interest>"

# Acceptable formats for issue_or_pr
# "is:issue"
# "is:pull-request
# issue_or_pr <- "is:issue"
issue_or_pr <- "is:pull-request"
# Check which folder to use
if (issue_or_pr == "is:issue"){
  issue_date_save_path <- save_path_issue_refresh
} else {
  issue_date_save_path <- save_path_pull_request
}

gh_response <- github_api_project_issue_search(owner, repo, token, query, issue_or_pr, verbose=TRUE)

# Make subsequent API calls and write to JSON file along save path

github_api_iterate_pages(token,gh_response,
                         save_path_issue_refresh,
                         prefix="issue",
                         verbose=TRUE)

Refresh Issues

The above function call downloads all the available issues less than the specified limit (default to API limit). The following chunk allows the downloading of issues that have been created only after the most recently created issue already downloaded. This allows the user to 'refresh' their data or continue downloading if a rate limit was previously reached.

There are a few instances in which downloading the issue data with comments does not capture all the issues:

The Github rest API rate limit (currently 5000/hour) is reached.
Function ends before completing.
More issues were since posted after downloading.

This function relies on the naming convention the downloaders utilize on the file to perform this operation. For details, see the function documentation.

# Acceptable formats for issue_or_pr
# "is:issue"
# "is:pull-request
issue_or_pr <- "is:issue"
# issue_or_pr <- "is:pull-request"
# gh call but with date
# dir.create(save_path_issue_refresh)
if (issue_or_pr == "is:issue"){
  issue_date_save_path <- save_path_issue_refresh
} else {
  issue_date_save_path <- save_path_pull_request
}
gh_response <- github_api_project_issue_refresh(owner,
                                                repo,
                                                token,
                                                issue_date_save_path,
                                                issue_or_pr,
                                                verbose=TRUE)
github_api_iterate_pages(token,gh_response,
                         issue_date_save_path,
                         prefix="issue",
                         verbose=TRUE)

# Read all JSON files from the directory
all_issue_files <- lapply(list.files(save_path_issue_refresh, full.names = TRUE), read_json)

# Parse each JSON file using the refresh parser
all_issue <- lapply(all_issue_files, github_parse_search_issues_refresh)


# Combine all the data tables
all_issues_combined <- rbindlist(all_issue, fill = TRUE)

# Display the head of the combined issues table
head(all_issues_combined) %>%
  gt(auto_align = FALSE)

The Refresh parser also parses issue data but the folder 'refresh_issues', parsing data retrieved from the search endpoint. In this notebook, that is data from the REFRESH issues portion saved to issue_search directory.

Comments

Issues and Pull Requests Comments

Sometimes it is helpful to retrieve issue comments after a particular date. Issue comments have the drawback in the Github API of only being able to specify a lower bound and downloading all comments created or updated after this date inclusive. If there are multiple revisions of a comment, the most recent version of the comment is downloaded. The acceptable date formats are:

YYYY-MM-DD Example: 2023-04-15 (15th April 2023)
YYYY-MM-DDTHH:MM Example: 2023-04-15T13:45 (15th April 2023 at 13:45 UTC)
YYYY-MM-DDTHH:MM:SS Example: 2023-04-15T13:45:30 (15th April 2023 at 13:45:30 UTC)
YYYY-MM-DDTHH:MM:SSZ or YYYY-MM-DDTHH:MM:SS+00:00 Example: 2023-04-15T13:45:30Z or 2023-04-15T13:45:30+00:00 (15th April 2023 at 13:45:30 UTC)

updated_lower_bound_comment <- "2024-04-25"

# make initial API CALL
gh_response <- github_api_project_issue_or_pr_comments_by_date(owner = owner,
                                                repo = repo,
                                                token = token,
                                                since = updated_lower_bound_comment,
                                                verbose=TRUE)
#dir.create(save_path_issue_or_pr_comments)
# Make subsequent API calls and write to JSON file along save path


github_api_iterate_pages(token,gh_response,
                         save_path_issue_or_pr_comments,
                         prefix="issue",
                         verbose=TRUE)

all_issue_or_pr_comments <- lapply(list.files(save_path_issue_or_pr_comments,
                                     full.names = TRUE),read_json)
all_issue_or_pr_comments <- lapply(all_issue_or_pr_comments,
                                   github_parse_project_issue_or_pr_comments)
all_issue_or_pr_comments <- rbindlist(all_issue_or_pr_comments,fill=TRUE)

head(all_issue_or_pr_comments,2)  %>%
  gt(auto_align = FALSE)

Refresh Issue or PR Comment

Similar to the refresh of the issues, this chunk allows for the downloading of comments that have been created and/or updated since the most recently created date among data already downloaded. This allows us to 'refresh' the comments, downloading comments made or updated since that date or continue downloading if a rate limit was reached.

This function relies on the naming convention the downloaders utilize on the file to perform this operation. For details, see the function documentation.

Because the endpoint this function relies on is based on the updated timestamp, running the refresher will download the most recent version of the comment changes. Only the most recent version of the comment will be downloaded, not all copies. However, if the same comment was modified before the next refresh call, then if the refresher function was executed again, then this would result in two comments with the same comment id being present in the table. This can be addressed by performing a group by over the comment_id in the generated parsed table, and selecting to return the max(updated_at) comment, resulting in a table that only the most recent comment verson as of the latest time the refresher was executed.

#gh call but with date
# get the data
gh_response_issue_or_pr_comment <- github_api_project_issue_or_pr_comment_refresh(owner,                                                                            repo,
                                              token,                                                                            save_path_issue_or_pr_comments,                                                    verbose=TRUE)

# create directory and iterate over data
#dir.create(save_path_issue_or_pr_comments)
github_api_iterate_pages(token,gh_response_issue_or_pr_comment,
                         save_path_issue_or_pr_comments,
                         prefix="issue_or_pr_comment",
                         verbose=TRUE)

all_issue_or_pr_comments <- lapply(list.files(save_path_issue_or_pr_comments,
                                     full.names = TRUE),read_json)
all_issue_or_pr_comments <- lapply(all_issue_or_pr_comments,
                                   github_parse_project_issue_or_pr_comments)
all_issue_or_pr_comments <- rbindlist(all_issue_or_pr_comments,fill=TRUE)

head(all_issue_or_pr_comments,2)  %>%
  gt(auto_align = FALSE)

Other API Endpoints

The following API endpoints do not currently implement the refresher, but may provide useful information.

Commits Endpoint: Obtaining author's name and e-mail

The three endpoints used before do not contain author and e-mail information, only the developers GitHub ids. This is a problem, if the project being studied contains communication data outside the GitHub ecosystem (e.g. uses JIRA as issue tracker). In order to link developers to other sources, we need both author and e-mail information.

To do so, we can use the committer endpoint.

gh_response <- github_api_project_commits(owner,repo,token)
dir.create(save_path_commit)
github_api_iterate_pages(token,gh_response,
                         save_path_commit,
                         prefix="commit",
                         verbose=TRUE)

Issue or PR Endpoint

The refresh search endpoint does not include pull requests. If the intent is to download both issues and pull requests, then the /issue/ endpoint should be used instead:

gh_response <- github_api_project_issue(owner,repo,token)
dir.create(save_path_issue)
github_api_iterate_pages(token,gh_response,
                         save_path_issue,
                         prefix="issue")

all_issue_or_pr <- lapply(list.files(save_path_issue,
                                     full.names = TRUE),read_json)
all_issue_or_pr <- lapply(all_issue_or_pr,
                                   github_parse_project_issue_or_pr_comments)
all_issue_or_pr <- rbindlist(all_issue_or_pr,fill=TRUE)

head(all_issue_or_pr,2)  %>%
  gt(auto_align = FALSE)

Pull Requests Endpoint

Similarly to the Issue API, we can also obtain other metadata from pull requests, including their labels.

save_path_pull_request <- paste0(save_path,"/pull_request/")

gh_response <- github_api_project_pull_request(owner,repo,token)
dir.create(save_path_pull_request)
github_api_iterate_pages(token,gh_response,
                         save_path_pull_request,
                         prefix="pull_request")

all_pr <- lapply(list.files(save_path_pull_request,
                                     full.names = TRUE),read_json)
all_pr <- lapply(all_pr,
                                   github_parse_project_pull_request)
all_pr <- rbindlist(all_pr,fill=TRUE)

tail(all_pr,1)  %>%
  gt(auto_align = FALSE)

Combining Issue and Pull Request communication

If our interest is to observe all the development communication, we may regard both the opening issue, pull request and comments as simply "replies" in a single table.

Note because we obtain the authors and committers name and e-mail, only comments made by developers who made at least one commit will contain their name and e-mail. That is, people who only post issues and comment will not have that information available, and instead will have their github user as part of the reply_from column. Therefore, identity match is likely not to work when no author name and e-mail is available. If you are analyzing social smells, this will not be a problem for org silo and missing link (as their condition require code changes). However, since radio silence only consider the mailing list network, caution must be exercised.

Below we show the result of such merge, including the name and e-mail fields obtained from the commit table. As before, we do not display the body column to prevent breaking the HTML format.

replies <- parse_github_replies(save_path_issue,
                                save_path_pull_request,
                                save_path_issue_or_pr_comments,
                                save_path_commit)  

tail(replies,2)  %>%
  gt(auto_align = FALSE)

sailuh/kaiaulu documentation built on Dec. 10, 2024, 3:14 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

sailuh/kaiaulu
Kaiaulu

In sailuh/kaiaulu: Kaiaulu

Introduction

GitHub Project's Comments

Project Configuration File

Issues

Issues by Date Range

Issues by Customized Queries

Refresh Issues

Comments

Issues and Pull Requests Comments

Refresh Issue or PR Comment

Other API Endpoints

Commits Endpoint: Obtaining author's name and e-mail

Issue or PR Endpoint

Pull Requests Endpoint

Combining Issue and Pull Request communication

R Package Documentation

Browse R Packages

We want your feedback!

sailuh/kaiaulu Kaiaulu

In sailuh/kaiaulu: Kaiaulu

Introduction

GitHub Project's Comments

Project Configuration File

Issues

Issues by Date Range

Issues by Customized Queries

Refresh Issues

Comments

Issues and Pull Requests Comments

Refresh Issue or PR Comment

Other API Endpoints

Commits Endpoint: Obtaining author's name and e-mail

Issue or PR Endpoint

Pull Requests Endpoint

Combining Issue and Pull Request communication

R Package Documentation

Browse R Packages

We want your feedback!

sailuh/kaiaulu
Kaiaulu