In this Vignette, we show how to download GitHub comments from both Issue and Pull Requests. This data may be useful if development communication occurs through issues or pull requests in GitHub in addition or instead of mailing lists. For details on how to merge various communication sources, see the reply_communication_showcase.Rmd
notebook. The parsed communication table from GitHub comments by parse_github_comments
has the same format as the mailing list and JIRA comments tables. In turn, the function expects as input the table generated at the end of this notebook.
This notebook assumes you are familiar with obtaining a GitHub token. For details, see the github_api_showcase.Rmd
notebook. The functions in Kaiaulu will assume you have a token available, which can be passed as parameter.
rm(list = ls()) require(kaiaulu) require(data.table) require(jsonlite) require(knitr) require(magrittr) require(gt)
In this notebook, we are interested in obtaining comment data from the GitHub API. Development communication may occur in either issues or pull requests. The GitHub API Pulls documentation states that 'Comments on pull requests can be managed via the Issue Comments API. Every pull request is an issue, but not every issue is a pull request. For this reason, "shared" actions for both features, like manipulating assignees, labels and milestones, are provided within the Issues API.'
Further details are noted on the issue endpoint: 'Note: GitHub's REST API v3 considers every pull request an issue, but not every issue is a pull request. For this reason, "Issues" endpoints may return both issues and pull requests in the response. You can identify pull requests by the pull_request key. Be aware that the id of a pull request returned from "Issues" endpoints will be an issue id. To find out the pull request id, use the "List pull requests" endpoint.'
While the above is true for comments, the first message of every issue and every pull request (which also include a title) is not regarded by GitHub as a comment. For example, suppose one issue was opened by Author A, and contain only one reply by Author B. In this case, A and B established communication, and we would like this interaction to be reflected in the final table. However, if we were to only use the comments endpoint, we would only obtain Author's B comment, and not Author's A. The same is true in Pull Requests.
Therefore, in this Notebook we have to rely on three endpoints from the GitHub API: The Issue endpoint
to obtain the "first comment" of every issue, the Pull Request endpoint
to obtain the "first comment" of every pull request, then finally the Issue and Pull Request Comment endpoint
, which provides comments for both issue and pull requests together.
To use the pipeline, you must specify the organization and project of interest, and your token. Obtain a github token following the instructions here.
conf <- parse_config("../conf/kaiaulu.yml") save_path_issue_refresh <- get_github_issue_search_path(conf, "project_key_1") save_path_issue <- get_github_issue_path(conf, "project_key_1") save_path_pull_request <- get_github_pull_request_path(conf, "project_key_1") save_path_issue_or_pr_comments <- get_github_issue_or_pr_comment_path(conf, "project_key_1") save_path_commit <- get_github_commit_path(conf, "project_key_1") # Path you wish to save all raw data. A folder with the repo name and sub-folders will be created. owner <- get_github_owner(conf, "project_key_1") # Has to match github organization (e.g. github.com/sailuh) repo <- get_github_repo(conf, "project_key_1") # Has to match github repository (e.g. github.com/sailuh/perceive) # your file github_token (a text file) contains the GitHub token API token <- scan("~/.ssh/github_token",what="character",quiet=TRUE)
Kaiaulu offers the ability to download issue data by date range. Using the github_api_project_issue_by_date()
, you can download issues based on their 'created' dates. The acceptable formats are:
YYYY-MM-DD Example: 2023-04-15 (15th April 2023)
YYYY-MM-DDTHH:MM Example: 2023-04-15T13:45 (15th April 2023 at 13:45 UTC)
YYYY-MM-DDTHH:MM:SS Example: 2023-04-15T13:45:30 (15th April 2023 at 13:45:30 UTC)
YYYY-MM-DDTHH:MM:SSZ or YYYY-MM-DDTHH:MM:SS+00:00 Example: 2023-04-15T13:45:30Z or 2023-04-15T13:45:30+00:00 (15th April 2023 at 13:45:30 UTC)
If both date range parameters are set to NULL
, then all issues will be retrieved. Alternatively, if for example date_lower_bound_issue
is set to "2000-01-01" and date_upper_bound
is set to "2005-01-01", then only issues created between these two dates will be retrieved.
If you would like to retrieve issues only after a certain date, set date_upper_bound_issue=NULL
and date_lower_bound
to the date.
If you would like to retrieve only issues before a certain date, set date_lower_bound_issue=NULL
and date_upper_bound to the date.
created_lower_bound_issue <- "1990-01-01" created_upper_bound_issue <- "2021-01-01" # make initial API CALL # Acceptable formats for issue_or_pr # "is:issue" # "is:pull-request # issue_or_pr <- "is:issue" issue_or_pr <- "is:pull-request" if (issue_or_pr == "is:issue"){ issue_date_save_path <- save_path_issue_refresh } else { issue_date_save_path <- save_path_pull_request } dir.create(issue_date_save_path) gh_response <- github_api_project_issue_by_date(owner, repo, token, created_lower_bound_issue, created_upper_bound_issue, issue_or_pr, verbose=TRUE) #dir.create(save_path_issue) # Make subsequent API calls and write to JSON file along save path dir.create(save_path_issue_refresh) github_api_iterate_pages(token,gh_response, issue_date_save_path, prefix="issue", verbose=TRUE)
# Read all JSON files from the directory all_issue_files <- lapply(list.files(save_path_issue_refresh, full.names = TRUE), read_json) # Parse each JSON file using the refresh parser all_issue <- lapply(all_issue_files, github_parse_search_issues_refresh) # Combine all the data tables all_issues_combined <- rbindlist(all_issue, fill = TRUE) # Display the head of the combined issues table head(all_issues_combined) %>% gt(auto_align = FALSE)
You may alternatively have full flexibility on the choice of parameters using github_api_project_issue()
.
query <- "<parameters_of_interest>" # Acceptable formats for issue_or_pr # "is:issue" # "is:pull-request # issue_or_pr <- "is:issue" issue_or_pr <- "is:pull-request" # Check which folder to use if (issue_or_pr == "is:issue"){ issue_date_save_path <- save_path_issue_refresh } else { issue_date_save_path <- save_path_pull_request } gh_response <- github_api_project_issue_search(owner, repo, token, query, issue_or_pr, verbose=TRUE) # Make subsequent API calls and write to JSON file along save path github_api_iterate_pages(token,gh_response, save_path_issue_refresh, prefix="issue", verbose=TRUE)
The above function call downloads all the available issues less than the specified limit (default to API limit). The following chunk allows the downloading of issues that have been created only after the most recently created issue already downloaded. This allows the user to 'refresh' their data or continue downloading if a rate limit was previously reached.
There are a few instances in which downloading the issue data with comments does not capture all the issues:
This function relies on the naming convention the downloaders utilize on the file to perform this operation. For details, see the function documentation.
# Acceptable formats for issue_or_pr # "is:issue" # "is:pull-request issue_or_pr <- "is:issue" # issue_or_pr <- "is:pull-request" # gh call but with date # dir.create(save_path_issue_refresh) if (issue_or_pr == "is:issue"){ issue_date_save_path <- save_path_issue_refresh } else { issue_date_save_path <- save_path_pull_request } gh_response <- github_api_project_issue_refresh(owner, repo, token, issue_date_save_path, issue_or_pr, verbose=TRUE) github_api_iterate_pages(token,gh_response, issue_date_save_path, prefix="issue", verbose=TRUE)
# Read all JSON files from the directory all_issue_files <- lapply(list.files(save_path_issue_refresh, full.names = TRUE), read_json) # Parse each JSON file using the refresh parser all_issue <- lapply(all_issue_files, github_parse_search_issues_refresh) # Combine all the data tables all_issues_combined <- rbindlist(all_issue, fill = TRUE) # Display the head of the combined issues table head(all_issues_combined) %>% gt(auto_align = FALSE)
The Refresh parser also parses issue data but the folder 'refresh_issues', parsing data retrieved from the search endpoint. In this notebook, that is data from the REFRESH issues portion saved to issue_search directory.
Sometimes it is helpful to retrieve issue comments after a particular date. Issue comments have the drawback in the Github API of only being able to specify a lower bound and downloading all comments created or updated after this date inclusive. If there are multiple revisions of a comment, the most recent version of the comment is downloaded. The acceptable date formats are:
YYYY-MM-DD Example: 2023-04-15 (15th April 2023)
YYYY-MM-DDTHH:MM Example: 2023-04-15T13:45 (15th April 2023 at 13:45 UTC)
YYYY-MM-DDTHH:MM:SS Example: 2023-04-15T13:45:30 (15th April 2023 at 13:45:30 UTC)
YYYY-MM-DDTHH:MM:SSZ or YYYY-MM-DDTHH:MM:SS+00:00 Example: 2023-04-15T13:45:30Z or 2023-04-15T13:45:30+00:00 (15th April 2023 at 13:45:30 UTC)
updated_lower_bound_comment <- "2024-04-25" # make initial API CALL gh_response <- github_api_project_issue_or_pr_comments_by_date(owner = owner, repo = repo, token = token, since = updated_lower_bound_comment, verbose=TRUE) #dir.create(save_path_issue_or_pr_comments) # Make subsequent API calls and write to JSON file along save path github_api_iterate_pages(token,gh_response, save_path_issue_or_pr_comments, prefix="issue", verbose=TRUE)
all_issue_or_pr_comments <- lapply(list.files(save_path_issue_or_pr_comments, full.names = TRUE),read_json) all_issue_or_pr_comments <- lapply(all_issue_or_pr_comments, github_parse_project_issue_or_pr_comments) all_issue_or_pr_comments <- rbindlist(all_issue_or_pr_comments,fill=TRUE) head(all_issue_or_pr_comments,2) %>% gt(auto_align = FALSE)
Similar to the refresh of the issues, this chunk allows for the downloading of comments that have been created and/or updated since the most recently created date among data already downloaded. This allows us to 'refresh' the comments, downloading comments made or updated since that date or continue downloading if a rate limit was reached.
This function relies on the naming convention the downloaders utilize on the file to perform this operation. For details, see the function documentation.
Because the endpoint this function relies on is based on the updated timestamp, running the refresher will download the most recent version of the comment changes. Only the most recent version of the comment will be downloaded, not all copies. However, if the same comment was modified before the next refresh call, then if the refresher function was executed again, then this would result in two comments with the same comment id being present in the table. This can be addressed by performing a group by over the comment_id in the generated parsed table, and selecting to return the max(updated_at) comment, resulting in a table that only the most recent comment verson as of the latest time the refresher was executed.
#gh call but with date # get the data gh_response_issue_or_pr_comment <- github_api_project_issue_or_pr_comment_refresh(owner, repo, token, save_path_issue_or_pr_comments, verbose=TRUE) # create directory and iterate over data #dir.create(save_path_issue_or_pr_comments) github_api_iterate_pages(token,gh_response_issue_or_pr_comment, save_path_issue_or_pr_comments, prefix="issue_or_pr_comment", verbose=TRUE)
all_issue_or_pr_comments <- lapply(list.files(save_path_issue_or_pr_comments, full.names = TRUE),read_json) all_issue_or_pr_comments <- lapply(all_issue_or_pr_comments, github_parse_project_issue_or_pr_comments) all_issue_or_pr_comments <- rbindlist(all_issue_or_pr_comments,fill=TRUE) head(all_issue_or_pr_comments,2) %>% gt(auto_align = FALSE)
The following API endpoints do not currently implement the refresher, but may provide useful information.
The three endpoints used before do not contain author and e-mail information, only the developers GitHub ids. This is a problem, if the project being studied contains communication data outside the GitHub ecosystem (e.g. uses JIRA as issue tracker). In order to link developers to other sources, we need both author and e-mail information.
To do so, we can use the committer endpoint.
gh_response <- github_api_project_commits(owner,repo,token) dir.create(save_path_commit) github_api_iterate_pages(token,gh_response, save_path_commit, prefix="commit", verbose=TRUE)
The refresh search endpoint does not include pull requests. If the intent is to download both issues and pull requests, then the /issue/
endpoint should be used instead:
gh_response <- github_api_project_issue(owner,repo,token) dir.create(save_path_issue) github_api_iterate_pages(token,gh_response, save_path_issue, prefix="issue")
all_issue_or_pr <- lapply(list.files(save_path_issue, full.names = TRUE),read_json) all_issue_or_pr <- lapply(all_issue_or_pr, github_parse_project_issue_or_pr_comments) all_issue_or_pr <- rbindlist(all_issue_or_pr,fill=TRUE) head(all_issue_or_pr,2) %>% gt(auto_align = FALSE)
Similarly to the Issue API, we can also obtain other metadata from pull requests, including their labels.
save_path_pull_request <- paste0(save_path,"/pull_request/")
gh_response <- github_api_project_pull_request(owner,repo,token) dir.create(save_path_pull_request) github_api_iterate_pages(token,gh_response, save_path_pull_request, prefix="pull_request")
all_pr <- lapply(list.files(save_path_pull_request, full.names = TRUE),read_json) all_pr <- lapply(all_pr, github_parse_project_pull_request) all_pr <- rbindlist(all_pr,fill=TRUE) tail(all_pr,1) %>% gt(auto_align = FALSE)
If our interest is to observe all the development communication, we may regard both the opening issue, pull request and comments as simply "replies" in a single table.
Note because we obtain the authors and committers name and e-mail, only comments made by developers who made at least one commit will contain their name and e-mail. That is, people who only post issues and comment will not have that information available, and instead will have their github user as part of the reply_from
column. Therefore, identity match is likely not to work when no author name and e-mail is available. If you are analyzing social smells, this will not be a problem for org silo and missing link (as their condition require code changes). However, since radio silence only consider the mailing list network, caution must be exercised.
Below we show the result of such merge, including the name and e-mail fields obtained from the commit table. As before, we do not display the body column to prevent breaking the HTML format.
replies <- parse_github_replies(save_path_issue, save_path_pull_request, save_path_issue_or_pr_comments, save_path_commit) tail(replies,2) %>% gt(auto_align = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.