This notebook explains how to compute social smells for a given open source project. Before we begin, it is important to understand what data is required to compute social smells, what Kaiaulu will do and also what will not do for you.
Social smell metrics requires both collaboration and communication data. Collaboration data is extracted from version control systems, while communication data can be obtained from whatever the project of interest uses for developers to communicate.
Obtaining collaboration data is relatively painless: You need only clone the version control system locally. Currently Kaiaulu only supports analysis in Git, but in the future it may support other version control systems. Obtaining communication data, however, requires more effort on your part depending on the project you choose. This is because there is a large variety of communication mediums and archive types open source projects use.
Broadly, open source projects use mailing lists, issue tracker (comments), or both for communication.
Example of mailing list archive types are GNU Mailing List Manager, Google Groups, Mail Archive, Apache's MOD Mbox, Free Lists, Discourse, etc. On March 2021, even GitHub launched its own built-in communication medium, Discussions.
Examples of issue tracker types include JIRA, GitHub's built-in Issue Tracker, GitLab's Issue Tracker, Monorail (used in Google's Chromium), Trac, Bugzilla, etc. You may also be interested in including discussion that occurs in GitHub's pull requests / GitLab merge requests. Of the above, GitHub Issue Tracker comments, GitHub Pull Requests comments (download_github_comments.Rmd
), and JIRA comments (download_jira_data.Rmd
) are currently supported. See the associated vignettes for details on how to download the data.
It is of course not viable for Kaiaulu to implement interfaces to every single archive type out there. Therefore, to calculate social smells, we expect you can obtain a .mbox representation of the mailing list of interest. This may be available from the open source project directly (e.g. Apache projects mod_mbox (see download_mod_mbox.Rmd
) and pipermail (see the R/download.R/download_pipermail()
), or via a crawler someone already made. For example, gg_scraper outputs a mbox file from Google Group mailing lists (although it can only obtain partial information of the e-mails, as Google Groups truncate them, which may pose limitations to some steps of the analysis of the identity matching discussed below).
The bottom line is, the required effort to obtain the mailing list data will vary depending on the open source project of interest, as open source projects may even transition over time through different archive types. Once you have available in your computer both git log, and at least one source of communication data (e.g. mbox, jira, or github), you are ready to proceed with the social smells analysis of this notebook.
Please ensure the following R packages are installed on your computer.
rm(list = ls()) seed <- 1 set.seed(seed) require(kaiaulu) require(stringi) require(data.table) require(knitr) require(visNetwork)
The parameters necessary for analysis are kept in a project configuration file to ensure reproducibility. In this project, we will use the Apache's Helix open source project. Refer to the conf/
folder on Kaiaulu's git repository for Helix and other project configuration files. It is in this project configuration file we specify where Kaiaulu can find the git log and communication sources from Helix. We also specify here filters of interest: For example, if Kaiaulu should ignore test files, or anything that is not source code.
At the scope of this notebook, only the first branch (top) specified in the project configuration file will be analyzed. Refer to the CLI interface if you are interested in executing this analysis over multiple branches.
We also provide the path for tools.yml
. Kaiaulu does not implement all available functionality from scratch. Conversely, it will also not expect all dependencies to be installed. Every function defined in the API expects as parameter a filepath to the external dependency binary. Tools.yml is a convenience file that stores all the binary paths, so it can be set once during setup and reused multiple times for analysis. You can find an example of tools.yml
on the github repo from Kaiaulu root directory. For this notebook, you will need to install Perceval (use version 0.12.24) and OSLOM. Instructions to do so are available in the Kaiaulu README.md. Once you are finished, set the "perceval," "oslom_dir," and "oslom_undir" paths in your tools.yml
.
tool <- parse_config("../tools.yml") conf <- parse_config("../conf/helix.yml") scc_path <- get_tool_project("scc", tool) oslom_dir_path <- get_tool_project("oslom_dir", tool) oslom_undir_path <- get_tool_project("oslom_undir", tool) perceval_path <- get_tool_project("perceval", tool) git_repo_path <- get_git_repo_path(conf) git_branch <- get_git_branches(conf)[1] start_commit <- get_window_start_commit(conf) end_commit <- get_window_end_commit(conf) window_size <- get_window_size(conf) mbox_path <- get_mbox_path(conf, "project_key_1") issues_json_folder_path <- get_github_issue_path(conf, "project_key_1") pull_requests_json_folder_path <- get_github_pull_request_path(conf, "project_key_1") comments_json_folder_path <- get_github_issue_or_pr_comment_path(conf, "project_key_1") commit_json_folder_path <- get_github_commit_path(conf, "project_key_1") jira_issue_comments_path <- get_jira_issues_comments_path(conf, "project_key_1") # Filters file_extensions <- get_file_extensions(conf) substring_filepath <- get_substring_filepath(conf)
The remainder of this notebook does not require modifications. If you encounter an error in any code block below, chances are one or more parameters above have been specified incorrectly, or the project of choice may have led to an outlier case. Please open an issue if you encounter an error, or if not sure post on discussions in Kaiaulu's GitHub. E-mailing bugs is discouraged as it is hard to track.
As stated in the introduction, we need both git log and at least one communication source (here named replies
) to compute social smells. Therefore, the first step is to parse the raw data.
To get started, we use the parse_gitlog
function to extract a table from the git log. You can inspect the project_git
variable to inspect what information is available from the git log.
git_checkout(git_branch,git_repo_path) project_git <- parse_gitlog(perceval_path,git_repo_path) project_git <- project_git %>% filter_by_file_extension(file_extensions,"file_pathname") %>% filter_by_filepath_substring(substring_filepath,"file_pathname")
Next, we parse the various communication channels the project use. Similarly to parse_gitlog
, the returned object is a table, which we can inspect directly in R to see what information is available.
We also have to parse and normalize the timezone across the different projects. Since one of the social metrics in the quality framework is the count of different timezones, we separate the timezone information before normalizing them.
project_git$author_tz <- sapply(stringi::stri_split(project_git$author_datetimetz, regex=" "),"[[",6) project_git$author_datetimetz <- as.POSIXct(project_git$author_datetimetz, format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC") project_git$committer_tz <- sapply(stringi::stri_split(project_git$committer_datetimetz, regex=" "),"[[",6) project_git$committer_datetimetz <- as.POSIXct(project_git$committer_datetimetz, format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC")
We apply the same logic above to the communication channels available. Not all of the communication channels may be in use in a given project, but you will want to ensure you accurately chose all the development communication used in a project. You can usually find a "How to contribute" page on the project website, which specifies the used sources.
Remember: Kaiaulu will not throw errors if you omit relevant sources of developer communication, instead the computed smells will be higher than what they should (as developers will deem to have not communicated because the source where they communicate was simply not included in the analysis!).
project_mbox <- NULL project_jira <- NULL project_github_replies <- NULL if(!is.null(mbox_path)){ project_mbox <- parse_mbox(perceval_path,mbox_path) project_mbox$reply_tz <- sapply(stringi::stri_split(project_git$reply_datetimetz, regex=" "),"[[",6) project_mbox$reply_datetimetz <- as.POSIXct(project_mbox$reply_datetimetz, format = "%a, %d %b %Y %H:%M:%S %z", tz = "UTC") } if(!is.null(jira_issue_comments_path)){ project_jira <- parse_jira_replies(parse_jira(jira_issue_comments_path)) # Timezone is embedded on separated field. All times shown in UTC. project_jira$reply_tz <- "0000" project_jira$reply_datetimetz <- as.POSIXct(project_jira$reply_datetimetz, format = "%Y-%m-%dT%H:%M:%S.000+0000", tz = "UTC") } github_reply_paths <- !is.null(issues_json_folder_path) & !is.null(pull_requests_json_folder_path) & !is.null(comments_json_folder_path) & !is.null(commit_json_folder_path) if(github_reply_paths){ project_github_replies <- parse_github_replies(issues_json_folder_path, pull_requests_json_folder_path, comments_json_folder_path, commit_json_folder_path) # Timezone is not available on GitHub timestamp, all in UTC project_github_replies$reply_tz <- "0000" project_github_replies$reply_datetimetz <- as.POSIXct(project_github_replies$reply_datetimetz, format = "%Y-%m-%dT%H:%M:%S", tz = "UTC") }
For the included reply types, we combine them. While we sort both tables here for clarity if the tables are explored, sorting is not assumed in any of the remaining analysis.
# All replies are combined into a single reply table. project_reply <- rbind(project_mbox, project_jira, project_github_replies) project_git <- project_git[order(author_datetimetz)] project_reply <- project_reply[order(reply_datetimetz)] #project_reply <- project_reply[reply_datetimetz >= start_date & reply_datetimetz <= end_date]
Having parsed both git log and replies, we are ready to start computing the social smells. Social smells are computed on a "time window" granularity. For example, we may ask "between January 2020 and April 2020, how many organizational silos are identified in Helix?". This means we will inspect both the git log and mailing list for the associated time period, perform the necessary transformations in the data, and compute the number of organizational silos.
So we begin by specifying how large our time window should be, in days:
window_size <- window_size # 90 days
Kaiaulu will then perform a non-overlapping time window of every 3 months of git log history and mailing list to identify the number of organizational silos, missing link and radio silence social smells.
We "slice" the git log and mailing list tables parsed earlier in window_size
chunks, and iterate in a for loop on each "slice".
Within a slice we do the following:
There is a large variety of customization in the above 5 steps. We will discuss them briefly here, as they directly impact the quality of your results.
It is very common authors have multiple names and e-mails within and across git log, mailing lists, issue trackers etc. There is no perfect way to identify all identities of an individual, only heuristics. Kaiaulu has a large number of unit tests, each capturing a different example of how people use their e-mails. However, this is not magic. It is possible, and it has been encountered before, cases where a software is used in an open source project that may entirely compromise the analysis if done blind. For example, a common heuristic used for identity matching is to consider two accounts to have the same identity if either the First+Last name OR E-mail have an exact match. In one open source project we analyzed, we found a software was being used that masked all users with commit access as "admin@project.org". Using this heuristic would have, therefore, compromised the entire data.
Using the code below, you can manually inspect project_git
and project_reply
in R the assigned identity to the various users. I strongly encourage you to do so. It is also possible to specify the identity_match()
function to consider only names, instead of name and e-mails, to avert the example above. If you find a edge case where the identity is incorrectly assigned, please open an issue so we can add the edge case. You may also manually correct the identity numbers, before executing the remaining code blocks, to improve the accuracy of the results.
At this point, it is also important to consider how the author and e-mail (if available) are obtained on the various sources. In particular, currently Kaiaulu can only obtain name data from JIRA. For GitHub and Pull Request comments, only users who had at least one commit will contain name and e-mail (otherwise the user name will be their github ID). These limitations are due directly to limitations of the sources respective API. If you believe additional information can be obtained, please open an issue with suggestions!
#Identity matching project_log <- list(project_git=project_git,project_reply=project_reply) project_log <- identity_match(project_log, name_column = c("author_name_email","reply_from"), assign_exact_identity, use_name_only=TRUE, label = "raw_name") project_git <- project_log[["project_git"]] project_reply <- project_log[["project_reply"]]
Remember: Social smells rely heavily on patterns of collaboration and communication. If the identities are poorly assigned, the social smells will not reflect correctly the project status (since in essence several people considered to be communicating with one another, are the same individual!).
As mentioned in the introduction, there are multiple types of mailing list archives out there, and it may be more sensible to use an issue tracker instead of a mailing list, or a combination of both depending on the project. Besides data types, there are also different types of transformations that can be done, when we transform the data in networks. This notebook implements the bipartite transformation (see parse_gitlog_network()
). It is also possible to use a temporal transformation (see parse_gitlog_temporal_network()
). The choice of transformation impacts the direction and overall type of network that will be generated, so it is important you understand how this impact your research conclusions. A similar transformation could be applied to mailing lists, but it is not yet implemented. Because we use bipartite in the code block below, we also perform a bipartite projection. These are well known operations in graph theory which also impact the interpretability of the results.
Another transformation that you can choose is whether the analysis should be done on files, or entities (e.g. functions). See parse_gitlog_entity_network()
and parse_gitlog_entity_temporal_network()
for entities. You may choose functions, classes, and other more specific types of entities depending on the language of interest (e.g. typedef structs for C).
A third choice we make here is whether the collaboration being analyzed is done by authors or committers. Normally a open source project has both. In the code block below, we analyze authors. If you are interested in committers, or potentially their interaction, see the available parameters of parse_gitlog()
.
For some social smells, such as radio silence and primma donna, community detection is required to be applied to the constructed networks. Do consider the implications of the one chosen below in your results. We will use a sample of the data here for demonstration instead of the full dataset:
# Define all timestamp in number of days since the very first commit of the repo # Note here the start_date and end_date are in respect to the git log. # Transform commit hashes into datetime so window_size can be used #start_date <- get_date_from_commit_hash(project_git,start_commit) #end_date <- get_date_from_commit_hash(project_git,end_commit) start_date <- as.POSIXct("2012-10-17 18:19:46", tz = "UTC") end_date <- as.POSIXct("2013-02-17 18:19:46", tz = "UTC")
datetimes <- project_git$author_datetimetz reply_datetimes <- project_reply$reply_datetimetz # Format time window for posixT window_size_f <- stringi::stri_c(window_size," day") # Note if end_date is not (and will likely not be) a multiple of window_size, # then the ending incomplete window is discarded so the metrics are not calculated # in a smaller interval time_window <- seq.POSIXt(from=start_date,to=end_date,by=window_size_f) # Create a list where each element is the social smells calculated for a given commit hash smells <- list() size_time_window <- length(time_window) for(j in 2:size_time_window){ # Initialize commit_interval <- NA start_day <- NA end_day <- NA org_silo <- NA missing_links <- NA radio_silence <- NA primma_donna <- NA st_congruence <- NA communicability <- NA num_tz <- NA code_only_devs <- NA code_files <- NA ml_only_devs <- NA ml_threads <- NA code_ml_both_devs <- NA i <- j - 1 # If the time window is of size 1, then there has been less than "window_size_f" # days from the start date. if(length(time_window) == 1){ # Below 3 month size start_day <- start_date end_day <- end_date }else{ start_day <- time_window[i] end_day <- time_window[j] } # Note: The start and end commits in your project config file should be set so # that the dates cover overlapping date ranges in bothproject_git_slice and project_reply_slice dates. # Double-check your project_git and project_reply to ensure this is the case if an error arises. # Obtain all commits from the gitlog which are within a particular window_size project_git_slice <- project_git[(author_datetimetz >= start_day) & (author_datetimetz < end_day)] # Obtain all email posts from the reply which are within a particular window_size project_reply_slice <- project_reply[(reply_datetimetz >= start_day) & (reply_datetimetz < end_day)] # Check if slices contain data gitlog_exist <- (nrow(project_git_slice) != 0) ml_exist <- (nrow(project_reply_slice) != 0) # Create Networks if(gitlog_exist){ i_commit_hash <- data.table::first(project_git_slice[author_datetimetz == min(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash j_commit_hash <- data.table::first(project_git_slice[author_datetimetz == max(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash # Parse networks edgelist from extracted data network_git_slice <- transform_gitlog_to_bipartite_network(project_git_slice, mode="author-file") # Community Smells functions are defined base of the projection networks of # dev-thread => dev-dev, and dev-file => dev-dev. This creates both dev-dev via graph projections git_network_authors <- bipartite_graph_projection(network_git_slice, mode = TRUE, weight_scheme_function = weight_scheme_sum_edges) code_clusters <- community_oslom(oslom_undir_path, git_network_authors, seed=seed, n_runs = 1000, is_weighted = TRUE) } if(ml_exist){ network_reply_slice <- transform_reply_to_bipartite_network(project_reply_slice) reply_network_authors <- bipartite_graph_projection(network_reply_slice, mode = TRUE, weight_scheme_function = weight_scheme_sum_edges) # Community Detection mail_clusters <- community_oslom(oslom_undir_path, reply_network_authors, seed=seed, n_runs = 1000, is_weighted = TRUE) } # Metrics # if(gitlog_exist){ commit_interval <- stri_c(i_commit_hash,"-",j_commit_hash) # Social Network Metrics code_only_devs <- length(unique(project_git_slice$identity_id)) code_files <- length(unique(project_git_slice$file_pathname)) } if(ml_exist){ # Smell radio_silence <- length(smell_radio_silence(mail.graph=reply_network_authors, clusters=mail_clusters)) # Social Technical Metrics ml_only_devs <- length(unique(project_reply_slice$identity_id)) ml_threads <- length(unique(project_reply_slice$reply_subject)) } if (ml_exist & gitlog_exist){ # Smells org_silo <- length(smell_organizational_silo(mail.graph=reply_network_authors, code.graph=git_network_authors)) missing_links <- length(smell_missing_links(mail.graph=reply_network_authors, code.graph=git_network_authors)) # Social Technical Metrics st_congruence <- smell_sociotechnical_congruence(mail.graph=reply_network_authors, code.graph=git_network_authors) # communicability <- community_metric_mean_communicability(reply_network_authors,git_network_authors) num_tz <- length(unique(c(project_git_slice$author_tz, project_git_slice$committer_tz, project_reply_slice$reply_tz))) code_ml_both_devs <- length(intersect(unique(project_git_slice$identity_id), unique(project_reply_slice$identity_id))) } # Aggregate Metrics smells[[stringi::stri_c(start_day,"|",end_day)]] <- data.table(commit_interval, start_datetime = start_day, end_datetime = end_day, org_silo, missing_links, radio_silence, #primma_donna, st_congruence, #communicability, num_tz, code_only_devs, code_files, ml_only_devs, ml_threads, code_ml_both_devs) } smells_interval <- rbindlist(smells)
This shows the last loop slice:
project_collaboration_network <- recolor_network_by_community(git_network_authors,code_clusters) gcid <- igraph::graph_from_data_frame(d=project_collaboration_network[["edgelist"]], directed = FALSE, vertices = project_collaboration_network[["nodes"]]) visIgraph(gcid,randomSeed = 1)
It may appear counter-intuitive the 5 connected nodes are not considered a single community, however recall the algorithm also consider the weights among the nodes. In this case, the algorithm did not consider the weight sufficiently high among the nodes to consider it a community.
You can also observe the identity match algorihtm in action and its potential implications: Different identities matched to the same author are separated by the | ). Had it not performed as intended, single nodes would appear separately and very likely connected, thus biasing the social metrics.
project_collaboration_network <- recolor_network_by_community(reply_network_authors,mail_clusters) gcid <- igraph::graph_from_data_frame(d=project_collaboration_network[["edgelist"]], directed = FALSE, vertices = project_collaboration_network[["nodes"]]) visIgraph(gcid,randomSeed = 1)
The remainder of this notebook does not compute any social smells. It provide some popular metrics commonly reported in software engineering literature, which may be useful to you when interpreting the social smells. While their granularity is not at "time window" level, they are computed like so in order to be placed in the same table of social smells after being aggregated to the same granularity.
churn <- list() for(j in 2:length(time_window)){ i <- j - 1 # If the time window is of size 1, then there has been less than "window_size_f" # days from the start date. if(length(time_window) == 1){ # Below 3 month size start_day <- start_date end_day <- end_date }else{ start_day <- time_window[i] end_day <- time_window[j] } # Obtain all commits from the gitlog which are within a particular window_size project_git_slice <- project_git[(author_datetimetz >= start_day) & (author_datetimetz < end_day)] gitlog_exist <- (nrow(project_git_slice) != 0) if(gitlog_exist){ # The start and end commit i_commit_hash <- data.table::first(project_git_slice[author_datetimetz == min(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash j_commit_hash <- data.table::first(project_git_slice[author_datetimetz == max(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash # The start and end datetime #start_datetime <- first(project_git_slice)$author_datetimetz #end_datetime <- last(project_git_slice)$author_datetimetz commit_interval <- stri_c(i_commit_hash,"-",j_commit_hash) churn[[commit_interval]] <- data.table(commit_interval, # start_datetime, # end_datetime, churn=metric_churn_per_commit_interval(project_git_slice), n_commits = length(unique(project_git_slice$commit_hash))) } } churn_interval <- rbindlist(churn)
time_window <- seq.POSIXt(from=start_date,to=end_date,by=window_size_f) #time_window <- seq(from=start_daydiff,to=end_daydiff,by=window_size) line_metrics <- list() for(j in 2:length(time_window)){ i <- j - 1 # If the time window is of size 1, then there has been less than "window_size_f" # days from the start date. if(length(time_window) == 1){ # Below 3 month size start_day <- start_date end_day <- end_date }else{ start_day <- time_window[i] end_day <- time_window[j] } # Obtain all commits from the gitlog which are within a particular window_size project_git_slice <- project_git[(author_datetimetz >= start_day) & (author_datetimetz < end_day)] gitlog_exist <- (nrow(project_git_slice) != 0) if(gitlog_exist){ i_commit_hash <- data.table::first(project_git_slice[author_datetimetz == min(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash # Use the ending hash of that window_size to calculate the flaws j_commit_hash <- data.table::first(project_git_slice[author_datetimetz == max(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash # Checkout to commit of interest git_checkout(j_commit_hash, git_repo_path) # Run line metrics against the checkedout commit commit_interval <- stri_c(i_commit_hash,"-",j_commit_hash) line_metrics[[commit_interval]] <- parse_line_metrics(scc_path,git_repo_path) line_metrics[[commit_interval]]$commit_interval <- commit_interval line_metrics[[commit_interval]]$git_checkout <- j_commit_hash line_metrics[[commit_interval]] <- line_metrics[[commit_interval]][,.(commit_interval, git_checkout, Location, Lines, Code, Comments, Blanks, Complexity)] # Filter Files line_metrics[[commit_interval]] <- line_metrics[[commit_interval]] %>% filter_by_file_extension(file_extensions,"Location") %>% filter_by_filepath_substring(substring_filepath,"Location") } } # Reset Repo to HEAD git_checkout(git_branch,git_repo_path) line_metrics_file <- rbindlist(line_metrics) line_metrics_interval <- line_metrics_file[,.(Lines = sum(Lines), Code = sum(Code), Comments = sum(Comments), Blanks = sum(Blanks), Complexity = sum(Complexity)), by = c("commit_interval","git_checkout")]
dt <- merge(smells_interval,churn_interval,by="commit_interval") dt <- merge(dt,line_metrics_interval,by="commit_interval") kable(dt)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.