Introduction

This notebook explains how to compute social smells for a given open source project. Before we begin, it is important to understand what data is required to compute social smells, what Kaiaulu will do and also what will not do for you.

Social smell metrics requires both collaboration and communication data. Collaboration data is extracted from version control systems, while communication data can be obtained from whatever the project of interest uses for developers to communicate.

Obtaining collaboration data is relatively painless: You need only clone the version control system locally. Currently Kaiaulu only supports analysis in Git, but in the future it may support other version control systems. Obtaining communication data, however, requires more effort on your part depending on the project you choose. This is because there is a large variety of communication mediums and archive types open source projects use.

Broadly, open source projects use mailing lists, issue tracker (comments), or both for communication.

Example of mailing list archive types are GNU Mailing List Manager, Google Groups, Mail Archive, Apache's MOD Mbox, Free Lists, Discourse, etc. On March 2021, even GitHub launched its own built-in communication medium, Discussions.

Examples of issue tracker types include JIRA, GitHub's built-in Issue Tracker, GitLab's Issue Tracker, Monorail (used in Google's Chromium), Trac, Bugzilla, etc. You may also be interested in including discussion that occurs in GitHub's pull requests / GitLab merge requests. Of the above, GitHub Issue Tracker comments, GitHub Pull Requests comments (download_github_comments.Rmd), and JIRA comments (download_jira_data.Rmd) are currently supported. See the associated vignettes for details on how to download the data.

It is of course not viable for Kaiaulu to implement interfaces to every single archive type out there. Therefore, to calculate social smells, we expect you can obtain a .mbox representation of the mailing list of interest. This may be available from the open source project directly (e.g. Apache projects mod_mbox (see download_mod_mbox.Rmd) and pipermail (see the R/download.R/download_pipermail()), or via a crawler someone already made. For example, gg_scraper outputs a mbox file from Google Group mailing lists (although it can only obtain partial information of the e-mails, as Google Groups truncate them, which may pose limitations to some steps of the analysis of the identity matching discussed below).

The bottom line is, the required effort to obtain the mailing list data will vary depending on the open source project of interest, as open source projects may even transition over time through different archive types. Once you have available in your computer both git log, and at least one source of communication data (e.g. mbox, jira, or github), you are ready to proceed with the social smells analysis of this notebook.

Libraries

Please ensure the following R packages are installed on your computer.

rm(list = ls())
seed <- 1
set.seed(seed)

require(kaiaulu)
require(stringi)
require(data.table)
require(knitr)
require(visNetwork)

Project Configuration File (Parameters Needed)

The parameters necessary for analysis are kept in a project configuration file to ensure reproducibility. In this project, we will use the Apache's Helix open source project. Refer to the conf/ folder on Kaiaulu's git repository for Helix and other project configuration files. It is in this project configuration file we specify where Kaiaulu can find the git log and communication sources from Helix. We also specify here filters of interest: For example, if Kaiaulu should ignore test files, or anything that is not source code.

At the scope of this notebook, only the first branch (top) specified in the project configuration file will be analyzed. Refer to the CLI interface if you are interested in executing this analysis over multiple branches.

We also provide the path for tools.yml. Kaiaulu does not implement all available functionality from scratch. Conversely, it will also not expect all dependencies to be installed. Every function defined in the API expects as parameter a filepath to the external dependency binary. Tools.yml is a convenience file that stores all the binary paths, so it can be set once during setup and reused multiple times for analysis. You can find an example of tools.yml on the github repo from Kaiaulu root directory. For this notebook, you will need to install Perceval (use version 0.12.24) and OSLOM. Instructions to do so are available in the Kaiaulu README.md. Once you are finished, set the "perceval," "oslom_dir," and "oslom_undir" paths in your tools.yml.

tools_path <- "../tools.yml"
conf_path <- "../conf/helix.yml"

tool <- yaml::read_yaml(tools_path)
scc_path <- tool[["scc"]]

oslom_dir_path <- tool[["oslom_dir"]]
oslom_undir_path <- tool[["oslom_undir"]]

conf <- yaml::read_yaml(conf_path)

perceval_path <- tool[["perceval"]]
git_repo_path <- conf[["version_control"]][["log"]]
git_branch <- conf[["version_control"]][["branch"]][1]

start_commit <- conf[["analysis"]][["window"]][["start_commit"]]
end_commit <- conf[["analysis"]][["window"]][["end_commit"]]
window_size <- conf[["analysis"]][["window"]][["size_days"]]

mbox_path <- conf[["mailing_list"]][["mbox"]]
github_replies_path <- conf[["issue_tracker"]][["github"]][["replies"]]
jira_issue_comments_path <- conf[["issue_tracker"]][["jira"]][["issue_comments"]]




# Filters
file_extensions <- conf[["filter"]][["keep_filepaths_ending_with"]]
substring_filepath <- conf[["filter"]][["remove_filepaths_containing"]]

The remainder of this notebook does not require modifications. If you encounter an error in any code block below, chances are one or more parameters above have been specified incorrectly, or the project of choice may have led to an outlier case. Please open an issue if you encounter an error, or if not sure post on discussions in Kaiaulu's GitHub. E-mailing bugs is discouraged as it is hard to track.

Parsing Input Data

As stated in the introduction, we need both git log and at least one communication source (here named replies) to compute social smells. Therefore, the first step is to parse the raw data.

Parse Gitlog

To get started, we use the parse_gitlog function to extract a table from the git log. You can inspect the project_git variable to inspect what information is available from the git log.

git_checkout(git_branch,git_repo_path)
project_git <- parse_gitlog(perceval_path,git_repo_path)
project_git <- project_git  %>%
  filter_by_file_extension(file_extensions,"file_pathname")  %>% 
  filter_by_filepath_substring(substring_filepath,"file_pathname")

Parse Replies

Next, we parse the various communication channels the project use. Similarly to parse_gitlog, the returned object is a table, which we can inspect directly in R to see what information is available.

We also have to parse and normalize the timezone across the different projects. Since one of the social metrics in the quality framework is the count of different timezones, we separate the timezone information before normalizing them.

project_git$author_tz <- sapply(stringi::stri_split(project_git$author_datetimetz,
                                                          regex=" "),"[[",6)
project_git$author_datetimetz <- as.POSIXct(project_git$author_datetimetz,
                                            format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC")


project_git$committer_tz <- sapply(stringi::stri_split(project_git$committer_datetimetz,
                                                          regex=" "),"[[",6)
project_git$committer_datetimetz <- as.POSIXct(project_git$committer_datetimetz,
                                            format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC")

We apply the same logic above to the communication channels available. Not all of the communication channels may be in use in a given project, but you will want to ensure you accurately chose all the development communication used in a project. You can usually find a "How to contribute" page on the project website, which specifies the used sources.

Remember: Kaiaulu will not throw errors if you omit relevant sources of developer communication, instead the computed smells will be higher than what they should (as developers will deem to have not communicated because the source where they communicate was simply not included in the analysis!).

project_mbox <- NULL
project_jira <- NULL
project_github_replies <- NULL



if(!is.null(mbox_path)){
  project_mbox <- parse_mbox(perceval_path,mbox_path)  

  project_mbox$reply_tz <- sapply(stringi::stri_split(project_git$reply_datetimetz,
                                                            regex=" "),"[[",6)

  project_mbox$reply_datetimetz <- as.POSIXct(project_mbox$reply_datetimetz,
                                        format = "%a, %d %b %Y %H:%M:%S %z", tz = "UTC")


}
if(!is.null(jira_issue_comments_path)){
  project_jira <- parse_jira_replies(parse_jira(jira_issue_comments_path)) 

  # Timezone is embedded on separated field. All times shown in UTC.
  project_jira$reply_tz <- "0000"

  project_jira$reply_datetimetz <- as.POSIXct(project_jira$reply_datetimetz,
                                        format = "%Y-%m-%dT%H:%M:%S.000+0000", tz = "UTC")
}
if(!is.null(github_replies_path)){
  project_github_replies <- parse_github_replies(github_replies_path)  


  # Timezone is not available on GitHub timestamp, all in UTC
  project_github_replies$reply_tz <- "0000"

  project_github_replies$reply_datetimetz <- as.POSIXct(project_github_replies$reply_datetimetz,
                                        format = "%Y-%m-%dT%H:%M:%S", tz = "UTC")

}

For the included reply types, we combine them. While we sort both tables here for clarity if the tables are explored, sorting is not assumed in any of the remaining analysis.

# All replies are combined into a single reply table. 
project_reply <- rbind(project_mbox,
                      project_jira,
                      project_github_replies)

project_git <- project_git[order(author_datetimetz)]

project_reply <- project_reply[order(reply_datetimetz)]

#project_reply <- project_reply[reply_datetimetz >= start_date & reply_datetimetz <= end_date]

Smells

Having parsed both git log and replies, we are ready to start computing the social smells. Social smells are computed on a "time window" granularity. For example, we may ask "between January 2020 and April 2020, how many organizational silos are identified in Helix?". This means we will inspect both the git log and mailing list for the associated time period, perform the necessary transformations in the data, and compute the number of organizational silos.

So we begin by specifying how large our time window should be, in days:

window_size <- window_size # 90 days

Kaiaulu will then perform a non-overlapping time window of every 3 months of git log history and mailing list to identify the number of organizational silos, missing link and radio silence social smells.

We "slice" the git log and mailing list tables parsed earlier in window_size chunks, and iterate in a for loop on each "slice".

Within a slice we do the following:

  1. Apply identity matching (the same person may have multiple identities) inter and intra sources.
  2. Construct a network representation of both the git log and the mailing list.
  3. Apply, if necessary, a projection of the networks (depending on the network function used)
  4. We apply community detection algorithms, necessary to calculate some of the social smells
  5. We compute the social smells

There is a large variety of customization in the above 5 steps. We will discuss them briefly here, as they directly impact the quality of your results.

Identity Matching

It is very common authors have multiple names and e-mails within and across git log, mailing lists, issue trackers etc. There is no perfect way to identify all identities of an individual, only heuristics. Kaiaulu has a large number of unit tests, each capturing a different example of how people use their e-mails. However, this is not magic. It is possible, and it has been encountered before, cases where a software is used in an open source project that may entirely compromise the analysis if done blind. For example, a common heuristic used for identity matching is to consider two accounts to have the same identity if either the First+Last name OR E-mail have an exact match. In one open source project we analyzed, we found a software was being used that masked all users with commit access as "admin@project.org". Using this heuristic would have, therefore, compromised the entire data.

Using the code below, you can manually inspect project_git and project_reply in R the assigned identity to the various users. I strongly encourage you to do so. It is also possible to specify the identity_match() function to consider only names, instead of name and e-mails, to avert the example above. If you find a edge case where the identity is incorrectly assigned, please open an issue so we can add the edge case. You may also manually correct the identity numbers, before executing the remaining code blocks, to improve the accuracy of the results.

At this point, it is also important to consider how the author and e-mail (if available) are obtained on the various sources. In particular, currently Kaiaulu can only obtain name data from JIRA. For GitHub and Pull Request comments, only users who had at least one commit will contain name and e-mail (otherwise the user name will be their github ID). These limitations are due directly to limitations of the sources respective API. If you believe additional information can be obtained, please open an issue with suggestions!

#Identity matching
project_log <- list(project_git=project_git,project_reply=project_reply)
project_log <- identity_match(project_log,
                                name_column = c("author_name_email","reply_from"),
                                assign_exact_identity,
                                use_name_only=TRUE,
                                label = "raw_name")

project_git <- project_log[["project_git"]]
project_reply <- project_log[["project_reply"]]

Remember: Social smells rely heavily on patterns of collaboration and communication. If the identities are poorly assigned, the social smells will not reflect correctly the project status (since in essence several people considered to be communicating with one another, are the same individual!).

Constructing a Network Representation

As mentioned in the introduction, there are multiple types of mailing list archives out there, and it may be more sensible to use an issue tracker instead of a mailing list, or a combination of both depending on the project. Besides data types, there are also different types of transformations that can be done, when we transform the data in networks. This notebook implements the bipartite transformation (see parse_gitlog_network()). It is also possible to use a temporal transformation (see parse_gitlog_temporal_network()). The choice of transformation impacts the direction and overall type of network that will be generated, so it is important you understand how this impact your research conclusions. A similar transformation could be applied to mailing lists, but it is not yet implemented. Because we use bipartite in the code block below, we also perform a bipartite projection. These are well known operations in graph theory which also impact the interpretability of the results.

Another transformation that you can choose is whether the analysis should be done on files, or entities (e.g. functions). See parse_gitlog_entity_network() and parse_gitlog_entity_temporal_network() for entities. You may choose functions, classes, and other more specific types of entities depending on the language of interest (e.g. typedef structs for C).

A third choice we make here is whether the collaboration being analyzed is done by authors or committers. Normally a open source project has both. In the code block below, we analyze authors. If you are interested in committers, or potentially their interaction, see the available parameters of parse_gitlog().

Community Detection

For some social smells, such as radio silence and primma donna, community detection is required to be applied to the constructed networks. Do consider the implications of the one chosen below in your results.

# Define all timestamp in number of days since the very first commit of the repo 
# Note here the start_date and end_date are in respect to the git log.

# Transform commit hashes into datetime so window_size can be used
start_date <- get_date_from_commit_hash(project_git,start_commit)
end_date <- get_date_from_commit_hash(project_git,end_commit)
datetimes <- project_git$author_datetimetz
reply_datetimes <- project_reply$reply_datetimetz

# Format time window for posixT
window_size_f <- stringi::stri_c(window_size," day")

# Note if end_date is not (and will likely not be) a multiple of window_size, 
# then the ending incomplete window is discarded so the metrics are not calculated 
# in a smaller interval
time_window <- seq.POSIXt(from=start_date,to=end_date,by=window_size_f)

# Create a list where each element is the social smells calculated for a given commit hash
smells <- list()
size_time_window <- length(time_window)
for(j in 2:size_time_window){

   # Initialize
  commit_interval <- NA
  start_day <- NA
  end_day <- NA
  org_silo <- NA
  missing_links <- NA
  radio_silence <- NA
  primma_donna <- NA
  st_congruence <- NA
  communicability <- NA
  num_tz <- NA
  code_only_devs <- NA
  code_files <- NA
  ml_only_devs <- NA
  ml_threads <- NA
  code_ml_both_devs <- NA

  i <- j - 1

  # If the time window is of size 1, then there has been less than "window_size_f"
  # days from the start date.
  if(length(time_window)  == 1){
    # Below 3 month size
    start_day <- start_date
    end_day <- end_date 
  }else{
    start_day <- time_window[i]
    end_day <- time_window[j] 
  }


  # Note: The start and end commits in your project config file should be set so 
  # that the dates cover overlapping date ranges in bothproject_git_slice and project_reply_slice dates.
  # Double-check your project_git and project_reply to ensure this is the case if an error arises.

  # Obtain all commits from the gitlog which are within a particular window_size
  project_git_slice <- project_git[(author_datetimetz >= start_day) & 
                     (author_datetimetz < end_day)]

  # Obtain all email posts from the reply which are within a particular window_size
  project_reply_slice <- project_reply[(reply_datetimetz >= start_day) & 
                       (reply_datetimetz < end_day)]

  # Check if slices contain data
  gitlog_exist <- (nrow(project_git_slice) != 0)
  ml_exist <- (nrow(project_reply_slice) != 0)


  # Create Networks 
  if(gitlog_exist){
    i_commit_hash <- data.table::first(project_git_slice[author_datetimetz == min(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash

    j_commit_hash <- data.table::first(project_git_slice[author_datetimetz == max(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash

    # Parse networks edgelist from extracted data
    network_git_slice <- transform_gitlog_to_bipartite_network(project_git_slice,
                                                               mode="author-file")

    # Community Smells functions are defined base of the projection networks of 
    # dev-thread => dev-dev, and dev-file => dev-dev. This creates both dev-dev via graph projections

    git_network_authors <- bipartite_graph_projection(network_git_slice,
                                                      mode = TRUE,
                                                      weight_scheme_function = weight_scheme_sum_edges)

    code_clusters <- community_oslom(oslom_undir_path,
                                     git_network_authors,
                                     seed=seed,
                                     n_runs = 1000,
                                     is_weighted = TRUE)

  }
  if(ml_exist){
    network_reply_slice <- transform_reply_to_bipartite_network(project_reply_slice)


    reply_network_authors <- bipartite_graph_projection(network_reply_slice,
                                                      mode = TRUE,
                                                      weight_scheme_function = weight_scheme_sum_edges)    

    # Community Detection

    mail_clusters <- community_oslom(oslom_undir_path,
              reply_network_authors,
              seed=seed,
              n_runs = 1000,
              is_weighted = TRUE)

  }
  # Metrics #

  if(gitlog_exist){
    commit_interval <- stri_c(i_commit_hash,"-",j_commit_hash)
    # Social Network Metrics 
    code_only_devs <- length(unique(project_git_slice$identity_id))
    code_files <- length(unique(project_git_slice$file_pathname))

  }  
  if(ml_exist){
    # Smell

    radio_silence <- length(smell_radio_silence(mail.graph=reply_network_authors, 
                                                          clusters=mail_clusters))

    # Social Technical Metrics
    ml_only_devs <- length(unique(project_reply_slice$identity_id))
    ml_threads <- length(unique(project_reply_slice$reply_subject))
  }
  if (ml_exist & gitlog_exist){
    # Smells 
    org_silo <- length(smell_organizational_silo(mail.graph=reply_network_authors,
                                                          code.graph=git_network_authors))

    missing_links <- length(smell_missing_links(mail.graph=reply_network_authors,
                                                code.graph=git_network_authors))
    # Social Technical Metrics
    st_congruence <- smell_sociotechnical_congruence(mail.graph=reply_network_authors,
                                                     code.graph=git_network_authors)
#    communicability <- community_metric_mean_communicability(reply_network_authors,git_network_authors)
    num_tz <- length(unique(c(project_git_slice$author_tz,
                              project_git_slice$committer_tz,
                              project_reply_slice$reply_tz)))
    code_ml_both_devs <- length(intersect(unique(project_git_slice$identity_id),
                                          unique(project_reply_slice$identity_id)))

  }

  # Aggregate Metrics
  smells[[stringi::stri_c(start_day,"|",end_day)]] <- data.table(commit_interval,
                                                                 start_datetime = start_day,
                                                                 end_datetime = end_day,
                                                                 org_silo,
                                                                 missing_links,
                                                                 radio_silence,
                                                                 #primma_donna,
                                                                 st_congruence,
                                                                 #communicability,
                                                                 num_tz,
                                                                 code_only_devs,
                                                                 code_files,
                                                                 ml_only_devs,
                                                                 ml_threads,
                                                                 code_ml_both_devs)
}
smells_interval <- rbindlist(smells)

Community Inspection per Time Slice

This shows the last loop slice:

project_collaboration_network <- recolor_network_by_community(git_network_authors,code_clusters)

gcid <- igraph::graph_from_data_frame(d=project_collaboration_network[["edgelist"]], 
                      directed = FALSE,
                      vertices = project_collaboration_network[["nodes"]])

visIgraph(gcid,randomSeed = 1)

It may appear counter-intuitive the 5 connected nodes are not considered a single community, however recall the algorithm also consider the weights among the nodes. In this case, the algorithm did not consider the weight sufficiently high among the nodes to consider it a community.

You can also observe the identity match algorihtm in action and its potential implications: Different identities matched to the same author are separated by the | ). Had it not performed as intended, single nodes would appear separately and very likely connected, thus biasing the social metrics.

project_collaboration_network <- recolor_network_by_community(reply_network_authors,mail_clusters)

gcid <- igraph::graph_from_data_frame(d=project_collaboration_network[["edgelist"]], 
                      directed = FALSE,
                      vertices = project_collaboration_network[["nodes"]])

visIgraph(gcid,randomSeed = 1)

Other Metrics

The remainder of this notebook does not compute any social smells. It provide some popular metrics commonly reported in software engineering literature, which may be useful to you when interpreting the social smells. While their granularity is not at "time window" level, they are computed like so in order to be placed in the same table of social smells after being aggregated to the same granularity.

Churn

churn <- list()

for(j in 2:length(time_window)){
  i <- j - 1

  # If the time window is of size 1, then there has been less than "window_size_f"
  # days from the start date.
  if(length(time_window)  == 1){
    # Below 3 month size
    start_day <- start_date
    end_day <- end_date 
  }else{
    start_day <- time_window[i]
    end_day <- time_window[j] 
  }

  # Obtain all commits from the gitlog which are within a particular window_size
  project_git_slice <- project_git[(author_datetimetz >= start_day) & 
                     (author_datetimetz < end_day)]

  gitlog_exist <- (nrow(project_git_slice) != 0)
  if(gitlog_exist){
    # The start and end commit
    i_commit_hash <- data.table::first(project_git_slice[author_datetimetz == min(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash
    j_commit_hash <- data.table::first(project_git_slice[author_datetimetz == max(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash


    # The start and end datetime
    #start_datetime <- first(project_git_slice)$author_datetimetz
    #end_datetime <- last(project_git_slice)$author_datetimetz

    commit_interval <- stri_c(i_commit_hash,"-",j_commit_hash)
    churn[[commit_interval]] <- data.table(commit_interval,
                                #           start_datetime,
                                #           end_datetime,
                        churn=metric_churn_per_commit_interval(project_git_slice),
                        n_commits = length(unique(project_git_slice$commit_hash)))  
  }
}
churn_interval <- rbindlist(churn)

Line Metrics

time_window <- seq.POSIXt(from=start_date,to=end_date,by=window_size_f)
#time_window <- seq(from=start_daydiff,to=end_daydiff,by=window_size)
line_metrics <- list()
for(j in 2:length(time_window)){
  i <- j - 1

  # If the time window is of size 1, then there has been less than "window_size_f"
  # days from the start date.
  if(length(time_window)  == 1){
    # Below 3 month size
    start_day <- start_date
    end_day <- end_date 
  }else{
    start_day <- time_window[i]
    end_day <- time_window[j] 
  }

  # Obtain all commits from the gitlog which are within a particular window_size
  project_git_slice <- project_git[(author_datetimetz >= start_day) & 
                     (author_datetimetz < end_day)]

  gitlog_exist <- (nrow(project_git_slice) != 0)
  if(gitlog_exist){

    i_commit_hash <- data.table::first(project_git_slice[author_datetimetz == min(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash
    # Use the ending hash of that window_size to calculate the flaws
    j_commit_hash <- data.table::first(project_git_slice[author_datetimetz == max(project_git_slice$author_datetimetz,na.rm=TRUE)])$commit_hash

    # Checkout to commit of interest
    git_checkout(j_commit_hash, git_repo_path)
    # Run line metrics against the checkedout commit
    commit_interval <- stri_c(i_commit_hash,"-",j_commit_hash)
    line_metrics[[commit_interval]] <- parse_line_metrics(scc_path,git_repo_path)
    line_metrics[[commit_interval]]$commit_interval <- commit_interval
    line_metrics[[commit_interval]]$git_checkout <- j_commit_hash
    line_metrics[[commit_interval]] <- line_metrics[[commit_interval]][,.(commit_interval,
                                                                          git_checkout,
                                                                          Location,
                                                                          Lines,
                                                                          Code,
                                                                          Comments,
                                                                          Blanks,
                                                                          Complexity)]

    # Filter Files
    line_metrics[[commit_interval]] <- line_metrics[[commit_interval]]  %>%
    filter_by_file_extension(file_extensions,"Location")  %>% 
    filter_by_filepath_substring(substring_filepath,"Location")
  }
}
# Reset Repo to HEAD
git_checkout(git_branch,git_repo_path)

line_metrics_file <- rbindlist(line_metrics)
line_metrics_interval <- line_metrics_file[,.(Lines = sum(Lines),
                                 Code = sum(Code),
                                 Comments = sum(Comments),
                                 Blanks = sum(Blanks),
                                 Complexity = sum(Complexity)),
                              by = c("commit_interval","git_checkout")]

Merge Churn, Smells, and Line Metrics

dt <- merge(smells_interval,churn_interval,by="commit_interval")
dt <- merge(dt,line_metrics_interval,by="commit_interval")
kable(dt)


sailuh/kaiaulu documentation built on April 17, 2024, 2:21 a.m.