Please ensure the following R packages are installed on your computer.
rm(list = ls()) seed <- 1 set.seed(seed) require(kaiaulu) require(stringi) require(data.table) require(knitr) require(visNetwork) require(igraph)
This notebook re-implements in Kaiaulu the motif analysis presented in our prior work from Codeface:
W. Mauerer, M. Joblin, D. A. Tamburri, C. Paradis, R. Kazman and S. Apel, "In Search of Socio-Technical Congruence: A Large-Scale Longitudinal Study," in IEEE Transactions on Software Engineering, vol. 48, no. 8, pp. 3159-3184, 1 Aug. 2022, doi: 10.1109/TSE.2021.3082074.
Kaiaulu re-implementation is easy to extend, and allow for any combination of motifs across the various networks already supported, such as dependency networks (file, function, class, etc.), communication networks (mailing lists, issue trackers (bugzilla, JIRA, GitHub)), and collaboration (git log at file, function, class, etc).
We demonstrate here both the triangle and square motif as originally defined in our paper, which leverages all 3 types of the networks. The project of analysis is Kaiaulu itself, however, this can be applied to other open source projects!
tool <- parse_config("../tools.yml") conf <- parse_config("../conf/kaiaulu.yml") scc_path <- get_tool_project("scc", tool) oslom_dir_path <- get_tool_project("oslom_dir", tool) oslom_undir_path <- get_tool_project("oslom_undir", tool) perceval_path <- get_tool_project("perceval", tool) git_repo_path <- get_git_repo_path(conf) git_branch <- get_git_branches(conf)[1] issues_json_folder_path <- get_github_issue_path(conf, "project_key_1") pull_requests_json_folder_path <- get_github_pull_request_path(conf, "project_key_1") comments_json_folder_path <- get_github_issue_or_pr_comment_path(conf, "project_key_1") commit_json_folder_path <- get_github_commit_path(conf, "project_key_1") # Filters file_extensions <- get_file_extensions(conf) substring_filepath <- get_substring_filepath(conf)
To get started, we use the parse_gitlog
function to extract a table from the git log. You can inspect the project_git
variable to observe what information is available from the git log. Note we also apply filter functions. The patterns of what we wish to filter are specified on the project configuration file.
git_checkout(git_branch,git_repo_path) project_git <- parse_gitlog(perceval_path,git_repo_path) project_git <- project_git %>% filter_by_file_extension(file_extensions,"file_pathname") %>% filter_by_filepath_substring(substring_filepath,"file_pathname")
We also can normalize the timezone across the different projects, although we do not use the time information in this notebook.
project_git$author_tz <- sapply(stringi::stri_split(project_git$author_datetimetz, regex=" "),"[[",6) project_git$author_datetimetz <- as.POSIXct(project_git$author_datetimetz, format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC") project_git$committer_tz <- sapply(stringi::stri_split(project_git$committer_datetimetz, regex=" "),"[[",6) project_git$committer_datetimetz <- as.POSIXct(project_git$committer_datetimetz, format = "%a %b %d %H:%M:%S %Y %z", tz = "UTC")
Next, we parse the various communication channels the project use. Similarly to parse_gitlog
, the returned object is a table, which we can browse directly in R to see what information is available.
In this Notebook, we only use GitHub replies (issue and pull requests commentary) since Kaiaulu does not use mailing lists. However, download and parser functions for mbox, jira, and bugzilla are available. Refer to the API for details.
project_github_replies <- parse_github_replies(issues_json_folder_path, pull_requests_json_folder_path, comments_json_folder_path, commit_json_folder_path) # Timezone is not available on GitHub timestamp, all in UTC project_github_replies$reply_tz <- "0000" project_github_replies$reply_datetimetz <- as.POSIXct(project_github_replies$reply_datetimetz, format = "%Y-%m-%dT%H:%M:%S", tz = "UTC")
For projects which use more than one communication channel (e.g. a developer mailing list, and a issue tracker), Kaiaulu abstracts the various communication channels as a table of replies. Refer to vignettes/reply_communicaton_showcase.Rmd
for details.
# All replies are combined into a single reply table. project_reply <- project_github_replies project_git <- project_git[order(author_datetimetz)] project_reply <- project_reply[order(reply_datetimetz)] #project_reply <- project_reply[reply_datetimetz >= start_date & reply_datetimetz <= end_date]
Because developers may use different identities across git log, and the communication channel (s), we apply a number of heuristics (encoded as unit tests in Kaiaulu) to assign a single ID. You can also inspect the generated table to perform manual corrections.
#Identity matching project_log <- list(project_git=project_git,project_reply=project_reply) project_log <- identity_match(project_log, name_column = c("author_name_email","reply_from"), assign_exact_identity, use_name_only=FALSE, label = "raw_name") project_git <- project_log[["project_git"]] project_reply <- project_log[["project_reply"]]
Having performed the necessary transformations on our data sources, we are ready to transform them to networks. Our goal is to create a single graph containing all the information of interest in order to search for sub-graphs of interest (i.e. our defined motifs).
A number of transformation functions are available in Kaiaulu to transform the various logs into networks. First, we transform our git log data into a bipartite author-file network:
git_network <- transform_gitlog_to_bipartite_network(project_git, mode="author-file")
Next we apply the same transformation to obtain our reply network. Note this reply network is also a bipartite graph, of the type developer-thread
. Since the communication is occurring in GitHub, an issue is equivalent to an e-mail thread. Because we wish to "add" communication edges between developers to the git log network, we perform a bipartite projection over developer-thread
to obtain a developer-developer
network. Here, we chose the weight scheme that sums the existing edge weights (i.e. number of replies to a thread) to the deleted thread node together.
reply_network <- transform_reply_to_bipartite_network(project_reply) reply_network <- bipartite_graph_projection(reply_network, mode = TRUE, weight_scheme_function = weight_scheme_sum_edges)
We can then add the developer-developer
network nodes and edges to the developer-file
network:
git_reply_network <- list() git_reply_network[["nodes"]] <- unique(rbind(git_network[["nodes"]], reply_network[["nodes"]])) git_reply_network[["edgelist"]] <- rbind(git_network[["edgelist"]], reply_network[["edgelist"]])
To perform motif search, we rely on the igraph
library. First, we transform the networks to igraph's network representation:
i_git_reply_network <- igraph::graph_from_data_frame(d=git_reply_network[["edgelist"]], directed = FALSE, vertices = git_reply_network[["nodes"]])
visIgraph(i_git_reply_network,randomSeed = 1)
We also create our motif triangle sub-graph and display:
motif_triangle <- motif_factory("triangle") i_triangle_motif <- igraph::graph_from_data_frame(d=motif_triangle[["edgelist"]], directed = FALSE, vertices = motif_triangle[["nodes"]])
visIgraph(i_triangle_motif)
Because the motif search expects the "color" node attribute to be numeric, we convert the node color to 1 if black, or 2 otherwise in the igraph network representation:
V(i_triangle_motif)$color <- ifelse(V(i_triangle_motif)$color == "black",1,2) V(i_git_reply_network)$color <- ifelse(V(i_git_reply_network)$color == "black",1,2)
We can then count the motifs:
## Count subgraph isomorphisms motif_count <- igraph::count_subgraph_isomorphisms(i_triangle_motif, i_git_reply_network, method="vf2", edge.color1 = NULL, edge.color2 = NULL) motif_count
Or obtain the list of every sub-graph match of the triangle motif:
i_motif_vertice_sequence <- subgraph_isomorphisms(i_triangle_motif, i_git_reply_network, method="vf2", edge.color1 = NULL, edge.color2 = NULL) motif_vertice_sequence <- lapply(i_motif_vertice_sequence,igraph::as_ids) motif_triangle_dt <- rbindlist(lapply((lapply(motif_vertice_sequence,t)),data.table)) kable(motif_triangle_dt)
For square motif, we now also consider the file dependencies. In this Notebook, we use Kaiaulu's built in R parser. However, the parse_dependencies()
can be used to obtain dependencies for languages which Depends support.
folder_path <- "../R" dependencies <- parse_r_dependencies(folder_path) file_network <- transform_r_dependencies_to_network(dependencies,dependency_type="file") file_network[["nodes"]]$name <- paste0("R/",file_network[["nodes"]]$name) file_network[["edgelist"]]$from <- paste0("R/",file_network[["edgelist"]]$from) file_network[["edgelist"]]$to <- paste0("R/",file_network[["edgelist"]]$to)
Similar to before, we combime the networks, now also including the file network.
file_git_reply_network <- list() file_git_reply_network[["nodes"]] <- unique(rbind(git_network[["nodes"]], reply_network[["nodes"]], file_network[["nodes"]])) file_git_reply_network[["edgelist"]] <- rbind(git_network[["edgelist"]], reply_network[["edgelist"]], file_network[["edgelist"]])
We then use igraph for visualization and calculating the motif. First of the network:
i_file_git_reply_network <- igraph::graph_from_data_frame(d=file_git_reply_network[["edgelist"]], directed = FALSE, vertices = file_git_reply_network[["nodes"]]) visIgraph(i_file_git_reply_network,randomSeed = 1)
And then of the square motif:
motif_square <- motif_factory("square") i_square_motif <- igraph::graph_from_data_frame(d=motif_square[["edgelist"]], directed = FALSE, vertices = motif_square[["nodes"]]) visIgraph(i_square_motif)
Once more, we transform the color of the nodes to numeric to perform the motif search.
V(i_square_motif)$color <- ifelse(V(i_square_motif)$color == "black",1,2) V(i_file_git_reply_network)$color <- ifelse(V(i_file_git_reply_network)$color == "black",1,2)
We can then count the motif ocurrences:
## Count subgraph isomorphisms motif_count <- count_subgraph_isomorphisms(i_square_motif, i_file_git_reply_network, method="vf2", edge.color1 = NULL, edge.color2 = NULL) motif_count
Or enumerate where it occurred:
i_motif_vertice_sequence <- subgraph_isomorphisms(i_square_motif, i_file_git_reply_network, method="vf2", edge.color1 = NULL, edge.color2 = NULL) motif_vertice_sequence <- lapply(i_motif_vertice_sequence,igraph::as_ids) motif_square_dt <- rbindlist(lapply((lapply(motif_vertice_sequence,t)),data.table)) kable(motif_square_dt)
We can also very easily highlight a particular set of motifs of interest in the graph. For example, consider the following detected motif:
kable(motif_triangle_dt[1])
We can color the associated nodes it on the original network for further exploration in red as follows:
triangle_motif_example <- unlist(motif_triangle_dt[1]) colored_git_reply_network <- git_reply_network is_in_motif_example <- colored_git_reply_network[["nodes"]]$name %in% triangle_motif_example colored_git_reply_network[["nodes"]][is_in_motif_example]$color <- "red" i_git_reply_network <- igraph::graph_from_data_frame(d=colored_git_reply_network[["edgelist"]], directed = FALSE, vertices = colored_git_reply_network[["nodes"]]) visIgraph(i_git_reply_network,randomSeed = 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.