This notebook provides a brief introduction to the Text module and how it can be used to identify Patterns and other keywords on class names.
We will use srcML for this first step in order to extract source code identifiers. You can install it through this link. Once installed, the command srcml
should be available via terminal. Use which srcml
to obtain its path, and include in your tools.yml
.
You should also install the identifier splitter, Spiral, a Python library. The recommended method to setup is:
sudo pip3 install git+https://github.com/casics/spiral.git
Finally,because we require interacting with Python to use this library, you should install the reticulate
R package. If install.package('reticulate')
fails due to any error, try to install.package('Rcpp')
and then re-attempt. You must specify the local Python version which you installed Spiral when using RStudio. See: https://stackoverflow.com/a/71044068/1260232 otherwise, reticulate
will be unable to load the Spiral
Python library for not being installed in the correct Python version.
rm(list = ls()) seed <- 1 set.seed(seed)
Analyzing open source projects often requires some manual work on your part to find where the open source project hosts its codebase and mailing list. Instead of hard-coding this on Notebooks, we keep this information in a project configuration file. Here's the minimal information this Notebook requires in a project configuration file:
project: website: https://github.com/junit-team/junit5/ #openhub: https://www.openhub.net/p/apache_portable_runtime version_control: # Where is the git log located locally? # This is the path to the .git of the project repository you are analyzing. # The .git is hidden, so you can see it using `ls -a` log: ../../rawdata/git_repo/junit5/.git # From where the git log was downloaded? log_url: https://github.com/junit-team/junit5/ # List of branches used for analysis branch: - main filter: keep_filepaths_ending_with: - cpp - c - h - java - js - py - cc remove_filepaths_containing: - test - java_code_examples tool: # srcML allow to parse src code as text (e.g. identifiers) srcml: # The file path to where you wish to store the srcml output of the project srcml_path: ../../analysis/depends/srcml_depends.xml # Analysis Configuration # analysis: # A list of topic and keywords (see src_text_showcase.Rmd). topics: topic_1: - model - view - controller topic_2: - visitor topic_3: - observer - listener topic_4: - adapter topic_5: - decorator topic_6: - factory - builder topic_7: - facade topic_8: - strategy topic_9: - command
require(kaiaulu) require(data.table) require(yaml) require(stringi) require(knitr) require(reticulate) require(magrittr) require(gt)
tool <- parse_config("../tools.yml") conf <- parse_config("../conf/junit5.yml") srcml_path <- get_tool_project("srcml", tool) git_repo_path <- get_git_repo_path(conf) folder_path <- stri_replace_last(git_repo_path,replacement="",regex=".git") # Tool Parameters srcml_filepath <- get_srcml_filepath(conf) # Filters file_extensions <- get_file_extensions(conf) substring_filepath <- get_substring_filepath(conf) # Analysis topics <- get_topics(conf)
This is all the project configuration files are used for. If you inspect the variables above, you will see they are just strings. As a reminder, the tools.yml is where you store the filepaths to third party software in your computer. Please see Kaiaulu's README.md for details. As a rule of thumb, any R Notebooks in Kaiaulu load the project configuration file at the start, much like you would normally initialize variables at the start of your source code.
To use srcml, we leverage the git path to specify the folder which srcml should execute. The srcml
library will then generate a single file, saved on srcml_filepath
that contains all the information of the project.
srcml_filepath <- annotate_src_text(srcml_path = srcml_path, src_folder = folder_path, srcml_filepath)
We can then use srcml
to query against this generated XML file. For example, we can query the class names. There is much more that can be parsed. Indeed, you can even use srcml
to modify the source code, and output runnable code out of it. The following is a convenience function that will also tabulate the output as a R table:
query_table <- query_src_text_class_names(srcml_path = srcml_path, srcml_filepath = srcml_filepath) head(query_table) %>% gt(auto_align = FALSE)
We can see that both the file name and class name were output here. To perform keyword matching, we must now split the class name identifiers into tokens. This is where the Spiral Python library comes in. First, we load the Ronin
method in R, via the reticulate
library:
reticulate::use_python("/usr/local/bin/python3") spiral_library <-reticulate::import("spiral.ronin", convert = TRUE) collections_library <-reticulate::import("collections", convert = TRUE) # May be required for newer Python versions. Comment if causes errors. collections_library$Iterable <- collections_library$abc$Iterable
Then, we use Spiral's split method over each classname in our prior table. To maintain the table format, we combine the tokens with ";" in each row, but they can be split again for token matching.
split_token_list <- sapply(query_table$classname,spiral_library$split) query_table$tokens <- sapply(split_token_list,stringi::stri_c,collapse=";") head(query_table) %>% gt(auto_align = FALSE)
Since we have a table, we can actually use Kaiaulu filter functions to do some pre-processing.
Note the depends project configuration file accounted for that pattern to be removed:
substring_filepath
Should we wish to remove such filepaths, we can do so as follows:
nrow(query_table)
query_table <- query_table %>% filter_by_file_extension(file_extensions,"filepath") %>% filter_by_filepath_substring(substring_filepath,"filepath") nrow(query_table)
Finally, we lower case all class name tokens for topic matching:
query_table$tokens <- tolower(query_table$tokens)
What is left is to use each class' tokens
column and the list of topics provided in the project configuration file for comparison.
topics
First, let's split the tokens again. Here's a sample for clarity:
split_tokens <- stringi::stri_split_regex(query_table$tokens,pattern = ";") split_tokens[1:2]
Each topic's tokens will be compared against a class' tokens. If there is a match between any topic and class token, we consider the topic to be true.
is_a_topic_match <- function(split_token,topic){ is_match <- any(topic %in% split_token) return(is_match) } for(i in 1:length(topics)){ topic_column_name <- paste0("topic_",as.character(i)) query_table[[topic_column_name]] <- sapply(split_tokens,is_a_topic_match,topics[[topic_column_name]]) }
And finally, a sample of the final table:
head(query_table) %>% gt(auto_align = FALSE)
The full list of identified matched classes is given in the following code block, where we check if any class has any topic as TRUE
:
any_topic_true <- apply(query_table[, 4:length(topics), with=FALSE],1,sum) any_topic_true <- ifelse(any_topic_true == 0,FALSE,TRUE) query_table[any_topic_true] %>% gt(auto_align = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.