In sailuh/kaiaulu: Kaiaulu

Introduction

This notebook provides a brief introduction to the Text module and how it can be used to identify Patterns and other keywords on class names.

Setup

Text Approach

We will use srcML for this first step in order to extract source code identifiers. You can install it through this link. Once installed, the command srcml should be available via terminal. Use which srcml to obtain its path, and include in your tools.yml.

You should also install the identifier splitter, Spiral, a Python library. The recommended method to setup is:

sudo pip3 install git+https://github.com/casics/spiral.git

Finally,because we require interacting with Python to use this library, you should install the reticulate R package. If install.package('reticulate') fails due to any error, try to install.package('Rcpp') and then re-attempt. You must specify the local Python version which you installed Spiral when using RStudio. See: https://stackoverflow.com/a/71044068/1260232 otherwise, reticulate will be unable to load the Spiral Python library for not being installed in the correct Python version.

rm(list = ls())
seed <- 1
set.seed(seed)

Project Configuration File

Analyzing open source projects often requires some manual work on your part to find where the open source project hosts its codebase and mailing list. Instead of hard-coding this on Notebooks, we keep this information in a project configuration file. Here's the minimal information this Notebook requires in a project configuration file:

project:
  website: https://github.com/junit-team/junit5/
  #openhub: https://www.openhub.net/p/apache_portable_runtime

version_control:
  # Where is the git log located locally?
  # This is the path to the .git of the project repository you are analyzing.
  # The .git is hidden, so you can see it using `ls -a`
  log: ../../rawdata/git_repo/junit5/.git
  # From where the git log was downloaded?
  log_url: https://github.com/junit-team/junit5/
  # List of branches used for analysis
  branch:
    - main

filter:
  keep_filepaths_ending_with:
    - cpp
    - c
    - h
    - java
    - js
    - py
    - cc
  remove_filepaths_containing:
    - test
    - java_code_examples

tool:
 # srcML allow to parse src code as text (e.g. identifiers)
  srcml:
    # The file path to where you wish to store the srcml output of the project
    srcml_path: ../../analysis/depends/srcml_depends.xml
# Analysis Configuration #
analysis:
  # A list of topic and keywords (see src_text_showcase.Rmd).
  topics:
    topic_1:
      - model
      - view
      - controller
    topic_2:
      - visitor
    topic_3:
      - observer
      - listener
    topic_4:
      - adapter
    topic_5:
      - decorator
    topic_6:
      - factory
      - builder
    topic_7:
      - facade
    topic_8:
      - strategy
    topic_9:
      - command

require(kaiaulu)
require(data.table)
require(yaml)
require(stringi)
require(knitr)
require(reticulate)
require(magrittr)
require(gt)

tool <- parse_config("../tools.yml")
conf <- parse_config("../conf/junit5.yml")
srcml_path <- get_tool_project("srcml", tool)

git_repo_path <- get_git_repo_path(conf)
folder_path <- stri_replace_last(git_repo_path,replacement="",regex=".git")

# Tool Parameters 
srcml_filepath <- get_srcml_filepath(conf)

# Filters
file_extensions <- get_file_extensions(conf)
substring_filepath <- get_substring_filepath(conf)

# Analysis
topics <- get_topics(conf)

This is all the project configuration files are used for. If you inspect the variables above, you will see they are just strings. As a reminder, the tools.yml is where you store the filepaths to third party software in your computer. Please see Kaiaulu's README.md for details. As a rule of thumb, any R Notebooks in Kaiaulu load the project configuration file at the start, much like you would normally initialize variables at the start of your source code.

Extracting Class Names for Token Matching

To use srcml, we leverage the git path to specify the folder which srcml should execute. The srcml library will then generate a single file, saved on srcml_filepath that contains all the information of the project.

srcml_filepath <- annotate_src_text(srcml_path = srcml_path,
                                     src_folder = folder_path,
                                     srcml_filepath)

We can then use srcml to query against this generated XML file. For example, we can query the class names. There is much more that can be parsed. Indeed, you can even use srcml to modify the source code, and output runnable code out of it. The following is a convenience function that will also tabulate the output as a R table:

query_table <- query_src_text_class_names(srcml_path = srcml_path,
                                     srcml_filepath = srcml_filepath)
head(query_table)  %>%
  gt(auto_align = FALSE)

Using Spiral's Ronin to Split Identfiers

We can see that both the file name and class name were output here. To perform keyword matching, we must now split the class name identifiers into tokens. This is where the Spiral Python library comes in. First, we load the Ronin method in R, via the reticulate library:

reticulate::use_python("/usr/local/bin/python3")
spiral_library <-reticulate::import("spiral.ronin", convert = TRUE)
collections_library <-reticulate::import("collections", convert = TRUE) 

# May be required for newer Python versions. Comment if causes errors.
collections_library$Iterable <- collections_library$abc$Iterable

Then, we use Spiral's split method over each classname in our prior table. To maintain the table format, we combine the tokens with ";" in each row, but they can be split again for token matching.

split_token_list <- sapply(query_table$classname,spiral_library$split)
query_table$tokens <-  sapply(split_token_list,stringi::stri_c,collapse=";")
head(query_table)  %>%
  gt(auto_align = FALSE)

File Filtering

Since we have a table, we can actually use Kaiaulu filter functions to do some pre-processing.

Note the depends project configuration file accounted for that pattern to be removed:

substring_filepath

Should we wish to remove such filepaths, we can do so as follows:

nrow(query_table)

query_table <- query_table  %>%
    filter_by_file_extension(file_extensions,"filepath")  %>%
  filter_by_filepath_substring(substring_filepath,"filepath")
nrow(query_table)

Finally, we lower case all class name tokens for topic matching:

query_table$tokens <- tolower(query_table$tokens)

Topic Matching

What is left is to use each class' tokens column and the list of topics provided in the project configuration file for comparison.

topics

First, let's split the tokens again. Here's a sample for clarity:

split_tokens <- stringi::stri_split_regex(query_table$tokens,pattern = ";")
split_tokens[1:2]

Each topic's tokens will be compared against a class' tokens. If there is a match between any topic and class token, we consider the topic to be true.

is_a_topic_match <- function(split_token,topic){
  is_match <- any(topic %in% split_token)  
  return(is_match)
}

for(i in 1:length(topics)){
  topic_column_name <- paste0("topic_",as.character(i))
  query_table[[topic_column_name]] <-
    sapply(split_tokens,is_a_topic_match,topics[[topic_column_name]])  
}

And finally, a sample of the final table:

head(query_table)   %>%
  gt(auto_align = FALSE)

The full list of identified matched classes is given in the following code block, where we check if any class has any topic as TRUE:

any_topic_true <- apply(query_table[, 4:length(topics), with=FALSE],1,sum)
any_topic_true <- ifelse(any_topic_true == 0,FALSE,TRUE)
query_table[any_topic_true]   %>%
  gt(auto_align = FALSE)