library(xml2) library(tidyverse) library(gt) library(knitr) library(kableExtra) library(tidygraph) library(ggraph) library(patchwork) library(magrittr) library(scales) library(magrittr) library(patchwork) library(feather) library(ggrepel) library(lubridate) library(git2r) library(furrr) knitr::opts_chunk$set(echo = FALSE, size = "small", warning = FALSE, message = FALSE, cache = FALSE, fig.pos="H") def.chunk.hook <- knitr::knit_hooks$get("chunk") knitr::knit_hooks$set(chunk = function(x, options) { x <- def.chunk.hook(x, options) ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x) })
size_line_of_code <- 160 length_alert_name <- 35 length_alert_name_side_by_side <- 14 size_line_of_code_side_by_side <- 77 pmd_path <- "pmd/bin/pmd.bat" rule_path <- "rulesets/java/quickstart.xml" output_path <- "" examples <- tribble( ~name, ~path, ~output, "Versão Old Original", "old" , "old_original", "Versão New 1", "new", "new_1" ) %>% mutate(id = row_number())
\section{Introduction}\label{intro}
This document is part of a research project about software degradation caused by careless developers' behavior and about strategies to deal with such undesired behavior. These strategies will possibly be inspired by concepts from game theory.
We assume that software degradation can be measured by the number and the types of \textit{kludges} made by software developers in the code. A kludge is code that
```{=tex} \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \tightlist \item Partially fixes a bug or partially implements a feature. \end{enumerate}
```{=tex} \setlength{\parindent}{1.0cm} \hangindent=1.0cm
The term partial can be understood as in \textit{partial functions}. A partial function is undefined for some elements in the formal domain. For instance, the square root function restricted to the integers: $f(25)$ is defined, but $f(26)$ is undefined. In terms of features, we can think about a developer calculating the point on which two lines cross and neglecting the case of parallel lines
```{=tex} \begin{enumerate} \setcounter{enumi}{1} \tightlist \item The developer knows that the code is only a partial solution, with high probability.~\footnote{We need to study technical debt papers to enrich the conceptual background.} \end{enumerate}
This project aims to study how software projects evolve in terms of number and kinds of kludges. So far, we are trying to identify kludges by looking at alerts generated by the PMD source code analyzer. PMD is a static source code analyzer that is commonly used to find potential programming flaws. PMD is a good choice for a researcher to analyze bad practices, because it supports multiple languages and it is very flexible. It allows the researcher to provide his own rules for finding interesting patterns in a source code in Java or other languages. These are the planned steps for this research project: ```{=tex} \begin{itemize} \item Confirm the assumption that the frequency of PMD alerts is an accurate measure of the prevalence of kludges; \item Confirm the assumption that kludges harm software development; \item Confirm the assumption that there is a game in which, in Nash equilibrium, a developer chooses a strategy in which he gets personal benefits while causing harm to the project, by making kludges; \item If all these assumptions are true, use mechanism design to devise how we can change the environment in a way that developers do not choose to make so much kludge, increasing the quality of the project in the long run; \item Implement this mechanism building a plugin for a prominent CI tool, such as Travis, Jenkins or GitLab. \end{itemize}
In this document, we evaluate PMD Source Code Alerts as proxies for kludges. In section \ref{pmd}, we present PMD Source Code Analyzer and show how we use the tool to generate alerts that are possible kludges. We use this tool, also, to create a simplified AST that will help us in the algorithm that infers how many new alerts were created, how many were fixed and how many remain open in a transition from an old version to a new version. This algorithm is described in Section \ref{alg}. In Section \ref{results}, we compare the creation of new alerts with the insertion of Self Admitted Technical Debt (SATD) comments.
\% ========================================================= % % PMD SOURCE CODE ANALYZER % % =========================================================
\section{The PMD Source Code Analyzer}\label{pmd}
We use PMD to list the alerts that represent \textit{possible kludges} in a source code. PMD receives a source code as input and generates a list of bad programming practices contained in the code, i.e., the alerts.
PMD traverses the AST of a source code searching for violations of rules which are configured by the user. PMD comes with a default rule set for the Java programming language. The default rule set finds common programming flaws such as unused variables, empty catch blocks, unnecessary object creation, and so forth. It is possible to configure a different set of rules by creating a custom XML file. Below, we can see a simple code and the alerts that were generated by the default rule set. In this example, PMD generates two alerts of the type ControlStatementBraces (CSB), in lines 11 and 20. This alert means that there are no braces in a statement that is inside a control statement.
MUDOU <- TRUE saida_alg2 <- kludgenudger::calculate_features_from_versions( code_file_old = "little-tree/code.java", code_file_new = "little-tree-new/code.java", pmd_path = pmd_path, glue_string = "{.data$id_alert}:line:{.data$beginline},\n{.data$small_rule}.{if_else(is.na(.data$rule_alert),'',paste0('\n',.data$rule_alert))}", mostra_new = c(2, 3, 4, 16), mostra_old = c(2, 3, 5, 14, 12), blockrules_location = "data/blockrules/blockrules_simple.xml", optimize_feature_calculation = FALSE )
\newpage
```{java code old simple, code=kludgenudger::read_and_decorate_code_and_alerts("little-tree/code.java", saida_alg2$versions_executed$pmd_output[[1]], FALSE, 10, use_mnemonic = TRUE ), echo = TRUE }
\subsection{Using PMD to capture the history of alerts}\label{history} To evaluate how the number of alerts evolved throughout the history of a software project, we must be able to analyze a pair of different versions of a source code (an old and a new version) and categorise each alert contained in the code as either \textbf{new}, \textbf{fixed} or \textbf{open}. We define a PMD alert generated for the old version as either \textbf{open} or \textbf{fixed} in the new version. An \textbf{open} alert remains in the new version of the code. A \textbf{fixed} alert does not exist in the new version. A PMD alert generated for the new version is either \textbf{open} or \textbf{new}. An \textbf{open} alert indicates that the same alert was identified in the old version of the source code. A \textbf{new} alert implies that the same alert cannot be identified in the old version. The intersection between \textbf{fixed} alerts, \textbf{new} alerts and \textbf{open} alerts is empty. The alerts identified as \textbf{open} are equivalent in both new and old versions. To decide whether an alert is \textbf{open}, \textbf{fixed} or \textbf{new}, one has to identify if this alert in the old version is equivalent to its occurrence in the new version. This document describes an algorithm to make this classification. \subsection{Using PMD to generate an simplified Abstract Syntax Tree }\label{ast} To classify alerts as \textbf{new}, \textbf{open} and \textbf{fixed}, we could match the lines of the old version to the lines of the new version of a source code using a diff functionality. Diff is useful, but not sufficient. Frequently, the source code changes are more complex than the ones we could address by using only diff. An alert $o$ in the old version could be essentially the same as the alert $n$ contained in the new version, but the piece of code where $o$ resides could be moved in a way that it is impossible to match $o$ to $n$ only using information from Diff. Section \ref{alg} presents and algorithm that uses extra information from a simplified AST created using PMD. PMD traverses the source code by visiting many different kinds of elements. We do not use all the types of nodes recognized by PMD to generate the simplified AST because there are many kinds of nodes that are not needed in the current heuristic that classify the alerts, as described in Section \ref{heuristic}. If we used all the kinds of nodes, we would end up with a tree that would not add value to our analysis but would add complexity to our algorithm. The kinds of elements that were selected are listed below: ```{=tex} \begin{itemize} \item \textbf{Block}: a block of statements enclosed by braces; \item \textbf{ClassOrInterfaceBody}: the body of an interface or a class, excluding the declaration; \item \textbf{CompilationUnit}: the root of an AST tree; \item \textbf{Method}: a method, including body and declaration; \item \textbf{Constructor}: a constructor, including body and declaration; \item \textbf{Statement}: any statement, like an if statement or an assignment; \end{itemize}
PMD is prepared to receive an optional XML file, along with the source generated by PMD. The default XML generates alerts when it finds common bad practices, but to create the simplified AST we use an XML file with instructions to generate all the nodes of the kinds we listed above. PMD output lists these nodes and the location of these nodes in the source code as shown in Table \ref{tab_nodes}.
saida_alg2$graph_old_with_alert %>% as_tibble() %>% select( `Kind of node` = rule, `Begin line` = beginline, `Begin column` = begincolumn, `End line` = endline, `End column` = endcolumn ) %>% kable( caption = "Output from PMD when creating a simplified AST\\label{tab_nodes}" ) %>% kable_styling( latex_options = c("striped", "hold_position") )
\% % MARCIO: o que significa ser "descendente" no parágrafo abaixo? % % Bruno: acredito que melhorei a explicação % % MARCIO: Não sei se "descendente" seria o melhor termo, pois não há % uma herança aqui. Não seria um "componente" ou "parte de B"? % %Bruno: deixei descendente apenas no contexto da AST. Neste caso creio que é o termo correto. Posso usar subchild também, mas é um termo menos comum
Looking at Table \ref{tab_nodes}, we see that there is information about the location of the nodes, in terms of lines and columns. We can infer which nodes are contained in other nodes, because we can compare the begin and end lines and the begin and end columns and we know that if a node A is inside the contents of a node B (within its range of lines of code), A is descendant of B in the AST. But we cannot see if the node A is a child or a grandchild of B (or a grand-grand-child\ldots). We follow three steps to recreate the AST:
```{=tex} \begin{enumerate} \item Link each node (a) to the set of nodes (X) that are fully contained between the begin line / begin column and end line / end column of node (a). We can construct a directed graph in which the elements are the nodes and the links described are the edges. This is not a tree yet, because each node will have edges directed to all its descendants and not only its children in the AST;
\item Sort the nodes in the decreasing order of its number of children. The objective is to establish that, in a search through this graph, the first child chosen will be the one that is a child in the AST, and not only a descendant;
\item Proceed a deep-first search starting from the compilation unit node. \end{enumerate}
\% ========================================================= % % RESEARCH QUESTIONS % % ========================================================= ```{=tex} \section{Research questions} \label{as_whole}
\noindent \textbf{RQ: Is the frequency of PMD Alerts an accurate measure of the prevalence of kludges?} \label{PMD_Kludge}
In a given transition between an old version and a new version we want to identify if there was an intense introduction of PMD alerts. This evidence of possible kludge introduction must be normalized by the size of the change in source code between the two versions. We do this by following this formula: %\footnote{Another possibility is to use this formula [ \frac{#NewAlerts - #FixedAlerts}{Change} ], where Change can be a measure based on the differences between the versions}:
\% % MARCIO: isto está estranho. Você só menciona o tamanho da mudança na nota % de rodapé. % % Bruno: explicitei o que chamo de change % % MARCIO: ok, mas não seria melhor colocar tudo na mesma equação, como fiz % abaixo? % %Bruno - OK
$$ \frac{#NewAlerts - #FixedAlerts}{Change} $$
where
$$Change = #NewAlerts + #FixedAlerts$$
With these version transition measured in terms of inclusion of PMD alerts, we can correlate these events with some other evidence of kludge. In Section \ref{results}, we calculate this correlation using \textit{Self-Admitted Technical Debt} (SATD) comments.
\% % MARCIO: você deveria usar e referenciar a equação acima na Seção 5. %
\vspace{16px}
\noindent \textbf{RQ: Do kludges harm software development?} \label{kludge_harm}
We need some way to measure degradation after a heavy introduction of kludges. A drop in the popularity may not be a proper evidence. The increment in the number of issues and bug fixed nor necessarily represent a degradation. Churn could be used here.
\% ========================================================= % % ALGORITHMS % % %=========================================================
```{=tex} \section{An algorithm to classify alerts} \label{alg}
This section discusses the algorithm to classify alerts as open, fixed, or new. The algorithm uses the simplified AST described in Section \ref{ast} to create features that help to infer if two alerts in different versions must be considered the same. We use the term \textit{feature} as it is used in the field of statistical learning (Kuhn and Johnson, 2019). In this field, the variables that are used to predict the outcome of an event are called \emph{independent variables}, \emph{predictors}, or \emph{features}. The term \emph{feature} is used more appropriately if referring to a variable that is a composite one or more variables or the result of a treatment upon raw data. Given a pair of alerts, one from the old version and one from the new version, we use the mapping between the lines of the old and the new version and their AST as raw data to create the features that will be used to infer if if they are the same alert. At the moment, we use a heuristic to infer if the pair of alerts is the same, as described in Section \ref{heuristic}. Figure \ref{fig:diag} shows the steps of the algorithm we are describing. ```{=tex} \begin{figure} \centering \smartdiagram[sequence diagram]{Get alerts for each version, Create AST for each version, Map new lines to old lines, Calculate features for each pair of alert, Decide if the alerts are the same, Classify alerts } \vspace{5mm}\par \caption{Steps of the algorithm to classify alerts}\label{fig:diag} \end{figure}
\subsection{An illustrative example}\label{source_used}
Hereafter, we consider the old and new version of an example of source code presented below. In the new version, the alert generated in the line 11 of the old version was fixed, but another remained on the new version. So we expect one alert \textbf{Fixed} in the old version, one alert \textbf{Open} in the old version, and the same \textbf{Open} alert in the new version.
```{java showing codes, code=kludgenudger::read_and_decorate_code_and_alerts_mapped("little-tree/code.java", saida_alg2$versions_executed$pmd_output[[1]], "little-tree-new/code.java", saida_alg2$versions_executed$pmd_output[[2]],saida_alg2$versions_crossed$lines_map[[1]], TRUE, 20, TRUE, 60), echo=TRUE, size="scriptsize" }
\subsection{Get alerts for each version} For the new and the old versions, we run PMD using the default rule set as described Section \ref{history}. Table \ref{old_alerts} is created for the old version and Table \ref{new_alerts} for the new version. ```r saida_alg2$versions_executed$pmd_output[[1]] %>% select( `Kind of node` = rule, `Begin line` = beginline, `Begin column` = begincolumn, `End line` = endline, `End column` = endcolumn ) %>% kable( caption = "Old version's alerts\\label{old_alerts}" ) %>% kable_styling( latex_options = c("striped", "HOLD_position") )
saida_alg2$versions_executed$pmd_output[[2]] %>% select( `Kind of node` = rule, `Begin line` = beginline, `Begin column` = begincolumn, `End line` = endline, `End column` = endcolumn ) %>% kable( caption = "New version's alerts\\label{new_alerts}" ) %>% kable_styling( latex_options = c("striped", "HOLD_position") )
\subsection{Create AST for each version}
For each version of the source code the algorithm creates a simplified AST as described in Section \ref{ast}. In Figure \ref{AST_compare_id_alerts} we can see the ASTs for the old and the new versions. In this figure, the numbers in the nodes are meaningless and are presented only for reference.
```r", fig.pos="H", cache=TRUE}
chart_graph_new <- kludgenudger::show_ast( saida_alg2$graph_new_with_alert, size_label = 3, show_label = TRUE, alpha_label = "mostra", name_field = "glue", aspect = 0.5, title = "Simplified AST for the old version"
)
chart_graph_old <- kludgenudger::show_ast( saida_alg2$graph_old_with_alert, size_label = 3, show_label = "TRUE", alpha_label = "mostra", name_field = "glue", aspect = 0.5, title = "Simplified AST for the new version"
)
chart_graph_old / chart_graph_new
\subsection{Map new lines to old lines}\label{map} For each difference stated in the output of \textit{git diff} (the sections of the diff file starting with \`\`\@\@''), there is an indication of the number of lines removed from the old version and the number of lines added to the new one. The line in which the lines are removed from the old version and the line at which the lines are added are indicated too. By using this information we create a mapping from the lines in the old version to the equivalent lines in the new version. For the new and old versions presented in Section \ref{source_used}, the relation is shown in Table \ref{table_map} \footnote{The mapping shown begins at line 5 in order to save space, since lines 1-4 in the old version map to lines 1-4 in the new version.}. ```r saida_alg2$versions_crossed$lines_map[[1]] %>% ungroup() %>% mutate( row = row_number(), na_mark = if_else(is.na(map_remove) | is.na(map_add), row , NA_integer_ ), next_na = na_mark, last_na = na_mark ) %>% fill( next_na, .direction = "up" ) %>% fill( last_na, .direction = "down" ) %>% replace_na( list( last_na = 0, next_na = nrow(saida_alg2$versions_crossed$lines_map[[1]]) + 1 ) ) %>% mutate( dist_next = next_na - row, dist_last = row - last_na + 0.1 ) %>% rowwise() %>% mutate( min_dist = min(dist_next, dist_last) ) %>% filter( min_dist < 4 ) %>% ungroup() %>% mutate( map_remove = case_when( min_dist == 3.1 ~ str_glue("{lag(map_remove)+1}-"), min_dist == 3.0 ~ str_glue("-{lead(map_remove)-1}"), TRUE ~ map_remove %>% as.character() ), map_add = case_when( min_dist == 3.1 ~ str_glue("{lag(map_add)+1}-"), min_dist == 3.0 ~ str_glue("-{lead(map_add)-1}"), TRUE ~ map_add %>% as.character() ) ) %>% select(old = map_remove, new = map_add) %>% mutate( old = if_else(is.na(old), str_glue("\\textcolor{{white}}{{{row_number()}}}"), old), new = if_else(is.na(new), str_glue("\\textcolor{{white}}{{{row_number()}}}"), new) ) %>% pivot_wider( names_from = old, values_from = new, names_repair = "minimal" ) %>% kable( caption = "Relation between lines of the old version and lines of the new version\\label{table_map}", escape = FALSE ) %>% kable_styling( font_size = , latex_options = c( "HOLD_position", "scale_down") )
\subsection{Calculate features for each pair of new and old alert}
For the proposed example, we calculate features for $2 \cdot 1 = 2$ combinations of new and old alerts, since we have 2 old alerts and 1 new alert. Table \ref{combination} shows the combinations for which features are calculated.
old_alerts <- saida_alg2$versions_executed$pmd_output[[1]] %>% select( `Begin Line Old` = beginline, # `End Line Old` = endline, `Rule Old` = rule ) new_alerts <- saida_alg2$versions_executed$pmd_output[[2]] %>% select( `Begin Line New` = beginline, # `End Line New` = endline, `Rule New` = rule ) combinations <- old_alerts %>% crossing(new_alerts) combinations %>% kable( caption = "Combinations of new and old alerts for which the features must be calculated \\label{combination}", escape = FALSE, col.names = ) %>% kable_styling( font_size = , latex_options = c("HOLD_position") )
\vspace{16pt}
For each alert, PMD Source Code Analyzer returns the following attributes:
\vspace{16pt}
We propose the following features to calculate for each combination:
\textbf{Same Rule}: a Boolean indicator that tells if the alerts are of the same type;
\textbf{Same Group ID}: a Boolean indicator that tells if the alerts are equivalent in terms of begin line and end line, considering the mapping described in Section \ref{map}. For each combination, Table \ref{same_group} shows the begin line in the old version, the corresponding begin line in the new version, and the begin line in the new version\footnote{We suppress the end lines because for all alerts in the example the begin lines and the end lines are the same}. The alert that begins in line 20 of the old version corresponds to the alert that begins in line 22 of the new version. So, for this combination ``Same group'' feature is true. For the other combination, it is false.
map_lines <- saida_alg2$versions_crossed$lines_map[[1]] %>% select( map_line_old = map_remove, map_line_new = map_add ) map_old <- map_lines %>% rename(`Corresponding line in new version` = map_line_new) combinations %>% left_join( map_old, by = c("Begin Line Old"="map_line_old") ) %>% select( `Begin Line Old`, `Rule Old`, `Corresponding line in new version`, `Begin Line New`, `Rule New` ) %>% mutate( `Same group` = `Corresponding line in new version` == `Begin Line New` ) %>% kable( caption = "Same group feature \\label{same_group}", escape = FALSE, col.names = c( "Begin Line Old", "Rule Old", "Corresponding line\n in new version", "Begin Line New", "Rule New", "Same group" ) %>% linebreak() ) %>% kable_styling( font_size = , latex_options = c("scale_down", "HOLD_position") )
\noindent \textbf{Same Method Group ID}: a Boolean indicator that tells
if the alerts belong to methods that are in the same group in the sense
of the
Same Group ID'' feature we described above. First, we find each alert's method following the path from the alert's node to the root. The first node of the kind
method''
or ``constructor'' found in this path defines the alert's method.
Considering the proposed example, Figure \ref{path_node_to_root_1} shows
the AST for the first combination of alerts (see Table
\ref{combination}).
In this combination, the method of the old alert is in node 5 of the left tree and the method for the new alert is in node 4 of the right tree. Table \ref{tab_same_method} shows that the begin and end lines of the method in the old version do not correspond to the lines in the new version. The correspondence uses the mapping defined in Section \ref{map}. For this combination, the ``Same Method Group ID'' feature is FALSE.
\% % MARCIO: acima é mesmo "begin and end lines of the alert" ou seria "begin and end % lines of the methods"? % %Bruno: sim
```r", fig.pos="H" , cache=TRUE}
path_old <- kludgenudger::show_ast( saida_alg2$graphs_from_alerts_old %>% rename( id_alert = id_alert_old, graph = graph_old) %$% graph[[1]] , size_label = 3, aspect = 4, nudge_x = 0.5, title = "Path from alert in old version, \nin line 11, to root" )
path_new <- kludgenudger::show_ast( saida_alg2$graphs_from_alerts_new %>% rename( id_alert = id_alert_new, graph = graph_new) %$% graph[[1]] , size_label = 3, aspect = 4, nudge_x = 0.5, title = "Path from alert in old version, \nin line 22, to root" )
(path_old + plot_spacer() + path_new)
```r tribble( ~"-", ~"Old version", ~"New version", "Begin line", 8, 19, "End line", 15, 27 ) %>% left_join( map_lines, by = c("Old version" = "map_line_old") ) %>% select( `-`, `Old version`, `Corresponding line in the new version` = map_line_new, `New version` ) %>% kable( caption = "Defining if Same Method Group ID \\label{tab_same_method}", escape = FALSE, col.names = c( "-", "Old version", "Corresponding line\nin the new version", "New version" ) %>% linebreak() ) %>% kable_styling( font_size = , latex_options = c( "HOLD_position") )
Figure \ref{path_node_to_root_2} shows the ASTs for the second combination of alerts (see Table \ref{combination}). In this combination the method of the old alert is in node 3 of the left tree and the method for the new alert is in node 4 of the right tree. Table \ref{tab_same_method_2} shows that the begin and end lines of the method in the old version do correspond to the lines in the new version. The correspondence uses the map defined in Section \ref{map}. For this combination, the ``Same Method Group ID'' feature is TRUE.
\% % MARCIO: mesma consideração feita acima ... % %Bruno: sim
```r", fig.pos="H" , cache=TRUE}
path_old <- kludgenudger::show_ast( saida_alg2$graphs_from_alerts_old %>% rename( id_alert = id_alert_old, graph = graph_old) %$% graph[[2]] , size_label = 3, aspect = 4, nudge_x = 0.5, title = "Path from alert in old version, \nin line 20, to root" )
path_new <- kludgenudger::show_ast( saida_alg2$graphs_from_alerts_new %>% rename( id_alert = id_alert_new, graph = graph_new) %$% graph[[1]] , size_label = 3, aspect = 4, nudge_x = 0.5, title = "Path from alert in old version, \nin line 22, to root" )
(path_old + plot_spacer() + path_new)
```r tribble( ~"-", ~"Old version", ~"New version", "Begin line", 17, 19, "End line", 25, 27 ) %>% left_join( map_lines, by = c("Old version" = "map_line_old") ) %>% select( `-`, `Old version`, `Corresponding line in the new version` = map_line_new, `New version` ) %>% kable( caption = "Defining if Same Method Group ID \\label{tab_same_method_2}", escape = FALSE, col.names = c( "-", "Old version", "Corresponding line\nin the new version", "New version" ) %>% linebreak() ) %>% kable_styling( font_size = , latex_options = c( "HOLD_position") )
\noindent \textbf{Same Method Name}: a Boolean indicator that tells if the alerts were found in a method with the same name. The methods for the alerts are found as for the ``Same Method Group ID''. However, instead of the corresponding lines, this feature evaluates the name of the methods. If the method related to the old alert and the method related to the new alert have the same name (even if the begin and end lines are not corresponding), then this feature is set to TRUE.
\noindent \textbf{Same Block}: a Boolean indicator that shows if the
alerts belong to the same block. The blocks are defined in a similar way
as described above, following the path from the node where each alert is
towards the root until we find a block'',
method'', or
constructor'' node. Then, the begin and end lines of the old alert and the corresponding lines in the new version are compared with the begin and end lines of the new alert, as for the
Same
Method Group ID''.
\noindent \textbf{Same Code}: for this feature we compare the source code that is contained in the nodes related to the alerts. The comparison is performed by lexicographically comparing the textual version of the source code residing within the lines and columns ranges (begin and end) of the alerts.
\noindent \textbf{Same Method Code}: for this feature we compare the source code of the methods related to the alerts. The methods related to the alerts are found the way we described for the ``Same Method Group ID'' above.
\noindent \textbf{Line distance}: the distance between the
mean line'' (\(\frac{beginline + endline}{2}\)) of the new alert and the
corresponding
mean line'' of the old alert. Table \ref{tab_line_distance} shows the
line distance for the combinations of new and old alerts in the example
we are following.
old_alerts_ld <- saida_alg2$versions_executed$pmd_output[[1]] %>% select( `Begin Line Old` = beginline, `End Line Old` = endline, `Rule Old` = rule ) new_alerts_ld <- saida_alg2$versions_executed$pmd_output[[2]] %>% select( `Begin Line New` = beginline, `End Line New` = endline, `Rule New` = rule ) combinations_ld <- old_alerts_ld %>% crossing(new_alerts_ld) map_lines <- saida_alg2$versions_crossed$lines_map[[1]] %>% select( map_line_old = map_remove, map_line_new = map_add ) map_old_begin <- map_lines %>% rename(`Corresponding begin line in new version` = map_line_new) map_old_end <- map_lines %>% rename(`Corresponding end line in new version` = map_line_new) combinations_ld %>% left_join( map_old_begin, by = c("Begin Line Old"="map_line_old") ) %>% left_join( map_old_end, by = c("End Line Old"="map_line_old") ) %>% select( `Begin Line Old`, `End Line Old`, `Corresponding begin line in new version`, `Corresponding end line in new version`, `Begin Line New`, `End Line New` ) %>% mutate( mean = (`Corresponding begin line in new version` + `Corresponding end line in new version`) / 2, `Corresponding mean line` = str_glue("({`Corresponding begin line in new version`} + {`Corresponding end line in new version`}) / 2 = {mean}"), `Mean line of new alert` = (`Begin Line New` + `End Line New`) / 2, `Line distance` = abs(mean - `Mean line of new alert`) ) %>% select( -mean ) %>% kable( caption = "Line distance feature \\label{tab_line_distance}", escape = FALSE, col.names = c( "Begin Line\nOld", "End Line\nOld", "Corresponding\nbegin line\nin new version", "Corresponding\nend line\nin new version", "Begin Line\nNew", "End Line\nNew", "Corresponding\nmean line", "Mean line of\nnew alert", "Line distance" ) %>% linebreak() ) %>% kable_styling( font_size = , latex_options = c("scale_down", "HOLD_position") )
Table \ref{table_features} shows the features calculated for the combinations of new and old alerts in our example.
MUDOU <- TRUE kludgenudger::report_features( saida_alg2, "Resulting features\\label{table_features} ", types_to_show = c( "same_rule", "same_id_group", "same_method_group", "same_method_name", "same_block", "same_code", "same_method_code", "dist_line" ) )
\subsection{Decide if two alerts are the same based on heuristic}\label{heuristic}
With the features at hand, we must decide if each combination of old and new alerts refers to the same alert. As a starting point, we are using a rule of thumb (a manually-devised heuristic) to make this decision. A new and an old alerts are declared the same if any of the following rules apply.
All the Boolean features are TRUE and \textit{Line Distance} is equal to 0;
\scriptsize
$$ \begin{aligned} (SameRule \land SameGroupID \land SameMethodGroupID \land \ SameMethodName \land SameBlock \land SameCode \land \ SameMethodCode \land LineDistance = 0) \Rightarrow SameAlert \end{aligned}$$
\normalsize
If \textit{Same Method Code} is TRUE, then we consider it is the same method, even if the name or the method is not the same. But if the method code is the same, then the alert code and the kind of the alert must be the same;
\% % MARCIO: esta regra está meio estranha. Para que usa o Same Method Group ID se está % usando a comparação de código aqui? % %Bruno: nao estou Method Group ID, mas o código do Method. Está confuso usar as palavras Code e ID? Posso usar SourceCode pra deixar claro que não é um código numérico identificador mas sim o código-fonte % \scriptsize $$ \begin{aligned} (SameRule \land SameCode \land SameMethodCode \Rightarrow SameAlert \end{aligned}$$
\normalsize
If the kind of alert is the same, and at least one of the features about the method is TRUE, we consider that the alerts are the same if the line distance must be less then 5 lines;
\scriptsize
$$ \begin{aligned} (SameRule \land (SameMethodCode \lor SameMethodGroup \lor SameMethodName) \land LineDistance < 5) \Rightarrow SameAlert \end{aligned}$$
\normalsize
\% % MARCIO: o que significa "we must have evidence that the method is the same"? % % MARCIO: "Same Group ID"=TRUE não implica que Line Distance = 0? Se for o caso, % as duas últimas regras são redundantes. % %Sim, posso juntar. Tava pensando em dar um peso diferente quando eu tivesse mais certeza, mas desisti
Table \ref{table_features_with_decision} shows the resulting features of the two combinations of alerts in the example we are following and the final decision if they are the same alert.
MUDOU <- TRUE kludgenudger::report_features( saida_alg2, "Resulting features\\label{table_features} ", types_to_show = c( "same_rule", "same_id_group", "same_method_group", "same_method_name", "same_block", "same_code", "same_method_code", "dist_line" ), return_raw_data = TRUE ) %>% mutate( same_alert = if_else( row_number() <= 8, FALSE, TRUE ) ) %>% kable( format = "latex", caption = "Resulting features\\label{table_features_with_decision} ", escape = TRUE, # booktabs = TRUE, # align = "r", # linesep = "", col.names = c( "Alert combination", "Feature", "Featre value", "Same alert" ) ) %>% kableExtra::collapse_rows(columns = c(1,4), latex_hline = "major", valign = "top") %>% kableExtra::kable_styling( latex_options = c("HOLD_position", "striped") )
\subsection{Classify alerts}\label{heuristic}
Finally we can decide if the alerts in the old version are open or fixed and if the alerts in the new version are open or new. In our example, we have two alerts in the old version and they are combined with the alert in the new version. Table \ref{tab_categorizing_old} shows the two old alerts combined with the new alert. The first old alert is not classified as the same as any of new alerts, so is marked as \textbf{Open}. The second alert is classified as the same as the new alert, so is declared \textbf{Fixed}.
tribble( ~begin_line, ~begin_line_new, ~same_alert, ~conclusion, 11, 22, FALSE, "OPEN", 20, 22, TRUE, "FIXED" ) %>% kable( format = "latex", caption = "Categorizing alerts of the old version \\label{tab_categorizing_old} ", escape = FALSE, # booktabs = TRUE, # align = "r", # linesep = "", col.names = c( "Begin line of the alert\nin the old version", "Begin line of the alert\nn the new version which\nis combined to the old alert", "Same alert according\nto heuristic,\n based on features", "Category" ) %>% linebreak() ) %>% kableExtra::collapse_rows(columns = c(1,4), latex_hline = "major", valign = "top") %>% kableExtra::kable_styling( latex_options = c("HOLD_position", "striped") )
Table \ref{tab_categorizing_new} shows the new alert combined with the two old alerts. The new alert is classified as the same as one of the old alerts, so is declared Open.
tribble( ~begin_line, ~begin_line_new, ~same_alert, ~conclusion, 22, 11, FALSE, "OPEN", 22, 20, TRUE, "OPEN" ) %>% kable( format = "latex", caption = "Categorizing alerts of the old version \\label{tab_categorizing_new} ", escape = FALSE, # booktabs = TRUE, # align = "r", # linesep = "", col.names = c( "Begin line of the alert\nin the new version", "Begin line of the alert\nn the old version which\nis combined to the old alert", "Same alert according\nto heuristic,\n based on features", "Category" ) %>% linebreak() ) %>% kableExtra::collapse_rows(columns = c(1,4), latex_hline = "major", valign = "top") %>% kableExtra::kable_styling( latex_options = c("HOLD_position", "striped") )
Table \ref{tab_summary_categories} shows the final categories for the two alerts in the old version and the alert in the new version.
tribble( ~alert, ~category, "Alert in the old version, line 11", "FIXED", "Alert in the old version, line 20", "OPEN", "Alert in the old version, line 22", "OPEN", ) %>% kable( format = "latex", caption = "Alerts and their categorization\\label{tab_summary_categories} ", escape = FALSE, # booktabs = TRUE, # align = "r", # linesep = "", col.names = c( "Alert", "Category" ) %>% linebreak() ) %>% kableExtra::kable_styling( latex_options = c("HOLD_position", "striped") )
\section{Comparing new alerts with new SATD comments}\label{results}
In this section, we study the correlation between the creation of new alerts and the insertion of \textit{Self-Admitted Technical Debt} comments in a transition between an old and a new version of source code.
<<<<<<< HEAD In [@Potdar2014], the authors discuss that the existence of comments that contain some specific patterns may indicate what they call Self-Admitted Technical Debts (SATD). In [@Sierra2019], Self-Admitted Technical Debt is defined as the event in which the developer consciously introduce debt. According to these two works, the developer acknowledges the SATD in the form of comments. In [@Wehaibi2016] we can find some patterns based on the work of Potdar and Shihab. For instance, some of these patterns are "hack", "retarded", "remove this code", "treat this as a soft error", "kludge", "fixme", "this isn't quite right", "fix this crap", "abandon all hope" and "kaboom". We used all the terms used by the authors and added other terms that we found in the code and which we consdidered indicators of SATDs. To find these additional terms, we sampled 5,000 comments and looked at them in order to recognize expressions that we judged as indicators of SATDs. All the regular expressions representing these terms are listed in Section \ref{sec_SATDs}.
We select 39 tagged and released versions of the project ArgoUML. They
parent of 7645398 (qualitative analysis) were released between 2001-04-06 and 2011-12-15. For each pair of sequential versions, we generate the PMD Alerts and categorise the identified alerts as \textbf{New}, \textbf{Fixed} or \textbf{Open} using the algorithm described in Section \ref{alg}. We want to understand if the number of new alerts, normalized by the magnitude of the change between two versions, is a good proxy for the amount of kludge introduced in the code base. A first approach we try in this preliminary investigation is to measure the correlation between the normalized amount of new alerts and comments that indicate \textit{Self-Admitted Technical Debt}.
versions_used <- c("9_8", "9_9", "10", "11_1", "11_2", "11_3", "11_4", "12", "13_1", "13_2", "13_3", "13_4", "13_5", "13_6", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33") categorised <- read_rds("categorised.rds") %>% filter( version_old %in% versions_used ) versions_executed <- read_rds("versions_executed.rds") versions <- versions_executed$version_new %>% unique()
In [@Potdar2014], the authors discuss that the existence of comments that contain some specific patterns may indicate what they call Self-Admitted Technical Debts (SATD). In [@Sierra2019], Self-Admitted Technical Debt is defined as the event in which the developer consciously introduce debt. According to these two works, the developer acknowledges the SATD in the form of comments. In [@Wehaibi2016] we can find some patterns based on the work of Potdar and Shihab. For instance, some of these patterns are "hack", "retarded", "remove this code", "treat this as a soft error", "kludge", "fixme", "this isn't quite right", "fix this crap", "abandon all hope" and "kaboom".
Figure \ref{timeseries} shows in the first plot the number of alerts at the end of each version transition. In the second plot it shows the number of new and fixed alerts. The third plot shows the number of comments that contain expressions listed in Wehaibi (2016). We classify each comment as \textbf{New}, \textbf{Fixed}, and \textbf{Open} by using the following procedure: if the text in the comment is the same as in the preceding version, the comment is classified as \textbf{Open}; the remaining ones are classified as \textbf{Fixed}, if they are in the old version and \textbf{New} if they are in the new version.
\% % MARCIO: tem certeza destes números? o primeiro pico parece corrigir mais erros % do que tem em estoque. O mesmo acontece para o segundo pico. Dá uma conferida % se em algum momento Fixed > Accum(Open+New) ... Também acho que vale a pena % mostrar o número de opens. % % MARCIO: além disso, falta um pedaço de linha vermelha no segundo gráfico. Pq? % % Bruno: não tinha estoque aí, era sempre transição. existem versões pra trás. %Faltavam uns itens de arquivos que não existima dos dois lados. Acho que os %gráficos %estão mais %claros agora que botei o estoque %
\% % MARCIO: adiciona a linha de comentários Open. % BRUNO: coloquei o estoque de comentários (e o de alertas) a cada versão. Se a transição é versão %x-versão y, o número de alertas/comentários na versão y = open + new %
<<<<<<< HEAD Figure \ref{timeseries} shows, in the first plot, the number of alerts at the end of each version transition. In the second plot it shows the number of new and fixed alerts. The third plot shows the number of comments that contain expressions listed in the Section \ref{sec_SATDs}. The fourth plot, shows the number of new and fized comments in each version transition. We classify each comment as \textbf{New}, \textbf{Fixed}, and \textbf{Open} by using the following procedure: if the text in the comment is the same as in the preceding version, the comment is classified as \textbf{Open}; the remaining ones are classified as \textbf{Fixed}, if they are in the old version and \textbf{New} if they are in the new version.
\% % MARCIO: tem certeza destes números? o primeiro pico parece corrigir mais erros % do que tem em estoque. O mesmo acontece para o segundo pico. Dá uma conferida % se em algum momento Fixed > Accum(Open+New) ... Também acho que vale a pena % mostrar o número de opens. % % MARCIO: além disso, falta um pedaço de linha vermelha no segundo gráfico. Pq? % % Bruno: não tinha estoque aí, era sempre transição. existem versões pra trás. %Faltavam uns itens de arquivos que não existima dos dois lados.
\% % MARCIO: adiciona a linha de comentários Open. % BRUNO: coloquei o estoque de comentários (e o de alertas) a cada versão. Se a transição é versão %x-versão y, o número de alertas/comentários na versão y = open + new %
======= ```r >>>>>>> parent of 7645398 (qualitative analysis) <<<<<<< HEAD comments <- results_comments %>% rename_with( ~str_glue("{.x}_comments") ) comparison <- alerts %>% inner_join( comments, by = c( "major_version_old_alerts" = "major_version_old_comments" , "minor_version_old_alerts" = "minor_version_old_comments" , "major_version_new_alerts" = "major_version_new_comments" , "minor_version_new_alerts" = "minor_version_new_comments" ) ) %>% mutate( ratio_alerts = new_alerts/(new_alerts + fixed_alerts), ratio_comments = new_comments/(new_comments + fixed_comments), version = str_glue("{major_version_alerts}.{minor_version_alerts}") ) %>% filter( major_version_alerts > 10, !(ratio_alerts == 0 & ratio_comments == 0 ) ) %>% rename_with( .cols = matches("_version_"), .fn = ~str_remove(.x, pattern = "_alerts") ) ======= # # all_results <- read_rds("all_results.rds") # # # categorised <- all_results %>% # mutate(data = map(.x = data, .f = function(x){x$categorised_alerts} )) %>% # unnest(data) %>% # group_by(version_old, version_new) %>% # mutate( # id_file = row_number() # ) %>% # unnest(data) # # versions_executed <- sults %>% # mutate(data = map(.x = data, .f = function(x){x$versions_executed} )) %>% # unnest(data) %>% # group_by(version_old, version_new) %>% # mutate( # id_file = row_number() # ) %>% # unnest(data) # # write_rds(categorised, "categorised.rds") # # write_rds(versions_executed, "versions_executed.rds") # results_alert <- analyse_alerts_and_categories() results_comments <- create_version_comparisons_comment() # # compare_comments <- function(old, new){ # # old <- old %>% # rename_with( # ~str_glue("{.x}_old") # ) # # new <- new %>% # rename_with( # ~str_glue("{.x}_new") # ) # # saida <- comments_comparison <- old %>% # full_join(new, # by = c("comment_old" = "comment_new") ) # # saida %>% # summarise( # n_comments_new = id_comment_old %>% is.na() %>% sum(), # n_comments_fixed = id_comment_new %>% is.na() %>% sum() # ) # # } # # # comments_kludge <- read_rds("comments_kludge.rds") %>% # mutate( # version = str_match(file_version, "[0-9]{2}_?[0-9]?") # ) %>% # select( # version, # comment # ) %>% # mutate( # version = str_replace(version, "09_", "9_") # ) %>% # mutate( # id_comment = row_number() # ) %>% # group_by( # version # ) %>% # nest() %>% # ungroup() %>% # mutate( # comments_old = lag(data) # ) %>% # rename( # comments_new = data # ) %>% # separate( # col = version, # into = c("major_version", "minor_version"), # remove = FALSE # ) %>% # mutate( # across( # .cols = c(major_version, minor_version), # .fns = as.integer # ) # ) %>% # replace_na( # list(minor_version = 0) # ) %>% # mutate( # comparison = str_glue("{lag(version)} to {version}"), # major_version_old = lag(major_version), # minor_version_old = lag(minor_version), # major_version_new = major_version, # minor_version_new = minor_version # ) %>% # slice_tail( # n = nrow(.) - 1 # ) %>% # mutate( # comparison_comments = map2(.x = comments_old, .y = comments_new, .f = compare_comments ) # ) %>% # select( # major_version_old, # minor_version_old, # major_version_new, # minor_version_new, # comparison, # comparison_comments # ) %>% # unnest(comparison_comments) # # # # total_comments <- read_rds("comments_kludge.rds") %>% # mutate( # version = str_match(file_version, "[0-9]{2}_?[0-9]?") # ) %>% # select( # version, # comment # ) %>% # filter( # version %in% (versions_used %>% str_replace("9_", "09_")) # ) %>% # group_by( # version # ) %>% # summarise( # n = n() # ) %>% # separate( # col = version, # into = c("major_version", "minor_version"), # remove = FALSE # ) %>% # mutate( # across( # .cols = c(major_version, minor_version), # .fns = as.integer # ) # ) %>% # replace_na( # list(minor_version = 0) # ) # # # comparisons <- results_alert %>% # inner_join( # comments_kludge, # by = c( # "major_version_old", # "minor_version_old", # "major_version_new", # "minor_version_new" # ) # ) %>% # left_join( # total_comments, # by = c("major_version_new" = "major_version", "minor_version_new" = "minor_version") # ) %>% # mutate( # prop_new_alerts = new / (new + fixed), # new_and_fixed_alerts = new + fixed, # prop_new_comments = n_comments_new / (n_comments_new + n_comments_fixed) , # new_and_fixed_comments = n_comments_new + n_comments_fixed # ) # # # comparisons_tidy <- comparisons %>% # select(-c(version_new, version_old)) %>% # pivot_longer( # cols = -comparison, # names_to = "atribute", # values_to = "value" # ) %>% # mutate( # comparison = str_replace_all(comparison, "9_", "09_") # ) # >>>>>>> parent of 7645398 (qualitative analysis)
```rChanges, alerts and comments", fig.height=0.9 }
fixed_new_data <- comparisons_tidy %>% filter(atribute %in% c("n_fixed_alerts", "n_new_alerts")) %>% mutate( category = if_else(atribute == "n_fixed_alerts", "Fixed", "New") )
ggplot_fixed_new <- ggplot(fixed_new_data, aes( x = comparison, y = value, color = category, group = category ) ) + geom_line( size = 1.2 ) + geom_point( size = 2.5 ) + theme_minimal() + scale_color_manual( values = c(Fixed = "darkgreen", New = "darkred") ) + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Number of fixed/new alerts per version transition" ) + labs( x = "Transition", y = "Number of alerts", color = "Category" <<<<<<< HEAD ======= )
changed <- comparisons_tidy %>% filter(atribute %in% c("friction"))
ggplot_changed <- ggplot(changed, aes( x = comparison, y = value, group = 1 ) ) + geom_line( size = 1.2, color = "darkblue" ) + geom_point( size = 2.5, color = "darkblue" ) + theme_minimal() + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Change per version transition" ) + labs( x = "Transition", y = "Change"
parent of 7645398 (qualitative analysis) )
fixed_new_comments <- comparisons_tidy %>% filter(atribute %in% c("n_comments_new", "n_comments_fixed")) %>% mutate( category = if_else(atribute == "n_comments_fixed", "Fixed", "New") ) %>% mutate(
)
ggplot_fixed_new_comments <- ggplot(fixed_new_comments, aes( x = comparison, y = value, color = category, group = category ) ) + geom_line( size = 1.2 ) + geom_point( size = 2.5 ) + theme_minimal() + scale_color_manual( values = c(Fixed = "darkgreen", New = "darkred") ) + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Number of fixed/new comments per version transition" ) + labs( x = "Transition", y = "Number of comments", color = "Category" )
ggplot_total_alerts <- ggplot(total_open_new_alerts, aes( x = comparison, y = total_alerts, ) ) + geom_line( size = 1.2, color = "darkblue", group = 1, aes( x = comparison, y = total_alerts ) ) + geom_point( size = 2.5, color = "darkblue" ) + theme_minimal() + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Number of total alerts after version transition" ) + labs( x = "Transition", y = "Number of alerts" )
total_comments_comparison <- fixed_new_comments %>% separate( col = comparison, into = c("old", "new"), sep = " to ", remove = FALSE ) %>% group_by( comparison, old, new ) %>% summarise() %>% left_join( total_comments, by = c("new" = "version") )
ggplot_total_comments <- ggplot(total_comments_comparison, aes( x = comparison, y = n, ) ) + geom_line( size = 1.2, color = "darkblue", group = 1, aes( x = comparison, y = n ) ) + geom_point( size = 2.5, color = "darkblue" ) + theme_minimal() + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Number of total comments after version transition" ) + labs( x = "Transition", y = "Number of comments" )
ggplot_total_alerts / ggplot_fixed_new / ggplot_total_comments / ggplot_fixed_new_comments + plot_layout(heights = unit(c(3, 3, 3, 3), c("cm", "cm", "cm", "cm") ), widths = unit(c(15, 15, 15, 15), c("cm", "cm", "cm", "cm") ))
\newpage ```r <<<<<<< HEAD ======= comparisons_non_zero <- comparisons %>% filter( new_and_fixed_comments > 0 ) >>>>>>> parent of 7645398 (qualitative analysis) correl_prop <- cor( comparisons_non_zero$prop_new_alerts, comparisons_non_zero$prop_new_comments )
\% % MARCIO: de quanto é esta correlação? Como lemos os gráficos das figuras 6 e 7? % % Bruno: dei uma exoplicada melhor nesta parte <<<<<<< HEAD
Hereafter we analyse the relation between the amount of new alerts and the amount of new comments.
$$PropNewAlerts = \frac{NewAlerts}{NewAlerts + OldAlerts}$$
=======
Hereafter we analyse the relation between the amount of new alerts and the amount of new comments.
$$PropNewAlerts = \frac{NewAlerts}{NewAlerts + OldAlerts}$$
parent of 7645398 (qualitative analysis) $$PropNewComments = \frac{NewComments}{NewComments + OldComments}$$
The correlation between the proportion of new alerts and the proportion
of new comments is
r correl_prop %>% number(accuracy = 0.01, decimal.mark = ".")
.
Figure \ref{scatter_prop} shows a scatter plot with the relation between the proportion of new alerts and the proportion of new comments. We added a regression line obtained from the following model:
$$ NewCommentsProportions = \alpha + \beta NewAlertsProportion $$
The shaded region represents the
```rProportion of new alerts x Proportion of new comments", fig.pos="H"}
ggplot( comparisons_non_zero, aes( y = prop_new_comments, x = prop_new_alerts, ) ) + geom_point() + geom_text_repel(aes(label = comparison), nudge_y = 0.05, size = 2) + geom_smooth(method = "lm") + ggtitle( "Proportion of new comments x Proportion of new alerts" ) + labs( y = "Proportion of new alerts", x = "Proportion of new comments", size = "Change" ) + scale_x_continuous( label = percent_format() ) + scale_y_continuous( label = percent_format() ) + scale_size_continuous( label = number_format(big.mark = ",", accuracy = 1) ) + theme_minimal() + theme( legend.position = "top" ) + NULL
We can evaluate if the positive relation we see between the new comments and new alert was detected by chance. Table \ref{tab_reg} shows the results of this regressions. The P-Value of the beta is 0.049. This means that, if there actually is not a linear relation between new alerts proportion and new comments proportion, we would have aproximately a probability of 5% to get a result as extreme as we got. The estimation for $\beta$ is 0.44, meaning that for each additional 1pp in the proportion of new alerts we would have 0.44 pp in the proportion of new comments. The $R^2$ is 0.14, so we could say that 14% of the variability in the proportion of new comments could be explained by the variation in the proportion of new alerts. ```r library(parsnip) library(gtsummary) lm_prop <- linear_reg() %>% set_engine("lm") lm_prop_fit <- lm_prop %>% <<<<<<< HEAD fit(ratio_comments ~ ratio_alerts, data = comparison_prepared)
We can evaluate if the positive relation we see between the new comments
and new alert was detected by chance. Table \ref{tab_reg} shows the
results of this regressions. The P-Value of the beta near zero. This
means that, if there actually is not a linear relation between new
alerts proportion and new comments proportion, it woukld be vwry unlikely to get a result as extreme as we got.
The estimation for $\beta$ is 0.57, meaning that for each additional 1pp
in the proportion of new alerts we observe, on average, additional 0.53 pp in the proportion
of new comments. The $R^2$ is r broom::glance(lm_prop_fit$fit)$r.squared %>% number(accuracy = 0.01, decimal.mark = ".")
, so we could say that r broom::glance(lm_prop_fit$fit)$r.squared %>% percent(accuracy = 1, decimal.mark = ".")
of the variability in the proportion of new comments can be explained by the
variation in the proportion of new alerts.
======= fit(prop_new_comments ~ prop_new_alerts, data = comparisons %>% filter(friction > 1000)) >>>>>>> parent of 7645398 (qualitative analysis) tbl_prop <- tbl_regression( lm_prop_fit$fit, pvalue_fun = function(x) style_pvalue(x, digits = 2), label = prop_new_alerts ~ "New Alerts Proportion", intercept = TRUE, ) tbl_prop %>% as_kable_extra(caption = "\\label{tab_reg} Regression: comments on alerts")
In the next steps, we can refine the way we select the SATD comments. There are papers that use more sofisticated schemes to identify them, using Natural Language Processing, for instance. Following these steps maybe we will be able to get more certainty about the relation between alerts and SATD comments. Maybe we will be able to explain a greater part of the variability in the proportion of new SATD comments. <<<<<<< HEAD
\section{Appendix: expressions to find SATDs}\label{sec_SATDs}
List of regular expressions used to find SATDs:
kludgenudger::get_kludge_expressions() %>% enframe( name = "#", value = "Expression" ) %>% kable( longtable = TRUE ) %>% kable_styling( latex_options = c("repeat_header") )
\section{References}
=======
\section{References}
parent of 7645398 (qualitative analysis)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.