In crotman/kludgenudger: support for Bruno Crotman's thesis about kludges in software development

library(xml2)
library(tidyverse)
library(gt)
library(knitr)
library(kableExtra)
library(tidygraph)
library(ggraph)
library(patchwork)
library(magrittr)
library(scales)
library(magrittr)
library(patchwork)
library(feather)
library(ggrepel)
library(lubridate)
library(git2r)
library(furrr)

knitr::opts_chunk$set(echo = FALSE, size = "small", warning = FALSE, message = FALSE, cache = FALSE, fig.pos="H")

def.chunk.hook  <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
  x <- def.chunk.hook(x, options)
  ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
})

size_line_of_code <- 160

length_alert_name <- 35

length_alert_name_side_by_side <- 14

size_line_of_code_side_by_side <- 77


pmd_path <- "pmd/bin/pmd.bat"

rule_path <- "rulesets/java/quickstart.xml"

output_path <-  ""  

examples <- tribble(

    ~name,                  ~path,                                                              ~output,          
    "Versão Old Original",  "old" ,  "old_original",
    "Versão New 1",         "new",   "new_1"

) %>% 
    mutate(id = row_number())

\section{Introduction}\label{intro}

This document is part of a research project about software degradation caused by careless developers' behavior and about strategies to deal with such undesired behavior. These strategies will possibly be inspired by concepts from game theory.

We assume that software degradation can be measured by the number and the types of \textit{kludges} made by software developers in the code. A kludge is code that

```{=tex} \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \tightlist \item Partially fixes a bug or partially implements a feature. \end{enumerate}

```{=tex}
\setlength{\parindent}{1.0cm}
\hangindent=1.0cm

The term partial can be understood as in \textit{partial functions}. A partial function is undefined for some elements in the formal domain. For instance, the square root function restricted to the integers: $f(25)$ is defined, but $f(26)$ is undefined. In terms of features, we can think about a developer calculating the point on which two lines cross and neglecting the case of parallel lines

```{=tex} \begin{enumerate} \setcounter{enumi}{1} \tightlist \item The developer knows that the code is only a partial solution, with high probability.~\footnote{We need to study technical debt papers to enrich the conceptual background.} \end{enumerate}

This project aims to study how software projects evolve in terms of
number and kinds of kludges. So far, we are trying to identify kludges
by looking at alerts generated by the PMD source code analyzer. PMD is a
static source code analyzer that is commonly used to find potential
programming flaws. PMD is a good choice for a researcher to analyze bad
practices, because it supports multiple languages and it is very
flexible. It allows the researcher to provide his own rules for finding
interesting patterns in a source code in Java or other languages. These
are the planned steps for this research project:

```{=tex}
\begin{itemize}
\item
  Confirm the assumption that the frequency of PMD alerts is an accurate
  measure of the prevalence of kludges;

\item
  Confirm the assumption that kludges harm software development;

\item
  Confirm the assumption that there is a game in which, in Nash
  equilibrium, a developer chooses a strategy in which he gets personal
  benefits while causing harm to the project, by making kludges;

\item
  If all these assumptions are true, use mechanism design to devise how
  we can change the environment in a way that developers do not choose
  to make so much kludge, increasing the quality of the project in the
  long run;

\item
  Implement this mechanism building a plugin for a prominent CI tool,
  such as Travis, Jenkins or GitLab.
\end{itemize}

In this document, we evaluate PMD Source Code Alerts as proxies for kludges. In section \ref{pmd}, we present PMD Source Code Analyzer and show how we use the tool to generate alerts that are possible kludges. We use this tool, also, to create a simplified AST that will help us in the algorithm that infers how many new alerts were created, how many were fixed and how many remain open in a transition from an old version to a new version. This algorithm is described in Section \ref{alg}. In Section \ref{results}, we compare the creation of new alerts with the insertion of Self Admitted Technical Debt (SATD) comments.

\% ========================================================= % % PMD SOURCE CODE ANALYZER % % =========================================================

\section{The PMD Source Code Analyzer}\label{pmd}

We use PMD to list the alerts that represent \textit{possible kludges} in a source code. PMD receives a source code as input and generates a list of bad programming practices contained in the code, i.e., the alerts.

PMD traverses the AST of a source code searching for violations of rules which are configured by the user. PMD comes with a default rule set for the Java programming language. The default rule set finds common programming flaws such as unused variables, empty catch blocks, unnecessary object creation, and so forth. It is possible to configure a different set of rules by creating a custom XML file. Below, we can see a simple code and the alerts that were generated by the default rule set. In this example, PMD generates two alerts of the type ControlStatementBraces (CSB), in lines 11 and 20. This alert means that there are no braces in a statement that is inside a control statement.

MUDOU <- TRUE

saida_alg2 <- kludgenudger::calculate_features_from_versions(
  code_file_old = "little-tree/code.java",
  code_file_new = "little-tree-new/code.java",
  pmd_path = pmd_path,
  glue_string = "{.data$id_alert}:line:{.data$beginline},\n{.data$small_rule}.{if_else(is.na(.data$rule_alert),'',paste0('\n',.data$rule_alert))}",
  mostra_new = c(2, 3, 4, 16),
  mostra_old =  c(2, 3, 5, 14, 12),
  blockrules_location = "data/blockrules/blockrules_simple.xml",
  optimize_feature_calculation = FALSE
)

\newpage

```{java code old simple, code=kludgenudger::read_and_decorate_code_and_alerts("little-tree/code.java", saida_alg2$versions_executed$pmd_output[[1]], FALSE, 10, use_mnemonic = TRUE ), echo = TRUE }

\subsection{Using PMD to capture the history of alerts}\label{history}

To evaluate how the number of alerts evolved throughout the history of a
software project, we must be able to analyze a pair of different
versions of a source code (an old and a new version) and categorise each
alert contained in the code as either \textbf{new}, \textbf{fixed} or
\textbf{open}.

We define a PMD alert generated for the old version as either
\textbf{open} or \textbf{fixed} in the new version. An \textbf{open}
alert remains in the new version of the code. A \textbf{fixed} alert
does not exist in the new version.

A PMD alert generated for the new version is either \textbf{open} or
\textbf{new}. An \textbf{open} alert indicates that the same alert was
identified in the old version of the source code. A \textbf{new} alert
implies that the same alert cannot be identified in the old version.

The intersection between \textbf{fixed} alerts, \textbf{new} alerts and
\textbf{open} alerts is empty. The alerts identified as \textbf{open}
are equivalent in both new and old versions. To decide whether an alert
is \textbf{open}, \textbf{fixed} or \textbf{new}, one has to identify if
this alert in the old version is equivalent to its occurrence in the new
version. This document describes an algorithm to make this
classification.

\subsection{Using PMD to generate an simplified Abstract Syntax Tree }\label{ast}

To classify alerts as \textbf{new}, \textbf{open} and \textbf{fixed}, we
could match the lines of the old version to the lines of the new version
of a source code using a diff functionality. Diff is useful, but not
sufficient. Frequently, the source code changes are more complex than
the ones we could address by using only diff. An alert $o$ in the old
version could be essentially the same as the alert $n$ contained in the
new version, but the piece of code where $o$ resides could be moved in a
way that it is impossible to match $o$ to $n$ only using information
from Diff. Section \ref{alg} presents and algorithm that uses extra
information from a simplified AST created using PMD.

PMD traverses the source code by visiting many different kinds of
elements. We do not use all the types of nodes recognized by PMD to
generate the simplified AST because there are many kinds of nodes that
are not needed in the current heuristic that classify the alerts, as
described in Section \ref{heuristic}. If we used all the kinds of nodes,
we would end up with a tree that would not add value to our analysis but
would add complexity to our algorithm. The kinds of elements that were
selected are listed below:

```{=tex}
\begin{itemize}


\item \textbf{Block}: a block of statements enclosed by braces;

\item \textbf{ClassOrInterfaceBody}: the body of an interface or a class, excluding the declaration;

\item \textbf{CompilationUnit}: the root of an AST tree;

\item \textbf{Method}: a method, including body and declaration;

\item \textbf{Constructor}: a constructor, including body and declaration;

\item \textbf{Statement}: any statement, like an if statement or an assignment;

\end{itemize}

PMD is prepared to receive an optional XML file, along with the source generated by PMD. The default XML generates alerts when it finds common bad practices, but to create the simplified AST we use an XML file with instructions to generate all the nodes of the kinds we listed above. PMD output lists these nodes and the location of these nodes in the source code as shown in Table \ref{tab_nodes}.

saida_alg2$graph_old_with_alert %>% 
  as_tibble() %>% 
  select(
    `Kind of node` = rule,
    `Begin line` = beginline,
    `Begin column` = begincolumn,
    `End line` = endline,
    `End column` = endcolumn
  ) %>% 
  kable(
    caption = "Output from PMD when creating a simplified AST\\label{tab_nodes}"
  ) %>% 
  kable_styling(
    latex_options = c("striped", "hold_position")
  )

\% % MARCIO: o que significa ser "descendente" no parágrafo abaixo? % % Bruno: acredito que melhorei a explicação % % MARCIO: Não sei se "descendente" seria o melhor termo, pois não há % uma herança aqui. Não seria um "componente" ou "parte de B"? % %Bruno: deixei descendente apenas no contexto da AST. Neste caso creio que é o termo correto. Posso usar subchild também, mas é um termo menos comum

Looking at Table \ref{tab_nodes}, we see that there is information about the location of the nodes, in terms of lines and columns. We can infer which nodes are contained in other nodes, because we can compare the begin and end lines and the begin and end columns and we know that if a node A is inside the contents of a node B (within its range of lines of code), A is descendant of B in the AST. But we cannot see if the node A is a child or a grandchild of B (or a grand-grand-child\ldots). We follow three steps to recreate the AST:

```{=tex} \begin{enumerate} \item Link each node (a) to the set of nodes (X) that are fully contained between the begin line / begin column and end line / end column of node (a). We can construct a directed graph in which the elements are the nodes and the links described are the edges. This is not a tree yet, because each node will have edges directed to all its descendants and not only its children in the AST;

\item Sort the nodes in the decreasing order of its number of children. The objective is to establish that, in a search through this graph, the first child chosen will be the one that is a child in the AST, and not only a descendant;

\item Proceed a deep-first search starting from the compilation unit node. \end{enumerate}

\% ========================================================= % %
RESEARCH QUESTIONS % %
=========================================================

```{=tex}
\section{Research questions}
\label{as_whole}

\noindent \textbf{RQ: Is the frequency of PMD Alerts an accurate measure of the prevalence of kludges?} \label{PMD_Kludge}

In a given transition between an old version and a new version we want to identify if there was an intense introduction of PMD alerts. This evidence of possible kludge introduction must be normalized by the size of the change in source code between the two versions. We do this by following this formula: %\footnote{Another possibility is to use this formula [ \frac{#NewAlerts - #FixedAlerts}{Change} ], where Change can be a measure based on the differences between the versions}:

\% % MARCIO: isto está estranho. Você só menciona o tamanho da mudança na nota % de rodapé. % % Bruno: explicitei o que chamo de change % % MARCIO: ok, mas não seria melhor colocar tudo na mesma equação, como fiz % abaixo? % %Bruno - OK

$$ \frac{#NewAlerts - #FixedAlerts}{Change} $$

where

$$Change = #NewAlerts + #FixedAlerts$$

With these version transition measured in terms of inclusion of PMD alerts, we can correlate these events with some other evidence of kludge. In Section \ref{results}, we calculate this correlation using \textit{Self-Admitted Technical Debt} (SATD) comments.

\% % MARCIO: você deveria usar e referenciar a equação acima na Seção 5. %

\vspace{16px}

\noindent \textbf{RQ: Do kludges harm software development?} \label{kludge_harm}

We need some way to measure degradation after a heavy introduction of kludges. A drop in the popularity may not be a proper evidence. The increment in the number of issues and bug fixed nor necessarily represent a degradation. Churn could be used here.

\% ========================================================= % % ALGORITHMS % % %=========================================================

```{=tex} \section{An algorithm to classify alerts} \label{alg}

This section discusses the algorithm to classify alerts as open, fixed,
or new. The algorithm uses the simplified AST described in Section
\ref{ast} to create features that help to infer if two alerts in
different versions must be considered the same. We use the term
\textit{feature} as it is used in the field of statistical learning
(Kuhn and Johnson, 2019). In this field, the variables that are used to
predict the outcome of an event are called \emph{independent variables},
\emph{predictors}, or \emph{features}.

The term \emph{feature} is used more appropriately if referring to a
variable that is a composite one or more variables or the result of a
treatment upon raw data. Given a pair of alerts, one from the old
version and one from the new version, we use the mapping between the
lines of the old and the new version and their AST as raw data to create
the features that will be used to infer if if they are the same alert.
At the moment, we use a heuristic to infer if the pair of alerts is the
same, as described in Section \ref{heuristic}. Figure \ref{fig:diag}
shows the steps of the algorithm we are describing.

```{=tex}
\begin{figure}
  \centering
  \smartdiagram[sequence diagram]{Get alerts for each version, Create AST for each version, Map new lines to old lines, Calculate features for each pair of alert, Decide if the alerts are the same, Classify alerts }
\vspace{5mm}\par
  \caption{Steps of the algorithm to classify alerts}\label{fig:diag}
\end{figure}

\subsection{An illustrative example}\label{source_used}

Hereafter, we consider the old and new version of an example of source code presented below. In the new version, the alert generated in the line 11 of the old version was fixed, but another remained on the new version. So we expect one alert \textbf{Fixed} in the old version, one alert \textbf{Open} in the old version, and the same \textbf{Open} alert in the new version.

```{java showing codes, code=kludgenudger::read_and_decorate_code_and_alerts_mapped("little-tree/code.java", saida_alg2$versions_executed$pmd_output[[1]], "little-tree-new/code.java", saida_alg2$versions_executed$pmd_output[[2]],saida_alg2$versions_crossed$lines_map[[1]], TRUE, 20, TRUE, 60), echo=TRUE, size="scriptsize" }

\subsection{Get alerts for each version}

For the new and the old versions, we run PMD using the default rule set
as described Section \ref{history}. Table \ref{old_alerts} is created
for the old version and Table \ref{new_alerts} for the new version.

```r

saida_alg2$versions_executed$pmd_output[[1]] %>% 
  select(
    `Kind of node` = rule,
    `Begin line` = beginline,
    `Begin column` = begincolumn,
    `End line` = endline,
    `End column` = endcolumn
  ) %>% 
  kable(
    caption = "Old version's alerts\\label{old_alerts}"
  ) %>% 
  kable_styling(
    latex_options = c("striped", "HOLD_position")
  )

saida_alg2$versions_executed$pmd_output[[2]] %>% 
  select(
    `Kind of node` = rule,
    `Begin line` = beginline,
    `Begin column` = begincolumn,
    `End line` = endline,
    `End column` = endcolumn
  ) %>% 
  kable(
    caption = "New version's alerts\\label{new_alerts}"
  ) %>% 
  kable_styling(
    latex_options = c("striped", "HOLD_position")
  )

\subsection{Create AST for each version}

For each version of the source code the algorithm creates a simplified AST as described in Section \ref{ast}. In Figure \ref{AST_compare_id_alerts} we can see the ASTs for the old and the new versions. In this figure, the numbers in the nodes are meaningless and are presented only for reference.

```r", fig.pos="H", cache=TRUE}

chart_graph_new <- kludgenudger::show_ast( saida_alg2$graph_new_with_alert, size_label = 3, show_label = TRUE, alpha_label = "mostra", name_field = "glue", aspect = 0.5, title = "Simplified AST for the old version"

)

chart_graph_old <- kludgenudger::show_ast( saida_alg2$graph_old_with_alert, size_label = 3, show_label = "TRUE", alpha_label = "mostra", name_field = "glue", aspect = 0.5, title = "Simplified AST for the new version"

)

chart_graph_old / chart_graph_new

\subsection{Map new lines to old lines}\label{map}

For each difference stated in the output of \textit{git diff} (the
sections of the diff file starting with \`\`\@\@''), there is an
indication of the number of lines removed from the old version and the
number of lines added to the new one. The line in which the lines are
removed from the old version and the line at which the lines are added
are indicated too. By using this information we create a mapping from
the lines in the old version to the equivalent lines in the new version.
For the new and old versions presented in Section \ref{source_used}, the
relation is shown in Table \ref{table_map}
\footnote{The mapping shown begins at line 5 in order to save space, since lines 1-4 in the old version map to lines 1-4 in the new version.}.

```r
saida_alg2$versions_crossed$lines_map[[1]] %>% 
  ungroup() %>% 
  mutate(
    row = row_number(),
    na_mark = if_else(is.na(map_remove) | is.na(map_add), row , NA_integer_  ),
    next_na = na_mark,
    last_na = na_mark
  ) %>% 
  fill(
    next_na, .direction = "up"
  ) %>% 
  fill(
    last_na, .direction = "down"
  ) %>% 
  replace_na(
    list(
      last_na = 0,
      next_na = nrow(saida_alg2$versions_crossed$lines_map[[1]]) + 1
    )
  ) %>%
  mutate(
    dist_next = next_na - row,
    dist_last = row - last_na + 0.1
  ) %>%
  rowwise() %>% 
  mutate(
    min_dist = min(dist_next, dist_last)
  ) %>% 
  filter(
    min_dist < 4
  ) %>%
  ungroup() %>% 
  mutate(    
    map_remove = 
      case_when(
        min_dist == 3.1 ~ str_glue("{lag(map_remove)+1}-"),
        min_dist == 3.0 ~ str_glue("-{lead(map_remove)-1}"),
        TRUE ~ map_remove %>% as.character()
      ),
    map_add = 
      case_when(
        min_dist == 3.1 ~ str_glue("{lag(map_add)+1}-"),
        min_dist == 3.0 ~ str_glue("-{lead(map_add)-1}"),
        TRUE ~ map_add %>% as.character()
      )
  ) %>% 
  select(old = map_remove, new = map_add) %>% 
  mutate(  
    old = if_else(is.na(old), str_glue("\\textcolor{{white}}{{{row_number()}}}"), old),
    new = if_else(is.na(new), str_glue("\\textcolor{{white}}{{{row_number()}}}"), new)
  ) %>% 
  pivot_wider(
    names_from = old,
    values_from = new,
    names_repair = "minimal"
  ) %>% 
  kable(
    caption = "Relation between lines of the old version and lines of the new version\\label{table_map}",
    escape = FALSE
  ) %>% 
  kable_styling(
    font_size = ,
    latex_options = c( "HOLD_position", "scale_down")
  )

\subsection{Calculate features for each pair of new and old alert}

For the proposed example, we calculate features for $2 \cdot 1 = 2$ combinations of new and old alerts, since we have 2 old alerts and 1 new alert. Table \ref{combination} shows the combinations for which features are calculated.

old_alerts <-  saida_alg2$versions_executed$pmd_output[[1]] %>%
  select(
    `Begin Line Old` = beginline,
    # `End Line Old` = endline,
    `Rule Old` = rule
  )

new_alerts <-  saida_alg2$versions_executed$pmd_output[[2]] %>%
  select(
    `Begin Line New` = beginline,
    # `End Line New` = endline,
    `Rule New` = rule
  )

combinations <- old_alerts %>%
  crossing(new_alerts)

combinations %>%
  kable(
    caption = "Combinations of new and old alerts for which the features must be calculated \\label{combination}",
    escape = FALSE,
    col.names = 
  ) %>%
  kable_styling(
    font_size = ,
    latex_options = c("HOLD_position")
  )

\vspace{16pt}

For each alert, PMD Source Code Analyzer returns the following attributes:

\vspace{16pt}

We propose the following features to calculate for each combination:

\textbf{Same Rule}: a Boolean indicator that tells if the alerts are of the same type;

\textbf{Same Group ID}: a Boolean indicator that tells if the alerts are equivalent in terms of begin line and end line, considering the mapping described in Section \ref{map}. For each combination, Table \ref{same_group} shows the begin line in the old version, the corresponding begin line in the new version, and the begin line in the new version\footnote{We suppress the end lines because for all alerts in the example the begin lines and the end lines are the same}. The alert that begins in line 20 of the old version corresponds to the alert that begins in line 22 of the new version. So, for this combination ``Same group'' feature is true. For the other combination, it is false.

map_lines <- saida_alg2$versions_crossed$lines_map[[1]] %>% 
  select(
    map_line_old = map_remove,
    map_line_new = map_add
  )

map_old <- map_lines %>% rename(`Corresponding line in new version` =  map_line_new)

combinations %>% 
  left_join(
    map_old,
    by = c("Begin Line Old"="map_line_old")
  ) %>% 
  select(
    `Begin Line Old`,
    `Rule Old`,
    `Corresponding line in new version`,
    `Begin Line New`,
    `Rule New`
  ) %>% 
  mutate(
    `Same group` = `Corresponding line in new version` == `Begin Line New`
  ) %>% 
  kable(
    caption = "Same group feature \\label{same_group}",
    escape = FALSE,
    col.names = 
      c(
        "Begin Line Old",
        "Rule Old",
        "Corresponding line\n in new version",
        "Begin Line New",
        "Rule New",
        "Same group"
      ) %>% linebreak()
  ) %>%
  kable_styling(
    font_size = ,
    latex_options = c("scale_down", "HOLD_position")
  )

\noindent \textbf{Same Method Group ID}: a Boolean indicator that tells if the alerts belong to methods that are in the same group in the sense of the Same Group ID'' feature we described above. First, we find each alert's method following the path from the alert's node to the root. The first node of the kindmethod'' or ``constructor'' found in this path defines the alert's method. Considering the proposed example, Figure \ref{path_node_to_root_1} shows the AST for the first combination of alerts (see Table \ref{combination}).

In this combination, the method of the old alert is in node 5 of the left tree and the method for the new alert is in node 4 of the right tree. Table \ref{tab_same_method} shows that the begin and end lines of the method in the old version do not correspond to the lines in the new version. The correspondence uses the mapping defined in Section \ref{map}. For this combination, the ``Same Method Group ID'' feature is FALSE.

\% % MARCIO: acima é mesmo "begin and end lines of the alert" ou seria "begin and end % lines of the methods"? % %Bruno: sim

```r", fig.pos="H" , cache=TRUE}

path_old <- kludgenudger::show_ast( saida_alg2$graphs_from_alerts_old %>% rename( id_alert = id_alert_old, graph = graph_old) %$% graph[[1]] , size_label = 3, aspect = 4, nudge_x = 0.5, title = "Path from alert in old version, \nin line 11, to root" )

path_new <- kludgenudger::show_ast( saida_alg2$graphs_from_alerts_new %>% rename( id_alert = id_alert_new, graph = graph_new) %$% graph[[1]] , size_label = 3, aspect = 4, nudge_x = 0.5, title = "Path from alert in old version, \nin line 22, to root" )

(path_old + plot_spacer() + path_new)

```r

tribble(
  ~"-",  ~"Old version", ~"New version",
  "Begin line", 8, 19,
  "End line", 15, 27
) %>% 
  left_join(
    map_lines,
    by = c("Old version" = "map_line_old")
  ) %>% 
  select(
    `-`,
    `Old version`,
    `Corresponding line in the new version` = map_line_new,
    `New version`
  ) %>% 
    kable(
    caption = "Defining if Same Method Group ID \\label{tab_same_method}",
    escape = FALSE,
    col.names = 
      c(
        "-",
        "Old version",
        "Corresponding line\nin the new version",
        "New version"
      ) %>% linebreak()
  ) %>%
  kable_styling(
    font_size = ,
    latex_options = c( "HOLD_position")
  )

Figure \ref{path_node_to_root_2} shows the ASTs for the second combination of alerts (see Table \ref{combination}). In this combination the method of the old alert is in node 3 of the left tree and the method for the new alert is in node 4 of the right tree. Table \ref{tab_same_method_2} shows that the begin and end lines of the method in the old version do correspond to the lines in the new version. The correspondence uses the map defined in Section \ref{map}. For this combination, the ``Same Method Group ID'' feature is TRUE.

\% % MARCIO: mesma consideração feita acima ... % %Bruno: sim

```r", fig.pos="H" , cache=TRUE}

path_old <- kludgenudger::show_ast( saida_alg2$graphs_from_alerts_old %>% rename( id_alert = id_alert_old, graph = graph_old) %$% graph[[2]] , size_label = 3, aspect = 4, nudge_x = 0.5, title = "Path from alert in old version, \nin line 20, to root" )

(path_old + plot_spacer() + path_new)

```r

tribble(
  ~"-",  ~"Old version", ~"New version",
  "Begin line", 17, 19,
  "End line", 25, 27
) %>%  
  left_join(
    map_lines,
    by = c("Old version" = "map_line_old")
  ) %>% 
  select(
    `-`,
    `Old version`,
    `Corresponding line in the new version` = map_line_new,
    `New version`
  ) %>% 
    kable(
    caption = "Defining if Same Method Group ID \\label{tab_same_method_2}",
    escape = FALSE,
    col.names = 
      c(
        "-",
        "Old version",
        "Corresponding line\nin the new version",
        "New version"
      ) %>% linebreak()
  ) %>%
  kable_styling(
    font_size = ,
    latex_options = c( "HOLD_position")
  )

\noindent \textbf{Same Method Name}: a Boolean indicator that tells if the alerts were found in a method with the same name. The methods for the alerts are found as for the ``Same Method Group ID''. However, instead of the corresponding lines, this feature evaluates the name of the methods. If the method related to the old alert and the method related to the new alert have the same name (even if the begin and end lines are not corresponding), then this feature is set to TRUE.

\noindent \textbf{Same Block}: a Boolean indicator that shows if the alerts belong to the same block. The blocks are defined in a similar way as described above, following the path from the node where each alert is towards the root until we find a block'',method'', or constructor'' node. Then, the begin and end lines of the old alert and the corresponding lines in the new version are compared with the begin and end lines of the new alert, as for theSame Method Group ID''.

\noindent \textbf{Same Code}: for this feature we compare the source code that is contained in the nodes related to the alerts. The comparison is performed by lexicographically comparing the textual version of the source code residing within the lines and columns ranges (begin and end) of the alerts.

\noindent \textbf{Same Method Code}: for this feature we compare the source code of the methods related to the alerts. The methods related to the alerts are found the way we described for the ``Same Method Group ID'' above.

\noindent \textbf{Line distance}: the distance between the mean line'' ($\frac{beginline + endline}{2}$) of the new alert and thecorresponding mean line'' of the old alert. Table \ref{tab_line_distance} shows the line distance for the combinations of new and old alerts in the example we are following.

old_alerts_ld <-  saida_alg2$versions_executed$pmd_output[[1]] %>%
  select(
    `Begin Line Old` = beginline,
    `End Line Old` = endline,
    `Rule Old` = rule
  ) 

new_alerts_ld <-  saida_alg2$versions_executed$pmd_output[[2]] %>%
  select(
    `Begin Line New` = beginline,
    `End Line New` = endline,
    `Rule New` = rule
  )

combinations_ld <- old_alerts_ld %>%
  crossing(new_alerts_ld)

map_lines <- saida_alg2$versions_crossed$lines_map[[1]] %>% 
  select(
    map_line_old = map_remove,
    map_line_new = map_add
  )

map_old_begin <- map_lines %>% rename(`Corresponding begin line in new version` =  map_line_new)
map_old_end <- map_lines %>% rename(`Corresponding end line in new version` =  map_line_new)


combinations_ld  %>% 
  left_join(
    map_old_begin,
    by = c("Begin Line Old"="map_line_old")
  ) %>% 
  left_join(
    map_old_end,
    by = c("End Line Old"="map_line_old")
  ) %>% 
  select(
    `Begin Line Old`,
    `End Line Old`,
    `Corresponding begin line in new version`,
    `Corresponding end line in new version`,
    `Begin Line New`,
    `End Line New`
  ) %>% 
  mutate(
    mean =  
      (`Corresponding begin line in new version` + `Corresponding end line in new version`) / 2,
    `Corresponding mean line` = str_glue("({`Corresponding begin line in new version`} + {`Corresponding end line in new version`}) / 2 = {mean}"),
    `Mean line of new alert` = (`Begin Line New` + `End Line New`) / 2,
    `Line distance` = abs(mean - `Mean line of new alert`)
  ) %>% 
  select(
    -mean
  ) %>% 
  kable(
    caption = "Line distance feature \\label{tab_line_distance}",
    escape = FALSE,
    col.names = 
      c(
        "Begin Line\nOld",
        "End Line\nOld",
        "Corresponding\nbegin line\nin new version",
        "Corresponding\nend line\nin new version",
        "Begin Line\nNew",
        "End Line\nNew",
        "Corresponding\nmean line",
        "Mean line of\nnew alert",
        "Line distance"
      ) %>% linebreak()
  ) %>%
  kable_styling(
    font_size = ,
    latex_options = c("scale_down", "HOLD_position")
  )

Table \ref{table_features} shows the features calculated for the combinations of new and old alerts in our example.

MUDOU <-  TRUE

kludgenudger::report_features(
  saida_alg2, 
  "Resulting features\\label{table_features} ",
  types_to_show = c(
    "same_rule",
    "same_id_group",
    "same_method_group",
    "same_method_name",
    "same_block",
    "same_code",
    "same_method_code",
    "dist_line"
  )

  )

\subsection{Decide if two alerts are the same based on heuristic}\label{heuristic}

With the features at hand, we must decide if each combination of old and new alerts refers to the same alert. As a starting point, we are using a rule of thumb (a manually-devised heuristic) to make this decision. A new and an old alerts are declared the same if any of the following rules apply.

All the Boolean features are TRUE and \textit{Line Distance} is equal to 0;

\scriptsize

$$ \begin{aligned} (SameRule \land SameGroupID \land SameMethodGroupID \land \ SameMethodName \land SameBlock \land SameCode \land \ SameMethodCode \land LineDistance = 0) \Rightarrow SameAlert \end{aligned}$$

\normalsize

If \textit{Same Method Code} is TRUE, then we consider it is the same method, even if the name or the method is not the same. But if the method code is the same, then the alert code and the kind of the alert must be the same;

\% % MARCIO: esta regra está meio estranha. Para que usa o Same Method Group ID se está % usando a comparação de código aqui? % %Bruno: nao estou Method Group ID, mas o código do Method. Está confuso usar as palavras Code e ID? Posso usar SourceCode pra deixar claro que não é um código numérico identificador mas sim o código-fonte % \scriptsize $$ \begin{aligned} (SameRule \land SameCode \land SameMethodCode \Rightarrow SameAlert \end{aligned}$$

\normalsize

If the kind of alert is the same, and at least one of the features about the method is TRUE, we consider that the alerts are the same if the line distance must be less then 5 lines;

\scriptsize

$$ \begin{aligned} (SameRule \land (SameMethodCode \lor SameMethodGroup \lor SameMethodName) \land LineDistance < 5) \Rightarrow SameAlert \end{aligned}$$

\normalsize

\% % MARCIO: o que significa "we must have evidence that the method is the same"? % % MARCIO: "Same Group ID"=TRUE não implica que Line Distance = 0? Se for o caso, % as duas últimas regras são redundantes. % %Sim, posso juntar. Tava pensando em dar um peso diferente quando eu tivesse mais certeza, mas desisti

Table \ref{table_features_with_decision} shows the resulting features of the two combinations of alerts in the example we are following and the final decision if they are the same alert.

MUDOU <-  TRUE

kludgenudger::report_features(
  saida_alg2, 
  "Resulting features\\label{table_features} ",
  types_to_show = c(
    "same_rule",
    "same_id_group",
    "same_method_group",
    "same_method_name",
    "same_block",
    "same_code",
    "same_method_code",
    "dist_line"
  ),
  return_raw_data = TRUE

  ) %>% 
  mutate(
    same_alert = if_else( row_number() <= 8, FALSE, TRUE )
  ) %>% 
    kable(
          format = "latex",
          caption = "Resulting features\\label{table_features_with_decision} ",
          escape = TRUE,
          # booktabs = TRUE,
          # align = "r",
          # linesep = "",
          col.names = c(
            "Alert combination",
            "Feature",
            "Featre value",
            "Same alert"
          )
    ) %>%
      kableExtra::collapse_rows(columns = c(1,4), latex_hline = "major", valign =  "top") %>% 
      kableExtra::kable_styling(
        latex_options = c("HOLD_position", "striped")
      )

\subsection{Classify alerts}\label{heuristic}

Finally we can decide if the alerts in the old version are open or fixed and if the alerts in the new version are open or new. In our example, we have two alerts in the old version and they are combined with the alert in the new version. Table \ref{tab_categorizing_old} shows the two old alerts combined with the new alert. The first old alert is not classified as the same as any of new alerts, so is marked as \textbf{Open}. The second alert is classified as the same as the new alert, so is declared \textbf{Fixed}.

tribble(

  ~begin_line, ~begin_line_new, ~same_alert, ~conclusion, 
  11,         22,               FALSE,      "OPEN",
  20,         22,               TRUE,       "FIXED"

) %>% 
    kable(
          format = "latex",
          caption = "Categorizing alerts of the old version \\label{tab_categorizing_old} ",
          escape = FALSE,
          # booktabs = TRUE,
          # align = "r",
          # linesep = "",
          col.names = c(
            "Begin line of the alert\nin the old version",
            "Begin line of the alert\nn the new version which\nis combined to the old alert",
            "Same alert according\nto heuristic,\n based on features",
            "Category"
          ) %>% linebreak()
    ) %>%
      kableExtra::collapse_rows(columns = c(1,4), latex_hline = "major", valign =  "top") %>% 
      kableExtra::kable_styling(
        latex_options = c("HOLD_position", "striped")
      )

Table \ref{tab_categorizing_new} shows the new alert combined with the two old alerts. The new alert is classified as the same as one of the old alerts, so is declared Open.

tribble(

  ~begin_line, ~begin_line_new, ~same_alert, ~conclusion, 
  22,         11,               FALSE,      "OPEN",
  22,         20,               TRUE,       "OPEN"

) %>% 
    kable(
          format = "latex",
          caption = "Categorizing alerts of the old version \\label{tab_categorizing_new} ",
          escape = FALSE,
          # booktabs = TRUE,
          # align = "r",
          # linesep = "",
          col.names = c(
            "Begin line of the alert\nin the new version",
            "Begin line of the alert\nn the old version which\nis combined to the old alert",
            "Same alert according\nto heuristic,\n based on features",
            "Category"
          ) %>% linebreak()
    ) %>%
      kableExtra::collapse_rows(columns = c(1,4), latex_hline = "major", valign =  "top") %>% 
      kableExtra::kable_styling(
        latex_options = c("HOLD_position", "striped")
      )

Table \ref{tab_summary_categories} shows the final categories for the two alerts in the old version and the alert in the new version.

tribble(

  ~alert,  ~category, 
  "Alert in the old version, line 11",         "FIXED",   
  "Alert in the old version, line 20",         "OPEN",   
  "Alert in the old version, line 22",         "OPEN",   

) %>% 
    kable(
          format = "latex",
          caption = "Alerts and their categorization\\label{tab_summary_categories} ",
          escape = FALSE,
          # booktabs = TRUE,
          # align = "r",
          # linesep = "",
          col.names = c(
            "Alert",
            "Category"
          ) %>% linebreak()
    ) %>%
      kableExtra::kable_styling(
        latex_options = c("HOLD_position", "striped")
      )

\section{Comparing new alerts with new SATD comments}\label{results}

In this section, we study the correlation between the creation of new alerts and the insertion of \textit{Self-Admitted Technical Debt} comments in a transition between an old and a new version of source code.

<<<<<<< HEAD In [@Potdar2014], the authors discuss that the existence of comments that contain some specific patterns may indicate what they call Self-Admitted Technical Debts (SATD). In [@Sierra2019], Self-Admitted Technical Debt is defined as the event in which the developer consciously introduce debt. According to these two works, the developer acknowledges the SATD in the form of comments. In [@Wehaibi2016] we can find some patterns based on the work of Potdar and Shihab. For instance, some of these patterns are "hack", "retarded", "remove this code", "treat this as a soft error", "kludge", "fixme", "this isn't quite right", "fix this crap", "abandon all hope" and "kaboom". We used all the terms used by the authors and added other terms that we found in the code and which we consdidered indicators of SATDs. To find these additional terms, we sampled 5,000 comments and looked at them in order to recognize expressions that we judged as indicators of SATDs. All the regular expressions representing these terms are listed in Section \ref{sec_SATDs}.

We select 32 tagged and released versions of the project ArgoUML. They

We select 39 tagged and released versions of the project ArgoUML. They

parent of 7645398 (qualitative analysis) were released between 2001-04-06 and 2011-12-15. For each pair of sequential versions, we generate the PMD Alerts and categorise the identified alerts as \textbf{New}, \textbf{Fixed} or \textbf{Open} using the algorithm described in Section \ref{alg}. We want to understand if the number of new alerts, normalized by the magnitude of the change between two versions, is a good proxy for the amount of kludge introduced in the code base. A first approach we try in this preliminary investigation is to measure the correlation between the normalized amount of new alerts and comments that indicate \textit{Self-Admitted Technical Debt}.

versions_used <- c("9_8",  "9_9",  "10",   "11_1", "11_2", "11_3", "11_4", "12",   "13_1", "13_2", "13_3", "13_4", "13_5", "13_6", "14",   "15",  
"16",   "17",   "18",   "19",   "20",   "21",   "22",   "23",   "24",   "25",   "26",   "27",   "28",   "29",   "30",   "31",   "32",   "33")


categorised <-  read_rds("categorised.rds") %>% 
  filter(
    version_old %in% versions_used
  )

versions_executed <-  read_rds("versions_executed.rds")

versions <-  versions_executed$version_new %>% unique()

In [@Potdar2014], the authors discuss that the existence of comments that contain some specific patterns may indicate what they call Self-Admitted Technical Debts (SATD). In [@Sierra2019], Self-Admitted Technical Debt is defined as the event in which the developer consciously introduce debt. According to these two works, the developer acknowledges the SATD in the form of comments. In [@Wehaibi2016] we can find some patterns based on the work of Potdar and Shihab. For instance, some of these patterns are "hack", "retarded", "remove this code", "treat this as a soft error", "kludge", "fixme", "this isn't quite right", "fix this crap", "abandon all hope" and "kaboom".

Figure \ref{timeseries} shows in the first plot the number of alerts at the end of each version transition. In the second plot it shows the number of new and fixed alerts. The third plot shows the number of comments that contain expressions listed in Wehaibi (2016). We classify each comment as \textbf{New}, \textbf{Fixed}, and \textbf{Open} by using the following procedure: if the text in the comment is the same as in the preceding version, the comment is classified as \textbf{Open}; the remaining ones are classified as \textbf{Fixed}, if they are in the old version and \textbf{New} if they are in the new version.

\% % MARCIO: tem certeza destes números? o primeiro pico parece corrigir mais erros % do que tem em estoque. O mesmo acontece para o segundo pico. Dá uma conferida % se em algum momento Fixed > Accum(Open+New) ... Também acho que vale a pena % mostrar o número de opens. % % MARCIO: além disso, falta um pedaço de linha vermelha no segundo gráfico. Pq? % % Bruno: não tinha estoque aí, era sempre transição. existem versões pra trás. %Faltavam uns itens de arquivos que não existima dos dois lados. Acho que os %gráficos %estão mais %claros agora que botei o estoque %

\% % MARCIO: adiciona a linha de comentários Open. % BRUNO: coloquei o estoque de comentários (e o de alertas) a cada versão. Se a transição é versão %x-versão y, o número de alertas/comentários na versão y = open + new %

<<<<<<< HEAD Figure \ref{timeseries} shows, in the first plot, the number of alerts at the end of each version transition. In the second plot it shows the number of new and fixed alerts. The third plot shows the number of comments that contain expressions listed in the Section \ref{sec_SATDs}. The fourth plot, shows the number of new and fized comments in each version transition. We classify each comment as \textbf{New}, \textbf{Fixed}, and \textbf{Open} by using the following procedure: if the text in the comment is the same as in the preceding version, the comment is classified as \textbf{Open}; the remaining ones are classified as \textbf{Fixed}, if they are in the old version and \textbf{New} if they are in the new version.

=======
```r
>>>>>>> parent of 7645398 (qualitative analysis)





<<<<<<< HEAD
comments <- results_comments %>% 
  rename_with(
    ~str_glue("{.x}_comments")
  ) 


comparison <- alerts %>% 
  inner_join(
    comments,
    by = c(
      "major_version_old_alerts" = "major_version_old_comments" ,
      "minor_version_old_alerts" = "minor_version_old_comments" ,
      "major_version_new_alerts" = "major_version_new_comments" ,
      "minor_version_new_alerts" = "minor_version_new_comments" 
    )
  ) %>% 
  mutate(
    ratio_alerts = new_alerts/(new_alerts + fixed_alerts),
    ratio_comments = new_comments/(new_comments + fixed_comments),
    version = str_glue("{major_version_alerts}.{minor_version_alerts}")
  ) %>% 
  filter(
    major_version_alerts > 10,
    !(ratio_alerts == 0 & ratio_comments == 0 )
  ) %>% 
  rename_with(
    .cols = matches("_version_"),
    .fn = ~str_remove(.x, pattern = "_alerts")
  )
=======
# 
# all_results <- read_rds("all_results.rds")
# 
# 
# categorised <- all_results %>%
#   mutate(data = map(.x = data, .f = function(x){x$categorised_alerts}  )) %>%
#   unnest(data) %>%
#   group_by(version_old, version_new) %>%
#   mutate(
#     id_file = row_number()
#   ) %>%
#   unnest(data)
# 
# versions_executed <- sults %>%
#   mutate(data = map(.x = data, .f = function(x){x$versions_executed}  )) %>%
#   unnest(data) %>%
#   group_by(version_old, version_new) %>%
#   mutate(
#     id_file = row_number()
#   ) %>%
#   unnest(data)
# 
# write_rds(categorised, "categorised.rds")
# 
# write_rds(versions_executed, "versions_executed.rds")
# 

results_alert <- analyse_alerts_and_categories()


results_comments <- create_version_comparisons_comment()



# 
# compare_comments  <- function(old, new){
#   
#   old <- old %>% 
#     rename_with(
#       ~str_glue("{.x}_old")
#     )
#   
#   new <- new %>% 
#     rename_with(
#       ~str_glue("{.x}_new")
#     )
#   
#   saida <- comments_comparison <- old %>% 
#     full_join(new,
#               by = c("comment_old" = "comment_new") )
#   
#   saida %>% 
#     summarise(
#       n_comments_new = id_comment_old %>% is.na() %>% sum(),
#       n_comments_fixed = id_comment_new %>% is.na() %>% sum()
#     )
#   
# }
# 
# 
# comments_kludge <- read_rds("comments_kludge.rds") %>% 
#   mutate(
#     version = str_match(file_version, "[0-9]{2}_?[0-9]?")
#   ) %>%
#   select(
#     version,
#     comment
#   ) %>%
#   mutate(
#     version = str_replace(version, "09_", "9_")
#   ) %>% 
#   mutate(
#     id_comment = row_number()
#   ) %>% 
#   group_by(
#     version
#   ) %>% 
#   nest() %>% 
#   ungroup() %>% 
#   mutate(
#     comments_old = lag(data)
#   ) %>% 
#   rename(
#     comments_new = data
#   ) %>% 
#   separate(
#     col = version,
#     into = c("major_version", "minor_version"),
#     remove = FALSE
#   ) %>% 
#   mutate(
#     across(
#       .cols = c(major_version, minor_version),
#       .fns = as.integer
#     )
#   ) %>% 
#   replace_na(
#     list(minor_version = 0)
#   ) %>% 
#   mutate(
#     comparison = str_glue("{lag(version)} to {version}"),
#     major_version_old = lag(major_version),
#     minor_version_old = lag(minor_version),
#     major_version_new = major_version,
#     minor_version_new = minor_version
#   ) %>% 
#   slice_tail(
#     n = nrow(.) - 1
#   ) %>% 
#   mutate(
#     comparison_comments = map2(.x = comments_old, .y = comments_new, .f = compare_comments  )
#   ) %>% 
#   select(
#     major_version_old,
#     minor_version_old,
#     major_version_new,
#     minor_version_new,
#     comparison,
#     comparison_comments
#   ) %>% 
#   unnest(comparison_comments)
# 
#   
#   
# total_comments <- read_rds("comments_kludge.rds") %>% 
#   mutate(
#     version = str_match(file_version, "[0-9]{2}_?[0-9]?")
#   ) %>%
#   select(
#     version,
#     comment
#   ) %>%
#   filter(
#     version %in% (versions_used %>% str_replace("9_", "09_"))
#   ) %>% 
#   group_by(
#     version 
#   ) %>% 
#   summarise(
#     n = n()
#   ) %>% 
#   separate(
#     col = version,
#     into = c("major_version", "minor_version"),
#     remove = FALSE
#   ) %>% 
#   mutate(
#     across(
#       .cols = c(major_version, minor_version),
#       .fns = as.integer
#     )
#   ) %>% 
#   replace_na(
#     list(minor_version = 0)
#   ) 
# 
# 
# comparisons <- results_alert %>% 
#   inner_join(
#     comments_kludge,
#     by = c(
#       "major_version_old",
#       "minor_version_old",
#       "major_version_new",
#       "minor_version_new"
#     )
#   ) %>% 
#   left_join(
#     total_comments,
#     by = c("major_version_new" = "major_version", "minor_version_new" = "minor_version")
#   ) %>% 
#   mutate(
#     prop_new_alerts = new / (new + fixed),
#     new_and_fixed_alerts = new + fixed,
#     prop_new_comments = n_comments_new / (n_comments_new + n_comments_fixed) ,
#     new_and_fixed_comments = n_comments_new + n_comments_fixed
#   ) 
# 
# 
# comparisons_tidy <- comparisons %>% 
#   select(-c(version_new, version_old)) %>% 
#   pivot_longer(
#     cols = -comparison,
#     names_to = "atribute",
#     values_to = "value"
#   ) %>% 
#   mutate(
#     comparison = str_replace_all(comparison, "9_", "09_")
#   )
# 
>>>>>>> parent of 7645398 (qualitative analysis)

```rChanges, alerts and comments", fig.height=0.9 }

fixed_new_data <- comparisons_tidy %>% filter(atribute %in% c("n_fixed_alerts", "n_new_alerts")) %>% mutate( category = if_else(atribute == "n_fixed_alerts", "Fixed", "New") )

ggplot_fixed_new <- ggplot(fixed_new_data, aes( x = comparison, y = value, color = category, group = category ) ) + geom_line( size = 1.2 ) + geom_point( size = 2.5 ) + theme_minimal() + scale_color_manual( values = c(Fixed = "darkgreen", New = "darkred") ) + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Number of fixed/new alerts per version transition" ) + labs( x = "Transition", y = "Number of alerts", color = "Category" <<<<<<< HEAD ======= )

changed <- comparisons_tidy %>% filter(atribute %in% c("friction"))

ggplot_changed <- ggplot(changed, aes( x = comparison, y = value, group = 1 ) ) + geom_line( size = 1.2, color = "darkblue" ) + geom_point( size = 2.5, color = "darkblue" ) + theme_minimal() + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Change per version transition" ) + labs( x = "Transition", y = "Change"

parent of 7645398 (qualitative analysis) )

fixed_new_comments <- comparisons_tidy %>% filter(atribute %in% c("n_comments_new", "n_comments_fixed")) %>% mutate( category = if_else(atribute == "n_comments_fixed", "Fixed", "New") ) %>% mutate(

)

ggplot_fixed_new_comments <- ggplot(fixed_new_comments, aes( x = comparison, y = value, color = category, group = category ) ) + geom_line( size = 1.2 ) + geom_point( size = 2.5 ) + theme_minimal() + scale_color_manual( values = c(Fixed = "darkgreen", New = "darkred") ) + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Number of fixed/new comments per version transition" ) + labs( x = "Transition", y = "Number of comments", color = "Category" )

ggplot_total_alerts <- ggplot(total_open_new_alerts, aes( x = comparison, y = total_alerts, ) ) + geom_line( size = 1.2, color = "darkblue", group = 1, aes( x = comparison, y = total_alerts ) ) + geom_point( size = 2.5, color = "darkblue" ) + theme_minimal() + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Number of total alerts after version transition" ) + labs( x = "Transition", y = "Number of alerts" )

total_comments_comparison <- fixed_new_comments %>% separate( col = comparison, into = c("old", "new"), sep = " to ", remove = FALSE ) %>% group_by( comparison, old, new ) %>% summarise() %>% left_join( total_comments, by = c("new" = "version") )

ggplot_total_comments <- ggplot(total_comments_comparison, aes( x = comparison, y = n, ) ) + geom_line( size = 1.2, color = "darkblue", group = 1, aes( x = comparison, y = n ) ) + geom_point( size = 2.5, color = "darkblue" ) + theme_minimal() + theme( axis.text.x = element_text(angle = 90) , legend.position = "top" ) + scale_y_continuous( labels = number_format(big.mark = ","), limits = c(0,NA) ) + ggtitle( "Number of total comments after version transition" ) + labs( x = "Transition", y = "Number of comments" )

ggplot_total_alerts / ggplot_fixed_new / ggplot_total_comments / ggplot_fixed_new_comments + plot_layout(heights = unit(c(3, 3, 3, 3), c("cm", "cm", "cm", "cm") ), widths = unit(c(15, 15, 15, 15), c("cm", "cm", "cm", "cm") ))

\newpage

```r

<<<<<<< HEAD
=======
comparisons_non_zero <- comparisons %>% 
  filter(
    new_and_fixed_comments > 0  
  )

>>>>>>> parent of 7645398 (qualitative analysis)
correl_prop <- cor(
  comparisons_non_zero$prop_new_alerts,
  comparisons_non_zero$prop_new_comments
)

\% % MARCIO: de quanto é esta correlação? Como lemos os gráficos das figuras 6 e 7? % % Bruno: dei uma exoplicada melhor nesta parte <<<<<<< HEAD

Hereafter we analyse the relation between the amount of new alerts and the amount of new comments.

$$PropNewAlerts = \frac{NewAlerts}{NewAlerts + OldAlerts}$$

=======

Hereafter we analyse the relation between the amount of new alerts and the amount of new comments.

$$PropNewAlerts = \frac{NewAlerts}{NewAlerts + OldAlerts}$$

parent of 7645398 (qualitative analysis) $$PropNewComments = \frac{NewComments}{NewComments + OldComments}$$

The correlation between the proportion of new alerts and the proportion of new comments is r correl_prop %>% number(accuracy = 0.01, decimal.mark = ".").

Figure \ref{scatter_prop} shows a scatter plot with the relation between the proportion of new alerts and the proportion of new comments. We added a regression line obtained from the following model:

$$ NewCommentsProportions = \alpha + \beta NewAlertsProportion $$

The shaded region represents the

```rProportion of new alerts x Proportion of new comments", fig.pos="H"}

ggplot( comparisons_non_zero, aes( y = prop_new_comments, x = prop_new_alerts, ) ) + geom_point() + geom_text_repel(aes(label = comparison), nudge_y = 0.05, size = 2) + geom_smooth(method = "lm") + ggtitle( "Proportion of new comments x Proportion of new alerts" ) + labs( y = "Proportion of new alerts", x = "Proportion of new comments", size = "Change" ) + scale_x_continuous( label = percent_format() ) + scale_y_continuous( label = percent_format() ) + scale_size_continuous( label = number_format(big.mark = ",", accuracy = 1) ) + theme_minimal() + theme( legend.position = "top" ) + NULL

We can evaluate if the positive relation we see between the new comments
and new alert was detected by chance. Table \ref{tab_reg} shows the
results of this regressions. The P-Value of the beta is 0.049. This
means that, if there actually is not a linear relation between new
alerts proportion and new comments proportion, we would have
aproximately a probability of 5% to get a result as extreme as we got.
The estimation for $\beta$ is 0.44, meaning that for each additional 1pp
in the proportion of new alerts we would have 0.44 pp in the proportion
of new comments. The $R^2$ is 0.14, so we could say that 14% of the
variability in the proportion of new comments could be explained by the
variation in the proportion of new alerts.

```r

library(parsnip)
library(gtsummary)

lm_prop <-  linear_reg() %>% 
  set_engine("lm")

lm_prop_fit <- lm_prop %>% 
<<<<<<< HEAD
  fit(ratio_comments ~ ratio_alerts, data = comparison_prepared)

We can evaluate if the positive relation we see between the new comments and new alert was detected by chance. Table \ref{tab_reg} shows the results of this regressions. The P-Value of the beta near zero. This means that, if there actually is not a linear relation between new alerts proportion and new comments proportion, it woukld be vwry unlikely to get a result as extreme as we got. The estimation for $\beta$ is 0.57, meaning that for each additional 1pp in the proportion of new alerts we observe, on average, additional 0.53 pp in the proportion of new comments. The $R^2$ is r broom::glance(lm_prop_fit$fit)$r.squared %>% number(accuracy = 0.01, decimal.mark = "."), so we could say that r broom::glance(lm_prop_fit$fit)$r.squared %>% percent(accuracy = 1, decimal.mark = ".") of the variability in the proportion of new comments can be explained by the variation in the proportion of new alerts.

=======
  fit(prop_new_comments ~ prop_new_alerts, data = comparisons %>% filter(friction > 1000))

>>>>>>> parent of 7645398 (qualitative analysis)
tbl_prop <- tbl_regression(
  lm_prop_fit$fit,
  pvalue_fun = function(x) style_pvalue(x, digits = 2),
  label = prop_new_alerts ~ "New Alerts Proportion",
  intercept = TRUE,
) 

tbl_prop %>% 
  as_kable_extra(caption = "\\label{tab_reg} Regression: comments on alerts")

In the next steps, we can refine the way we select the SATD comments. There are papers that use more sofisticated schemes to identify them, using Natural Language Processing, for instance. Following these steps maybe we will be able to get more certainty about the relation between alerts and SATD comments. Maybe we will be able to explain a greater part of the variability in the proportion of new SATD comments. <<<<<<< HEAD

\section{Appendix: expressions to find SATDs}\label{sec_SATDs}

List of regular expressions used to find SATDs:

kludgenudger::get_kludge_expressions() %>%
  enframe(
    name = "#",
    value = "Expression"
  ) %>% 
  kable(
    longtable = TRUE
  ) %>% 
  kable_styling(
    latex_options = c("repeat_header")
  )

\section{References}

=======

\section{References}

parent of 7645398 (qualitative analysis)

crotman/kludgenudger documentation built on Oct. 19, 2021, 7:30 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

crotman/kludgenudger
support for Bruno Crotman's thesis about kludges in software development

In crotman/kludgenudger: support for Bruno Crotman's thesis about kludges in software development

We select 32 tagged and released versions of the project ArgoUML. They

R Package Documentation

Browse R Packages

We want your feedback!

crotman/kludgenudger support for Bruno Crotman's thesis about kludges in software development

In crotman/kludgenudger: support for Bruno Crotman's thesis about kludges in software development

We select 32 tagged and released versions of the project ArgoUML. They

R Package Documentation

Browse R Packages

We want your feedback!

crotman/kludgenudger
support for Bruno Crotman's thesis about kludges in software development