In russHyde/dupree: Identify Duplicated R Code in a Project

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

dupree

The goal of dupree is to identify chunks / blocks of highly duplicated code within a set of R scripts.

A very lightweight approach is used:

The user provides a set of *.R and/or *.Rmd files;
All R-code in the user-provided files is read and code-blocks are identified;
The non-trivial symbols from each code-block are retained (for instance, really common symbols like <-, ,, +, ( are dropped);
Similarity between different blocks is calculated using stringdist::seq_sim by longest-common-subsequence (symbol-identity is at whole-word level - so "my_data", "my_Data", "my.data" and "myData" are not considered to be identical in the calculation - and all non-trivial symbols have equal weight in the similarity calculation);
Code-blocks pairs (both between and within the files) are returned in order of highest similarity

To prevent the results being dominated by high-identity blocks containing very few symbols (eg, library(dplyr)) the user can specify a min_block_size. Any code-block containing at least this many non-trivial symbols will be kept.

Installation

You can install dupree from github with:

if (!"dupree" %in% installed.packages()) {
  # Alternatively:
  # install.packages("dupree")
  remotes::install_github("russHyde/dupree")
}

Example

To run dupree over a set of R files, you can use the dupree(), dupree_dir() or dupree_package() functions. For example, to identify duplication within all of the .R and .Rmd files for the dupree package you could run the following:

## basic example code
library(dupree)

files <- dir(pattern = "*.R(md)*$", recursive = TRUE)

dupree(files)

Any top-level code blocks that contain at least r formals(dupree)$min_block_size non-trivial tokens are included in the above analysis (a token being a function or variable name, an operator etc; but ignoring comments, white-space and some really common tokens: [](){}-+$@:,=, <-, && etc). To be more restrictive, you could consider larger code-blocks (increase min_block_size) within just the ./R/ source code directory:

# R-source code files in the ./R/ directory of the dupree package:
source_files <- dir(path = "./R", pattern = "*.R(md)*$", full.names = TRUE)

# analyse any code blocks that contain at least 50 non-trivial tokens
dupree(source_files, min_block_size = 50)

For each (sufficiently big) code block in the provided files, dupree will return the code-block that is most-similar to it (although any given block may be present in the results multiple times if it is the closest match for several other code blocks).

Code block pairs with a higher score value are more similar. score lies in the range [0, 1]; and is calculated by the stringdist package: matching occurs at the token level: the token "my_data" is no more similar to the token "myData" than it is to "x".

If you find code-block-pairs with a similarity score much greater than 0.5 there is probably some commonality that could be abstracted away.

Note that you can do something similar using the functions dupree_dir and (if you are analysing a package) dupree_package.

# Analyse all R files in the R/ directory:
dupree_dir(".", filter = "R/")

# Analyse all R files except those in the tests / presentations directories:
# `dupree_dir` uses grep-like arguments
dupree_dir(
  ".",
  filter = "tests|presentations", invert = TRUE
)

# Analyse all R source code in the package (only looking at the ./R/ directory)
dupree_package(".")

russHyde/dupree documentation built on April 8, 2024, 10:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

russHyde/dupree
Identify Duplicated R Code in a Project

In russHyde/dupree: Identify Duplicated R Code in a Project

dupree

Installation

Example

R Package Documentation

Browse R Packages

We want your feedback!

russHyde/dupree Identify Duplicated R Code in a Project

In russHyde/dupree: Identify Duplicated R Code in a Project

dupree

Installation

Example

R Package Documentation

Browse R Packages

We want your feedback!

russHyde/dupree
Identify Duplicated R Code in a Project