The goal of dupree
is to identify chunks / blocks of highly duplicated
code within a set of R scripts.
A very lightweight approach is used:
The user provides a set of *.R
and/or *.Rmd
files;
All R-code in the user-provided files is read and code-blocks are identified;
The non-trivial symbols from each code-block are retained (for
instance, really common symbols like <-
, ,
, +
, (
are
dropped);
Similarity between different blocks is calculated using
stringdist::seq_sim
by longest-common-subsequence (symbol-identity
is at whole-word level - so “my_data”, “my_Data”, “my.data” and
“myData” are not considered to be identical in the calculation -
and all non-trivial symbols have equal weight in the similarity
calculation);
Code-blocks pairs (both between and within the files) are returned in order of highest similarity
To prevent the results being dominated by high-identity blocks
containing very few symbols (eg, library(dplyr)
) the user can specify
a min_block_size
. Any code-block containing at least this many
non-trivial symbols will be kept.
You can install dupree
from github with:
if (!"dupree" %in% installed.packages()) {
# Alternatively:
# install.packages("dupree")
remotes::install_github("russHyde/dupree")
}
To run dupree
over a set of R files, you can use the dupree()
,
dupree_dir()
or dupree_package()
functions. For example, to identify
duplication within all of the .R
and .Rmd
files for the dupree
package you could run the following:
## basic example code
library(dupree)
files <- dir(pattern = "*.R(md)*$", recursive = TRUE)
dupree(files)
#> # A tibble: 14 x 7
#> file_a file_b block_a block_b line_a line_b score
#> <chr> <chr> <int> <int> <int> <int> <dbl>
#> 1 R/dupree_classes.R tests/testthat/tes… 33 8 57 13 0.296
#> 2 tests/testthat/tes… tests/testthat/tes… 8 10 13 118 0.248
#> 3 R/dupree_classes.R R/dupree_classes.R 33 61 57 117 0.218
#> 4 tests/testthat/tes… tests/testthat/tes… 8 11 13 64 0.216
#> 5 R/dupree_classes.R R/dupree_classes.R 33 88 57 180 0.215
#> 6 tests/testthat/tes… tests/testthat/tes… 11 1 64 1 0.185
#> 7 tests/testthat/tes… tests/testthat/tes… 1 2 1 132 0.172
#> 8 R/dupree_classes.R R/dupree.R 33 111 57 124 0.146
#> 9 tests/testthat/tes… tests/testthat/tes… 8 6 13 25 0.120
#> 10 R/dupree_classes.R tests/testthat/hel… 33 4 57 4 0.114
#> 11 R/dupree_classes.R R/dupree_code_enum… 88 48 180 90 0.111
#> 12 presentations/clea… R/dupree_classes.R 28 61 316 117 0.105
#> 13 tests/testthat/tes… tests/testthat/tes… 6 3 25 11 0.0972
#> 14 R/dupree_code_enum… tests/testthat/tes… 48 1 90 1 0.00298
Any top-level code blocks that contain at least 40 non-trivial tokens
are included in the above analysis (a token being a function or variable
name, an operator etc; but ignoring comments, white-space and some
really common tokens: [](){}-+$@:,=
, <-
, &&
etc). To be more
restrictive, you could consider larger code-blocks (increase
min_block_size
) within just the ./R/
source code directory:
# R-source code files in the ./R/ directory of the dupree package:
source_files <- dir(path = "./R", pattern = "*.R(md)*$", full.names = TRUE)
# analyse any code blocks that contain at least 50 non-trivial tokens
dupree(source_files, min_block_size = 50)
#> # A tibble: 1 x 7
#> file_a file_b block_a block_b line_a line_b score
#> <chr> <chr> <int> <int> <int> <int> <dbl>
#> 1 ./R/dupree_classes.R ./R/dupree_classes.R 61 88 117 180 0.104
For each (sufficiently big) code block in the provided files, dupree
will return the code-block that is most-similar to it (although any
given block may be present in the results multiple times if it is the
closest match for several other code blocks).
Code block pairs with a higher score
value are more similar. score
lies in the range [0, 1]; and is calculated by the
stringdist
package:
matching occurs at the token level: the token “my_data” is no more
similar to the token “myData” than it is to “x”.
If you find code-block-pairs with a similarity score much greater than 0.5 there is probably some commonality that could be abstracted away.
Note that you can do something similar using the functions dupree_dir
and (if you are analysing a package) dupree_package
.
# Analyse all R files in the R/ directory:
dupree_dir(".", filter = "R/")
#> # A tibble: 6 x 7
#> file_a file_b block_a block_b line_a line_b score
#> <chr> <chr> <int> <int> <int> <int> <dbl>
#> 1 ./R/dupree_classes.R ./R/dupree_classes… 33 61 57 117 0.218
#> 2 ./R/dupree_classes.R ./R/dupree_classes… 33 88 57 180 0.215
#> 3 ./tests/testthat/te… ./tests/testthat/t… 1 2 1 132 0.172
#> 4 ./R/dupree_classes.R ./R/dupree.R 33 111 57 124 0.146
#> 5 ./R/dupree_classes.R ./R/dupree_code_en… 88 48 180 90 0.111
#> 6 ./R/dupree_code_enu… ./tests/testthat/t… 48 1 90 1 0.00298
# Analyse all R files except those in the tests / presentations directories:
# `dupree_dir` uses grep-like arguments
dupree_dir(
".",
filter = "tests|presentations", invert = TRUE
)
#> # A tibble: 4 x 7
#> file_a file_b block_a block_b line_a line_b score
#> <chr> <chr> <int> <int> <int> <int> <dbl>
#> 1 ./R/dupree_class… ./R/dupree_classes.R 33 61 57 117 0.218
#> 2 ./R/dupree_class… ./R/dupree_classes.R 33 88 57 180 0.215
#> 3 ./R/dupree_class… ./R/dupree.R 33 111 57 124 0.146
#> 4 ./R/dupree_class… ./R/dupree_code_enumera… 88 48 180 90 0.111
# Analyse all R source code in the package (only looking at the ./R/ directory)
dupree_package(".")
#> # A tibble: 6 x 7
#> file_a file_b block_a block_b line_a line_b score
#> <chr> <chr> <int> <int> <int> <int> <dbl>
#> 1 ./R/dupree_classes.R ./R/dupree_classes… 33 61 57 117 0.218
#> 2 ./R/dupree_classes.R ./R/dupree_classes… 33 88 57 180 0.215
#> 3 ./tests/testthat/te… ./tests/testthat/t… 1 2 1 132 0.172
#> 4 ./R/dupree_classes.R ./R/dupree.R 33 111 57 124 0.146
#> 5 ./R/dupree_classes.R ./R/dupree_code_en… 88 48 180 90 0.111
#> 6 ./R/dupree_code_enu… ./tests/testthat/t… 48 1 90 1 0.00298
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.