knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
The goal of dupree
is to identify chunks / blocks of highly duplicated code
within a set of R scripts.
A very lightweight approach is used:
The user provides a set of *.R
and/or *.Rmd
files;
All R-code in the user-provided files is read and code-blocks are identified;
The non-trivial symbols from each code-block are retained (for instance,
really common symbols like <-
, ,
, +
, (
are dropped);
Similarity between different blocks is calculated using stringdist::seq_sim
by longest-common-subsequence (symbol-identity is at whole-word level - so
"my_data", "my_Data", "my.data" and "myData" are not considered to be identical
in the calculation - and all non-trivial symbols have equal weight in the
similarity calculation);
Code-blocks pairs (both between and within the files) are returned in order of highest similarity
To prevent the results being dominated by high-identity blocks containing very
few symbols (eg, library(dplyr)
) the user can specify a min_block_size
. Any
code-block containing at least this many non-trivial symbols will be kept.
You can install dupree
from github with:
if (!"dupree" %in% installed.packages()) { # Alternatively: # install.packages("dupree") remotes::install_github("russHyde/dupree") }
To run dupree
over a set of R files, you can use the dupree()
,
dupree_dir()
or dupree_package()
functions. For example, to identify
duplication within all of the .R
and .Rmd
files for the dupree
package
you could run the following:
## basic example code library(dupree) files <- dir(pattern = "*.R(md)*$", recursive = TRUE) dupree(files)
Any top-level code blocks that contain at least
r formals(dupree)$min_block_size
non-trivial tokens are
included in the above analysis (a token being a function or variable name, an
operator etc; but ignoring comments, white-space and some really common tokens:
[](){}-+$@:,=
, <-
, &&
etc). To be more restrictive, you could consider
larger code-blocks (increase min_block_size
) within just the ./R/
source
code directory:
# R-source code files in the ./R/ directory of the dupree package: source_files <- dir(path = "./R", pattern = "*.R(md)*$", full.names = TRUE) # analyse any code blocks that contain at least 50 non-trivial tokens dupree(source_files, min_block_size = 50)
For each (sufficiently big) code block in the provided files, dupree
will
return the code-block that is most-similar to it (although any given block
may be present in the results multiple times if it is the closest match for
several other code blocks).
Code block pairs with a higher score
value are more similar. score
lies in
the range [0, 1]; and is calculated by the
stringdist
package: matching
occurs at the token level: the token "my_data" is no more similar to the token
"myData" than it is to "x".
If you find code-block-pairs with a similarity score much greater than 0.5 there is probably some commonality that could be abstracted away.
Note that you can do something similar using the functions dupree_dir
and
(if you are analysing a package) dupree_package
.
# Analyse all R files in the R/ directory: dupree_dir(".", filter = "R/")
# Analyse all R files except those in the tests / presentations directories: # `dupree_dir` uses grep-like arguments dupree_dir( ".", filter = "tests|presentations", invert = TRUE )
# Analyse all R source code in the package (only looking at the ./R/ directory) dupree_package(".")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.