SimilaR_fromDirectory: Quantify the Similarity of Pairs of R Functions

Description Usage Arguments Details Value References Examples

View source: R/SimilaR_fromDirectory.R

Description

An implementation of the SimilaR algorithm - a method to quantify the similarity of R functions based on Program Dependence Graphs. Possible use cases include detection of code clones for improving software quality and of plagiarism among students' homework assignments.

SimilaR_fromDirectory scans for function definitions in all *.R source files in a given directory and performs pairwise comparisons.

SimilaR_fromTwoFunctions compares the code-base of two function objects.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SimilaR_fromDirectory(
  dirname,
  returnType = c("data.frame", "matrix"),
  fileTypes = c("function", "file"),
  aggregation = c("tnorm", "sym", "both")
)

SimilaR_fromTwoFunctions(
  function1,
  function2,
  functionNames,
  returnType = c("data.frame", "matrix"),
  aggregation = c("tnorm", "sym", "both")
)

Arguments

dirname

path to a directory with source files named *.R

returnType

"data.frame" or "matrix"; indicates the output object type

fileTypes

"function" or "file"; indicates which pairs of functions extracted from the source files in dirname should be compared; "function" compares each function against every other function; "file" compares only the functions defined in different source files

aggregation

"sym", "tnorm", or "both"; specifies which model of similarity asymmetry should be used; "sym" means that one (overall) similarity degree is computed; "both" evaluates and returns the degree to which the first function in a function pair is similar ("contained in", "is subset of") to the second one, and, separately, the extent to which the second function is similar to the first one; "tnorm" computes two similarity values and aggregates them to a single number

function1

a first function object to compare

function2

a second function object to compare

functionNames

optional functions' names to be included in the output

Details

Note that, depending on the "aggregation" argument, the method may either return a single value, representing the overall (symmetric) similarity between a pair of functions, or or two different values, measuring the (non-symmetric) degrees of "subsethood". The user might possibly wish to aggregate these two values by means of some custom aggregation function.

Value

If returnType is equal to "data.frame", a data frame that gives the information about the similarity of the inspected pairs of functions, row by row, is returned. The data frame has the following columns:

Rows in the data frame are sorted with respect to the SimilaR column (descending). Of course, SimilaR_fromTwoFunctions gives a data frame with only one row.

If returnType is equal to "matrix", a square matrix is returned. The element at index (i,j) equals to the similarity degree between the i-th and the j-th function. When aggregation is equal to "sym" or "tnorm", the matrix is symmetric. Column names and row names of the matrix are generated from the names of the functions being compared.

References

Bartoszuk M., A source code similarity assessment system for functional programming languages based on machine learning and data aggregation methods, Ph.D. thesis, Warsaw University of Technology, Warsaw, Poland, 2018.

Bartoszuk M., Gagolewski M., Binary aggregation functions in software plagiarism detection, In: Proc. FUZZ-IEEE'17, IEEE, 2017.

Bartoszuk M., Beliakov G., Gagolewski M., James S., Fitting aggregation functions to data: Part II - Idempotentization, In: Carvalho J.P. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part II (Communications in Computer and Information Science 611), Springer, 2016, pp. 780-789. doi:10.1007/978-3-319-40581-0_63.

Bartoszuk M., Beliakov G., Gagolewski M., James S., Fitting aggregation functions to data: Part I - Linearization and regularization, In: Carvalho J.P. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part II (Communications in Computer and Information Science 611), Springer, 2016, pp. 767-779. doi:10.1007/978-3-319-40581-0_62.

Bartoszuk M., Gagolewski M., Detecting similarity of R functions via a fusion of multiple heuristic methods, In: Alonso J.M., Bustince H., Reformat M. (Eds.), Proc. IFSA/EUSFLAT 2015, Atlantis Press, 2015, pp. 419-426.

Bartoszuk M., Gagolewski M., A fuzzy R code similarity detection algorithm, In: Laurent A. et al. (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, Part III (CCIS 444), Springer-Verlag, Heidelberg, 2014, pp. 21-30.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
f1 <- function(x) {x*x}
f2 <- function(x,y) {x+y}

## A data frame is returned: 1 row, 4 columns
SimilaR_fromTwoFunctions(f1,
                         f2,
                         returnType = "data.frame",
                         aggregation = "tnorm")

## Custom names in the returned data frame
SimilaR_fromTwoFunctions(f1,
                         f2,
                         functionNames = c("first", "second"),
                         returnType = "data.frame",
                         aggregation = "tnorm")

## A data frame is returned: 1 row, 5 columns
SimilaR_fromTwoFunctions(f1,
                         f2,
                         returnType = "data.frame",
                         aggregation = "both")

## A non-symmetric square matrix is returned,
## with 2 rows and 2 columns
SimilaR_fromTwoFunctions(f1,
                         f2,
                         returnType = "matrix",
                         aggregation = "both")


## Typical example, where we wish to compare the functions from different files,
## but we do not want to compare the functions from the same file.
## There will be one value describing the overall similarity level.
SimilaR_fromDirectory(system.file("testdata","data",package="SimilaR"),
                                 returnType = "data.frame",
                                 fileTypes="file",
                                 aggregation = "sym")

## In this example we want to compare every pair of functions: even those
## defined in the same file. Two (non-symmetric) similarity degrees
## are reported.
SimilaR_fromDirectory(system.file("testdata","data2",package="SimilaR"),
                      returnType = "data.frame",
                      fileTypes="function",
                      aggregation = "both")

SimilaR documentation built on July 1, 2020, 6 p.m.