In mattansb/cheatR: Catch Cheaters

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
  # fig.path = "/man/figures/"
)

Scripting

Create a list of files:

my_files <- list.files(path = '../man/files/', pattern = '.doc', full.names = TRUE)
my_files

The first 3 documents are different drafts of the same paper, so we would expect them to be similar to each other. The last document is a draft of a different paper, so it should be dissimilar to the first 3. All files are about 45K words long.

Now we can use cheatR to find duplicates.

The only function, catch_em, takes the following input arguments:

flist - a list of documents (.doc/.docx/.pdf). A full/relative path must be provided.
n_grams - see ngram package.
time_lim - max time in seconds for each comparison (we found that some corrupt files run forever and crash R, so a time limit might be needed).

library(cheatR)
results <- catch_em(flist = my_files,
                    n_grams = 10, time_lim = 1) # defaults

The resulting list contains a matrix with the similarity values between each pair of documents:

results

You can also plot the relational graph if you'd like to get a more clear picture of who copied from who.

plot(results, weight_range = c(0.7, 1), remove_lonely = FALSE)

mattansb/cheatR documentation built on April 22, 2022, 4:43 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mattansb/cheatR
Catch Cheaters

In mattansb/cheatR: Catch Cheaters

Scripting

R Package Documentation

Browse R Packages

We want your feedback!

mattansb/cheatR Catch Cheaters

In mattansb/cheatR: Catch Cheaters

Scripting

R Package Documentation

Browse R Packages

We want your feedback!

mattansb/cheatR
Catch Cheaters