knitr::opts_chunk$set( collapse = TRUE, comment = "#>" # fig.path = "/man/figures/" )
Create a list of files:
my_files <- list.files(path = '../man/files/', pattern = '.doc', full.names = TRUE) my_files
The first 3 documents are different drafts of the same paper, so we would expect them to be similar to each other. The last document is a draft of a different paper, so it should be dissimilar to the first 3. All files are about 45K words long.
Now we can use
cheatR to find duplicates.
The only function,
catch_em, takes the following input arguments:
flist- a list of documents (
time_lim- max time in seconds for each comparison (we found that some corrupt files run forever and crash R, so a time limit might be needed).
library(cheatR) results <- catch_em(flist = my_files, n_grams = 10, time_lim = 1) # defaults
The resulting list contains a matrix with the similarity values between each pair of documents:
You can also plot the relational graph if you'd like to get a more clear picture of who copied from who.
plot(results, weight_range = c(0.7, 1), remove_lonely = FALSE)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.