knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)
library(playjareyesores)
library(nomnoml)

Some brief background. Occassionally, I have students cheat on their essays or long-answer questions, typically by copying their answers from somewhere. My university provides access to plagiarism detection services like turnitin and safeassign, however these tools can be cumbersome to use. I have also sometimes resorted to using R for plagiarism detection. I am writing this package to collect my code for this purpose and put it one place.

Also, I am using this vignette as a pseudo blog to work out which kinds of functions to focus on building. So, this is a tutorial for myself.

Two academic papers

I'm reviewing a paper right now. It's very similar to another paper I've read by the same author. Whole chunks are copied from one previous paper to another. I can't be bothered to "manually" figure out which parts are the same. It would be nice to have some R functions for this.

#direction: right
nomnoml::nomnoml(
"[paper 1 | get into R] -> [clean]
[paper 2 | get into R] -> [clean]
[clean | keep parts you want] -> [compare]
[compare | lots of methods to try] -> [Report | Stuff you want to learn]",
height = 150, width=400)

Ok, did some googling and tried out a bunch things. I found some great packages that already exist, so am going to play around with them a bit. These packages include textreuse, and tabulizer (which unfortunately depends on Java, but allows for importing pdfs with 2 columns, useful for scientific papers).

Import and cleaning functions

Here are some wrapper functions for importing pdfs or plain text, and doing some basic cleaning, involving, deleting new lines, deleting "- " which happens when lines run over in a .pdf, and converting everything to lowercase.

p1 <- clean_1_col_pdf(file="data/test1col.pdf")
p2 <- clean_2_col_pdf(file="data/CGM2006.pdf")
p3 <- clean_plain_txt(file="data/sometext.txt")

Compare documents using textreuse

The textreuse package can do some heavy lifting in terms of comparing documents for overlap. Below I use the align_local function to compare two of my papers. Looks like my self-plagiarism is limited to one of the methods sections, which makes sense because these two papers would have used the same methods.

library(textreuse)
p1 <- clean_2_col_pdf(file="data/CGM2006.pdf")
p2 <- clean_2_col_pdf(file="data/Crump et al. - 2008.pdf")
test <- align_local(p1,p2)
## Crump et al. 2006 wzxhzdk:5
## Crump et al. 2008 wzxhzdk:6

Estimating number of same words

The align_local function is interesting, it produces to strings, one for each document, that are lined up as well as possible, and identical in length. When words don't line up, there is a fudge factor, and missing, deleted or different words are replaced with a hashtag. Here's a quick and dirty way to figure out how many of the words exactly line up.

al_summary <- align_local_sum(test)
al_summary$sum
al_summary$sentence

This seems useful to me. The sentence that you are looking at didn't necessarilly appear as consecutive words, however it gives a reasonable bird's eye view of the overlap between two documents. If there are a big chunks here, one document was copied from another.

N gram methods

Another useful way to detect overlap is to use n-gram methods. Here, I use the tm package to create a document term matrix for two texts. The function allows you to set the number of n-grams. All unique n-grams between the texts are computed and counted for each text. Then I fid the proportion of common n-grams between the two texts. In general, if texts use the same words, they will have some overlap, but as the number of n-grams grows, texts that are not the same will have vanishingly small overlap.

ngram_proportion_same(p1,p2,1)
ngram_proportion_same(p1,p2,2)
ngram_proportion_same(p1,p2,3)
ngram_proportion_same(p1,p2,4)
ngram_proportion_same(p1,p2,5)

By default, the overlapping ngrams are not returned, but they can be using show="ngrams". Note, that I've done some further cleaning the texts, this was necessary after I added another feature to show="ngrams", see next section. In general, it's not clear to me how "clean" the text needs to be, and I'll try to get back to this and figure it out.

p1 <- qdapRegex::rm_non_ascii(p1) %>%
  tm::removeNumbers() %>%
  qdapRegex::rm_non_words()
p2 <- qdapRegex::rm_non_ascii(p2) %>%
  tm::removeNumbers() %>%
  qdapRegex::rm_non_words()
out <- ngram_proportion_same(p1,p2,10,show="ngrams")
out$the_grams[1:10] # show first 10 ngrams

These two papers were on the same topic, and had many of the same references, which account for much of the overlap.

Reporting

t1 <- "here is some text. I'd like to write about a few things. Then I'm going to compare what I wrote here with what I'm going to write in a little bit. After that, I'm going to make a function to look at the documents side by side, and bold the ngrams that are the same between the texts."
t2 <-"And some more text for you. This time I'm not as certain what I'm going to say, but I'm going to compare what I write here with what I wrote before. I'll do that in a little. The purpose is to get some text that I can use to make a report that lines up the documents, showing which ngrams were the same."

out <- ngram_proportion_same(t1,t2,3,show="ngrams", meta=c("A title","B title"))
## A wzxhzdk:12
## B wzxhzdk:13

Range of n grams

out <- ngrams_analysis(t1,t2,range = 2:5)
attributes(out)
attributes(out$ngram2)

Full report using ngrams_report

The ngrams_report function takes the output from the ngrams_analysis function and returns a print object for use in an .RMD document that prints to HTML. The print out is a summary of the analysis.

Here is an example using two papers published by Robert Sternberg, who has been identified as self-plagiarizing in some of his work (for example see). I thought this would be a good test case. I obtained that Sternberg published in 2010, where there was supposed overlap between the texts. Let's use the ngrams_report function to find out. Remember, to set the knitr chunk option to results ="asis".

# load in papers
p1 <- clean_1_col_pdf("data/Sternberg2010.pdf")
p2 <- clean_1_col_pdf("data/Sternberg2010b.pdf")

# clean them up a a  bit more
p1 <- LSAfun::breakdown(p1) %>%
  qdapRegex::rm_white()
p2 <- LSAfun::breakdown(p2) %>%
  qdapRegex::rm_white()

# run the ngram analysis
meta_titles <- c("Sternberg 2010A, School Psychology International",
                 "Sternberg 2010B, Journal of Cognitive Education and Psychology")
out <- ngrams_analysis(p1,p2,3:8, meta = meta_titles)

# print out results below

ngrams_report(out, print_n = 5, highlight_n = "ngram5", color="red")

Reporting style options

The color option changes the font color of the highlighted text. It should respect any color that would normally work with CSS. Note, that the higlighted words are also wrapped by an HTML span code with id = ngram. As a result, css could be used to further stylize the highlighted text.

t1 <- "here is some text. I'd like to write about a few things. Then I'm going to compare what I wrote here with what I'm going to write in a little bit. After that, I'm going to make a function to look at the documents side by side, and bold the ngrams that are the same between the texts."
t2 <-"And some more text for you. This time I'm not as certain what I'm going to say, but I'm going to compare what I write here with what I wrote before. I'll do that in a little. The purpose is to get some text that I can use to make a report that lines up the documents, showing which ngrams were the same."

out <- ngrams_analysis(t1,t2,range = 2:5, meta=c("A title","B title"))

# print out results below

ngrams_report(out, print_n = 5, highlight_n = "ngram5", color="blue")


CrumpLab/playjareyesores documentation built on June 25, 2019, 8:29 a.m.