knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)
library(playjareyesores)
library(magrittr)
library(dplyr)

The main article shows how to compare two documents and get a report showing overlap between the text at a given ngram level. You might have several documents that you want to compare. This tutorial develops methods for multiple-document comparision.

Read in some pdfs

I don't have a function for batch reading in a bunch of pdfs just yet. Reading in text from pdfs can be tricky business. Might be better to to write your own code for this so you can flag .pdfs that are behaving badly.

In any case, what we have here are a bunch of my papers, and two from Sternberg. The first step is to read the papers in. Hopefully I don't self-plagiarize very much, and Sternberg's texts will pop out as having lots of overlap.

file_paths <- list.files("pdfs/")

pdf_txts <- list()
for(i in 1:length(file_paths)){
  paper <- clean_2_col_pdf(paste0("pdfs/",file_paths[i]))
  pdf_txts[i] <- qdapRegex::rm_non_ascii(paper) %>%
    LSAfun::breakdown()
}

Compare the documents

the_texts <- unlist(pdf_txts)
out <- multi_doc_compare(text=the_texts,
                         n_grams = 3,
                         sd_criterion = 3)

The function returns several things:

Let's look at the similarities between the articles, not particularly compelling as you will see:

knitr::kable(round(out$similarities, digits=2))

check these

Well, are there any pairs we should check more closely? Yup, the Sternberg papers.

knitr::kable(out$check_these)


CrumpLab/playjareyesores documentation built on June 25, 2019, 8:29 a.m.