etwtc: Package for processing text from government PDF documents

#' Pdf documents to a corpus
#'
#' This function reads pdf files into R, and creates a corpus for text analysis.
#' @param path A directory address to a folder containing one or more .pdf documents.
#' @keywords pdf Corpus
#' @export
#' @examples
#' pdf_corpus(path = '~/User/folder/')

pdf_corpus <- function(path){
  library(tm)
  library(pdftools)

  files <- list.files(path = path, pattern = "pdf$")

  files <- paste(path, '/', files, sep='' )

  files <- gsub('//','/', files)

  list <- NULL

  for(i in 1:length(files)){
    single <- pdf_text(files[i], opw = "", upw = "")
    single <- gsub('\r\n',' ', single)
    single <- paste(single, collapse=" ")
    list[[i]]<- single
  }


  corpus <- VCorpus(VectorSource(list))

  files2 <- list.files(path = path, pattern = "pdf$")

  files2 <- gsub('.pdf', '', files2)

  names(corpus) <- files2




  return(corpus)

}

palesl/etwtc documentation built on May 24, 2019, 6:14 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

palesl/etwtc
Package for processing text from government PDF documents

R/pdf_corpus.R
In palesl/etwtc: Package for processing text from government PDF documents

R Package Documentation

Browse R Packages

We want your feedback!

palesl/etwtc Package for processing text from government PDF documents

R/pdf_corpus.R In palesl/etwtc: Package for processing text from government PDF documents

R Package Documentation

Browse R Packages

We want your feedback!

palesl/etwtc
Package for processing text from government PDF documents

R/pdf_corpus.R
In palesl/etwtc: Package for processing text from government PDF documents