remove_boilerplate: Remove repetitive "boilerplate" text from documents

Description Usage Arguments

View source: R/remove_boilerplate.R

Description

Remove repetitive "boilerplate" text from documents to minimize noise in the STM analysis.

Usage

1
2
remove_boilerplate(input_dir, ngram_dir, output_dir, rep_text_dir,
  header_footer_dir, language = "en")

Arguments

input_dir

Directory containing text files to extract ngrams from.

ngram_dir

Directory in which to find ngrams.

output_dir

Directory in which to save texts with boilerplate removed.

rep_text_dir

Directory in which to save repetitive text for review.

header_footer_dir

Directory in which to save header and footer text for review.

language

Language in which documents are written.


dtburk/gensci.stm documentation built on Nov. 13, 2019, 12:33 a.m.