read_dir_transcript: Read In Multiple Transcript Files From a Directory
In textreadr: Read Text Documents into R

Description Usage Arguments Value See Also Examples

Read in multiple transcript files from a directory and create a base::data.frame().

read_dir_transcript(
  path,
  col.names = c("Document", "Person", "Dialogue"),
  pattern = NULL,
  all.files = FALSE,
  recursive = FALSE,
  skip = 0,
  merge.broke.tot = TRUE,
  header = FALSE,
  dash = "",
  ellipsis = "...",
  quote2bracket = FALSE,
  rm.empty.rows = TRUE,
  na = "",
  sep = NULL,
  comment.char = "",
  max.person.nchar = 20,
  ignore.case = FALSE,
  verbose = FALSE,
  ...
)

`path`	Path to the directory.
`col.names`	A character vector specifying the column names of the transcript columns (document, person, dialogue).
`pattern`	An optional regular expression. Only file names which match the regular expression will be returned.
`all.files`	Logical. If `FALSE`, only the names of visible files are returned. If `TRUE`, all file names will be returned.
`recursive`	Logical. Should the listing recurse into directories?
`skip`	Integer; the number of lines of the data file to skip before beginning to read data.
`merge.broke.tot`	logical. If `TRUE` and if the file being read in is .docx with broken space between a single turn of talk read_transcript will attempt to merge these into a single turn of talk.
`header`	logical. If `TRUE` the file contains the names of the variables as its first line.
`dash`	A character string to replace the en and em dashes special characters (default is to remove).
`ellipsis`	A character string to replace the ellipsis special characters.
`quote2bracket`	logical. If `TRUE` replaces curly quotes with curly braces (default is `FALSE`). If `FALSE` curly quotes are removed.
`rm.empty.rows`	logical. If `TRUE` `read_transcript()` attempts to remove empty rows.
`na`	A character string to be interpreted as an `NA` value.
`sep`	The field separator character. Values on each line of the file are separated by this character. The default of `NULL` instructs `read_transcript()` to use a separator suitable for the file type being read in.
`comment.char`	A character vector of length one containing a single character or an empty string. Use `""` to turn off the interpretation of comments altogether.
`max.person.nchar`	The max number of characters long names are expected to be. This information is used to warn the user if a separator appears beyond this length in the text.
`ignore.case`	logical. If `TRUE` case in the `pattern` argument will be ignored.
`verbose`	Logical. Should Each iteration of the read-in be reported.
`...`	ignored.

Returns a dataframe of documents, dialogue, and people.

read_transcript

skips <- c(0, 1, 1, 0, 0, 1)
path <- system.file("docs/transcripts", package = 'textreadr')
textreadr::peek(read_dir_transcript(path, skip = skips), Inf)

## Not run: 
## with additional  cleaning
library(tidyverse, textshape, textclean)

path %>%
    read_dir_transcript(skip = skips) %>%
    textclean::filter_row("Person", "^\\[") %>%
    mutate(
        Person = stringi::stri_replace_all_regex(Person, "(^/\\s*)|(:\\s*$)", "") %>%
            trimws(),
        Dialogue = stringi::stri_replace_all_regex(Dialogue, "(^/\\s*)", "")
    ) %>%
    peek(Inf)

## End(Not run)