corpus_files: Get a comprehensive data frame describing the files of your...

Description Usage Arguments Value References Examples

View source: R/corpus_files.R

Description

The function translates the hierarchy defintion given into a data frame with one row for each file, including the generated document ID.

Usage

1
2
3
4
5
6
corpus_files(
  dir,
  hierarchy = list(),
  fsep = .Platform$file.sep,
  full_list = FALSE
)

Arguments

dir

File path to the root directory of the text corpus, or a TIF[1] compliant data frame.

hierarchy

A named list of named character vectors describing the directory hierarchy level by level. If TRUE instead, the hierarchy structure is taken directly from the directory tree. See section Hierarchy of readCorpus for details.

fsep

Character string defining the path separator to use.

full_list

Logical, see return value.

Value

Either a data frame with columns doc_id, file, path and one further factor column for each hierarchy level, or (if full_list=TRUE) a list containing that data frame (all_files) and also data frames describing the hierarchy by given names (hier_names), directories (hier_dirs) and relative paths (hier_paths).

References

[1] Text Interchange Formats (https://github.com/ropensci/tif)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
myCorpusFiles <- corpus_files(
  dir=file.path(
    path.package("tm.plugin.koRpus"), "examples", "corpus"
  ),
  hierarchy=list(
    Topic=c(
      Winner="Reality Winner",
      Edwards="Natalie Edwards"
    ),
    Source=c(
      Wikipedia_prev="Wikipedia (old)",
      Wikipedia_new="Wikipedia (new)"
    )
  )
)

tm.plugin.koRpus documentation built on May 18, 2021, 5:07 p.m.