docList-class: An S4 class to represent a document collection.

Description Slots What it does See Also

Description

docList returns a special tei2r object that contains a list of information about your document collection.

Slots

directory

A string that gives the filepath to the main directory (folder), which holds all the files in the collection.

filenames

A vector containing all of the filenames for the documents in the collection.

paths

A vector containing the full path to each file in the collection.

indexFile

A string that gives the filepath to the index file for the corpus. This file should house the meta-data for each file in the corpus.

index

A data frame that holds the meta-data for each document in the corpus. This data frame is created by reading the file found at indexFile.

stopwordsFile

A string that gives the filepath to the file that contains a comma seperated list of words to be removed during text cleanup.

stopwords

A vector derived from the stopwordsFile that is passed to the text cleanup functions in order for them to be removed from the text.

texts

A list of character vectors, each drawn from documents in the collection, and each placed in the order provided by the index.

What it does

The docList is the foundation of the tei2r package and should be the first object created when working with the package. The object is constructed by calling the buildDocList function. This function builds the object by storing the path to the collection's files (directory), the file containing the collection's meta-data (indexFile), and the stopwords file (stopwordsFile). From these pieces of information, the function automatically determines the filenames and paths for the collection's files.

See Also

buildDocList


michaelgavin/tei2r documentation built on May 22, 2019, 9:50 p.m.