Representing and computing on text documents.
Text documents are documents containing (natural language)
text. In packages which employ the infrastructure provided by package
NLP, such documents are represented via the virtual S3 class
"TextDocument": such packages then provide S3 text document
classes extending the virtual base class (such as the
AnnotatedPlainTextDocument objects provided by package
All extension classes must provide an
method which extracts the natural language text in documents of the
respective classes in a “suitable” (not necessarily structured)
form, as well as
methods for accessing the (possibly raw) document content and metadata.
In addition, the infrastructure features the generic functions
sents(), etc., for which
extension classes can provide methods giving a structured view of the
text contained in documents of these classes (returning, e.g., a
character vector with the word tokens in these documents, and a list
of such character vectors).