readMail | R Documentation |
Return a function which reads in an electronic mail document.
readMail(DateFormat = character())
DateFormat |
A character vector giving date-time formats for the “Date” header field in the mail document. By default, the “basic” formats of RFC 5322 are tried. |
Formally this function is a function generator, i.e., it returns a function (which reads in a mail document) with a well-defined signature, but can access passed over arguments (e.g., the “Date” header format) via lexical scoping.
In version 0.3.0 of the tm.plugin.mail package, the reader code was switched to use the Python email library via CRAN package reticulate. Compared to previous versions, this allows to
handle textual message bodies in character sets other than US-ASCII and the use of base64 or quoted-printable transfer encodings (RFC 2045)
handle non-US-ASCII text data in message header fields (RFC 2047)
correctly handle the metadata in structured header fields (RFC 5322)
For messages using the Multipurpose Internet Mail Extensions (MIME)
extensions, the texts extracted from the messages are the (suitably
decoded) bodies when using the ‘text/plain’ or
‘text/html’ content types, or the body parts using these
types when using ‘multipart/mixed’ or
‘multipart/alternative’ (see
RFC 2046 for more
information).
Non-MIME messages are treated like ‘text/plain’.
The extracted texts are represented as character vectors with length
the number of extracted body parts and names giving the MIME
subtype ("plain"
or "html"
).
This allows text mining applications to flexibly handle HTML content
“as appropriate” by filtering on the names of the content of
the MailDocument
objects.
In case the Python processing fails or its results cannot be
transferred to R (in particular, when text body parts contain embedded
NULs), the reader falls back to simple header field processing
appropriate for unstructered headers, and/or extracting no text.
Information about problems is provided in the problems
element
of the metadata.
A function
with the following formals:
elem
a named list with the component content
which must hold the document to be read in.
language
a string giving the language.
id
a character giving a unique identifier for the created text document.
The function returns a MailDocument
representing the
text and metadata extracted from elem$content
. The argument
id
is used as fallback if no corresponding metadata entry is
found in elem$content
.
Ingo Feinerer and Kurt Hornik
Reader
for basic information on the reader
infrastructure employed by package tm.
strptime
for date-time format specifications.
RFC 5322, RFC 2045, RFC 2045, RFC 2047.
require("tm")
newsgroup <- system.file("mails", package = "tm.plugin.mail")
news <- VCorpus(DirSource(newsgroup),
readerControl = list(reader = readMail))
inspect(news)
## Use the high-level content and metadata accessors from package 'NLP':
require("NLP")
content(news[[2]])
meta(news[[2]])
## Processed header fields of the message.
meta(news[[2]])$header
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.