Description Usage Arguments Details Value Author(s) See Also Examples
View source: R/LexisNexisSource.R
Construct a source for an input containing a set of articles exported from LexisNexis in the HTML format.
1 | LexisNexisSource(x, encoding = "UTF-8")
|
x |
Either a character identifying the file or a connection. |
encoding |
A character giving the encoding of |
This function imports the body of the articles, but also sets several meta-data variables on individual documents:
datetimestamp
: The publication date.
heading
: The title of the article.
origin
: The newspaper the article comes from.
intro
: The short introduction accompanying the article.
section
: The part of the newspaper containing the article.
subject
: One or several keywords defining the subject.
coverage
: One or several keywords identifying the covered regions.
company
: One or several keywords identifying the covered companies.
stocksymbol
: One or several keywords identifying the stock exchange
symbols of the covered companies.
industry
: One or several keywords identifying the covered industries.
type
: The type of source from which the document originates.
wordcount
: The number of words in the article.
publisher
: The publisher of the newspaper.
rights
: The copyright information associated with the article.
language
: This information is set automatically if
readerControl = list(language = NA)
is passed (see the example below).
Else, the language specified manually is set for all articles. If omitted,
the default, "en", is used.
Please note that dates are not guaranteed to be parsed correctly if the machine from which the HTML file was exported uses a locale different from that of the machine where it is read.
Currently, only HTML files saved in English and French are supported. Please send the maintainer examples of LexisNexis files in your language if you want it to be supported.
An object of class LexisNexisSource
which extends the class
Source
representing set of articles from LexisNexis.
Milan Bouchet-Valat
readLexisNexisHTML
for the function actually parsing
individual articles.
getSources
to list available sources.
1 2 3 4 5 6 7 8 9 10 | library(tm)
file <- system.file("texts", "lexisnexis_test_en.html",
package = "tm.plugin.lexisnexis")
corpus <- Corpus(LexisNexisSource(file))
# See the contents of the documents
inspect(corpus)
# See meta-data associated with first article
meta(corpus[[1]])
|
Loading required package: NLP
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument>>
Metadata: 17
Content: chars: 82
[[2]]
<<PlainTextDocument>>
Metadata: 17
Content: chars: 74
author : By PAPER AUTHOR
datetimestamp: 1991-12-19
description : character(0)
heading : Heading One
id : SomeNewsp199112191
language : en
origin : Some Newspaper
intro : character(0)
section : Section 5; Part 2; Page 16; Column 2; National Desk
subject : character(0)
coverage : character(0)
company : character(0)
stocksymbol : character(0)
industry : character(0)
type : character(0)
wordcount : 584 words
rights : Copyright 1991 My Company
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.