LexisNexisSource: LexisNexis Source

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/LexisNexisSource.R

Description

Construct a source for an input containing a set of articles exported from LexisNexis in the HTML format.

Usage

1
  LexisNexisSource(x, encoding = "UTF-8")

Arguments

x

Either a character identifying the file or a connection.

encoding

A character giving the encoding of x. It will be ignored unless the HTML input does not include this information, which should normally not happen with files exported from LexisNexis.

Details

This function imports the body of the articles, but also sets several meta-data variables on individual documents:

Please note that dates are not guaranteed to be parsed correctly if the machine from which the HTML file was exported uses a locale different from that of the machine where it is read.

Currently, only HTML files saved in English and French are supported. Please send the maintainer examples of LexisNexis files in your language if you want it to be supported.

Value

An object of class LexisNexisSource which extends the class Source representing set of articles from LexisNexis.

Author(s)

Milan Bouchet-Valat

See Also

readLexisNexisHTML for the function actually parsing individual articles.

getSources to list available sources.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    library(tm)
    file <- system.file("texts", "lexisnexis_test_en.html",
                        package = "tm.plugin.lexisnexis")
    corpus <- Corpus(LexisNexisSource(file))

    # See the contents of the documents
    inspect(corpus)

    # See meta-data associated with first article
    meta(corpus[[1]])

Example output

Loading required package: NLP
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

[[1]]
<<PlainTextDocument>>
Metadata:  17
Content:  chars: 82

[[2]]
<<PlainTextDocument>>
Metadata:  17
Content:  chars: 74

  author       : By PAPER AUTHOR
  datetimestamp: 1991-12-19
  description  : character(0)
  heading      : Heading One
  id           : SomeNewsp199112191
  language     : en
  origin       : Some Newspaper
  intro        : character(0)
  section      : Section 5; Part 2; Page 16; Column 2; National Desk
  subject      : character(0)
  coverage     : character(0)
  company      : character(0)
  stocksymbol  : character(0)
  industry     : character(0)
  type         : character(0)
  wordcount    : 584 words
  rights       : Copyright 1991 My Company

tm.plugin.lexisnexis documentation built on Oct. 30, 2019, 10:33 a.m.