Description Usage Arguments Details Value Note Author(s) See Also Examples
View source: R/FactivaSource.R
Construct a source for an input containing a set of articles exported from Factiva in the XML or HTML formats.
1 2 | FactivaSource(x, encoding = "UTF-8",
format = c("auto", "XML", "HTML"))
|
x |
Either a character identifying the file or a connection. |
encoding |
A character giving the encoding of |
format |
The format of the file or connection identified by |
This function can be used to import both XML and HTML files.
If format
is set to “auto” (the default), the file extension is used
to guess the format: if the file name ends with “.xml” or “.XML”,
XML is assumed; else, the file is assumed to be in the HTML format.
It is advised to export articles from Factiva in the XML format rather than in HTML when possible, since the latter does not provide completely clean information. In particular, dates are not guaranteed to be parsed correctly if the machine from which the HTML file was exported uses a locale different from that of the machine where it is read.
The following screencast illustrates how to export articles in the correct HTML format from the Factiva website: http://rtemis.hypotheses.org/files/2017/02/Factiva-animated-tutorial.gif. Do note that by not following this procedure, you will obtain a HTML file which cannot be imported by this package.
This function imports the body of the articles, but also sets several meta-data variables on individual documents:
datetimestamp
: The publication date.
heading
: The title of the article.
origin
: The newspaper the article comes from.
edition
: The (local) variant of the newspaper.
section
: The part of the newspaper containing the article.
subject
: One or several keywords defining the subject.
company
: One or several keywords identifying the covered companies.
industry
: One or several keywords identifying the covered industries.
infocode
: One or several Information Provider Codes (IPC).
infodesc
: One or several Information Provider Descriptions (IPD).
coverage
: One or several keywords identifying the covered regions.
page
: The number of the page on which the article appears (if applicable).
wordcount
: The number of words in the article.
publisher
: The publisher of the newspaper.
rights
: The copyright information associated with the article.
language
: This information is set automatically if
readerControl = list(language = NA)
is passed (see the example below).
Else, the language specified manually is set for all articles. If omitted,
the default, "en", is used.
An object of class XMLSource
which extends the class
Source
representing set of articles from Factiva.
It has been found that some Factiva articles contain unescaped characters that are not authorized in XML files. If such articles are included in the input you are trying to import, the XML parser will fail printing a few error messages, and the corpus will not be created at all.
If you experience this bug, please report this to the Factiva Customer Service, which will fix the incriminated article; feel free to ask the maintainer of the present package if needed. In the meantime, you can exclude the problematic article from the XML file: to identify it, proceed by exporting only one half of the original corpus at a time, as many times as needed, and see when it fails; you will eventually find the culprit. (If you know XML, you can use an XML validator to find the relevant part of the file, and fix it by hand.)
Milan Bouchet-Valat
readFactivaXML
and readFactivaHTML
for the functions
actually parsing individual articles.
getSources
to list available sources.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | ## Not run:
## For an XML file
library(tm)
file <- system.file("texts", "reut21578-factiva.xml",
package = "tm.plugin.factiva")
source <- FactivaSource(file)
corpus <- Corpus(source, readerControl = list(language = NA))
# See the contents of the documents
inspect(corpus)
# See meta-data associated with first article
meta(corpus[[1]])
## End(Not run)
## For an HTML file
library(tm)
file <- system.file("texts", "factiva_test.html",
package = "tm.plugin.factiva")
source <- FactivaSource(file)
corpus <- Corpus(source, readerControl = list(language = NA))
# See the contents of the documents
inspect(corpus)
# See meta-data associated with first article
meta(corpus[[1]])
|
Loading required package: NLP
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 19
[[1]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 572
[[2]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 2686
[[3]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 376
[[4]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 440
[[5]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 598
[[6]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 2824
[[7]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 2801
[[8]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 972
[[9]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 2159
[[10]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 2118
[[11]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 2292
[[12]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 647
[[13]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 675
[[14]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 644
[[15]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 674
[[16]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 926
[[17]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 514
[[18]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 1845
[[19]]
<<PlainTextDocument>>
Metadata: 19
Content: chars: 417
author : character(0)
datetimestamp: 1987-02-26
description : character(0)
heading : DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
id : REUT-870226-01
language : en
origin : Reuters-21578
Author : character(0)
edition : character(0)
section : character(0)
subject : character(0)
coverage : c("United States", "North America")
company : character(0)
industry : character(0)
infocode : character(0)
infodesc : character(0)
wordcount : character(0)
publisher : character(0)
rights : Copyright 1987 Reuters
Warning messages:
1: In readerControl$reader(elem, readerControl$language, as.character(counter)) :
Could not parse document date "29 d<U+00E9>cembre 2011". You may need to change the system locale to match that of the corpus. See LC_TIME in ?Sys.setlocale.
2: In readerControl$reader(elem, readerControl$language, as.character(counter)) :
Could not parse document date "29 d<U+00E9>cembre 2011". You may need to change the system locale to match that of the corpus. See LC_TIME in ?Sys.setlocale.
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument>>
Metadata: 18
Content: chars: 134
[[2]]
<<PlainTextDocument>>
Metadata: 18
Content: chars: 43
author : character(0)
datetimestamp: NA
description : character(0)
heading : Test 1
id : TESTFR-111229-e
language : fr
origin : Test newspaper
edition : character(0)
section : character(0)
subject : c("National/Presidential Elections", "Domestic Politics", "Elections", "Political/General News", "Politics/International Relations")
coverage : c("France", "European Union Countries", "Europe", "Mediterranean", "Western Europe")
company : character(0)
industry : character(0)
infocode : c("INGE", "VOTE", "GEN", "PIL", "POL", "LANGFR", "FA", "FB", "RTF", "DNP")
infodesc : character(0)
wordcount : 295
publisher : Reuters Limited
rights : (c) Test company.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.