EuropresseSource: Europresse Source

Description Usage Arguments Details Value Author(s) See Also Examples

Description

Construct a source for an input containing a set of articles exported from Europresse in the HTML format.

Usage

1
  EuropresseSource(x, encoding = "UTF-8")

Arguments

x

Either a character identifying the file or a connection.

encoding

A character giving the encoding of x. Files exported from Europresse often specify an incorrect encoding, in which case you will need to find out the correct one.

Details

This function imports the body of the articles, but also sets several meta-data variables on individual documents:

Please note that it commonly happens that the encoding specified in Europresse HTML files does not correspond to the one actually used in the text: in that case, you will need to find out the correct encoding and specify it manually.

Value

An object of class EuropresseSource which extends the class Source representing set of articles from Europresse.

Author(s)

Milan Bouchet-Valat

See Also

readEuropresseHTML2 for the function actually parsing individual articles.

getSources to list available sources.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
    library(tm)
    file <- system.file("texts", "europresse_test2.html",
                        package = "tm.plugin.europresse")
    corpus <- Corpus(EuropresseSource(file))

    # See the contents of the documents
    inspect(corpus)

    # See meta-data associated with first article
    meta(corpus[[1]])

    

Example output

Loading required package: NLP
Warning messages:
1: In readerControl$reader(elem, readerControl$language, as.character(counter)) :
  Could not parse document date from "Communication", "mardi 14 mars 2006", "p. 16". You may need to change the system locale to match that of the corpus. See LC_TIME in ?Sys.setlocale.
2: In readerControl$reader(elem, readerControl$language, as.character(counter)) :
  Could not parse document date from "Politique", "mardi 14 mars 2006", "p. 12". You may need to change the system locale to match that of the corpus. See LC_TIME in ?Sys.setlocale.
3: In readerControl$reader(elem, readerControl$language, as.character(counter)) :
  Could not parse document date from "Economie", "lundi 13 mars 2006", "p. MDE6". You may need to change the system locale to match that of the corpus. See LC_TIME in ?Sys.setlocale.
4: In readerControl$reader(elem, readerControl$language, as.character(counter)) :
  Could not parse document date from "Derni<U+00E8>re heure", "lundi 13 mars 2006", "p. 32". You may need to change the system locale to match that of the corpus. See LC_TIME in ?Sys.setlocale.
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4

[[1]]
<<PlainTextDocument>>
Metadata:  10
Content:  chars: 18

[[2]]
<<PlainTextDocument>>
Metadata:  10
Content:  chars: 10

[[3]]
<<PlainTextDocument>>
Metadata:  10
Content:  chars: 0

[[4]]
<<PlainTextDocument>>
Metadata:  10
Content:  chars: 10

  author       : character(0)
  datetimestamp: NA
  description  : character(0)
  heading      : Title
  id           : 20060315LM0q15031256056
  language     : en
  origin       : Newspaper
  section      : Communication
  pages        : character(0)
  rights       : <U+00A9> 2006 Owner. Tous droits r<U+00E9>serv<U+00E9>s.

tm.plugin.europresse documentation built on May 1, 2019, 8:18 p.m.