readRCV1: Read In a Reuters Corpus Volume 1 Document

Description Usage Arguments Value References See Also Examples

Description

Read in a Reuters Corpus Volume 1 XML document.

Usage

1
2
readRCV1(elem, language, id)
readRCV1asPlain(elem, language, id)

Arguments

elem

a named list with the component content which must hold the document to be read in.

language

a string giving the language.

id

Not used.

Value

An XMLTextDocument for readRCV1, or a PlainTextDocument for readRCV1asPlain, representing the text and metadata extracted from elem$content.

References

Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361–397. https://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf

See Also

Reader for basic information on the reader infrastructure employed by package tm.

Examples

1
2
3
4
5
f <- system.file("texts", "rcv1_2330.xml", package = "tm")
f_bin <- readBin(f, raw(), file.size(f))
rcv1 <- readRCV1(elem = list(content = f_bin), language = "en", id = "id1")
content(rcv1)
meta(rcv1)

Example output

Loading required package: NLP
{xml_document}
<newsitem itemid="2330" id="root" date="1996-08-20" lang="en">
[1] <title>USA: Tylan stock jumps; weighs sale of company.</title>
[2] <headline>Tylan stock jumps; weighs sale of company.</headline>
[3] <dateline>SAN DIEGO</dateline>
[4] <text>\n  <p>The stock of Tylan General Inc. jumped Tuesday after the mak ...
[5] <copyright>(c) Reuters Limited 1996</copyright>
[6] <metadata>\n  <codes class="bip:countries:1.0">\n    <code code="USA"> </ ...
  author       : 
  datetimestamp: 1996-08-20
  description  : 
  heading      : USA: Tylan stock jumps; weighs sale of company.
  id           : 2330
  language     : en
  origin       : Reuters Corpus Volume 1
  publisher    : Reuters Holdings Plc
  topics       : c("C15", "C152", "C18", "C181", "CCAT")
  industries   : I34420
  countries    : USA

tm documentation built on April 7, 2021, 3:01 a.m.