readSZ: Read the SZ corpus

View source: R/readSZ.R

readSZR Documentation

Read the SZ corpus

Description

Reads the XML-files from the SZ corpus and seperates the text and meta data.

Usage

readSZ(path = getwd(), file = list.files(path = path, pattern =
  "*.xml$", full.names = FALSE, recursive = TRUE, ignore.case = TRUE),
  do.meta = TRUE, do.text = TRUE)

Arguments

path

Path where the data files are.

file

Character string with names of the HTML files.

do.meta

Logical: Should the algorithm collect meta data?

do.text

Logical: Should the algorithm collect text data?

Value

meta

id date rubrik page AnzChar AnzWoerter dachzeile title zwischentitel untertitel

text

Text (Paragraphenweise)

Examples


##---- Should be DIRECTLY executable !! ----

Docma-TU/tmT documentation built on May 5, 2022, 12:45 a.m.