get_nexis_html: extract texts and meta data from Nexis HTML files
In readtext: Import and Handling for Plain and Formatted Text Files

get_nexis_html

R Documentation

extract texts and meta data from Nexis HTML files

Description

This extract headings, body texts and meta data (date, byline, length, section, edition) from items in HTML files downloaded by the scraper.

Usage

get_nexis_html(path, paragraph_separator = "\n\n", verbosity, ...)

Arguments

`path`	either path to a HTML file or a directory that contains HTML files
`paragraph_separator`	a character to separate paragraphs in body texts
`verbosity`	0: output errors only 1: output errors and warnings (default) 2: output a brief summary message 3: output detailed file-related messages
`...`	only to trap extra arguments

Examples

## Not run: 
irt <- readtext:::get_nexis_html('tests/data/nexis/irish-times_1995-06-12_0001.html')
afp <- readtext:::get_nexis_html('tests/data/nexis/afp_2013-03-12_0501.html')
gur <- readtext:::get_nexis_html('tests/data/nexis/guardian_1986-01-01_0001.html')
sun <- readtext:::get_nexis_html('tests/data/nexis/sun_2000-11-01_0001.html')
spg <- readtext:::get_nexis_html('tests/data/nexis/spiegel_2012-02-01_0001.html', 
                                  language_date = 'german')

all <- readtext('tests/data/nexis', source = 'nexis')
all <- readtext('tests/data/nexis', source = 'nexis')

## End(Not run)

readtext documentation built on May 29, 2024, 5:13 a.m.