read.atekst.dir: Parse all atekst .txt files in a directory

Description Usage Arguments Examples

View source: R/read.atekst.dir_function.R

Description

Parse all .txt files downloaded from atekst within a directory (including subfolders). It can use a pattern (regex) to identify files. The function returns a data frame with the headline, paper, date, time, mode (net/print), url, and text for each article. In order to speed it up it is possible to run it in parallel by setting parallel to TRUE and setting cores. When working with large corpuses it is recommended to run the function once and save the resulting data frame as a .RData-file. That way it can be loaded (using load()) into R in a fraction of the time it takes to parse the whole corpus.

Usage

1
2
read.atekst.dir(dir, recursive = TRUE,
  regex = "^Utvalgte_dokumenter.*.txt$", parallel = FALSE, cores = 1)

Arguments

dir

Directory containing atekst .txt files.

recursive

If TRUE, the function also parses files within subfolders.

regex

Regular expression (pattern) to use for selecting files to parse.

parallel

If TRUE it will try to do it in parallel (using the packages foreach, iterators, doParallel and parallel).

cores

The amount of cores to use (if parallel is TRUE).

Examples

1
2
corpus <- read.atekst.dir("some/directory")
save(corpus, file = "atekst-corpus.RData")

mikaelpoul/parseAtekst documentation built on May 22, 2017, 7:41 a.m.