read.segments: Read and Segment Multiple Texts
In lingmatch: Linguistic Matching and Accommodation

read.segments

R Documentation

Read and Segment Multiple Texts

Description

Split texts by word count or specific characters. Input texts directly, or read them in from files.

Usage

read.segments(path = ".", segment = NULL, ext = ".txt", subdir = FALSE,
  segment.size = -1, bysentence = FALSE, end_in_quotes = TRUE,
  preclean = FALSE, text = NULL)

Arguments

`path`	Path to a folder containing files, or a vector of paths to files. If no folders or files are recognized in `path`, it is treated as `text`.
`segment`	Specifies how the text of each file should be segmented. If a character, split at that character; '\n' by default. If a number, texts will be broken into that many segments, each with a roughly equal number of words.
`ext`	The extension of the files you want to read in. '.txt' by default.
`subdir`	Logical; if `TRUE`, files in folders in `path` will also be included.
`segment.size`	Logical; if specified, `segment` will be ignored, and texts will be broken into segments containing roughly `segment.size` number of words.
`bysentence`	Logical; if `TRUE`, and `segment` is a number or `segment.size` is specified, sentences will be kept together, rather than potentially being broken across segments.
`end_in_quotes`	Logical; if `FALSE`, sentence-ending marks (`.?!`) will not be considered when immediately followed by a quotation mark. For example, `'"Word." Word.'` would be considered one sentence.
`preclean`	Logical; if `TRUE`, text will be cleaned with `lma_dict(special)` before segmentation.
`text`	A character vector with text to be split, used in place of `path`. Each entry is treated as a file.

Value

A data.frame with columns for file names (input), segment number within file (segment), word count for each segment (WC), and the text of each segment (text).

Examples

# split preloaded text
read.segments("split this text into two segments", 2)

## Not run: 

# read in all files from the package directory
texts <- read.segments(path.package("lingmatch"), ext = "")
texts[, -4]

# segment .txt files in dir in a few ways:
dir <- "path/to/files"

## into 1 line segments
texts_lines <- read.segments(dir)

## into 5 even segments each
texts_5segs <- read.segments(dir, 5)

## into 50 word segments
texts_50words <- read.segments(dir, segment.size = 50)

## into 1 sentence segments
texts_1sent <- read.segments(dir, segment.size = 1, bysentence = TRUE)

## End(Not run)

lingmatch documentation built on May 29, 2024, 11:48 a.m.