slice.corpus: Subset documents using their positions

View source: R/slice.R

slice.corpusR Documentation

Subset documents using their positions

Description

slice() lets you index documents by their (integer) locations. It allows you to select, remove, and duplicate documents. It is accompanied by a number of helpers for common use cases:

  • slice_head() and slice_tail() select the first or last documents.

  • slice_sample() randomly selects documents.

  • slice_min() and slice_max() select documents with highest or lowest values of a document variable.

Usage

## S3 method for class 'corpus'
slice(.data, ..., .preserve = FALSE)

## S3 method for class 'corpus'
slice_head(.data, ..., n, prop)

## S3 method for class 'corpus'
slice_tail(.data, ..., n, prop)

## S3 method for class 'corpus'
slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)

## S3 method for class 'corpus'
slice_min(.data, ..., n, prop, with_ties = TRUE)

## S3 method for class 'corpus'
slice_max(.data, ..., n, prop, with_ties = TRUE)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.

...

<data-masking> Expressions that return a logical value, and are defined in terms of the variables in .data. If multiple expressions are included, they are combined with the & operator. Only rows for which all conditions evaluate to TRUE are kept.

.preserve

Relevant when the .data input is grouped. If .preserve = FALSE (the default), the grouping structure is recalculated based on the resulting data, otherwise the grouping is kept as is.

n, prop

Provide either n, the number of documents, or prop, the proportion of documents to select. If neither are supplied, n = 1 will be used.

If n is greater than the number of rows in the group (or prop > 1), the result will be silently truncated to the group size. If the proportion of a group size is not an integer, it is rounded down.

weight_by

<data-masking> Sampling weights. This must evaluate to a vector of non-negative numbers the same length as the input. Weights are automatically standardised to sum to 1.

replace

Should sampling be performed with (TRUE) or without (FALSE, the default) replacement.

with_ties

Should ties be kept together? The default, TRUE, may return more rows than you request. Use FALSE to ignore ties, and return the first n rows.

Value

An object of the same type as .data. The output has the following properties:

  • Each document may appear 0, 1, or many times in the output. (If duplicated, then document names will be modified to remain unique.)

  • Document variables are not modified.

Examples

slice(data_corpus_inaugural, 2:5)
slice(data_corpus_inaugural, 55:n())
slice_head(data_corpus_inaugural, n = 2)
slice_tail(data_corpus_inaugural, n = 3)
slice_tail(data_corpus_inaugural, prop = .05)

set.seed(42)
slice_sample(data_corpus_inaugural, n = 3)
slice_sample(data_corpus_inaugural, prop = .10, replace = TRUE)

data_corpus_inaugural <- data_corpus_inaugural %>%
    mutate(ntoks = ntoken(data_corpus_inaugural))
# shortest three texts
slice_min(data_corpus_inaugural, ntoks, n = 3)
# longest three texts
slice_max(data_corpus_inaugural, ntoks, n = 3)

quanteda/quanteda.tidy documentation built on April 5, 2025, 2:50 p.m.