fslice: Faster 'dplyr::slice()'

View source: R/fslice.R

fsliceR Documentation

Faster dplyr::slice()

Description

When there are lots of groups, the fslice() functions are much faster.

Usage

fslice(data, ..., .by = NULL, keep_order = FALSE, sort_groups = TRUE)

fslice_head(
  data,
  ...,
  n,
  prop,
  .by = NULL,
  keep_order = FALSE,
  sort_groups = TRUE
)

fslice_tail(
  data,
  ...,
  n,
  prop,
  .by = NULL,
  keep_order = FALSE,
  sort_groups = TRUE
)

fslice_min(
  data,
  order_by,
  ...,
  n,
  prop,
  .by = NULL,
  with_ties = TRUE,
  na_rm = FALSE,
  keep_order = FALSE,
  sort_groups = TRUE
)

fslice_max(
  data,
  order_by,
  ...,
  n,
  prop,
  .by = NULL,
  with_ties = TRUE,
  na_rm = FALSE,
  keep_order = FALSE,
  sort_groups = TRUE
)

fslice_sample(
  data,
  n,
  replace = FALSE,
  prop,
  .by = NULL,
  keep_order = FALSE,
  sort_groups = TRUE,
  weights = NULL,
  seed = NULL
)

Arguments

data

Data frame

...

See ?dplyr::slice for details.

.by

(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select.

keep_order

Should the sliced data frame be returned in its original order? The default is FALSE.

sort_groups

If TRUE (the default) the by-group slices will be done in order of the sorted groups. If FALSE the group order is determined by first-appearance in the data.

n

Number of rows.

prop

Proportion of rows.

order_by

Variables to order by.

with_ties

Should ties be kept together? The default is TRUE.

na_rm

Should missing values in fslice_max() and fslice_min() be removed? The default is FALSE.

replace

Should fslice_sample() sample with or without replacement? Default is FALSE, without replacement.

weights

Probability weights used in fslice_sample().

seed

Seed number defining RNG state. If supplied, this is only applied locally within the function and the seed state isn't retained after sampling. To clarify, whatever seed state was in place before the function call, is restored to ensure seed continuity. If left NULL (the default), then the seed is never modified.

Details

fslice() and friends allow for more flexibility in how you order the by-group slicing.
Furthermore, you can control whether the returned data frame is sliced in the order of the supplied row indices, or whether the original order is retained (like dplyr::filter()).

In fslice(), when length(n) == 1, an optimised method is implemented that internally uses list_subset(), a fast function for extracting single elements from single-level lists that contain vectors of the same type, e.g. integer.

fslice_head() and fslice_tail() are very fast with large numbers of groups.

fslice_sample() is arguably more intuitive as it by default resamples each entire group without replacement, without having to specify a maximum group size like in dplyr::slice_sample().

Value

A data.frame of specified rows.

Examples

library(timeplyr)
library(dplyr)
library(nycflights13)

flights <- flights %>%
  group_by(origin, dest)

# First row repeated for each group
flights %>%
  fslice(1, 1)
# First row per group
flights %>%
  fslice_head(n = 1)
# Last row per group
flights %>%
  fslice_tail(n = 1)
# Earliest flight per group
flights %>%
  fslice_min(time_hour, with_ties = FALSE)
# Last flight per group
flights %>%
  fslice_max(time_hour, with_ties = FALSE)
# Random sample without replacement by group
# (or stratified random sampling)
flights %>%
  fslice_sample()


timeplyr documentation built on Sept. 12, 2024, 7:37 a.m.