top_perc: Select Top Percentage of Data and Statistical Summarization
In mintyr: Streamlined Data Processing Tools for Genomic Selection

top_perc

R Documentation

Select Top Percentage of Data and Statistical Summarization

Description

The top_perc function selects the top percentage of data based on a specified trait and computes summary statistics. It allows for grouping by additional columns and offers flexibility in the type of statistics calculated. The function can also retain the selected data if needed.

Usage

top_perc(data, perc, trait, by = NULL, type = "mean_sd", keep_data = FALSE)

Arguments

`data`	A `data.frame` containing the source dataset for analysis Supports various data frame-like structures Automatically converts non-data frame inputs
`perc`	Numeric vector of percentages for data selection Range: `-1` to `1` Positive values: Select top percentiles Negative values: Select bottom percentiles Multiple percentiles supported
`trait`	Character string specifying the 'selection column' Must be a valid column name in the input `data` Used as the basis for top/bottom percentage selection
`by`	Optional character vector for 'grouping columns' Default is `NULL` Enables stratified analysis Allows granular percentage selection within groups
`type`	Statistical summary type Default: `"mean_sd"` Controls the type of summary statistics computed Supports various summary methods from `rstatix`
`keep_data`	Logical flag for data retention Default: `FALSE` `TRUE`: Return both summary statistics and selected data `FALSE`: Return only summary statistics

Value

A list or data frame:

If keep_data is FALSE, a data frame with summary statistics.
If keep_data is TRUE, a list where each element is a list containing summary statistics (stat) and the selected top data (data).

Note

The perc parameter accepts values between -1 and 1. Positive values select the top percentage, while negative values select the bottom percentage.
The function performs initial checks to ensure required arguments are provided and valid.
Grouping by additional columns (by) is optional and allows for more granular analysis.
The type parameter specifies the type of summary statistics to compute, with "mean_sd" as the default.
If keep_data is set to TRUE, the function will return both the summary statistics and the selected top data for each percentage.

Examples

# Example 1: Basic usage with single trait
# This example selects the top 10% of observations based on Petal.Width
# keep_data=TRUE returns both summary statistics and the filtered data
top_perc(iris, 
         perc = 0.1,                # Select top 10%
         trait = c("Petal.Width"),  # Column to analyze
         keep_data = TRUE)          # Return both stats and filtered data

# Example 2: Using grouping with 'by' parameter
# This example performs the same analysis but separately for each Species
# Returns nested list with stats and filtered data for each group
top_perc(iris, 
         perc = 0.1,                # Select top 10%
         trait = c("Petal.Width"),  # Column to analyze
         by = "Species")            # Group by Species

# Example 3: Complex example with multiple percentages and grouping variables
# Reshape data from wide to long format for Sepal.Length and Sepal.Width
iris |> 
  tidyr::pivot_longer(1:2,
                      names_to = "names", 
                      values_to = "values") |> 
  mintyr::top_perc(
    perc = c(0.1, -0.2),
    trait = "values",
    by = c("Species", "names"),
    type = "mean_sd")

mintyr documentation built on April 4, 2025, 2:56 a.m.