knitr::opts_chunk$set( echo = TRUE, eval = FALSE, message = TRUE, warning = FALSE, collapse = TRUE, comment = "#>" )
Sample size planning for sequential tests differs fundamentally from fixed-design studies. In sequential ANOVA, the final sample size is determined by the evidence in the data itself and consequently remains unknown beforehand -- data collection continues until either the upper or lower decision boundary is reached.
The challenge: While this data-driven stopping rule is very efficient, it creates practical difficulties. Resource planning requires knowing whether you might need 100 observations or 1,000. Budget constraints, time limitations, and logistical considerations all demand some advance estimate of required resources.
The solution: Although the exact final sample size cannot be known in advance, simulation-based planning bridges the gap between statistical theory and practical constraints.
The sprtt package provides the plan_sample_size() function, which generates HTML reports summarizing simulation results for sequential ANOVAs.
Researchers can obtain guidance on:
The decision boundaries of the sequential ANOVA control Type I ($\alpha$) and Type II ($\beta$) errors in the long run. However, introducing a maximum sample size $N_{\text{max}}$ for practical resource planning creates an important complication: it reduces the achievable power below the nominal $1-\beta$.
When $N_{\text{max}}$ is reached before a decision boundary is crossed, this results in a non-decision. The non-decision rate depends directly on the chosen maximum sample size. This introduces a new metric: the decision rate (the probability of reaching a decision) given resource limitations.
While non-decisions are undesirable, they represent a crucial conceptual distinction from accepting the null hypothesis. SPRTs like the sequential ANOVA differentiate between stopping data collection to accept the null hypothesis and stopping because more evidence would be required to make a decision but resources are exhausted. Importantly, as long as no decision has been reached, data collection can continue if additional resources become available.
plan_sample_size() FunctionThe plan_sample_size() function generates interactive HTML reports for sample size planning based on a large simulation database.
Reports include recommended maximum sample sizes, expected sample sizes, power curves, and comparisons to traditional ANOVA designs.
To make sample size planning fast and accessible, sprtt includes access to extensive simulation results.
These simulations were conducted by:
This simulation database is stored externally to keep the package installation size small.
The data are downloaded automatically on first use of plan_sample_size() and cached locally for future sessions.
cat(sprintf( 'Source code of the simulation database: <a href="https://github.com/MeikeSteinhilber/sprtt_plan_sample_size" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/GitHub-MeikeSteinhilber/sprtt__plan__sample__size-blue?logo=github" alt="GitHub" style="vertical-align:middle;"></a></span></p>' ))
Let's walk through a practical example. Imagine you're planning a study to compare three groups. You want to detect medium-sized effects (Cohen's $f = 0.25$) or larger with specific error control.
You set $\alpha = 0.05$ to control Type I errors at the standard 5% level, ensuring that rejections of the null hypothesis are trustworthy in the long run. To minimize Type II errors, you set $\beta = 0.05$, limiting false acceptances of $H_0$ also to 5%. However, given limited resources, you're willing to accept a 15% non-decision rate, meaning you'll reach a decision 85% of the time.
Critically, this setup reflects a deliberate trade-off: by keeping both error rates as low as 5%, you accept that a decision will not always be reached — but when it is, it can be trusted. Non-decisions, by contrast, indicate that the available evidence was insufficient given your error constraints, and more data are required.
Now let's see how to generate a sample size planning report for this scenario:
plan_sample_size( f_expected = 0.25, # Expected effect size k_groups = 3, # Number of groups beta = 0.05, # beta error rate decision_rate = 0.85 # desired percentage of decisions )
When you run this code for the first time, several things happen:
The entire process typically takes a couple of seconds for the initial download, then just a few seconds more for generating the subsequent report.
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| f_expected | numeric | required | Expected standardized effect size (Cohen's f). Must be one of: 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, or 0.40. |
| k_groups | integer | required | Number of groups to compare. Must be 2, 3, or 4. |
| beta | numeric | 0.05 | Beta error rate. Must be 0.20, 0.10, or 0.05. |
| output_dir | character | tempdir() | Directory where the HTML report will be saved. |
| output_file | character | "sprtt-report-sample-size-planning.html" | Filename for the generated report. |
| open | logical | interactive() | Whether to open the report in your browser after generation. Set to FALSE for batch processing. |
| overwrite | logical | FALSE | Whether to overwrite an existing file with the same name without prompting. |
The function validates all inputs before generating the report. If you specify a parameter value that doesn't exist in the simulation database, you'll receive an informative error message listing the available options. For example:
# This will produce an error: plan_sample_size(f_expected = 0.22, k_groups = 3) #> Error: `f_expected` = 0.22 is not available. #> Please choose one of 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, or 0.4
The expected effect size has a large impact on required sample size. Here's how to generate reports for different scenarios:
# report 1 plan_sample_size(f_expected = 0.15, k_groups = 3, beta = 0.05) # report 2 plan_sample_size(f_expected = 0.35, k_groups = 3, beta = 0.05)
By default, reports are saved to a temporary directory. For reports you want to keep, specify a custom location:
plan_sample_size( f_expected = 0.25, k_groups = 4, output_dir = "~/Documents/research/sample_size_planning", output_file = "study1_anova.html", open = TRUE )
This is particularly useful when preparing documentation for grant applications, pre-registrations, or manuscript supplementary materials.
When preparing grant applications or pre-registrations, you might want to explore multiple scenarios (e.g., different effect size assumptions):
# Define scenarios to compare scenarios <- data.frame( effect = c(0.15, 0.20, 0.25), label = c("conservative", "expected", "optimistic") ) # Generate reports for each scenario for (i in 1:(nrow(scenarios))) { plan_sample_size( f_expected = scenarios$effect[i], k_groups = 3, beta = 0.10, output_dir = "sample_size_reports", output_file = sprintf("plan_sample_size_%s.html", scenarios$label[i]), open = FALSE, # Don't open each one overwrite = TRUE ) } message("Generated ", nrow(scenarios), " sample size reports")
This approach creates a set of reports that document your planning across different assumptions.
Downloading Data Explicitly
While plan_sample_size() downloads data automatically when needed, you can also download it explicitly:
# Download simulation data manually download_sample_size_data()
This is useful if you want to:
To force a re-download (for example, after a package update with new simulation data):
download_sample_size_data(force = TRUE)
Checking Cache Status
To see whether data are cached and how much disk space they occupy:
cache_info()
This displays:
Clearing the Cache
If you need to free up disk space or suspect corrupted data, you can clear the cache:
cache_clear()
The data will be re-downloaded automatically the next time you run plan_sample_size().
Working with Simulation Data Directly
Advanced users may want to access the raw simulation data for custom analyses or visualizations. You can load the data directly into your R session:
# Load the complete simulation dataset (downloads automatically if not yet cached) loaded <- load_sample_size_data() # Access the simulation data frame df_all <- loaded$data # Check which dataset version this report is based on loaded$description # short description loaded$version # e.g. "v0.1.0-data" loaded$created # date the dataset was created loaded$n_rep # number of simulation iterations per condition
The data frame df_all contains all simulation results and can be filtered, summarized, or visualized using standard R tools. See ?load_sample_size_data for a full description of all available columns.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.