ps_streaks_get_max_rank_by_sampling: ps_streaks_get_max_rank_by_sampling
In tor-gu/streakexplorer: Baseball Streak Explorer

ps_streaks_get_max_rank_by_sampling

R Documentation

ps_streaks_get_max_rank_by_sampling

Give an estimate of the rank returned by ps_streaks_get_max_rank_simple using this method:

First: Apply the algorithm of ps_streaks_get_max_rank_simple to a limited set of intensity levels (e.g. c(25,50,75) instead of 1:101).
Second, increase the returned rank and increase it by a scaling factor (e.g. 1.5).
Third, restrict the full streaks table to Rank values below the scaled initial estimate.
Finally, apply ps_streaks_get_max_rank_simple to the restricted streak table, this time across all intensity levels.

ps_streaks_get_max_rank_by_sampling(
  lzy_streaks,
  n,
  min_year,
  max_year,
  teams,
  levels,
  scaling
)

`lzy_streaks`	Lazy streaks table
`n`	Function will maximize value of `n`th highest rank
`min_year`	Minimum year for filter
`max_year`	Maximum year for filter
`teams`	Vector of team IDs for filter.
`levels`	Intensity levels for the sampling, e.g. `c(25,50,75)`
`scaling`	Scaling factor, e.g. `1.5`

Notes:

This estimate will always be less than or equal to the true value.
This function calls ps_streaks_get_max_rank_simple twice, but each time with a filter applied to the lzy_streaks_tbl. It is less efficient than ps_streaks_get_max_rank_simple on smaller datasets, but much faster on larger datasets.
Increasing the scaling factor or the intensity sample space increases the accuracy at the cost of speed.
Smaller datasets require larger scaling factors, and larger datasets require smaller scaling factors.