ncaa_bb_pbp_parse: ncaa_bb_pbp_parse

View source: R/ncaa_bb.R

ncaa_bb_pbp_parseR Documentation

ncaa_bb_pbp_parse

Description

a function to parse the box and pbp html into a pbp table

Usage

ncaa_bb_pbp_parse(box_html, pbp_html)

Arguments

box_html

an rvest html object

pbp_html

an rvest html object

Value

a dataframe of cleaned, parsed, play-by-play data for the specified game_id check for which team has possession on each play – this is effectively the instantaneous "who has possession" whenever a given stat is recorded e.g. if team A makes a shot, that shot is recorded as the possession of team A, not team B who has possession immediately after the made shot however, if team B forces a steal, team B is recorded as the possessing team fixing a very annoying edge case where a team wins jump ball then immediately loses it on a turnover for when we can't guess who has possession, we basically fill in the gaps based on who had possession before and after a play putting these columns on temporary hold, V1 PBP completely breaks them these are deliberately commented out for a few reasons:

  1. As stringer data, these designators are somewhat noisy

  2. As Seth Partnow pointed out in The Midrange Theory, these designators can be biased

  3. THese designators are only available in V2 of the PBP, not V1 grouping by period because some lineups will change between periods without being noted in the pbp whenever a player is subbed in, they have a 1 in that row and a 0 in the row before. whenever a player is subbed out, they have a 0 in that row and a 1 in the row before we then cascade the 1s and 0s up and down to create map of who is in the game at any given time now we map player names to the roster df. this could be noisy in theory but i haven't seen any issues in practice some character encoding stuff, dropping players with mispelled names in the pbp for v1, these columns are not recorded, so they are set to NA so they don't register as false negatives


ehess/ncaascrapR documentation built on March 28, 2022, 3:33 a.m.