growth_over_time: Track foot traffic by group over time
In SafeGraphInc/SafeGraphR: Package for Processing and Analyzing SafeGraph Data

growth_over_time

R Documentation

Track foot traffic by group over time

Description

A start-to-finish download and analysis! This function, given a range of dates, a subset of data, and a grouping set, will produce an estimate of how foot traffic to those groups has changed over that date range within that subset.

Usage

growth_over_time(
  dates,
  by,
  ma = 7,
  dir = ".",
  old_dir = NULL,
  new_dir = NULL,
  filelist = NULL,
  filelist_norm = NULL,
  start_dates = NULL,
  filter = NULL,
  naics_link = NULL,
  origin = 0,
  key = NULL,
  secret = NULL,
  make_graph = FALSE,
  graph_by = NULL,
  line_labels = NULL,
  graph_by_titles = NULL,
  test_run = TRUE,
  read_opts = NULL,
  processing_opts = NULL,
  graph_opts = list(title = data.table::fcase(is.null(graph_by) & is.null(by),
    "SafeGraph Foot Traffic Growth", is.null(graph_by),
    paste("SafeGraph Foot Traffic Growth by", paste(by, collapse = ", ")), min(by %in%
    graph_by) == 1, "SafeGraph Foot Traffic Growth", default =
    paste("SafeGraph Foot Traffic Growth by", paste(by[!(by %in% graph_by)], collapse =
    ", ")))),
  patterns_backfill_date = "2020/12/14/21/",
  norm_backfill_date = "2020/12/14/21/",
  ...
)

Arguments

`dates`	The range of dates to cover in analysis. Note that (1) analysis will track growth relative to the first date listed here, and (2) if additional, earlier dates are necessary for the `ma` moving-average, they will be added automatically, don't do it yourself.
`by`	A character vector of variable names to calculate growth separately by. You will get back a data set with one observation per date in `dates` per combination of variables in `by`. Set to `NULL` to aggregate all traffic by `date` (within the `filter`). See the variable names in the [patterns documentation](http://docs.safegraph.com), and in addition you may use `state_fips` and/or `county_fips` for state and county FIPS codes.
`ma`	Number of days over which to take the moving average.
`dir`	The folder where the `patterns_backfill`/`patterns` folders of patterns data, as well as `normalization_stats`/`normalization_stats_backfill` are stored. This is also where any files that need to be downloaded from AWS will be stored.
`old_dir`	Where "old" (pre-December 7, 2020) files go, if not the same as `dir`. This should be the folder that contains the `patterns_backfill` and the `normalization_stats_backfill` folder.
`new_dir`	Where "new" (post-December 7, 2020) files go, if not the same as `dir`. This should be the folder that contains the `patterns` and the `normalization_stats` folder.
`filelist`	If your data is not structured as downloaded from AWS, use this option to pass a vector of (full) filenames for patterns CSV.GZ data instead of looking in `dir` or on AWS. These will not be checked for date ranges until after opening them all, so be extra sure you have everything you need!
`filelist_norm`	If your data is not structured as downloaded from AWS, use this option to pass a vector of (full) filenames for normalization CSV data instead of looking in `dir` or on AWS. These will not be checked for date ranges until after opening them all, so be extra sure you have everything you need!
`start_dates`	If using the `filelist` argument, provide a vector of the first date present in each file. This should be the same length as `filelist`.
`filter`	A character variable describing a subset of the data to include, for example `filter = 'state_fips == 6'` to only include California, or `brands == 'McDonald\'s'` to only include McDonald's. See the variable names in the [patterns documentation](http://docs.safegraph.com), and in addition you may use `state_fips` and/or `county_fips` for state and county FIPS codes.
`naics_link`	Necessary only to `filter` or `by` on a NAICS code. A `data.table`, possibly produced by `link_poi_naics`, that links `placekey` and `naics_code`. This will allow you to include `'naics_code'` in the `by` argument. Technically you could have stuff other than `naics_code` in here and use that in `by` too, I won't stop ya.
`origin`	The value indicating no growth/initial value. The first date for each group will have this value. Usually 0 (for "0 percent growth") or 1 ("100 percent of initial value").
`key`	A character string containing an AWS Access Key ID, necessary if your range of dates extends beyond the files in `dir`.
`secret`	A character string containing an AWS Secret Access Key, necessary if your range of dates extends beyond the files in `dir`.
`make_graph`	Set to TRUE to produce (and return) a nicely-formatted graph showing growth over time with separate lines for each `by` group. If `by` produces more than, roughly, six combinations, then this won't look very good and you should also specify `graph_by`. Requires that ggplot2 and ggrepel be installed. If this is `TRUE`, then instead of returning a `data.table`, will return a `list` where the first element is the normal `data.table`, and the second is the `ggplot` object.
`graph_by`	A character vector, which must be a subset of `by`. Will produce a separate graph for each combination of `graph_by`, graphing separate lines on each for the remaining elements of `by` that aren't in `graph_by`. Now, the second element of the returned `list` will itself be a list containing each of the different graphs as elements, and no graph will be automatically printed. Only relevant if `make_graph = TRUE`.
`line_labels`	A `data.table` (or object like a `data.frame` that can be coerced to a `data.table`). Contains columns for all the variables that are in `by` but not `graph_by`. Those columns should uniquely identify the rows. Contains exactly one other column, which is the label that will be put on the graph lines. For example, `data(naics_codes)` would work as an argument if `by = naics_code`. If `line_labels` is specified, any combination of `by`-but-not-`graph_by` values that is not present in `line_labels` will be dropped.
`graph_by_titles`	A `data.table` (or object like a `data.frame` that can be coerced to a `data.table`). Contains columns for all the variables that are in `graph_by`. Those columns should uniquely identify the rows. Contains exactly one other column, which is the label that will be put in each graph's subtitle. If `graph_by_titles` is specified, any combination of `graph_by` values that are not in `graph_by_titles` will be dropped.
`test_run`	Runs your analysis for only the first week of data, just to make sure it looks like you want. `TRUE` by default because this is a slow, data-hungry, and (if you haven't already downloaded the files) bandwidth-hungry command, and you should only run the full thing after being sure it's right!
`read_opts`	A named list of options to be sent to `read_many_patterns`. Be careful using as there may be conflicts with options implied by other parameters. Including a `select` option in here will likely speed up the function considerably.
`processing_opts`	A named list of options to be sent to `processing_template`. Be careful using as there may be conflicts with options implied by other parameters.
`graph_opts`	A named list of options to be sent to `graph_template`.
`patterns_backfill_date`	Character variable with the folder structure for the most recent `patterns_backfill` pull. i.e., the 2018, 2019, and 2020 folders containing backfill data in their subfolders should set in the `paste0(old_dir,'/patterns_backfill/',patterns_backfill_date)` folder.
`norm_backfill_date`	A character string containing the series of dates that fills the X in `normalization_stats_backfill/X/` and in which the `2018`, `2019`, and `2020` folders sit.
`...`	Parameters to be passed on to `patterns_lookup()` (and, often, from there on to `safegraph_aws()`.)

Details

This goes from start to finish, downloading any necessary files from AWS, reading them in and processing them, normalizing the data by sample size, calculating a moving average, and returning the processed data by group and date. It will even make you a nice graph if you want using graph_template.

Returns a data.table with all the variables in by, the date, the raw visits_by_day, the total_devices_seen normalization variable, the adj_visits variable adjusted for sample size, and growth_visits, which calculates growth from the start of the dates range. If make_graph is TRUE, will instead return a list where the first element is that data.table, and the second is a ggplot graph object.

Be aware:

1. This will only work with the visits_by_day variable. Or at least it's only designed to. Maybe you can get it to work with something else.

2. This uses processing_template, so all the caveats of that function apply here. No attempt will be made to handle outliers, oddities in the data, etc.. You get what you get. If you want anything more complex, you'll have to do it by hand! You might try mining this function's source code (just do foot_traffic_growth in the console) to get started.

3. Each week of included data means a roughly 1GB AWS download unless it's already on your system. Please don't ask for more than you need, and if you have already downloaded the data, please input the directory properly to avoid re-downloading.

4. This requires data to be downloaded from AWS, and will not work on Shop data. See read_many_shop followed by processing_template for that.

5. Be aware that very long time frames, for example crossing multiple years, will always be just a little suspect for this. The sample changed structure considerably from 2019 to 2020. Usually this is handled by normalization by year and then calculation of YOY change on top of that. This function doesn't do that, but you could take its output and do that yourself if you wanted.

TO BE ADDED SOON: Sample size adjustments to equalize sampling rates, and labeling.

Examples


 ## Not run: 
 data(state_info)

 p <- growth_over_time(lubridate::ymd('2020-12-07') + lubridate::days(0:6),
                       by = c('region', 'brands'),
                       filter = 'brands %in% c("Macy\'s", "Target")',
                       make_graph = TRUE,
                       graph_by = 'region',
                       graph_by_titles = state_info[, .(region, statename)],
                       test_run = FALSE)

 # The growth data overall for the growth of Target and Macy's in this week
 p[[1]]

 # Look at the graph for the growth of Target and Macy's in this week in Colorado
 p[[2]][[6]]
 
## End(Not run)

SafeGraphInc/SafeGraphR documentation built on Nov. 25, 2022, 11:20 a.m.