read_many_patterns: Read and row-bind many patterns files
In SafeGraphInc/SafeGraphR: Package for Processing and Analyzing SafeGraph Data

read_many_patterns

R Documentation

Read and row-bind many patterns files

Description

This accepts a directory. It will use read_patterns to load every csv.gz in that folder, assuming they are all patterns files. It will then row-bind together each of the produced processed files. Finally, if post_by is specified, it will re-perform the aggregation, handy for new-format patterns files that split the same week's data across multiple files.

Usage

read_many_patterns(
  dir = ".",
  recursive = TRUE,
  filelist = NULL,
  start_date = NULL,
  post_by = !is.null(by),
  by = NULL,
  fun = sum,
  na.rm = TRUE,
  filter = NULL,
  expand_int = NULL,
  expand_cat = NULL,
  expand_name = NULL,
  multi = NULL,
  naics_link = NULL,
  select = NULL,
  gen_fips = TRUE,
  silent = FALSE,
  ...
)

Arguments

`dir`	Name of the directory the files are in.
`recursive`	Search in all subdirectories as well, as with the since-June-24-2020 format of the AWS downloads. There is not currently a way to include only a subset of these subdirectory files. Perhaps run `list.files(recursive=TRUE)` on your own and pass a subset of the results to the `filelist` option.
`filelist`	A vector of filenames to read in, OR a named list of options to send to `patterns_lookup()`. This list must include `dates` for the dates of data you want, and `list_files` will be set to `TRUE`. If you like, add `key` and `secret` to this list to also download the files you need.
`start_date`	A vector of dates giving the first date present in each zip file, to be passed to `read_patterns` giving the first date present in the file, as a date object. Unlike in `read_patterns`, this value will be added to the data as a variable called `start_date` so you can use it in `post_by`.
`post_by`	After reading in all the files, re-perform aggregation to this level. Use a character vector of variable names (or a list of vectors if using `multi`). Or just set to `TRUE` to have `post_by = by` plus, if present, `expand_name` or `'date'`. Set to `FALSE` to skip re-aggregation. Including `'start_date'` in both `by` and `post_by` is a good idea if you aren't using an approach that creates a `date` variable. By default this is `TRUE` unless `by = NULL` (if `by = NULL` in a `multi` option, it will still be `TRUE` by default for that).
`by, fun, na.rm, filter, expand_int, expand_cat, expand_name, multi, naics_link, select, gen_fips, silent, ...`	Arguments to be passed to `read_patterns`, specified as in `help(read_patterns)`.

Details

Note that after reading in data, if gen_fips = TRUE, state and county names can be merged in using data(fips_to_names).

Examples

## Not run: 
# Our current working directory is full of .csv.gz files!
# Too many... we will probably run out of memory if we try to read them all in at once, so let's chunk it
files <- list.files(pattern = '.gz', recursive = TRUE)
patterns <- read_many_patterns(filelist = files[1:10],
    # We only need these variables (and poi_cbg which is auto-added with gen_fips = TRUE)
    select = c('brands','visits_by_day'),
    # We want two formatted files to come out. The first aggregates to the state-brand-day level, getting visits by day
    multi = list(list(name = 'by_brands', by = c('state_fips','brands'), expand_int = 'visits_by_day'),
    # The second aggregates to the state-county-day level but only for Colorado and COnnecticut (see the filter)
    list(name = 'co_and_ct', by = c('state_fips','county_fips'), filter = 'state_fips %in% 8:9', expand_int = 'visits_by_day')))
patterns_brands <- patterns[[1]]
patterns_co_and_ct <- patterns[[2]]

# Alternately, find the files we need for the seven days starting December 7, 2020,
# read them all in (and if we'd given key and secret too, download them first),
# and then aggregate to the state-date level
dt <- read_many_patterns(filelist = list(dates = lubridate::ymd("2020-12-07") + lubridate::days(0:6)),
                         by = "state_fips", expand_int = 'visits_by_day',
                         select = 'visits_by_day')


# don't forget that if you want weekly data but AREN'T using visits_by_day
# (for example if you're using visitors_home_cbg)
# you want start_date in your by option, as in the second list in multi here
dt <- read_many_patterns(filelist = list(dates = lubridate::ymd("2020-12-07") + lubridate::days(0:6)),
                         select = c('visits_by_day','visitor_home_cbgs'),
                         multi = list(list(name = 'visits',by = 'state_fips',
                         expand_int = 'visits_by_day',filter = 'state_fips == 6'),
                         list(name = 'cbg',by = c('start_date','state_fips'),
                         expand_cat = 'visitor_home_cbgs', filter = 'state_fips == 6')))

## End(Not run)

SafeGraphInc/SafeGraphR documentation built on Nov. 25, 2022, 11:20 a.m.