R/covidcast.R

Defines functions api_to_date date_to_string report_api_errors .request .request_meta .is_testing .list .listitem covidcast geo_warning_message date_sequence covidcast_days specific_meta summary.covidcast_meta head.covidcast_meta print.covidcast_meta covidcast_meta covidcast_signals summary.covidcast_signal head.covidcast_signal print.covidcast_signal as.covidcast_signal.data.frame as.covidcast_signal.covidcast_signal as.covidcast_signal covidcast_signal .onAttach

Documented in as.covidcast_signal as.covidcast_signal.covidcast_signal as.covidcast_signal.data.frame covidcast covidcast_meta covidcast_signal covidcast_signals print.covidcast_meta print.covidcast_signal summary.covidcast_meta summary.covidcast_signal

#' covidcast: Client for Delphi's COVIDcast API
#'
#' The covidcast package provides access to numerous COVID-19 data streams,
#' updated daily, covering the United States of America. These include publicly
#' reported cases and deaths data, along with data the Delphi research group
#' collects or obtains from partners.
#'
#' @section Finding data sources and documentation:
#'
#' The COVIDcast API includes:
#'
#' * publicly reported COVID case and death data
#' * insurance claims data reporting on COVID-related doctor's visits and
#'   hospitalizations, obtained from health partners
#' * aggregate results from massive COVID symptom surveys conducted by Delphi
#' * mobility data aggregated from SafeGraph
#' * symptom search trends from Google
#'
#' and numerous other important signals, most available daily at the county
#' level.
#'
#' Each data stream is identified by its data source and signal names. These are
#' documented on the COVIDcast API website:
#' <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html>
#'
#' Each data stream has a page giving detailed technical documentation on how
#' the data is collected, how it is aggregated, and any limitations or known
#' problems with the data.
#'
#' @section Getting started:
#'
#' We recommend browsing the vignettes, which include numerous examples:
#' `browseVignettes(package = "covidcast")`.
#'
#' See also `covidcast_signal()` for details on how to obtain COVIDcast data as
#' a data frame.
#'
#' @references A. Reinhart et al., An open repository of real-time COVID-19
#'   indicators. Proc. Natl. Acad. Sci. U.S.A. 118, e2111452118 (2021).
#'   \doi{10.1073/pnas.2111452118}
#'
#' @keywords internal
"_PACKAGE"

# API base url
COVIDCAST_BASE_URL <- 'https://api.covidcast.cmu.edu/epidata/'

# Max rows returned by API
MAX_RESULTS <- 1000000

# Package-local environment for shared storage
pkg_env <- new.env()

# Cache the httr covidcast_meta response when available
pkg_env$META_RESPONSE <- NA

.onAttach <- function(libname, pkgname) {
  msg <- c("We encourage COVIDcast API users to register on our mailing list:",
           "https://lists.andrew.cmu.edu/mailman/listinfo/delphi-covidcast-api",
           "We'll send announcements about new data sources, package updates,",
           "server maintenance, and new features.")
  packageStartupMessage(paste(msg, collapse = "\n"))
}

#' Obtain a data frame for one COVIDcast signal
#'
#' Obtains data for selected date ranges for all geographic regions of the
#' United States. Available data sources and signals are documented in the
#' COVIDcast signal documentation.
#' Most (but not all) data sources are available at the county level, but the
#' API can also return data aggregated to metropolitan statistical areas,
#' hospital referral regions, or states, as desired, by using the `geo_type`
#' argument.
#'
#' For data on counties, metropolitan statistical areas, and states, this
#' package provides the [`county_census`], [`msa_census`], and [`state_census`]
#' datasets. These include each area's unique identifier, used in the
#' `geo_values` argument to select specific areas, and basic information on
#' population and other Census data.
#'
#' Downloading large amounts of data may be slow, so this function prints
#' messages for each chunk of data it downloads. To suppress these, use
#' [base::suppressMessages()], as in
#' `suppressMessages(covidcast_signal("fb-survey", ...))`.
#'
#' @section Metadata:
#'
#' The returned object has a `metadata` attribute attached containing basic
#' information about the signal. Use `attributes(x)$metadata` to access this
#' metadata. The metadata is stored as a data frame of one row, and contains the
#' same information that `covidcast_meta()` would return for a given signal.
#'
#' Note that not all `covidcast_signal` objects may have all fields of metadata
#' attached; for example, an object created with `as.covidcast_signal()` using
#' data from another source may only contain the `geo_type` variable, along with
#' `data_source` and `signal`. Before using the metadata of a `covidcast_signal`
#' object, always check for the presence of the attributes you need.
#'
#' @section Issue dates and revisions:
#'
#' The COVIDcast API tracks updates and changes to its underlying data, and
#' records the first date each observation became available. For example, a data
#' source may report its estimate for a specific state on June 3rd on June 5th,
#' once records become available. This data is considered "issued" on June 5th.
#' Later, the data source may update its estimate for June 3rd based on revised
#' data, creating a new issue on June 8th. By default, `covidcast_signal()`
#' returns the most recent issue available for every observation. The `as_of`,
#' `issues`, and `lag` parameters allow the user to select specific issues
#' instead, or to see all updates to observations. These options are mutually
#' exclusive, and you should only specify one; if you specify more than one, you
#' may get an error or confusing results.
#'
#' Note that the API only tracks the initial value of an estimate and *changes*
#' to that value. If a value was first issued on June 5th and never updated,
#' asking for data issued on June 6th (using `issues` or `lag`) would *not*
#' return that value, though asking for data `as_of` June 6th would. See
#' `vignette("covidcast")` for examples.
#'
#' Note also that the API enforces a maximum result row limit; results beyond
#' the maximum limit are truncated. This limit is sufficient to fetch
#' observations in all counties in the United States on one day. This client
#' automatically splits queries for multiple days across multiple API calls.
#' However, if data for one day has been issued many times, using the `issues`
#' argument may return more results than the query limit. A warning will be
#' issued in this case. To see all results, split your query across multiple
#' calls with different `issues` arguments.
#'
#' @section API keys:
#'
#' By default, `covidcast_signal()` submits queries to the API anonymously. All
#' the examples in the package documentation are compatible with anonymous use
#' of the API, but [there are some limits on anonymous
#' queries](https://cmu-delphi.github.io/delphi-epidata/api/api_keys.html),
#' including a rate limit. If you regularly query large amounts of data, please
#' consider [registering for a free API
#' key](https://api.delphi.cmu.edu/epidata/admin/registration_form), which lifts
#' these limits. Even if your usage falls within the anonymous usage limits,
#' registration helps us understand who and how others are using the Delphi
#' Epidata API, which may in turn inform future research, data partnerships, and
#' funding.
#'
#' If you have an API key, you can use it by setting the `covidcast.auth`
#' option once before calling `covidcast_signal()` or `covidcast_signals()`:
#'
#' ```
#' options(covidcast.auth = "your_api_key")
#'
#' cli <- covidcast_signal(data_source = "fb-survey", signal = "smoothed_cli",
#'                         start_day = "2020-05-01", end_day = "2020-05-07",
#'                         geo_type = "state")
#' ```
#'
#' @param data_source String identifying the data source to query. See
#'   <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html>
#'   for a list of available data sources.
#' @param signal String identifying the signal from that source to query. Again,
#'   see <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html>
#'   for a list of available signals.
#' @param start_day Query data beginning on this date. Date object, or string in
#'   the form "YYYY-MM-DD". If `start_day` is `NULL`, defaults to first day data
#'   is available for this signal.
#' @param end_day Query data up to this date, inclusive. Date object or string
#'   in the form "YYYY-MM-DD". If `end_day` is `NULL`, defaults to the most
#'   recent day data is available for this signal.
#' @param geo_type The geography type for which to request this data, such as
#'   "county" or "state". Defaults to "county". See
#'   <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_geography.html>
#'   for details on which types are available.
#' @param geo_values Which geographies to return. The default, "*", fetches all
#'   geographies. To fetch specific geographies, specify their IDs as a vector
#'   or list of strings. See
#'   <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_geography.html>
#'   for details on how to specify these IDs.
#' @param as_of Fetch only data that was available on or before this date,
#'   provided as a `Date` object or string in the form "YYYY-MM-DD". If `NULL`,
#'   the default, return the most recent available data. Note that only one of
#'   `as_of`, `issues`, and `lag` should be provided; it does not make sense to
#'   specify more than one. For more on data revisions, see
#'   "Issue dates and revisions" below.
#' @param issues Fetch only data that was published or updated ("issued") on
#'   these dates. Provided as either a single `Date` object (or string in the
#'   form "YYYY-MM-DD"), indicating a single date to fetch data issued on, or a
#'   vector specifying two dates, start and end. In this case, return all data
#'   issued in this range. There may be multiple rows for each observation,
#'   indicating several updates to its value. If `NULL`, the default, return the
#'   most recently issued data.
#' @param lag Integer. If, for example, `lag = 3`, then we fetch only data that
#'   was published or updated exactly 3 days after the date. For example, a row
#'   with `time_value` of June 3 will only be included in the results if its
#'   data was issued or updated on June 6. If `NULL`, the default, return the
#'   most recently issued data regardless of its lag.
#' @param time_type The temporal resolution to request this data. Most signals
#'   are available at the "day" resolution (the default); some are only
#'   available at the "week" resolution, representing an MMWR week ("epiweek").
#'
#' @return `covidcast_signal` object with matching data. The object is a data
#'   frame with additional metadata attached. Each row is one observation of one
#'   signal on one day in one geographic location. Contains the following
#'   columns:
#'
#'   \item{data_source}{Data source from which this observation was obtained.}
#'   \item{signal}{Signal from which this observation was obtained.}
#'   \item{geo_value}{String identifying the location, such as a state name or
#'   county FIPS code.}
#'   \item{time_value}{Date object identifying the date of this observation. For
#'   data with `time_type = "week"`, this is the first day of the corresponding
#'   epiweek.}
#'   \item{issue}{Date object identifying the date this estimate was issued.
#'   For example, an estimate with a `time_value` of June 3 might have been
#'   issued on June 5, after the data for June 3rd was collected and ingested
#'   into the API.}
#'   \item{lag}{Integer giving the difference between `issue` and `time_value`,
#'   in days.}
#'   \item{value}{Signal value being requested. For example, in a query for the
#'   "confirmed_cumulative_num" signal from the "usa-facts" source, this would
#'   be the cumulative number of confirmed cases in the area, as of the given
#'   `time_value`.}
#'   \item{stderr}{Associated standard error of the signal value, if available.}
#'   \item{sample_size}{Integer indicating the sample size available in that
#'   geography on that day; sample size may not be available for all signals,
#'   due to privacy or other constraints, in which case it will be `NA`.}
#'
#'   Consult the signal documentation for more details on how values and
#'   standard errors are calculated for specific signals.
#'
#'   The returned data frame has a `metadata` attribute containing metadata
#'   about the signal contained within; see "Metadata" below for details.
#'
#' @references COVIDcast API documentation:
#'   \url{https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html}
#'
#' Documentation of all COVIDcast sources and signals:
#' \url{https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html}
#'
#' COVIDcast public dashboard: \url{https://delphi.cmu.edu/covidcast/}
#'
#' @examples \dontrun{
#' ## Fetch all counties from 2020-05-10 to the most recent available data
#' covidcast_signal("fb-survey", "smoothed_cli", start_day = "2020-05-10")
#' ## Fetch all counties on just 2020-05-10 and no other days
#' covidcast_signal("fb-survey", "smoothed_cli", start_day = "2020-05-10",
#'                  end_day = "2020-05-10")
#' ## Fetch all states on 2020-05-10, 2020-05-11, 2020-05-12
#' covidcast_signal("fb-survey", "smoothed_cli", start_day = "2020-05-10",
#'                  end_day = "2020-05-12", geo_type = "state")
#' ## Fetch all available data for just Pennsylvania and New Jersey
#' covidcast_signal("fb-survey", "smoothed_cli", geo_type = "state",
#'                  geo_values = c("pa", "nj"))
#' ## Fetch all available data in the Pittsburgh metropolitan area
#' covidcast_signal("fb-survey", "smoothed_cli", geo_type = "msa",
#'                  geo_values = name_to_cbsa("Pittsburgh"))
#' }
#'
#' @seealso [plot.covidcast_signal()], [covidcast_signals()],
#'   [`as.covidcast_signal()`], [`county_census`], [`msa_census`],
#'   [`state_census`]
#' @export
#' @importFrom rlang abort `:=`
#' @importFrom dplyr `%>%`
covidcast_signal <- function(data_source, signal,
                             start_day = NULL, end_day = NULL,
                             geo_type = c("county", "hrr", "msa", "dma",
                                          "state", "hhs", "nation"),
                             geo_values = "*",
                             as_of = NULL, issues = NULL, lag = NULL,
                             time_type = c("day", "week")) {
  geo_type <- match.arg(geo_type)
  time_type <- match.arg(time_type)

  relevant_meta <- specific_meta(data_source, signal, geo_type, time_type)

  if (is.null(start_day) || is.null(end_day)) {
    if (is.null(relevant_meta$max_time) || is.null(relevant_meta$min_time)) {
      abort(
        paste0("No match in metadata for source '", data_source,
               "', signal '", signal, "', and geo_type '", geo_type,
               "' at the ", time_type, " level. ",
               "Check that the source and signal are correctly spelled and ",
               "that the signal is available at this geographic level."),
        data_source = data_source,
        signal = signal,
        geo_type = geo_type,
        time_type = time_type,
        class = "covidcast_meta_not_found"
      )
    }
  }

  if (is.null(start_day)) {
    start_day <- relevant_meta$min_time
  } else {
    start_day <- as.Date(start_day)
  }

  if (is.null(end_day)) {
    end_day <- relevant_meta$max_time
  } else {
    end_day <- as.Date(end_day)
  }

  if (start_day > end_day) {
    stop("`end_day` must be on or after `start_day`.")
  }

  if (!is.null(as_of)) {
    as_of <- as.Date(as_of)

    # By definition, there can never be data from the future. So clamp `end_day`
    # to be no larger than `as_of`.
    end_day <- min(as_of, end_day)
  }

  if (!is.null(issues)) {
    issues <- as.Date(issues)
  }
  if (identical(geo_values, "*")) {
    max_geos <- relevant_meta$num_locations
  } else {
    max_geos <- length(geo_values)
  }
  df <- covidcast_days(data_source, signal, start_day, end_day, geo_type,
                       geo_values, time_type, as_of, issues, lag, max_geos)

  # Drop direction column (if it still exists)
  df$direction <- NULL

  return(as.covidcast_signal(df, signal, geo_type, time_type, data_source,
                             metadata = relevant_meta))
}

#' Convert data from an external source into a form compatible with
#' `covidcast_signal`.
#'
#' Several methods are provided to convert common objects (such as data frames)
#' into `covidcast_signal` objects, which can be used with the various
#' `covidcast_signal` methods (such as `plot.covidcast_signal()` or
#' `covidcast_cor()`). See `vignette("external-data")` for examples.
#'
#' @param x Object to be converted. See Methods section below for details on
#'   formatting of each input type.
#' @param geo_type The geography type stored in this object.
#' @param time_type The time resolution stored in this object. If "day", the
#'   default, each observation covers one day. If "week", each time value is
#'   assumed to be the start date of the epiweek (MMWR week) that the data
#'   represents.
#' @param data_source The name of the data source to use as a label for this
#'   data.
#' @param signal The signal name to use for this data.
#' @param issue Issue date to use for this data, if not present in `x`, as a
#'   `Date` object. If no issue date is present in `x` and `issue` is `NULL`,
#'   today's date will be used.
#' @param metadata List of metadata to attach to the `covidcast_signal` object.
#'   See the "Metadata" section of `covidcast_signal()`. All objects will have
#'   `geo_type`, `data_source`, and `signal` columns included in their metadata;
#'   named entries in this list are added as additional columns.
#' @param ... Additional arguments passed to methods.
#' @return `covidcast_signal` object; see `covidcast_signal()` for documentation
#'   of fields and structure.
#' @seealso [`covidcast_signal()`]
#' @export
as.covidcast_signal <- function(x, ...) {
  UseMethod("as.covidcast_signal")
}

#' @method as.covidcast_signal covidcast_signal
#' @describeIn as.covidcast_signal Simply returns the `covidcast_signal` object
#'   unchanged.
#' @export
as.covidcast_signal.covidcast_signal <- function(x, ...) {
  return(x)
}

#' @method as.covidcast_signal data.frame
#' @describeIn as.covidcast_signal The input data frame `x` must contain the
#'   columns `time_value`, `value`, and `geo_value`. If an `issue` column is
#'   present in `x`, it will be used as the issue date for each observation; if
#'   not, the `issue` argument will be used. Other columns will be preserved
#'   as-is.
#' @export
as.covidcast_signal.data.frame <- function(x,
                                           signal = NULL,
                                           geo_type = c(
                                             "county",
                                             "msa",
                                             "hrr",
                                             "dma",
                                             "state",
                                             "hhs",
                                             "nation"
                                           ),
                                           time_type = c("day", "week"),
                                           data_source = "user",
                                           issue = NULL,
                                           metadata = list(),
                                           ...) {
  if (is.null(signal)) {
    abort(paste0(c(
      "when `x` is a data frame, signal name must",
      " be provided to as.covidcast_signal"
    ), collapse = ""), class = "covidcast_coerce_signal")
  }

  geo_type <- match.arg(geo_type)
  time_type <- match.arg(time_type)

  metadata$data_source <- data_source
  metadata$signal <- signal
  metadata$geo_type <- geo_type
  metadata$time_type <- time_type
  metadata <- as.data.frame(metadata)

  if (!("data_source" %in% names(x))) {
    x$data_source <- data_source
  }

  if (!("signal" %in% names(x))) {
    x$signal <- signal
  }

  class(x) <- c("covidcast_signal", "data.frame")

  if (nrow(x) == 0) {
    # No data; columns don't matter.
    return(x)
  }

  if (!("time_value" %in% names(x))) {
    abort(paste0(c(
      "`x` must contain a `time_value` column",
      " containing the time of each observation"
    ), collapse = ""), class = "covidcast_coerce_time_value")
  }

  if (!("value" %in% names(x))) {
    abort(paste0(c(
      "`x` must contain a `value` column",
      " containing the value of each observation"
    ), collapse = ""), class = "covidcast_coerce_value")
  }

  if (!("geo_value" %in% names(x))) {
    abort(paste0(c(
      "`x` must contain a `geo_value` column",
      " containing the location of each observation"
    ), collapse = ""), class = "covidcast_coerce_geo_value")
  }

  # issue is optional; if omitted, use default
  if (!("issue" %in% names(x))) {
    if (is.null(issue)) {
      x$issue <- Sys.Date()
    } else {
      x$issue <- issue
    }
  }

  # Reorder data_source, signal, geo_value, time_value, so that they appear in
  # this order.
  x <- dplyr::relocate(x, "data_source", "signal", "geo_value", "time_value")

  attributes(x)$metadata <- metadata

  return(x)
}

#' Print `covidcast_signal` object
#'
#' Prints a brief summary of the data source, signal, and geographic level, and
#' then prints the underlying data frame, for an object returned by
#' `covidcast_signal()`.
#'
#' @param x The `covidcast_signal` object.
#' @param ... Additional arguments passed to `print.data.frame()` to print the
#'   data.
#' @return The `covidcast_signal` object, unchanged.
#'
#' @method print covidcast_signal
#' @export
print.covidcast_signal <- function(x, ...) {
  cat(sprintf("A `covidcast_signal` dataframe with %i rows and %i columns.\n\n",
              nrow(x), ncol(x)))
  cat(sprintf("%-12s: %s\n", "data_source", x$data_source[1]))
  cat(sprintf("%-12s: %s\n", "signal", x$signal[1]))
  cat(sprintf("%-12s: %s\n\n", "geo_type", attributes(x)$metadata$geo_type))

  # forward to print the data as well
  NextMethod("print")
}

#' @method head covidcast_signal
#' @importFrom utils head
#' @export
head.covidcast_signal <- function(x, ...) {
  head(as.data.frame(x), ...)
}

#' Summarize `covidcast_signal` object
#'
#' Prints a variety of summary statistics about the underlying data, such as
#' median values, the date range included, sample sizes, and so on, for an
#' object returned by `covidcast_signal()`.
#'
#' @param object The `covidcast_signal` object.
#' @param ... Additional arguments, for compatibility with `summary()`.
#'   Currently unused.
#' @return No return value; called only to print summary statistics.
#'
#' @method summary covidcast_signal
#' @importFrom stats median
#' @export
summary.covidcast_signal <- function(object, ...) {
  x <- object
  cat(sprintf("A `covidcast_signal` dataframe with %i rows and %i columns.\n\n",
              nrow(x), ncol(x)))
  cat(sprintf("%-12s: %s\n", "data_source", x$data_source[1]))
  cat(sprintf("%-12s: %s\n", "signal", x$signal[1]))
  cat(sprintf("%-12s: %s\n\n", "geo_type", attributes(x)$metadata$geo_type))
  cat(sprintf("%-36s: %s\n", "first date", min(x$time_value)))
  cat(sprintf("%-36s: %s\n", "last date", max(x$time_value)))
  cat(sprintf("%-36s: %i\n", "median number of geo_values per day",
              as.integer(x %>% dplyr::group_by(.data$time_value) %>%
                         dplyr::summarize(num = dplyr::n()) %>%
                         dplyr::summarize(median(.data$num)))))
}

##########

#' Obtain multiple COVIDcast signals at once
#'
#' This convenience function uses `covidcast_signal()` to obtain multiple
#' signals, potentially from multiple data sources.
#'
#' @details The argument structure is just as in `covidcast_signal()`, except
#'   the first four arguments `data_source`, `signal`, `start_day`, `end_day`
#'   are permitted to be vectors. The first two arguments `data_source` and
#'   `signal` are recycled appropriately in the calls to `covidcast_signal()`;
#'   see example below. The next two arguments `start_day`, `end_day`, unless
#'   `NULL`, must be either length 1 or N.
#'
#' See `vignette("multi-signals")` for additional examples.
#'
#' @return A list of `covidcast_signal` data frames, of length `N =
#'     max(length(data_source), length(signal))`. This list can be aggregated
#'     into a single data frame of either "wide" or "long" format using
#'     `aggregate_signals()`.
#'
#' @inheritParams covidcast_signal
#'
#' @seealso [covidcast_signal()], [aggregate_signals()]
#' @examples
#' \dontrun{
#' ## Fetch USAFacts confirmed cases and deaths over the same time period
#' covidcast_signals("usa-facts", signal = c("confirmed_incidence_num",
#'                                           "deaths_incidence_num"),
#'                    start_day = "2020-08-15", end_day = "2020-10-01")
#' }
#' @export
covidcast_signals <- function(data_source, signal,
                              start_day = NULL, end_day = NULL,
                              geo_type = c(
                                "county",
                                "hrr",
                                "msa",
                                "dma",
                                "state",
                                "hhs",
                                "nation"
                              ),
                              time_type = c("day", "week"),
                              geo_values = "*",
                              as_of = NULL, issues = NULL, lag = NULL) {
  N <- max(length(data_source), length(signal))
  df_list <- vector("list", N)

  if (!(length(start_day) %in% c(0, 1, N))) {
    stop("When `start_day` is not `NULL`, it should have length 1 or N, where ",
         "`N = max(length(data_source), length(signal)`.")
  }

  if (!(length(end_day) %in% c(0, 1, N))) {
    stop("When `end_day` is not `NULL`, it should have length 1 or N, where ",
         "`N = max(length(data_source), length(signal)`.")
  }

  data_source <- rep(data_source, length.out = N)
  signal <- rep(signal, length.out = N)

  # rep() throws a warning on NULLs (and you can't make a vector of NULLs), so
  # only replicate these when provided. When NULL, leave as NULL so
  # `covidcast_signal()` can determine correct dates from metadata.
  if (!is.null(start_day)) {
    start_day <- rep(start_day, length.out = N)
  }

  if (!is.null(end_day)) {
    end_day <- rep(end_day, length.out = N)
  }

  for (i in 1:N) {
    df_list[[i]] <- covidcast_signal(data_source = data_source[i],
                                     signal = signal[i],
                                     start_day = start_day[i],
                                     end_day = end_day[i],
                                     geo_type = geo_type,
                                     time_type = time_type,
                                     geo_values = geo_values,
                                     as_of = as_of, issues = issues,
                                     lag = lag)
  }

  return(df_list)
}

##########

#' Obtain COVIDcast metadata
#'
#' Obtains a data frame of metadata describing all publicly available data
#' streams from the COVIDcast API.
#'
#' @return Data frame containing one row per signal, with the following columns:
#'   \item{data_source}{Data source name.}
#'   \item{signal}{Signal name.}
#'   \item{min_time}{First day for which this signal is available.}
#'   \item{max_time}{Most recent day for which this signal is available.}
#'   \item{geo_type}{Geographic level for which this signal is available, such
#'   as county, state, msa, or hrr. Most signals are available at multiple
#'   geographic levels and will hence be listed in multiple rows with their
#'   own metadata.}
#'   \item{time_type}{Temporal resolution at which this signal is reported.
#'   "day", for example, means the signal is reported daily.}
#'   \item{num_locations}{Number of distinct geographic locations available for
#'  this signal. For example, if `geo_type` is county, the number of counties
#'  for which this signal has ever been reported.}
#'  \item{min_value}{Smallest value that has ever been reported.}
#'  \item{max_value}{Largest value that has ever been reported.}
#'  \item{mean_value}{Arithmetic mean of all reported values.}
#'  \item{stdev_value}{Sample standard deviation of all reported values.}
#'  \item{max_issue}{Most recent issue date for this signal.}
#'  \item{min_lag}{Smallest lag from observation to issue, in `time_type` units}
#'  \item{max_lag}{Largest lag from observation to issue, in `time_type` units}
#'
#' @references COVIDcast API sources and signals documentation:
#'  \url{https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html}
#'
#' @seealso [summary.covidcast_meta()]
#'
#' @importFrom utils read.csv
#' @export
covidcast_meta <- function() {
  meta <- .request_meta()

  if (nchar(meta) == 0) {
    abort("Failed to obtain metadata", class = "covidcast_meta_fetch_failed")
  }

  # helper to do the right api_to_date for each time_type, since it does not
  # take a vectorized time_type
  adjust_dates <- function(col, time_type) {
    # map2 returns a list of entries and doesn't provide a way to get a vector
    # of Dates automatically. however, if we c() everything together, we get the
    # right type.
    do.call("c", purrr::map2(col, time_type, api_to_date))
  }

  meta <- read.csv(textConnection(meta), stringsAsFactors = FALSE) %>%
    dplyr::mutate(min_time = adjust_dates(.data$min_time, .data$time_type),
                  max_time = adjust_dates(.data$max_time, .data$time_type),
                  max_issue = adjust_dates(.data$max_issue, .data$time_type))

  class(meta) <- c("covidcast_meta", "data.frame")
  return(meta)
}

#' Print `covidcast_meta` object
#'
#' Prints a brief summary of the metadata, and then prints the underlying data
#' frame, for an object returned by `covidcast_meta()`.
#'
#' @param x The `covidcast_meta` object.
#' @param ... Additional arguments passed to `print.data.frame()` to print the
#'   data.
#' @return The `covidcast_meta` object, unchanged.
#'
#' @method print covidcast_meta
#' @export
print.covidcast_meta <- function(x, ...) {
  cat(sprintf("A `covidcast_meta` data frame with %i rows and %i columns.\n\n",
              nrow(x), ncol(x)))
  cat(sprintf("%-23s: %s\n", "Number of data sources",
              length(unique(x$data_source))))
  cat(sprintf("%-23s: %s\n\n", "Number of signals",
              length(unique(paste(x$data_source, x$signal)))))

  # forward to print the data as well
  NextMethod("print")
}

#' @method head covidcast_meta
#' @importFrom utils head
#' @export
head.covidcast_meta <- function(x, ...) {
  head(as.data.frame(x), ...)
}

#' Summarize `covidcast_meta` object
#'
#' Prints a tabular summary of the object returned by `covidcast_meta()`,
#' containing each source and signal and a summary of the geographic levels it
#' is available at.
#'
#' @param object The `covidcast_meta` object.
#' @param ... Additional arguments, for compatibility with `summary()`.
#'   Currently unused.
#' @return A data frame with one row per unique signal in the metadata, with
#'   the following columns:
#' \item{data_source}{Data source name}
#' \item{signal}{Signal name}
#' \item{county}{"*" if this signal is available at the county level, `""`
#' otherwise}
#' \item{msa}{`"*"` if this signal is available at the Metropolitan Statistical
#' Area level, `""` otherwise}
#' \item{dma}{`"*"` if this signal is available at the Designated Marketing Area
#' level, `""` otherwise}
#' \item{hrr}{`"*"` if this signal is available at the Hospital Referral Region
#' level, `""` otherwise}
#' \item{state}{`"*"` if this signal is available at the state level, `""`
#' otherwise}
#' \item{hhs}{`"*"` if this signal is available at the Health and Human Services
#' region level, `""` otherwise}
#' \item{nation}{`"*"` if this signal is available at the national level, `""`
#' otherwise}
#'
#' @method summary covidcast_meta
#' @export
summary.covidcast_meta <- function(object, ...) {
  x <- object
  cat(sprintf("A `covidcast_meta` data frame with %i rows and %i columns.\n\n",
              nrow(x), ncol(x)))
  cat(sprintf("%-23s: %s\n", "Number of data sources",
              length(unique(x$data_source))))
  cat(sprintf("%-23s: %s\n\n", "Number of signals",
              length(unique(paste(x$data_source, x$signal)))))
  cat("Summary:\n\n")
  df <- suppressMessages(
    x %>% dplyr::group_by(.data$data_source, .data$signal) %>%
    dplyr::summarize(county = ifelse("county" %in% .data$geo_type, "*", ""),
                     msa = ifelse("msa" %in% .data$geo_type, "*", ""),
                     dma = ifelse("dma" %in% .data$geo_type, "*", ""),
                     hrr = ifelse("hrr" %in% .data$geo_type, "*", ""),
                     state = ifelse("state" %in% .data$geo_type, "*", ""),
                     hhs = ifelse("hhs" %in% .data$geo_type, "*", ""),
                     nation = ifelse("nation" %in% .data$geo_type, "*", "")
                     ) %>%
    dplyr::ungroup()
  )
  print(as.data.frame(df), right = FALSE, row.names = FALSE)
  invisible(df)
}

# Retrieve only the metadata that is specific to the values of interest (signal,
# geo_type, etc.)
specific_meta <- function(data_source, signal, geo_type, time_type = "day") {
  meta_info <- covidcast_meta()
  relevant_meta <- meta_info[meta_info$data_source == data_source &
                             meta_info$signal == signal &
                             meta_info$geo_type == geo_type &
                             meta_info$time_type == time_type, ]

  # If no metadata for source/signal/geo_type combo, still return minimal data.
  # Use maximum observed values of desired geo_type as an upper bound for
  # num_locations.
  if (nrow(relevant_meta) == 0) {
    geo_nums <- max(meta_info[meta_info$geo_type == geo_type, ]$num_locations)
    relevant_meta <- list(geo_type = geo_type, num_locations = geo_nums)
  }
  return(relevant_meta)
}


# Helper function, not user-facing, to loop through a sequence of days. Chooses
# batches of days based on expected number of results, queries covidcast for
# each batch and combines the resutls.
covidcast_days <- function(data_source, signal, start_day, end_day, geo_type,
                           geo_value, time_type, as_of, issues, lag,
                           max_geos) {
  days <- date_sequence(start_day, end_day, time_type)
  ndays <- length(days)

  # issues is either a single date, or a vector with a start and end date.
  if (length(issues) == 2) {
    nissues <- length(date_sequence(issues[1], issues[2], time_type))
  } else {
    nissues <- 1
  }

  max_results <- getOption("covidcast.max_results", default = MAX_RESULTS)

  # Theoretically, each geo_value could have data issued each day. Likely
  # overestimates when handling multiple issue dates, resulting in more batches.
  max_days_at_time <- floor(max_results / (max_geos * nissues))

  # In theory, we could exceed max rows with 1 day, but try anyway
  if (max_days_at_time == 0) {
    max_days_at_time <- 1
  }
  num_batches <- ceiling(ndays / max_days_at_time)
  dat <- list()

  # The API limits the number of rows that can be returned at once, so we query
  # in batches.
  for (i in seq_len(num_batches)) {
    start_offset <- (i - 1) * max_days_at_time
    end_offset <- min(i * max_days_at_time, ndays) - 1
    query_start_day <- start_day + start_offset
    query_end_day <- start_day + end_offset

    # Expected number of unique dates fetched, to check against later
    num_time_values <- length(days[(start_offset + 1):(end_offset + 1)])

    # Use range calls where possible for speed.
    time_values <- paste0(days[(start_offset + 1)], "-", days[(end_offset + 1)])
    response <- covidcast(data_source = data_source,
                          signal = signal,
                          time_type = time_type,
                          geo_type = geo_type,
                          time_values = time_values,
                          geo_value = geo_value,
                          as_of = as_of,
                          issues = issues,
                          lag = lag)

    if (is.null(response)) {
      warn(paste0("Fetching ", signal, " from ", data_source, " for ",
                  query_start_day, " to ", query_end_day,
                  " in geography '", geo_value, "': no results"),
           data_source = data_source,
           signal = signal,
           start_day = query_start_day,
           end_day = query_end_day,
           geo_value = geo_value,
           time_type = time_type,
           class = "covidcast_fetch_failed")

      next
    }

    dat[[i]] <- response

    summary <- sprintf(
      "Fetched day %s to %s: num_entries = %s",
      query_start_day,
      query_end_day,
      nrow(response))

    if (length(summary) != 0) {
      message(summary)
    }

    if (nrow(response) > 0) {
      desired_geos <- tolower(unique(geo_value))

      returned_geo_array <- response %>%
        dplyr::select("geo_value", "time_value") %>%
        dplyr::group_by(.data$time_value) %>%
        dplyr::summarize(geo_value = list(geo_value))
      returned_time_values <- returned_geo_array$time_value

      if (length(returned_time_values) != num_time_values) {
        missing_time_values <- setdiff(time_values, returned_time_values)
        missing_dates <- api_to_date(missing_time_values, time_type)

        warn(sprintf("Data not fetched for the following days: %s",
                     paste(missing_dates, collapse = ", ")),
             data_source = data_source,
             signal = signal,
             day = missing_dates,
             geo_value = geo_value,
             time_type = time_type,
             class = "covidcast_missing_time_values"
        )
      }

      if (!identical("*", geo_value)) {
        missing_geo_array <- returned_geo_array[
          lapply(returned_geo_array$geo_value, length) < length(desired_geos), ]

        if (nrow(missing_geo_array) > 0) {
          missing_geo_array$warning <-
            unlist(apply(returned_geo_array,
                         1,
                         FUN = function(row) {
                           geo_warning_message(row, desired_geos, time_type)
                         }))
          warn(missing_geo_array$warning,
               data_source = data_source,
               signal = signal,
               day = api_to_date(missing_geo_array$time_value, time_type),
               geo_value = geo_value,
               time_type = time_type,
               class = "covidcast_missing_geo_values")
        }
      }
    }
  }

  df <- dat %>%
    purrr::map(purrr::compact) %>% # remove the list elements that are NULL
    dplyr::bind_rows() # make this into a data frame

  if (nrow(df) > 0) {
    # If no data is found, there is no time_value column to report
    df$time_value <- api_to_date(df$time_value, time_type)
    df$issue <- api_to_date(df$issue, time_type)
  }

  return(df)
}

# Get a sequence of dates, in the format required by the API, representing the
# range from start_day to end_day.
#' @importFrom MMWRweek MMWRweek
date_sequence <- function(start_day, end_day, time_type = c("day", "week")) {
  time_type <- match.arg(time_type)

  if (time_type == "day") {
    return(date_to_string(seq(start_day, end_day, by = 1)))
  } else if (time_type == "week") {
    dates <- seq(start_day, end_day, by = "1 week")
    return(date_to_string(dates, time_type))
  }
}

# Helper function (not user facing) to create warning messages when geo_values
# are missing.
geo_warning_message <- function(row, desired_geos, time_type) {
  missing_geos <- setdiff(desired_geos, unlist(row$geo_value))
  if (length(missing_geos) > 0) {
    missing_geos_str <- paste0(missing_geos, collapse = ", ")
    err_msg <- sprintf("Data not fetched for some geographies on %s: %s",
                       api_to_date(row$time_value, time_type), missing_geos_str)
    return(err_msg)
  }
}

# Fetch Delphi's COVID-19 indicators
covidcast <- function(data_source, signal, time_type, geo_type, time_values,
                      geo_value, as_of, issues, lag) {
  # Check parameters
  if (missing(data_source) || missing(signal) || missing(time_type) ||
       missing(geo_type) || missing(time_values) || missing(geo_value)) {
    stop("`data_source`, `signal`, `time_type`, `geo_type`, `time_values`, ",
         "and `geo_value` are all required.")
  }

  # Set up request
  params <- list(
    data_source = data_source,
    signal = signal,
    time_type = time_type,
    geo_type = geo_type,
    time_values = .list(time_values),
    geo_value = geo_value,
    format = "csv"
  )

  if (length(params$geo_value) > 1) {
    # convert to string
    params$geo_values <- paste0(params$geo_value, collapse = ",")
    params$geo_value <- NULL
  }
  if (!is.null(as_of)) {
    params$as_of <- date_to_string(as_of, time_type)
  }

  if (!is.null(issues)) {
    if (length(issues) == 2) {
      params$issues <- paste0(date_to_string(issues[1], time_type),
                              "-",
                              date_to_string(issues[2], time_type))
    } else if (length(issues) == 1) {
      params$issues <- date_to_string(issues, time_type)
    } else {
      stop("`issues` must be either a single date or a date interval.")
    }
  }

  if (!is.null(lag)) {
    params$lag <- lag
  }

  # Make the API call. If the API returns a non-200 status code, indicating e.g.
  # a database error, .request() raises an error. It returns an empty string if
  # there are no results for our query.
  response <- .request("covidcast", params)
  if (nchar(response) == 0) {
    # empty if no results
    return(NULL)
  }

  # geo_value must be read as character so FIPS codes are returned as character,
  # not numbers (with leading 0s potentially removed)
  return(read.csv(textConnection(response), stringsAsFactors = FALSE,
                  colClasses = c("geo_value" = "character")))
}

# Helper function to cast values and/or ranges to strings
.listitem <- function(value) {
  if(is.list(value) && 'from' %in% names(value) && 'to' %in% names(value)) {
    return(paste0(toString(value$from), '-', toString(value$to)))
  } else {
    return(toString(value))
  }
}

# Helper function to build a list of values and/or ranges
.list <- function(values) {
  if(!is.list(values) || (
    'from' %in% names(values) && 'to' %in% names(values))) {
    values <- list(values)
  }
  return(paste0(unlist(lapply(values, .listitem)), collapse=','))
}

# from testthat::is_testing(), under MIT license
.is_testing <- function() {
  identical(Sys.getenv("TESTTHAT"), "true")
}

# Helper function to use cached metadata whenever possible
.request_meta <- function() {
  # temporary check while we wait for rerequest support in httptest: always
  # request while testing. see
  # https://github.com/nealrichardson/httptest/issues/84
  pkg_env$META_RESPONSE <- if(identical(pkg_env$META_RESPONSE, NA) || .is_testing()) {
    .request("covidcast_meta", list(format = "csv"), raw = TRUE)
  } else {
    httr::rerequest(pkg_env$META_RESPONSE)
  }

  report_api_errors(pkg_env$META_RESPONSE)

  return(httr::content(pkg_env$META_RESPONSE, as = "text",
                       encoding = "utf-8"))
}

# Helper function to request and parse epidata
.request <- function(endpoint, params, raw = FALSE) {
  # API call. Allow base API URL to be replaced, e.g. to use a staging/testing
  # server when needed.
  url <- paste0(getOption("covidcast.base_url", default = COVIDCAST_BASE_URL),
                endpoint, "/")
  auth <- getOption("covidcast.auth", default = NA)
  user_agent <- httr::user_agent("covidcastR")
  headers <- httr::add_headers(Authorization = ifelse(is.na(auth), "", paste("Bearer", auth)))

  response <- httr::GET(url,
                        user_agent,
                        headers,
                        query = params)

  # Query URL can be too long if lots of time values or geo values are involved;
  # switch to POST
  if (httr::status_code(response) == 414) {
    response <- httr::POST(url, user_agent,
                           headers,
                           body = params)
  }

  report_api_errors(response)


  if (raw) {
    return(response)
  }

  return(httr::content(response, as = "text",
                       encoding = "utf-8"))
}

# Examine httr response object and report errors accordingly
report_api_errors <- function(response) {
  auth <- getOption("covidcast.auth", default = NA)

  # HTTP 429 Too Many Requests may be API rate-limiting
  if (httr::status_code(response) == 429 && is.na(auth)) {
    msg <- xml2::xml_text(xml2::xml_find_all(xml2::read_html(
        httr::content(response, as = "text", encoding = "utf-8")),
        "//p"
        ))
    abort(c("Rate limit exceeded when fetching data from API anonymously. See the \"API keys\" section of the `covidcast_signal()` documentation for information on registering for an API key.",
            "i" = "Message from server:",
            "i" = msg),
          class = "covidcast_rate_limit",
          call = parent.frame())
  }

  # Catch all other kinds of errors
  msg <- "fetch data from API"
  if (is.na(auth)) {
    msg <- paste(msg, "anonymously")
  }

  httr::stop_for_status(response, task = msg)
}

# This is the date format expected by the API
date_to_string <- function(mydate, time_type = c("day", "week")) {
  time_type <- match.arg(time_type)

  if (time_type == "day") {
    format(mydate, "%Y%m%d")
  } else {
    weeks <- MMWRweek(mydate)

    # API takes MMWRweeks in YYYYMM format, so force zero-padding as needed.
    return(sprintf("%d%02d", weeks$MMWRyear, weeks$MMWRweek))
  }
}

# Convert dates from API to Date objects. For days, get a Date; for epiweeks,
# get the Date of the first day in the epiweek.
#' @importFrom MMWRweek MMWRweek2Date
api_to_date <- function(str, time_type = c("day", "week")) {
  time_type <- match.arg(time_type)

  if (time_type == "day") {
    return(as.Date(as.character(str), format = "%Y%m%d"))
  } else if (time_type == "week") {
    # Extract year and week number from string.
    years <- as.numeric(substr(str, 1, 4))
    weeks <- as.numeric(substr(str, 5, 6))
    return(MMWRweek2Date(years, weeks))
  }
}

Try the covidcast package in your browser

Any scripts or data that you put into this service are public.

covidcast documentation built on July 26, 2023, 5:29 p.m.