get_metadata_nhgis: List available data sources from IPUMS NHGIS

View source: R/api_metadata.R

get_metadata_nhgisR Documentation

List available data sources from IPUMS NHGIS

Description

Retrieve information about available NHGIS data sources, including datasets, data tables (summary tables), time series tables, and shapefiles (GIS files).

To retrieve summary metadata for all available data sources of a particular type, use the type argument. To retrieve detailed metadata for a single data source, use the dataset, data_table, or time_series_table argument. See the metadata availability section below for information on the metadata provided for each data type.

For general information, see the NHGIS data source overview and the FAQ.

Learn more about the IPUMS API in vignette("ipums-api") and NHGIS extract definitions in vignette("ipums-api-nhgis").

Usage

get_metadata_nhgis(
  type = NULL,
  dataset = NULL,
  data_table = NULL,
  time_series_table = NULL,
  delay = 0,
  api_key = Sys.getenv("IPUMS_API_KEY")
)

Arguments

type

One of "datasets", "data_tables", "time_series_tables", or "shapefiles" indicating the type of summary metadata to retrieve. Leave NULL if requesting metadata for a single dataset, data_table, or time_series_table.

dataset

Name of an individual dataset for which to retrieve metadata.

data_table

Name of an individual data table for which to retrieve metadata. If provided, an associated dataset must also be specified.

time_series_table

Name of an individual time series table for which to retrieve metadata.

delay

Number of seconds to delay between successive API requests, if multiple requests are needed to retrieve all records.

A delay is highly unlikely to be necessary and is intended only as a fallback in the event that you cannot retrieve all metadata records without exceeding the API rate limit.

Only used if type is provided.

api_key

API key associated with your user account. Defaults to the value of the IPUMS_API_KEY environment variable. See set_ipums_api_key().

Value

If type is provided, a tibble of summary metadata for all data sources of the provided type. Otherwise, a named list of metadata for the specified dataset, data_table, or time_series_table.

Metadata availability

The following sections summarize the metadata fields provided for each data type. Summary metadata include a subset of the fields provided for individual data sources.

Datasets:

  • name: The unique identifier for the dataset. This is the value that is used to refer to the dataset when interacting with the IPUMS API.

  • group: The group of datasets to which the dataset belongs. For instance, 5 separate datasets are part of the "2015 American Community Survey" group.

  • description: A short description of the dataset.

  • sequence: Order in which the dataset will appear in the metadata API and extracts.

  • has_multiple_data_types: Logical value indicating whether multiple data types exist for this dataset. For example, ACS datasets include both estimates and margins of error.

  • data_tables: A tibble containing names, codes, and descriptions for all data tables available for the dataset.

  • geog_levels: A tibble containing names, descriptions, and extent information for the geographic levels available for the dataset. The has_geog_extent_selection field contains logical values indicating whether extent selection is allowed (and required) for the associated geographic level. See geographic_instances below.

  • breakdowns: A tibble containing names, types, descriptions, and breakdown values for all breakdowns available for the dataset.

  • years: A vector of years for which the dataset is available. This field is only present if a dataset is available for multiple years. Note that ACS datasets are not considered to be available for multiple years.

  • geographic_instances: A tibble containing names and descriptions for all valid geographic extents for the dataset. This field is only present if at least one of the dataset's geog_levels allows geographic extent selection.

Data tables:

  • name: The unique identifier for the data table within its dataset. This is the value that is used to refer to the data table when interacting with the IPUMS API.

  • description: A short description of the data table.

  • universe: The statistical population measured by this data table (e.g. persons, families, occupied housing units, etc.)

  • nhgis_code: The code identifying the data table in the extract. Variables in the extract data will include column names prefixed with this code.

  • sequence: Order in which the data table will appear in the metadata API and extracts.

  • dataset_name: Name of the dataset to which this data table belongs.

  • n_variables: Number of variables included in this data table.

  • variables: A tibble containing variable descriptions and codes for the variables included in the data table

Time series tables:

  • name: The unique identifier for the time series table. This is the value that is used to refer to the time series table when interacting with the IPUMS API.

  • description: A short description of the time series table.

  • geographic_integration: The method by which the time series table aligns geographic units across time. "Nominal" integration indicates that geographic units are aligned by name (disregarding changes in unit boundaries). "Standardized" integration indicates that data from multiple time points are standardized to the indicated year's census units. For more information, click here.

  • sequence: Order in which the time series table will appear in the metadata API and extracts.

  • time_series: A tibble containing names and descriptions for the individual time series available for the time series table.

  • years: A tibble containing information on the available data years for the time series table.

  • geog_levels: A tibble containing names and descriptions for the geographic levels available for the time series table.

Shapefiles:

  • name: The unique identifier for the shapefile. This is the value that is used to refer to the shapefile when interacting with the IPUMS API.

  • year: The survey year in which the shapefile's represented areas were used for tabulations, which may be different than the vintage of the represented areas. For more information, click here.

  • geographic_level: The geographic level of the shapefile.

  • extent: The geographic extent covered by the shapefile.

  • basis: The derivation source of the shapefile.

  • sequence: Order in which the shapefile will appear in the metadata API and extracts.

See Also

define_extract_nhgis() to create an IPUMS NHGIS extract definition.

Examples

## Not run: 
library(dplyr)

# Get summary metadata for all available sources of a given data type
get_metadata_nhgis("datasets")

# Filter to identify data sources of interest by their metadata values
all_tsts <- get_metadata_nhgis("time_series_tables")

tsts <- all_tsts %>%
  filter(
    grepl("Children", description),
    grepl("Families", description),
    geographic_integration == "Standardized to 2010"
  )

tsts$name

# Get detailed metadata for a single source with its associated argument:
cs5_meta <- get_metadata_nhgis(time_series_table = "CS5")
cs5_meta$geog_levels

# Use the available values when defining an NHGIS extract request
define_extract_nhgis(
  time_series_tables = tst_spec("CS5", geog_levels = "state")
)

# Detailed metadata is also provided for datasets and data tables
get_metadata_nhgis(dataset = "1990_STF1")
get_metadata_nhgis(data_table = "NP1", dataset = "1990_STF1")

# Iterate over data sources to retrieve detailed metadata for several
# records. For instance, to get variable metadata for a set of data tables:
tables <- c("NP1", "NP2", "NP10")

var_meta <- purrr::map(
  tables,
  function(dt) {
    dt_meta <- get_metadata_nhgis(dataset = "1990_STF1", data_table = dt)

    # This ensures you avoid hitting rate limit for large numbers of tables
    Sys.sleep(1)

    dt_meta$variables
  }
)

## End(Not run)

ipumsr documentation built on Sept. 12, 2024, 7:38 a.m.