read_nhgis: Read tabular data from an NHGIS extract

View source: R/nhgis_read.R

read_nhgisR Documentation

Read tabular data from an NHGIS extract

Description

Read a csv or fixed-width (.dat) file downloaded from the NHGIS extract system.

To read spatial data from an NHGIS extract, use read_ipums_sf().

Usage

read_nhgis(
  data_file,
  file_select = NULL,
  vars = NULL,
  col_types = NULL,
  n_max = Inf,
  guess_max = min(n_max, 1000),
  do_file = NULL,
  var_attrs = c("val_labels", "var_label", "var_desc"),
  remove_extra_header = TRUE,
  verbose = TRUE,
  data_layer = deprecated()
)

Arguments

data_file

Path to a .zip archive containing an NHGIS extract or a single file from an NHGIS extract.

file_select

If data_file is a .zip archive that contains multiple files, an expression identifying the file to load. Accepts a character vector specifying the file name, a tidyselect selection, or an index position. This must uniquely identify a file.

vars

Names of variables to include in the output. Accepts a vector of names or a tidyselect selection. If NULL, includes all variables in the file.

col_types

One of NULL, a cols() specification or a string. If NULL, all column types will be inferred from the values in the first guess_max rows of each column. Alternatively, you can use a compact string representation to specify column types:

  • c = character

  • i = integer

  • n = number

  • d = double

  • l = logical

  • f = factor

  • D = date

  • T = date time

  • t = time

  • ? = guess

  • _ or - = skip

See read_delim() for more details.

n_max

Maximum number of lines to read.

guess_max

For .csv files, maximum number of lines to use for guessing column types. Will never use more than the number of lines read.

do_file

For fixed-width files, path to the .do file associated with the provided data_file. The .do file contains the parsing instructions for the data file.

By default, looks in the same path as data_file for a .do file with the same name. See Details section below.

var_attrs

Variable attributes to add from the codebook (.txt) file included in the extract. Defaults to all available attributes.

See set_ipums_var_attributes() for more details.

remove_extra_header

If TRUE, remove the additional descriptive header row included in some NHGIS .csv files.

This header row is not usually needed as it contains similar information to that included in the "label" attribute of each data column (if var_attrs includes "var_label").

verbose

Logical controlling whether to display output when loading data. If TRUE, displays IPUMS conditions, a progress bar, and column types. Otherwise, all are suppressed.

Will be overridden by readr.show_progress and readr.show_col_types options, if they are set.

data_layer

[Deprecated] Please use file_select instead.

Details

The .do file that is included when downloading an NHGIS fixed-width extract contains the necessary metadata (e.g. column positions and implicit decimals) to correctly parse the data file. read_nhgis() uses this information to parse and recode the fixed-width data appropriately.

If you no longer have access to the .do file, consider resubmitting the extract that produced the data. You can also change the desired data format to produce a .csv file, which does not require additional metadata files to be loaded.

For more about resubmitting an existing extract via the IPUMS API, see vignette("ipums-api", package = "ipumsr").

Value

A tibble containing the data found in data_file

See Also

read_ipums_sf() to read spatial data from an IPUMS extract.

read_nhgis_codebook() to read metadata about an IPUMS NHGIS extract.

ipums_list_files() to list files in an IPUMS extract.

Examples

# Example files
csv_file <- ipums_example("nhgis0972_csv.zip")
fw_file <- ipums_example("nhgis0730_fixed.zip")

# Provide the .zip archive directly to load the data inside:
read_nhgis(csv_file)

# For extracts that contain multiple files, use `file_select` to specify
# a single file to load. This accepts a tidyselect expression:
read_nhgis(fw_file, file_select = matches("ds239"), verbose = FALSE)

# Or an index position:
read_nhgis(fw_file, file_select = 2, verbose = FALSE)

# For CSV files, column types are inferred from the data. You can
# manually specify column types with `col_types`. This may be useful for
# geographic codes, which should typically be interpreted as character values
read_nhgis(csv_file, col_types = list(MSA_CMSAA = "c"), verbose = FALSE)

# Fixed-width files are parsed with the correct column positions
# and column types automatically:
read_nhgis(fw_file, file_select = contains("ts"), verbose = FALSE)

# You can also read in a subset of the data file:
read_nhgis(
  csv_file,
  n_max = 15,
  vars = c(GISJOIN, YEAR, D6Z002),
  verbose = FALSE
)

ipumsr documentation built on Sept. 12, 2024, 7:38 a.m.